Workflow: gatk4W-spark.cwl

Fetched 2024-11-27 20:33:28 GMT

Author: AMBARISH KUMAR er.ambarish@gmail.com & ambari73_sit@jnu.ac.in This is a proposed standard operating procedure for genomic variant detection using GATK4. It is hoped to be effective and useful for getting SARS-CoV-2 genome variants. It uses Illumina RNASEQ reads and genome sequence.

children parents
Workflow as SVG
  • Selected
  • Default Values
  • Nested Workflows
  • Tools
  • Inputs/Outputs

Inputs

ID Type Title Doc
sampleName String
rnaseq_left_reads File
rnaseq_right_reads File
sars_cov_2_reference_genome File
sars_cov_2_reference_2bit_genome File

Steps

ID Runs Label Doc
select_snps
GATK-SelectVariants.cwl (CommandLineTool)

Select a subset of variants from a VCF file

<p>This tool makes it possible to select a subset of variants based on various criteria in order to facilitate certain analyses. Examples of such analyses include comparing and contrasting cases vs. controls, extracting variant or non-variant loci that meet certain requirements, or troubleshooting some unexpected results, to name a few.</p>

<p> There are many different options for selecting subsets of variants from a larger callset: <ul> <li>Extract one or more samples from a callset based on either a complete sample name or a pattern match.</li> <li>Specify criteria for inclusion that place thresholds on annotation values, e.g. \"DP > 1000\" (depth of coverage greater than 1000x), \"AF < 0.25\" (sites with allele frequency less than 0.25). These criteria are written as \"JEXL expressions\", which are documented in the <a href=\"https://www.broadinstitute.org/gatk/guide/article?id=1255\">article about using JEXL expressions</a>.</li> <li>Provide concordance or discordance tracks in order to include or exclude variants that are also present in other given callsets.</li> <li>Select variants based on criteria like their type (e.g. INDELs only), evidence of mendelian violation, filtering status, allelicity, etc.</li> </ul> </p>

<p>There are also several options for recording the original values of certain annotations which are recalculated when one subsets the new callset, trims alleles, etc.</p>

<h3>Input</h3> <p> A variant call set in VCF format from which a subset can be selected. </p>

<h3>Output</h3> <p> A new VCF file containing the selected subset of variants. </p>

* <h3>Usage examples</h3> <h4>Select SNPs</h4> <pre> gatk SelectVariants \ -R Homo_sapiens_assembly38.fasta \ -V input.vcf \ --select-type-to-include SNP \ -O output.vcf </pre>

<h4>Query Chromosome 20 Variants from a GenomicsDB</h4> <pre> gatk SelectVariants \ -R Homo_sapiens_assembly38.fasta \ -V gendb://genomicsDB \ -L 20 \ -O output.chr20.vcf </pre>

select_indels
GATK-SelectVariants.cwl (CommandLineTool)

Select a subset of variants from a VCF file

<p>This tool makes it possible to select a subset of variants based on various criteria in order to facilitate certain analyses. Examples of such analyses include comparing and contrasting cases vs. controls, extracting variant or non-variant loci that meet certain requirements, or troubleshooting some unexpected results, to name a few.</p>

<p> There are many different options for selecting subsets of variants from a larger callset: <ul> <li>Extract one or more samples from a callset based on either a complete sample name or a pattern match.</li> <li>Specify criteria for inclusion that place thresholds on annotation values, e.g. \"DP > 1000\" (depth of coverage greater than 1000x), \"AF < 0.25\" (sites with allele frequency less than 0.25). These criteria are written as \"JEXL expressions\", which are documented in the <a href=\"https://www.broadinstitute.org/gatk/guide/article?id=1255\">article about using JEXL expressions</a>.</li> <li>Provide concordance or discordance tracks in order to include or exclude variants that are also present in other given callsets.</li> <li>Select variants based on criteria like their type (e.g. INDELs only), evidence of mendelian violation, filtering status, allelicity, etc.</li> </ul> </p>

<p>There are also several options for recording the original values of certain annotations which are recalculated when one subsets the new callset, trims alleles, etc.</p>

<h3>Input</h3> <p> A variant call set in VCF format from which a subset can be selected. </p>

<h3>Output</h3> <p> A new VCF file containing the selected subset of variants. </p>

* <h3>Usage examples</h3> <h4>Select SNPs</h4> <pre> gatk SelectVariants \ -R Homo_sapiens_assembly38.fasta \ -V input.vcf \ --select-type-to-include SNP \ -O output.vcf </pre>

<h4>Query Chromosome 20 Variants from a GenomicsDB</h4> <pre> gatk SelectVariants \ -R Homo_sapiens_assembly38.fasta \ -V gendb://genomicsDB \ -L 20 \ -O output.chr20.vcf </pre>

split_alignments
GATK-SplitNCigarReads.cwl (CommandLineTool)

Splits reads that contain Ns in their cigar string (e.g. spanning splicing events in RNAseq data).

Identifies all N cigar elements and creates k+1 new reads (where k is the number of N cigar elements). The first read includes the bases that are to the left of the first N element, while the part of the read that is to the right of the N (including the Ns) is hard clipped and so on for the rest of the new reads. Used for post-processing RNA reads aligned against the full reference.

<h3>Input</h3> <p> BAM file </p>

<h3>Output</h3> <p> BAM file with reads split at N CIGAR elements and CIGAR strings updated. </p>

update_read_group
picard_AddOrReplaceReadGroups.cwl (CommandLineTool)

Assigns all the reads in a file to a single new read-group.

<h3>Summary</h3> Many tools (Picard and GATK for example) require or assume the presence of at least one <code>RG</code> tag, defining a \"read-group\" to which each read can be assigned (as specified in the <code>RG</code> tag in the SAM record). This tool enables the user to assign all the reads in the INPUT to a single new read-group. For more information about read-groups, see the <a href='https://www.broadinstitute.org/gatk/guide/article?id=6472'> GATK Dictionary entry.</a> <br /> This tool accepts as INPUT BAM and SAM files or URLs from the <a href=\"http://ga4gh.org/#/documentation\">Global Alliance for Genomics and Health (GA4GH)</a>. <h3>Caveats</h3> The value of the tags must adhere (according to the <a href=\"https://samtools.github.io/hts-specs/SAMv1.pdf\">SAM-spec</a>) with the regex <pre>#READGROUP_ID_REGEX</pre> (one or more characters from the ASCII range 32 through 126). In particular <code>&lt;Space&gt;</code> is the only non-printing character allowed. <br/> The program enables only the wholesale assignment of all the reads in the INPUT to a single read-group. If your file already has reads assigned to multiple read-groups, the original <code>RG</code> value will be lost. Documentation: http://broadinstitute.github.io/picard/command-line-overview.html#AddOrReplaceReadGroups

index_split_alignments
samtools_index.cwl (CommandLineTool)

Indexing BAM.

create_sequence_dictionary
picard_CreateSequenceDictionary.cwl (CommandLineTool)

Create a SAM/BAM file from a fasta containing reference sequence. The output SAM file contains a header but no SAMRecords, and the header contains only sequence records.

align_rnaseq_reads_to_genome
bowtie2_align.cwl (CommandLineTool)

Tool runs bowtie aligner to align input FASTQ file(s) to reference genome

filer_out_low_quality_variants
GATK-VariantFiltration.cwl (CommandLineTool)

Filter variant calls based on INFO and/or FORMAT annotations

<p> This tool is designed for hard-filtering variant calls based on certain criteria. Records are hard-filtered by changing the value in the FILTER field to something other than PASS. Filtered records will be preserved in the output unless their removal is requested in the command line. </p>

<h3>Inputs</h3> <ul> <li>A VCF of variant calls to filter.</li> <li>One or more filtering expressions and corresponding filter names.</li> </ul>

<h3>Output</h3> <p> A filtered VCF in which passing variants are annotated as PASS and failing variants are annotated with the name(s) of the filter(s) they failed. </p>

<h3>Usage example</h3> <pre> gatk VariantFiltration \ -R reference.fasta \ -V input.vcf.gz \ -O output.vcf.gz \ --filter-name \"my_filter1\" \ --filter-expression \"AB < 0.2\" \ --filter-name \"my_filter2\" \ --filter-expression \"MQ0 > 50\" </pre>

<h3>Note</h3> <p> Composing filtering expressions can range from very simple to extremely complicated depending on what you're trying to do. <p> Compound expressions (ones that specify multiple conditions connected by &&, AND, ||, or OR, and reference multiple attributes) require special consideration. By default, variants that are missing one or more of the attributes referenced in a compound expression are treated as PASS for the entire expression, even if the variant would satisfy the filter criteria for another part of the expression. This can lead to unexpected results if any of the attributes referenced in a compound expression are present for some variants, but missing for others. <p> It is strongly recommended that such expressions be provided as individual arguments, each referencing a single attribute and specifying a single criteria. This ensures that all of the individual expression are applied to each variant, even if a given variant is missing values for some of the expression conditions. <p> As an example, multiple individual expressions provided like this: <pre> gatk VariantFiltration \ -R reference.fasta \ -V input.vcf.gz \ -O output.vcf.gz \ --filter-name \"my_filter1\" \ --filter-expression \"AB < 0.2\" \ --filter-name \"my_filter2\" \ --filter-expression \"MQ0 > 50\" </pre>

are preferable to a single compound expression such as this:

<pre> gatk VariantFiltration \ -R reference.fasta \ -V input.vcf.gz \ -O output.vcf.gz \ --filter-name \"my_filter\" \ --filter-expression \"AB < 0.2 || MQ0 > 50\" </pre> See this <a href=\"https://www.broadinstitute.org/gatk/guide/article?id=1255\">article about using JEXL expressions</a> for more information.

index_reference_genome_with_bowtie2
bowtie2_build.cwl (CommandLineTool)

Tool runs bowtie2-build to generate indices from input FASTA files

index_reference_genome_with_samtools
samtools_faidx.cwl (CommandLineTool)
call_plausible_haplotypes_and_detect_variants
GATK-HaplotypeCaller.cwl (CommandLineTool)

Call germline SNPs and indels via local re-assembly of haplotypes

<p>The HaplotypeCaller is capable of calling SNPs and indels simultaneously via local de-novo assembly of haplotypes in an active region. In other words, whenever the program encounters a region showing signs of variation, it discards the existing mapping information and completely reassembles the reads in that region. This allows the HaplotypeCaller to be more accurate when calling regions that are traditionally difficult to call, for example when they contain different types of variants close to each other. It also makes the HaplotypeCaller much better at calling indels than position-based callers like UnifiedGenotyper.</p>

<p>In the GVCF workflow used for scalable variant calling in DNA sequence data, HaplotypeCaller runs per-sample to generate an intermediate GVCF (not to be used in final analysis), which can then be used in GenotypeGVCFs for joint genotyping of multiple samples in a very efficient way. The GVCF workflow enables rapid incremental processing of samples as they roll off the sequencer, as well as scaling to very large cohort sizes (e.g. the 92K exomes of ExAC).</p>

<p>In addition, HaplotypeCaller is able to handle non-diploid organisms as well as pooled experiment data. Note however that the algorithms used to calculate variant likelihoods is not well suited to extreme allele frequencies (relative to ploidy) so its use is not recommended for somatic (cancer) variant discovery. For that purpose, use Mutect2 instead.</p>

<p>Finally, HaplotypeCaller is also able to correctly handle the splice junctions that make RNAseq a challenge for most variant callers, on the condition that the input read data has previously been processed according to our recommendations as documented <a href='https://software.broadinstitute.org/gatk/documentation/article?id=4067'>here</a>.</p>

<h3>How HaplotypeCaller works</h3>

<br /> <h4><a href='https://software.broadinstitute.org/gatk/documentation/article?id=4147'>1. Define active regions </a></h4>

<p>The program determines which regions of the genome it needs to operate on (active regions), based on the presence of evidence for variation.

<br /> <h4><a href='https://software.broadinstitute.org/gatk/documentation/article?id=4146'>2. Determine haplotypes by assembly of the active region </a></h4>

<p>For each active region, the program builds a De Bruijn-like graph to reassemble the active region and identifies what are the possible haplotypes present in the data. The program then realigns each haplotype against the reference haplotype using the Smith-Waterman algorithm in order to identify potentially variant sites. </p>

<br /> <h4><a href='https://software.broadinstitute.org/gatk/documentation/article?id=4441'>3. Determine likelihoods of the haplotypes given the read data </a></h4>

<p>For each active region, the program performs a pairwise alignment of each read against each haplotype using the PairHMM algorithm. This produces a matrix of likelihoods of haplotypes given the read data. These likelihoods are then marginalized to obtain the likelihoods of alleles for each potentially variant site given the read data. </p>

<br /> <h4><a href='https://software.broadinstitute.org/gatk/documentation/article?id=4442'>4. Assign sample genotypes </a></h4>

<p>For each potentially variant site, the program applies Bayes' rule, using the likelihoods of alleles given the read data to calculate the likelihoods of each genotype per sample given the read data observed for that sample. The most likely genotype is then assigned to the sample. </p>

<h3>Input</h3> <p> Input bam file(s) from which to make variant calls </p>

<h3>Output</h3> <p> Either a VCF or GVCF file with raw, unfiltered SNP and indel calls. Regular VCFs must be filtered either by variant recalibration (Best Practice) or hard-filtering before use in downstream analyses. If using the GVCF workflow, the output is a GVCF file that must first be run through GenotypeGVCFs and then filtering before further analysis. </p>

<h3>Caveats</h3> <ul> <li>We have not yet fully tested the interaction between the GVCF-based calling or the multisample calling and the RNAseq-specific functionalities. Use those in combination at your own risk.</li> </ul>

<h3>Special note on ploidy</h3> <p>This tool is able to handle many non-diploid use cases; the desired ploidy can be specified using the -ploidy argument. Note however that very high ploidies (such as are encountered in large pooled experiments) may cause performance challenges including excessive slowness. We are working on resolving these limitations.</p>

<h3>Additional Notes</h3> <ul> <li>When working with PCR-free data, be sure to set `-pcr_indel_model NONE` (see argument below).</li> <li>When running in `-ERC GVCF` or `-ERC BP_RESOLUTION` modes, the confidence threshold is automatically set to 0. This cannot be overridden by the command line. The threshold can be set manually to the desired level in the next step of the workflow (GenotypeGVCFs)</li> <li>We recommend using a list of intervals to speed up analysis. See <a href='https://software.broadinstitute.org/gatk/documentation/article?id=4133'>this document</a> for details.</li> </ul>

Outputs

ID Type Label Doc
snps File
indels File
Permalink: https://w3id.org/cwl/view/git/530e671c113e91fe6b9114eda32103bb980450c7/gatk4W-spark.cwl