Workflow: GATK-complete-WES-Workflow-h3abionet.cwl

Fetched 2025-05-02 21:32:02 GMT

# H3ABioNet GATK Germline Workflow # Overview A [GATK best-practices](https://software.broadinstitute.org/gatk/best-practices/bp_3step.php?case=GermShortWGS) germline workflow designed to work with GATK 3.5 (Van der Auwera et al., 2013). For more information see our [GitHub](https://github.com/h3abionet/h3agatk) site. # Workflow Summary ![pipeline](https://raw.githubusercontent.com/h3abionet/h3agatk/master/workflows/GATK/gatk_germline_small.png) # Workflow Tool Details ## FastQC FastQC is used as an initial QC step where the input files are checked for usual metrics such as: - Read length - Reads distribution - GC content - ... ## Trimmomatic Trimmomatic is the entry point of the pipeline, it is used to cleanup the reads in the input fastq files from any sequencing adaptors. ## BWA [BWA](http://bio-bwa.sourceforge.net) is used to align the reads from the the input fastq files -paired-ends- (Li, 2013). We use specifically `bwa mem` as recommended by the [GATK best-practices](https://software.broadinstitute.org/gatk/best-practices/bp_3step.php?case=GermShortWGS). BWA produces a SAM file containing the aligned reads against the human reference genome (hg19, GATK bundle build 2.8). As GATK tools downstream requires properly formatted Read Group information. We add by default 'toy' Read Group information while processing the alignment to the output SAM file. we specifically use the flag `-R '@RG\tID:foo\tSM:bar\tLB:library1'`. ## SAMtools [SAMtools](http://www.htslib.org) (Li et al., 2009) are used few times in the pipeline: 1. Convert BWA's output from a SAM format to a BAM format 2. Sort the reads in the generated BAM file in step 1 (above) 3. Indexing the BAM file for the following tools to use ## Picard [Picard tools](https://broadinstitute.github.io/picard/) are used to mark duplicated reads in the aligned and sorted BAM file, making thus the files lighter and less prone to errors in the downstream steps of the pipeline. ## GATK [Genome Analysis Tool Kit](https://software.broadinstitute.org/gatk) refered to as GATK (DePristo et al., 2011) is used to process the data throught multiple steps as described by the [GATK best-practices](https://software.broadinstitute.org/gatk/best-practices/bp_3step.php?case=GermShortWGS) (i.e. figure bellow). ![GATK best-practices pipeline](https://raw.githubusercontent.com/h3abionet/h3agatk/master/workflows/GATK/gatk.png) The GATK steps are the following: 1. Indel Realignment: 1. [Realign Target Creator](https://software.broadinstitute.org/gatk/documentation/tooldocs/org_broadinstitute_gatk_tools_walkers_indels_RealignerTargetCreator.php) 2. [Indel Realigner](https://software.broadinstitute.org/gatk/documentation/tooldocs/org_broadinstitute_gatk_tools_walkers_indels_IndelRealigner.php) 2. Mark Duplicates (a picard step) 3. Base Quality Score Recalibration (BQSR): 1. [Base Recalibrator](https://software.broadinstitute.org/gatk/documentation/tooldocs/org_broadinstitute_gatk_tools_walkers_bqsr_BaseRecalibrator.php) 2. [Print Reads](https://software.broadinstitute.org/gatk/documentation/tooldocs/org_broadinstitute_gatk_tools_walkers_readutils_PrintReads.php) 4. [Haplotype Caller](https://software.broadinstitute.org/gatk/documentation/tooldocs/) 5. Variant Quality Score Recalibration (VQSR): 1. [Variant Recalibrator](https://software.broadinstitute.org/gatk/documentation/tooldocs/org_broadinstitute_gatk_tools_walkers_variantrecalibration_VariantRecalibrator.php) 2. [Apply Recalibration](https://software.broadinstitute.org/gatk/documentation/tooldocs/org_broadinstitute_gatk_tools_walkers_variantrecalibration_ApplyRecalibration.php) ## SnpEff SNPEff is used in this pipeline to annotate the variant calls (Cingolani et al., 2012). The annotation is extensive and uses multi-database approach to provide the user with as much information about the called variants as possible. ## BAMStat [BAMStats](http://bamstats.sourceforge.net), is a simple software tool built on the Picard Java API (2), which can calculate and graphically display various metrics derived from SAM/BAM files of value in QC assessments.

children parents
Workflow as SVG
  • Selected
  • Default Values
  • Nested Workflows
  • Tools
  • Inputs/Outputs

Inputs

ID Type Title Doc
dbsnp File

vcf file containing SNP variations used for Haplotype caller

reads File[] (Optional)

files containing the paired end reads in fastq format required for bwa-mem

tmpdir String (Optional)

temporary dir for picard

gatk_jar File

Jar executable of the GATK tool

covariate String[] (Optional)

required for base recalibrator

reference File

reference human genome file

bwa_threads Integer

number of threads

snpf_genome String
gatk_threads Integer

number of threads

resource_1kg File
resource_omni File
snpf_data_dir Directory
bwa_read_group String

read group

resource_dbsnp File
resource_mills File
bwa_output_name String

name of bwa-mem output file

resource_hapmap File
snpf_nodownload Boolean
known_variant_db File[] (Optional)

array of known variant files for realign target creator

samtools_threads Integer

number of threads

filter_expression String
samtools-index-bai Boolean

boolean set to output bam file from samtools view

samtools-view-isbam Boolean

boolean set to output bam file from samtools view

output_samtools-sort String

output file name for bam file generated by samtools sort

output_samtools-view String

output file name for bam file generated by samtools view

samtools-view-sambam String (Optional)

temporary dir for picard

snpeff_java_mem_opts String[] (Optional)

memory options passed to the java command run for snpEff

uncompressed_reference File

reference human genome file

output_RefDictionaryFile String

output file name for picard create dictionary command from picard toolkit

outputFileName_PrintReads String

name of PrintReads command output file

readSorted_MarkDuplicates String

set to be true showing that reads are sorted already

createIndex_MarkDuplicates String

set to be true to create .bai file from Picard Mark Duplicates

metricsFile_MarkDuplicates String

metric file generated by MarkDupicates command listing duplicates

depth_omitIntervalStatistics Boolean (Optional)

Do not calculate per-interval statistics

outputFileName_IndelRealigner String

name of indelRealigner output file

outputFileName_MarkDuplicates String

output file name generated as a result of Markduplicates command from picard toolkit

outputFileName_HaplotypeCaller String

name of Haplotype caller command output file

depth_omitDepthOutputAtEachBase Boolean (Optional)

Do not output depth of coverage at each base

outputFileName_BaseRecalibrator String

name of BaseRecalibrator output file

removeDuplicates_MarkDuplicates String

set to be true

depth_outputfile_DepthOfCoverage String (Optional)

name of the output report basename

outputFileName_RealignTargetCreator String

name of realignTargetCreator output file

Steps

ID Runs Label Doc
SnpVQSR
IndelFilter
HaplotypeCaller

Outputs

ID Type Label Doc
output_bamstat File
output_printReads File
output_HaplotypeCaller File
output_SnpVQSR_recal_File File
output_SnpVQSR_annotated_snps File
output_IndelFilter_annotated_indels File
Permalink: https://w3id.org/cwl/view/git/3f7a70fac81d7b70362c1587a6142e373a97d0ad/workflows/GATK/GATK-complete-WES-Workflow-h3abionet.cwl