CWL Workflow: GATK-complete-WES-Workflow-h3abionet.cwl

Workflow: GATK-complete-WES-Workflow-h3abionet.cwl

Fetched 2025-05-02 21:32:02 GMT

Verified with cwltool version 3.1.20221201130942

# H3ABioNet GATK Germline Workflow # Overview A [GATK best-practices](https://software.broadinstitute.org/gatk/best-practices/bp_3step.php?case=GermShortWGS) germline workflow designed to work with GATK 3.5 (Van der Auwera et al., 2013). For more information see our [GitHub](https://github.com/h3abionet/h3agatk) site. # Workflow Summary ![pipeline](https://raw.githubusercontent.com/h3abionet/h3agatk/master/workflows/GATK/gatk_germline_small.png) # Workflow Tool Details ## FastQC FastQC is used as an initial QC step where the input files are checked for usual metrics such as: - Read length - Reads distribution - GC content - ... ## Trimmomatic Trimmomatic is the entry point of the pipeline, it is used to cleanup the reads in the input fastq files from any sequencing adaptors. ## BWA [BWA](http://bio-bwa.sourceforge.net) is used to align the reads from the the input fastq files -paired-ends- (Li, 2013). We use specifically `bwa mem` as recommended by the [GATK best-practices](https://software.broadinstitute.org/gatk/best-practices/bp_3step.php?case=GermShortWGS). BWA produces a SAM file containing the aligned reads against the human reference genome (hg19, GATK bundle build 2.8). As GATK tools downstream requires properly formatted Read Group information. We add by default 'toy' Read Group information while processing the alignment to the output SAM file. we specifically use the flag `-R '@RG\tID:foo\tSM:bar\tLB:library1'`. ## SAMtools [SAMtools](http://www.htslib.org) (Li et al., 2009) are used few times in the pipeline: 1. Convert BWA's output from a SAM format to a BAM format 2. Sort the reads in the generated BAM file in step 1 (above) 3. Indexing the BAM file for the following tools to use ## Picard [Picard tools](https://broadinstitute.github.io/picard/) are used to mark duplicated reads in the aligned and sorted BAM file, making thus the files lighter and less prone to errors in the downstream steps of the pipeline. ## GATK [Genome Analysis Tool Kit](https://software.broadinstitute.org/gatk) refered to as GATK (DePristo et al., 2011) is used to process the data throught multiple steps as described by the [GATK best-practices](https://software.broadinstitute.org/gatk/best-practices/bp_3step.php?case=GermShortWGS) (i.e. figure bellow). ![GATK best-practices pipeline](https://raw.githubusercontent.com/h3abionet/h3agatk/master/workflows/GATK/gatk.png) The GATK steps are the following: 1. Indel Realignment: 1. [Realign Target Creator](https://software.broadinstitute.org/gatk/documentation/tooldocs/org_broadinstitute_gatk_tools_walkers_indels_RealignerTargetCreator.php) 2. [Indel Realigner](https://software.broadinstitute.org/gatk/documentation/tooldocs/org_broadinstitute_gatk_tools_walkers_indels_IndelRealigner.php) 2. Mark Duplicates (a picard step) 3. Base Quality Score Recalibration (BQSR): 1. [Base Recalibrator](https://software.broadinstitute.org/gatk/documentation/tooldocs/org_broadinstitute_gatk_tools_walkers_bqsr_BaseRecalibrator.php) 2. [Print Reads](https://software.broadinstitute.org/gatk/documentation/tooldocs/org_broadinstitute_gatk_tools_walkers_readutils_PrintReads.php) 4. [Haplotype Caller](https://software.broadinstitute.org/gatk/documentation/tooldocs/) 5. Variant Quality Score Recalibration (VQSR): 1. [Variant Recalibrator](https://software.broadinstitute.org/gatk/documentation/tooldocs/org_broadinstitute_gatk_tools_walkers_variantrecalibration_VariantRecalibrator.php) 2. [Apply Recalibration](https://software.broadinstitute.org/gatk/documentation/tooldocs/org_broadinstitute_gatk_tools_walkers_variantrecalibration_ApplyRecalibration.php) ## SnpEff SNPEff is used in this pipeline to annotate the variant calls (Cingolani et al., 2012). The annotation is extensive and uses multi-database approach to provide the user with as much information about the called variants as possible. ## BAMStat [BAMStats](http://bamstats.sourceforge.net), is a simple software tool built on the Picard Java API (2), which can calculate and graphically display various metrics derived from SAM/BAM files of value in QC assessments.

Selected
|
Default Values
Nested Workflows
Tools
Inputs/Outputs

This workflow is Open Source and may be reused according to the terms of: MIT License

Note that the tools invoked by the workflow may have separate licenses.

Inputs

ID	Type	Doc
dbsnp	File	vcf file containing SNP variations used for Haplotype caller
reads	File[] (Optional)	files containing the paired end reads in fastq format required for bwa-mem
tmpdir	String (Optional)	temporary dir for picard
gatk_jar	File	Jar executable of the GATK tool
covariate	String[] (Optional)	required for base recalibrator
reference	File	reference human genome file
bwa_threads	Integer	number of threads
snpf_genome	String
gatk_threads	Integer	number of threads
resource_1kg	File
resource_omni	File
snpf_data_dir	Directory
bwa_read_group	String	read group
resource_dbsnp	File
resource_mills	File
bwa_output_name	String	name of bwa-mem output file
resource_hapmap	File
snpf_nodownload	Boolean
known_variant_db	File[] (Optional)	array of known variant files for realign target creator
samtools_threads	Integer	number of threads
filter_expression	String
samtools-index-bai	Boolean	boolean set to output bam file from samtools view
samtools-view-isbam	Boolean	boolean set to output bam file from samtools view
output_samtools-sort	String	output file name for bam file generated by samtools sort
output_samtools-view	String	output file name for bam file generated by samtools view
samtools-view-sambam	String (Optional)	temporary dir for picard
snpeff_java_mem_opts	String[] (Optional)	memory options passed to the java command run for snpEff
uncompressed_reference	File	reference human genome file
output_RefDictionaryFile	String	output file name for picard create dictionary command from picard toolkit
outputFileName_PrintReads	String	name of PrintReads command output file
readSorted_MarkDuplicates	String	set to be true showing that reads are sorted already
createIndex_MarkDuplicates	String	set to be true to create .bai file from Picard Mark Duplicates
metricsFile_MarkDuplicates	String	metric file generated by MarkDupicates command listing duplicates
depth_omitIntervalStatistics	Boolean (Optional)	Do not calculate per-interval statistics
outputFileName_IndelRealigner	String	name of indelRealigner output file
outputFileName_MarkDuplicates	String	output file name generated as a result of Markduplicates command from picard toolkit
outputFileName_HaplotypeCaller	String	name of Haplotype caller command output file
depth_omitDepthOutputAtEachBase	Boolean (Optional)	Do not output depth of coverage at each base
outputFileName_BaseRecalibrator	String	name of BaseRecalibrator output file
removeDuplicates_MarkDuplicates	String	set to be true
depth_outputfile_DepthOfCoverage	String (Optional)	name of the output report basename
outputFileName_RealignTargetCreator	String	name of realignTargetCreator output file

Steps

ID	Runs	Label	Doc
SnpVQSR	GATK-Sub-Workflow-h3abionet-snp.cwl (Workflow)
IndelFilter	GATK-Sub-Workflow-h3abionet-indel-no-vqsr.cwl (Workflow)
HaplotypeCaller	GATK-Sub-Workflow-Workflow-h3abionet-haplotype.cwl (Workflow)

Outputs

ID	Type	Label	Doc
output_bamstat	File
output_printReads	File
output_HaplotypeCaller	File
output_SnpVQSR_recal_File	File
output_SnpVQSR_annotated_snps	File
output_IndelFilter_annotated_indels	File

Permalink:

https://w3id.org/cwl/view/git/3f7a70fac81d7b70362c1587a6142e373a97d0ad/workflows/GATK/GATK-complete-WES-Workflow-h3abionet.cwl