Workflow: somatic_exome: exome alignment and somatic variant detection
somatic_exome is designed to perform processing of mutant/wildtype H.sapiens exome sequencing data. It features BQSR corrected alignments, 4 caller variant detection, and vep style annotations. Structural variants are detected via manta and cnvkit. In addition QC metrics are run, including somalier concordance metrics. example input file = analysis_workflows/example_data/somatic_exome.yaml
- Selected
- |
- Default Values
- Nested Workflows
- Tools
- Inputs/Outputs
Inputs
ID | Type | Title | Doc |
---|---|---|---|
mills | File | mills: File specifying common polymorphic indels from mills et al. |
mills provides known polymorphic indels recommended by GATK for a variety of tools including the BaseRecalibrator. This file is part of the GATK resource bundle available at http://www.broadinstitute.org/gatk/guide/article?id=1213 Essentially it is a list of known indels originally discovered by mill et al. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1557762/ File should be in vcf format, and tabix indexed. |
docm_vcf | File |
The set of alleles that gatk haplotype caller will use to force-call regardless of evidence |
|
omni_vcf | File | ||
trimming | https://w3id.org/cwl/view/git/18d8efdc4c97c1c9222f603f529b909b36fa42e7/definitions/types/trimming_options.yml#trimming_options (Optional) | ||
vep_pick |
configures how vep will annotate genomic features that each variant overlaps; for a detailed description of each option see https://useast.ensembl.org/info/docs/tools/vep/script/vep_other.html#pick_allele_gene_eg |
||
dbsnp_vcf | File | dbsnp_vcf: File specifying common polymorphic indels from dbSNP |
dbsnp_vcf provides known indels reecommended by GATK for a variety of tools including the BaseRecalibrator. This file is part of the GATK resource bundle available at http://www.broadinstitute.org/gatk/guide/article?id=1213 Essintially it is a list of known indels from dbSNP. File should be in vcf format, and tabix indexed. |
reference | File | reference: Reference fasta file for a desired assembly |
reference contains the nucleotide sequence for a given assembly (hg37, hg38, etc.) in fasta format for the entire genome. This is what reads will be aligned to. Appropriate files can be found on ensembl at https://ensembl.org/info/data/ftp/index.html When providing the reference secondary files corresponding to reference indices must be located in the same directory as the reference itself. These files can be created with samtools index, bwa index, and picard CreateSequenceDictionary. |
tumor_name | String (Optional) | tumor_name: String specifying the name of the MT sample |
tumor_name provides a string for what the MT sample will be referred to in the various outputs, for exmaple the VCF files. |
normal_name | String (Optional) | normal_name: String specifying the name of the WT sample |
normal_name provides a string for what the WT sample will be referred to in the various outputs, for exmaple the VCF files. |
known_indels | File | known_indels: File specifying common polymorphic indels from 1000G |
known_indels provides known indels reecommended by GATK for a variety of tools including the BaseRecalibrator. This file is part of the GATK resource bundle available at http://www.broadinstitute.org/gatk/guide/article?id=1213 Essintially it is a list of known indels from 1000 Genomes Phase I indel calls. File should be in vcf format, and tabix indexed. |
somalier_vcf | File |
a vcf file of known polymorphic sites for somalier to compare normal and tumor samples for identity; sites files can be found at: https://github.com/brentp/somalier/releases |
|
manta_non_wgs | Boolean (Optional) |
toggles on or off manta settings for WES vs. WGS mode for structural variant detection |
|
synonyms_file | File (Optional) |
synonyms_file allows the use of different chromosome identifiers in vep inputs or annotation files (cache, database, GFF, custom file, fasta). File should be tab-delimited with the primary identifier in column 1 and the synonym in column 2. |
|
vep_cache_dir | Directory |
path to the vep cache directory, available at: https://useast.ensembl.org/info/docs/tools/vep/script/vep_cache.html#pre |
|
bait_intervals | File | bait_intervals: interval_list file of baits used in the sequencing experiment |
bait_intervals is an interval_list corresponding to the baits used in sequencing reagent. These are essentially coordinates for regions you were able to design probes for in the reagent. Typically the reagent provider has this information available in bed format and it can be converted to an interval_list with Picard BedToIntervalList. Astrazeneca also maintains a repo of baits for common sequencing reagents available at https://github.com/AstraZeneca-NGS/reference_data |
bqsr_intervals | String[] | bqsr_intervals: Array of strings specifying regions for base quality score recalibration |
bqsr_intervals provides an array of genomic intervals for which to apply GATK base quality score recalibrations. Typically intervals are given for the entire chromosome (i.e. chr1, chr2, etc.), these names should match the format in the reference file. |
cle_vcf_filter | Boolean | ||
known_variants | File (Optional) |
Previously discovered variants to be flagged in this pipelines's output vcf |
|
tumor_sequence | https://w3id.org/cwl/view/git/18d8efdc4c97c1c9222f603f529b909b36fa42e7/definitions/types/sequence_data.yml#sequence_data[] | tumor_sequence: yml file specifying the location of MT sequencing data |
tumor_sequence is a yml file for which to pass information regarding sequencing data for single sample (i.e. fastq files). If more than one fastq file exist for a sample, as in the case for multiple instrument data, the sequence tag is simply repeated with the additional data (see example input file). Note that in the @RG field ID and SM are required. |
normal_sequence | https://w3id.org/cwl/view/git/18d8efdc4c97c1c9222f603f529b909b36fa42e7/definitions/types/sequence_data.yml#sequence_data[] | normal_sequence: yml file specifying the location of WT sequencing data |
normal_sequence is a yml file for which to pass information regarding sequencing data for single sample (i.e. fastq files). If more than one fastq file exist for a sample, as in the case for multiple instrument data, the sequence tag is simply repeated with the additional data (see example input file). Note that in the @RG field ID and SM are required. |
varscan_p_value | Float (Optional) | ||
target_intervals | File | target_intervals: interval_list file of targets used in the sequencing experiment |
target_intervals is an interval_list corresponding to the targets for the capture reagent. Bed files with this information can be converted to interval_lists with Picard BedToIntervalList. In general for a WES exome reagent bait_intervals and target_intervals are the same. |
summary_intervals | https://w3id.org/cwl/view/git/18d8efdc4c97c1c9222f603f529b909b36fa42e7/definitions/types/labelled_file.yml#labelled_file[] | ||
tumor_sample_name | String | ||
manta_call_regions | File (Optional) |
bgzip-compressed, tabix-indexed BED file specifiying regions to which manta structural variant analysis is limited |
|
normal_sample_name | String | ||
per_base_intervals | https://w3id.org/cwl/view/git/18d8efdc4c97c1c9222f603f529b909b36fa42e7/definitions/types/labelled_file.yml#labelled_file[] | per_base_intervals: additional intervals over which to summarize coverage/QC at a per-base resolution |
per_base_intervals is a list of regions (in interval_list format) over which to summarize coverage/QC at a per-base resolution. |
pindel_insert_size | Integer | ||
vep_ensembl_species | String |
ensembl species - Must be present in the cache directory. Examples: homo_sapiens or mus_musculus |
|
vep_ensembl_version | String |
ensembl version - Must be present in the cache directory. Example: 95 |
|
vep_to_table_fields | String[] |
VEP fields in final output |
|
annotate_coding_only | Boolean (Optional) |
if set to true, vep only returns consequences that fall in the coding regions of transcripts |
|
filter_docm_variants | Boolean (Optional) | ||
manta_output_contigs | Boolean (Optional) |
if set to true configures manta to output assembled contig sequences in the final VCF files |
|
mutect_scatter_count | Integer | ||
per_target_intervals | https://w3id.org/cwl/view/git/18d8efdc4c97c1c9222f603f529b909b36fa42e7/definitions/types/labelled_file.yml#labelled_file[] | per_target_intervals: additional intervals over which to summarize coverage/QC at a per-target resolution |
per_target_intervals list of regions (in interval_list format) over which to summarize coverage/QC at a per-target resolution. |
strelka_cpu_reserved | Integer (Optional) | ||
varscan_min_coverage | Integer (Optional) | ||
varscan_min_var_freq | Float (Optional) | ||
vep_ensembl_assembly | String |
genome assembly to use in vep. Examples: GRCh38 or GRCm38 |
|
varscan_strand_filter | Integer (Optional) | ||
vep_custom_annotations | https://w3id.org/cwl/view/git/18d8efdc4c97c1c9222f603f529b909b36fa42e7/definitions/types/vep_custom_annotation.yml#vep_custom_annotation[] |
custom type, check types directory for input format |
|
qc_minimum_base_quality | Integer (Optional) | ||
target_interval_padding | Integer | target_interval_padding: number of bp flanking each target region in which to allow variant calls |
The effective coverage of capture products generally extends out beyond the actual regions targeted. This parameter allows variants to be called in these wingspan regions, extending this many base pairs from each side of the target regions. |
varscan_max_normal_freq | Float (Optional) | ||
variants_to_table_fields | String[] |
The names of one or more standard VCF fields or INFO fields to include in the output table |
|
qc_minimum_mapping_quality | Integer (Optional) | ||
filter_somatic_llr_threshold | Float |
Sets the stringency (log-likelihood ratio) used to filter out non-somatic variants. Typical values are 10=high stringency, 5=normal, 3=low stringency. Low stringency may be desirable when read depths are low (as in WGS) or when tumor samples are impure. |
|
mutect_artifact_detection_mode | Boolean | ||
filter_somatic_llr_tumor_purity | Float |
Sets the purity of the tumor used in the somatic llr filter, used to remove non-somatic variants. Probably only needs to be adjusted for low-purity (< 50%). Range is 0 to 1 |
|
picard_metric_accumulation_level | String | ||
variants_to_table_genotype_fields | String[] |
The name of a genotype field to include in the output table |
|
mutect_max_alt_alleles_in_normal_count | Integer (Optional) | ||
mutect_max_alt_allele_in_normal_fraction | Float (Optional) | ||
filter_somatic_llr_normal_contamination_rate | Float |
Sets the fraction of tumor present in the normal sample (range 0 to 1), used in the somatic llr filter. Useful for heavily contaminated adjacent normals. Range is 0 to 1 |
Steps
ID | Runs | Label | Doc |
---|---|---|---|
manta |
../tools/manta_somatic.cwl
(CommandLineTool)
|
Set up and execute manta | |
cnvkit |
../tools/cnvkit_batch.cwl
(CommandLineTool)
|
||
concordance |
../tools/concordance.cwl
(CommandLineTool)
|
Concordance checking between Tumor and Normal BAM | |
detect_variants |
detect_variants.cwl
(Workflow)
|
Detect Variants workflow | |
tumor_index_cram |
../tools/index_cram.cwl
(CommandLineTool)
|
samtools index cram | |
normal_index_cram |
../tools/index_cram.cwl
(CommandLineTool)
|
samtools index cram | |
tumor_bam_to_cram |
../tools/bam_to_cram.cwl
(CommandLineTool)
|
BAM to CRAM conversion | |
normal_bam_to_cram |
../tools/bam_to_cram.cwl
(CommandLineTool)
|
BAM to CRAM conversion | |
pad_target_intervals |
../tools/interval_list_expand.cwl
(CommandLineTool)
|
expand interval list regions by a given number of basepairs | |
tumor_alignment_and_qc |
alignment_exome.cwl
(Workflow)
|
exome alignment with qc | |
normal_alignment_and_qc |
alignment_exome.cwl
(Workflow)
|
exome alignment with qc |
Outputs
ID | Type | Label | Doc |
---|---|---|---|
final_tsv | File | ||
final_vcf | File | ||
cn_diagram | File (Optional) | ||
tumor_cram | File | ||
normal_cram | File | ||
vep_summary | File | ||
all_candidates | File | ||
cn_scatter_plot | File (Optional) | ||
tumor_flagstats | File | ||
diploid_variants | File (Optional) | ||
intervals_target | File (Optional) | ||
normal_flagstats | File | ||
small_candidates | File | ||
somatic_variants | File (Optional) | ||
tumor_hs_metrics | File | ||
docm_filtered_vcf | File | ||
normal_hs_metrics | File | ||
final_filtered_vcf | File | ||
reference_coverage | File (Optional) | ||
mutect_filtered_vcf | File | ||
pindel_filtered_vcf | File | ||
tumor_only_variants | File (Optional) | ||
intervals_antitarget | File (Optional) | ||
strelka_filtered_vcf | File | ||
varscan_filtered_vcf | File | ||
mutect_unfiltered_vcf | File | ||
pindel_unfiltered_vcf | File | ||
tumor_target_coverage | File | ||
normal_target_coverage | File | ||
strelka_unfiltered_vcf | File | ||
tumor_bin_level_ratios | File | ||
tumor_segmented_ratios | File | ||
varscan_unfiltered_vcf | File | ||
tumor_summary_hs_metrics | File[] | ||
normal_summary_hs_metrics | File[] | ||
tumor_antitarget_coverage | File | ||
tumor_insert_size_metrics | File | ||
tumor_per_base_hs_metrics | File[] | ||
tumor_verify_bam_id_depth | File | ||
normal_antitarget_coverage | File | ||
normal_insert_size_metrics | File | ||
normal_per_base_hs_metrics | File[] | ||
normal_verify_bam_id_depth | File | ||
tumor_per_target_hs_metrics | File[] | ||
tumor_snv_bam_readcount_tsv | File | ||
tumor_verify_bam_id_metrics | File | ||
normal_per_target_hs_metrics | File[] | ||
normal_snv_bam_readcount_tsv | File | ||
normal_verify_bam_id_metrics | File | ||
somalier_concordance_metrics | File | ||
tumor_indel_bam_readcount_tsv | File | ||
tumor_mark_duplicates_metrics | File | ||
normal_indel_bam_readcount_tsv | File | ||
normal_mark_duplicates_metrics | File | ||
somalier_concordance_statistics | File | ||
tumor_alignment_summary_metrics | File | ||
tumor_per_base_coverage_metrics | File[] | ||
normal_alignment_summary_metrics | File | ||
normal_per_base_coverage_metrics | File[] | ||
tumor_per_target_coverage_metrics | File[] | ||
normal_per_target_coverage_metrics | File[] |
https://w3id.org/cwl/view/git/18d8efdc4c97c1c9222f603f529b909b36fa42e7/definitions/pipelines/somatic_exome.cwl