Workflow: somatic_exome: exome alignment and somatic variant detection
somatic_exome is designed to perform processing of mutant/wildtype H.sapiens exome sequencing data. It features BQSR corrected alignments, 4 caller variant detection, and vep style annotations. Structural variants are detected via manta and cnvkit. In addition QC metrics are run, including somalier concordance metrics. example input file = analysis_workflows/example_data/somatic_exome.yaml
- Selected
- |
- Default Values
- Nested Workflows
- Tools
- Inputs/Outputs
Inputs
ID | Type | Title | Doc |
---|---|---|---|
docm_vcf | File |
The set of alleles that gatk haplotype caller will use to force-call regardless of evidence |
|
omni_vcf | File | ||
trimming | https://w3id.org/cwl/view/git/a08de598edc04f340fdbff76c9a92336a7702022/definitions/types/trimming_options.yml#trimming_options (Optional) | ||
vep_pick |
configures how vep will annotate genomic features that each variant overlaps; for a detailed description of each option see https://useast.ensembl.org/info/docs/tools/vep/script/vep_other.html#pick_allele_gene_eg |
||
reference | File | reference: Reference fasta file for a desired assembly |
reference contains the nucleotide sequence for a given assembly (hg37, hg38, etc.) in fasta format for the entire genome. This is what reads will be aligned to. Appropriate files can be found on ensembl at https://ensembl.org/info/data/ftp/index.html When providing the reference secondary files corresponding to reference indices must be located in the same directory as the reference itself. These files can be created with samtools index, bwa index, and picard CreateSequenceDictionary. |
tumor_name | String (Optional) | tumor_name: String specifying the name of the MT sample |
tumor_name provides a string for what the MT sample will be referred to in the various outputs, for example the VCF files. |
normal_name | String (Optional) | normal_name: String specifying the name of the WT sample |
normal_name provides a string for what the WT sample will be referred to in the various outputs, for example the VCF files. |
somalier_vcf | File |
a vcf file of known polymorphic sites for somalier to compare normal and tumor samples for identity; sites files can be found at: https://github.com/brentp/somalier/releases |
|
manta_non_wgs | Boolean (Optional) |
toggles on or off manta settings for WES vs. WGS mode for structural variant detection |
|
scatter_count | Integer |
scatters each supported variant detector (varscan, pindel, mutect) into this many parallel jobs |
|
synonyms_file | File (Optional) |
synonyms_file allows the use of different chromosome identifiers in vep inputs or annotation files (cache, database, GFF, custom file, fasta). File should be tab-delimited with the primary identifier in column 1 and the synonym in column 2. |
|
vep_cache_dir | Directory |
path to the vep cache directory, available at: https://useast.ensembl.org/info/docs/tools/vep/script/vep_cache.html#pre |
|
bait_intervals | File | bait_intervals: interval_list file of baits used in the sequencing experiment |
bait_intervals is an interval_list corresponding to the baits used in sequencing reagent. These are essentially coordinates for regions you were able to design probes for in the reagent. Typically the reagent provider has this information available in bed format and it can be converted to an interval_list with Picard BedToIntervalList. Astrazeneca also maintains a repo of baits for common sequencing reagents available at https://github.com/AstraZeneca-NGS/reference_data |
bqsr_intervals | String[] | bqsr_intervals: Array of strings specifying regions for base quality score recalibration |
bqsr_intervals provides an array of genomic intervals for which to apply GATK base quality score recalibrations. Typically intervals are given for the entire chromosome (chr1, chr2, etc.), these names should match the format in the reference file. |
cle_vcf_filter | Boolean | ||
tumor_sequence | https://w3id.org/cwl/view/git/a08de598edc04f340fdbff76c9a92336a7702022/definitions/types/sequence_data.yml#sequence_data[] | tumor_sequence: MT sequencing data and readgroup information |
tumor_sequence represents the sequencing data for the MT sample as either FASTQs or BAMs with accompanying readgroup information. Note that in the @RG field ID and SM are required. |
normal_sequence | https://w3id.org/cwl/view/git/a08de598edc04f340fdbff76c9a92336a7702022/definitions/types/sequence_data.yml#sequence_data[] | normal_sequence: WT sequencing data and readgroup information |
normal_sequence represents the sequencing data for the WT sample as either FASTQs or BAMs with accompanying readgroup information. Note that in the @RG field ID and SM are required. |
varscan_p_value | Float (Optional) | ||
bqsr_known_sites | File[] | bqsr_known_sites: One or more databases of known polymorphic sites used to exclude regions around known polymorphisms from analysis. |
Known polymorphic indels recommended by GATK for a variety of tools including the BaseRecalibrator. This is part of the GATK resource bundle available at http://www.broadinstitute.org/gatk/guide/article?id=1213 File should be in vcf format, and tabix indexed. |
target_intervals | File | target_intervals: interval_list file of targets used in the sequencing experiment |
target_intervals is an interval_list corresponding to the targets for the capture reagent. Bed files with this information can be converted to interval_lists with Picard BedToIntervalList. In general for a WES exome reagent bait_intervals and target_intervals are the same. |
summary_intervals | https://w3id.org/cwl/view/git/a08de598edc04f340fdbff76c9a92336a7702022/definitions/types/labelled_file.yml#labelled_file[] | ||
tumor_sample_name | String | ||
manta_call_regions | File (Optional) |
bgzip-compressed, tabix-indexed BED file specifiying regions to which manta structural variant analysis is limited |
|
normal_sample_name | String | ||
per_base_intervals | https://w3id.org/cwl/view/git/a08de598edc04f340fdbff76c9a92336a7702022/definitions/types/labelled_file.yml#labelled_file[] | per_base_intervals: additional intervals over which to summarize coverage/QC at a per-base resolution |
per_base_intervals is a list of regions (in interval_list format) over which to summarize coverage/QC at a per-base resolution. |
pindel_insert_size | Integer | ||
validated_variants | File (Optional) |
An optional VCF with variants that will be flagged as 'VALIDATED' if found in this pipeline's main output VCF |
|
vep_ensembl_species | String |
ensembl species - Must be present in the cache directory. Examples: homo_sapiens or mus_musculus |
|
vep_ensembl_version | String |
ensembl version - Must be present in the cache directory. Example: 95 |
|
vep_to_table_fields | String[] |
VEP fields in final output |
|
annotate_coding_only | Boolean (Optional) |
if set to true, vep only returns consequences that fall in the coding regions of transcripts |
|
filter_docm_variants | Boolean (Optional) | ||
manta_output_contigs | Boolean (Optional) |
if set to true configures manta to output assembled contig sequences in the final VCF files |
|
per_target_intervals | https://w3id.org/cwl/view/git/a08de598edc04f340fdbff76c9a92336a7702022/definitions/types/labelled_file.yml#labelled_file[] | per_target_intervals: additional intervals over which to summarize coverage/QC at a per-target resolution |
per_target_intervals list of regions (in interval_list format) over which to summarize coverage/QC at a per-target resolution. |
strelka_cpu_reserved | Integer (Optional) | ||
varscan_min_coverage | Integer (Optional) | ||
varscan_min_var_freq | Float (Optional) | ||
vep_ensembl_assembly | String |
genome assembly to use in vep. Examples: GRCh38 or GRCm38 |
|
varscan_strand_filter | Integer (Optional) | ||
vep_custom_annotations | https://w3id.org/cwl/view/git/a08de598edc04f340fdbff76c9a92336a7702022/definitions/types/vep_custom_annotation.yml#vep_custom_annotation[] |
custom type, check types directory for input format |
|
qc_minimum_base_quality | Integer (Optional) | ||
target_interval_padding | Integer | target_interval_padding: number of bp flanking each target region in which to allow variant calls |
The effective coverage of capture products generally extends out beyond the actual regions targeted. This parameter allows variants to be called in these wingspan regions, extending this many base pairs from each side of the target regions. |
varscan_max_normal_freq | Float (Optional) | ||
variants_to_table_fields | String[] |
The names of one or more standard VCF fields or INFO fields to include in the output table |
|
cnvkit_target_average_size | Integer (Optional) |
approximate size of split target bins for CNVkit; if not set a suitable window size will be set by CNVkit automatically |
|
qc_minimum_mapping_quality | Integer (Optional) | ||
filter_somatic_llr_threshold | Float |
Sets the stringency (log-likelihood ratio) used to filter out non-somatic variants. Typical values are 10=high stringency, 5=normal, 3=low stringency. Low stringency may be desirable when read depths are low (as in WGS) or when tumor samples are impure. |
|
mutect_artifact_detection_mode | Boolean | ||
filter_somatic_llr_tumor_purity | Float |
Sets the purity of the tumor used in the somatic llr filter, used to remove non-somatic variants. Probably only needs to be adjusted for low-purity (< 50%). Range is 0 to 1 |
|
picard_metric_accumulation_level | String | ||
variants_to_table_genotype_fields | String[] |
The name of a genotype field to include in the output table |
|
mutect_max_alt_alleles_in_normal_count | Integer (Optional) | ||
mutect_max_alt_allele_in_normal_fraction | Float (Optional) | ||
filter_somatic_llr_normal_contamination_rate | Float |
Sets the fraction of tumor present in the normal sample (range 0 to 1), used in the somatic llr filter. Useful for heavily contaminated adjacent normals. Range is 0 to 1 |
Steps
ID | Runs | Label | Doc |
---|---|---|---|
manta |
../tools/manta_somatic.cwl
(CommandLineTool)
|
Set up and execute manta | |
cnvkit |
../tools/cnvkit_batch.cwl
(CommandLineTool)
|
Note: cnvkit batch is a complex command that is capable of running all or part of the cnvkit internal pipeline, depending on the combination of inputs provided to it. In order to take advantage of this, most inputs to this cwl are optional, so that different workflows can use different forms of the command while still using a single cwl file. For further reading, see the relevant cnvkit docs at https://cnvkit.readthedocs.io/en/stable/quickstart.html#build-a-reference-from-normal-samples-and-infer-tumor-copy-ratios https://cnvkit.readthedocs.io/en/stable/pipeline.html#batch In our pipelines, the command form is mainly determined by the components of the reference input. The somatic_exome cwl pipeline provides a fasta file and a normal bam, which causes the batch pipeline to construct a copy number reference (.cnn file) based on the normal bam. The germline_wgs cwl pipeline does not provide a normal bam; instead it passes a cnn reference file as an optional input. This file is intended to be manually generated from a reference normal sample for use in the pipeline. If it is not provided, cnvkit will automatically generate a flat reference file. |
|
concordance |
../tools/concordance.cwl
(CommandLineTool)
|
Concordance checking between Tumor and Normal BAM | |
detect_variants |
detect_variants.cwl
(Workflow)
|
Detect Variants workflow | |
tumor_index_cram |
../tools/index_cram.cwl
(CommandLineTool)
|
samtools index cram | |
normal_index_cram |
../tools/index_cram.cwl
(CommandLineTool)
|
samtools index cram | |
tumor_bam_to_cram |
../tools/bam_to_cram.cwl
(CommandLineTool)
|
BAM to CRAM conversion | |
normal_bam_to_cram |
../tools/bam_to_cram.cwl
(CommandLineTool)
|
BAM to CRAM conversion | |
pad_target_intervals |
../tools/interval_list_expand.cwl
(CommandLineTool)
|
expand interval list regions by a given number of basepairs | |
tumor_alignment_and_qc |
alignment_exome.cwl
(Workflow)
|
exome alignment with qc | |
normal_alignment_and_qc |
alignment_exome.cwl
(Workflow)
|
exome alignment with qc |
Outputs
ID | Type | Label | Doc |
---|---|---|---|
final_tsv | File | ||
final_vcf | File | ||
cn_diagram | File (Optional) | ||
tumor_cram | File | ||
normal_cram | File | ||
vep_summary | File | ||
all_candidates | File | ||
cn_scatter_plot | File (Optional) | ||
tumor_flagstats | File | ||
diploid_variants | File (Optional) | ||
intervals_target | File (Optional) | ||
normal_flagstats | File | ||
small_candidates | File | ||
somatic_variants | File (Optional) | ||
tumor_hs_metrics | File | ||
docm_filtered_vcf | File | ||
normal_hs_metrics | File | ||
final_filtered_vcf | File | ||
reference_coverage | File (Optional) | ||
mutect_filtered_vcf | File | ||
pindel_filtered_vcf | File | ||
tumor_only_variants | File (Optional) | ||
intervals_antitarget | File (Optional) | ||
strelka_filtered_vcf | File | ||
varscan_filtered_vcf | File | ||
mutect_unfiltered_vcf | File | ||
pindel_unfiltered_vcf | File | ||
tumor_target_coverage | File | ||
normal_target_coverage | File | ||
strelka_unfiltered_vcf | File | ||
tumor_bin_level_ratios | File | ||
tumor_segmented_ratios | File | ||
varscan_unfiltered_vcf | File | ||
tumor_summary_hs_metrics | File[] | ||
normal_summary_hs_metrics | File[] | ||
tumor_antitarget_coverage | File | ||
tumor_insert_size_metrics | File | ||
tumor_per_base_hs_metrics | File[] | ||
tumor_verify_bam_id_depth | File | ||
normal_antitarget_coverage | File | ||
normal_insert_size_metrics | File | ||
normal_per_base_hs_metrics | File[] | ||
normal_verify_bam_id_depth | File | ||
tumor_per_target_hs_metrics | File[] | ||
tumor_snv_bam_readcount_tsv | File | ||
tumor_verify_bam_id_metrics | File | ||
normal_per_target_hs_metrics | File[] | ||
normal_snv_bam_readcount_tsv | File | ||
normal_verify_bam_id_metrics | File | ||
somalier_concordance_metrics | File | ||
tumor_indel_bam_readcount_tsv | File | ||
tumor_mark_duplicates_metrics | File | ||
normal_indel_bam_readcount_tsv | File | ||
normal_mark_duplicates_metrics | File | ||
somalier_concordance_statistics | File | ||
tumor_alignment_summary_metrics | File | ||
tumor_per_base_coverage_metrics | File[] | ||
normal_alignment_summary_metrics | File | ||
normal_per_base_coverage_metrics | File[] | ||
tumor_per_target_coverage_metrics | File[] | ||
normal_per_target_coverage_metrics | File[] |
https://w3id.org/cwl/view/git/a08de598edc04f340fdbff76c9a92336a7702022/definitions/pipelines/somatic_exome.cwl