CWL Workflow: qc_collapsed_bam

Workflow: qc_collapsed_bam

Fetched 2025-12-23 06:00:17 GMT

Verified with cwltool version 3.1.20221201130942

Selected
|
Default Values
Nested Workflows
Tools
Inputs/Outputs

This workflow is Open Source and may be reused according to the terms of: Apache License 2.0

Note that the tools invoked by the workflow may have separate licenses.

Inputs

ID	Type	Title	Doc
maf	File
json	Boolean (Optional)		Also output data in JSON format.
plot	Boolean (Optional)		Also output plots of the data.
json_1	Boolean (Optional)
prefix	String (Optional)
bed_file	File (Optional)
vcf_file	File
reference	File
sample_sex	String (Optional)
sample_name	String
sample_group	String (Optional)
collapsed_bam	File[]	collapsed_bam
major_threshold	Float (Optional)
minor_threshold	Float (Optional)		Minor contamination threshold for bad sample.
coverage_threshold	Integer (Optional)		Samples with Y chromosome above this value will be considered male.
pool_a_bait_intervals	File (Optional)	pool_a_bait_intervals	Optional set of intervals over which to restrict analysis. [Optional].
pool_b_bait_intervals	File (Optional)	pool_b_bait_intervals	Optional set of intervals over which to restrict analysis. [Optional].
group_reads_by_umi_bam	File[]	group_reads_by_umi_bam	Input BAM file generated by GroupReadByUmi.
hsmetrics_coverage_cap	Integer (Optional)
pool_a_target_intervals	File	pool_a_target_intervals
pool_b_target_intervals	File	pool_b_target_intervals
hsmetrics_minimum_base_quality	Integer (Optional)
hsmetrics_minimum_mapping_quality	Integer (Optional)

Steps

ID	Runs	Label	Doc
biometrics_minor	access_qc__packed.cwl#biometrics_minor.cwl_2 (CommandLineTool)
bam_qc_stats_pool_a	access_qc__packed.cwl#bam_qc_stats.cwl (Workflow)	bam_qc_stats
bam_qc_stats_pool_b	access_qc__packed.cwl#bam_qc_stats.cwl (Workflow)	bam_qc_stats
biometrics_sexmismatch	access_qc__packed.cwl#biometrics_sexmismatch.cwl_2 (CommandLineTool)
biometrics_major_0_2_13	access_qc__packed.cwl#biometrics_major.cwl_2 (CommandLineTool)
biometrics_extract_0_2_13	access_qc__packed.cwl#biometrics_extract.cwl (CommandLineTool)
getbasecountsmultisample_1_2_5	access_qc__packed.cwl#getbasecountsmultisample_1.2.5.cwl (CommandLineTool)	getbasecountsmultisample_1.2.5
fgbio_collect_duplex_seq_metrics_1_2_0	access_qc__packed.cwl#fgbio_collect_duplex_seq_metrics_1.2.0.cwl (CommandLineTool)	fgbio_collect_duplex_seq_metrics_1.2.0	Collects a suite of metrics to QC duplex sequencing data. Inputs ------ The input to this tool must be a BAM file that is either: 1. The exact BAM output by the 'GroupReadsByUmi' tool (in the sort-order it was produced in) 2. A BAM file that has MI tags present on all reads (usually set by 'GroupReadsByUmi' and has been sorted with 'SortBam' into 'TemplateCoordinate' order. Calculation of metrics may be restricted to a set of regions using the '--intervals' parameter. This can significantly affect results as off-target reads in duplex sequencing experiments often have very different properties than on-target reads due to the lack of enrichment. Several metrics are calculated related to the fraction of tag families that have duplex coverage. The definition of \"duplex\" is controlled by the '--min-ab-reads' and '--min-ba-reads' parameters. The default is to treat any tag family with at least one observation of each strand as a duplex, but this could be made more stringent, e.g. by setting '--min-ab-reads=3 --min-ba-reads=3'. If different thresholds are used then '--min-ab-reads' must be the higher value. Outputs ------- The following output files are produced: 1. <output>.family_sizes.txt: metrics on the frequency of different types of families of different sizes 2. <output>.duplex_family_sizes.txt: metrics on the frequency of duplex tag families by the number of observations from each strand 3. <output>.duplex_yield_metrics.txt: summary QC metrics produced using 5%, 10%, 15%...100% of the data 4. <output>.umi_counts.txt: metrics on the frequency of observations of UMIs within reads and tag families 5. <output>.duplex_qc.pdf: a series of plots generated from the preceding metrics files for visualization 6. <output>.duplex_umi_counts.txt: (optional) metrics on the frequency of observations of duplex UMIs within reads and tag families. This file is only produced if the '--duplex-umi-counts' option is used as it requires significantly more memory to track all pairs of UMIs seen when a large number of UMI sequences are present. Within the metrics files the prefixes 'CS', 'SS' and 'DS' are used to mean: * CS: tag families where membership is defined solely on matching genome coordinates and strand * SS: single-stranded tag families where membership is defined by genome coordinates, strand and UMI; ie. 50/A and 50/B are considered different tag families. * DS: double-stranded tag families where membership is collapsed across single-stranded tag families from the same double-stranded source molecule; i.e. 50/A and 50/B become one family Requirements ------------ For plots to be generated R must be installed and the ggplot2 package installed with suggested dependencies. Successfully executing the following in R will ensure a working installation: install.packages(\"ggplot2\", repos=\"http://cran.us.r-project.org\", dependencies=TRUE)
fgbio_collect_duplex_seq_metrics_1_2_1	access_qc__packed.cwl#fgbio_collect_duplex_seq_metrics_1.2.0.cwl (CommandLineTool)	fgbio_collect_duplex_seq_metrics_1.2.0	Collects a suite of metrics to QC duplex sequencing data. Inputs ------ The input to this tool must be a BAM file that is either: 1. The exact BAM output by the 'GroupReadsByUmi' tool (in the sort-order it was produced in) 2. A BAM file that has MI tags present on all reads (usually set by 'GroupReadsByUmi' and has been sorted with 'SortBam' into 'TemplateCoordinate' order. Calculation of metrics may be restricted to a set of regions using the '--intervals' parameter. This can significantly affect results as off-target reads in duplex sequencing experiments often have very different properties than on-target reads due to the lack of enrichment. Several metrics are calculated related to the fraction of tag families that have duplex coverage. The definition of \"duplex\" is controlled by the '--min-ab-reads' and '--min-ba-reads' parameters. The default is to treat any tag family with at least one observation of each strand as a duplex, but this could be made more stringent, e.g. by setting '--min-ab-reads=3 --min-ba-reads=3'. If different thresholds are used then '--min-ab-reads' must be the higher value. Outputs ------- The following output files are produced: 1. <output>.family_sizes.txt: metrics on the frequency of different types of families of different sizes 2. <output>.duplex_family_sizes.txt: metrics on the frequency of duplex tag families by the number of observations from each strand 3. <output>.duplex_yield_metrics.txt: summary QC metrics produced using 5%, 10%, 15%...100% of the data 4. <output>.umi_counts.txt: metrics on the frequency of observations of UMIs within reads and tag families 5. <output>.duplex_qc.pdf: a series of plots generated from the preceding metrics files for visualization 6. <output>.duplex_umi_counts.txt: (optional) metrics on the frequency of observations of duplex UMIs within reads and tag families. This file is only produced if the '--duplex-umi-counts' option is used as it requires significantly more memory to track all pairs of UMIs seen when a large number of UMI sequences are present. Within the metrics files the prefixes 'CS', 'SS' and 'DS' are used to mean: * CS: tag families where membership is defined solely on matching genome coordinates and strand * SS: single-stranded tag families where membership is defined by genome coordinates, strand and UMI; ie. 50/A and 50/B are considered different tag families. * DS: double-stranded tag families where membership is collapsed across single-stranded tag families from the same double-stranded source molecule; i.e. 50/A and 50/B become one family Requirements ------------ For plots to be generated R must be installed and the ggplot2 package installed with suggested dependencies. Successfully executing the following in R will ensure a working installation: install.packages(\"ggplot2\", repos=\"http://cran.us.r-project.org\", dependencies=TRUE)

Outputs

Permalink: https://w3id.org/cwl/view/git/248e7c3edaff48e1b97a7931d66aa3b23ce97f54/access_qc__packed.cwl?part=qc_collapsed_bam.cwl

ID	Type	Label
fillout_maf	File
biometrics_major_csv	File
biometrics_minor_csv	File[]
biometrics_major_json	File (Optional)
biometrics_major_plot	File (Optional)
biometrics_minor_json	File[] (Optional)
biometrics_minor_plot	File[] (Optional)
biometrics_extract_pickle	File
biometrics_sexmismatch_csv	File[]
biometrics_minor_sites_plot	File[] (Optional)
biometrics_sexmismatch_json	File[] (Optional)
gatk_collect_hs_metrics_txt_pool_a	File[]	gatk_collect_hs_metrics_txt_pool_a
gatk_collect_hs_metrics_txt_pool_b	File[]	gatk_collect_hs_metrics_txt_pool_b
gatk_collect_insert_size_metrics_txt_pool_a	File[]	gatk_collect_insert_size_metrics_txt_pool_a
gatk_collect_insert_size_metrics_txt_pool_b	File[]	gatk_collect_insert_size_metrics_txt_pool_b
fgbio_collect_duplex_seq_metrics_duplex_pool_a	File[] (Optional)	fgbio_collect_duplex_seq_metrics_duplex_pool_a
fgbio_collect_duplex_seq_metrics_duplex_qc_pool_a	File[] (Optional)	fgbio_collect_duplex_seq_metrics_duplex_qc_pool_a
fgbio_collect_duplex_seq_metrics_duplex_qc_pool_b	File[] (Optional)	fgbio_collect_duplex_seq_metrics_duplex_qc_pool_b
gatk_collect_alignment_summary_metrics_txt_pool_a	File[]	gatk_collect_alignment_summary_metrics_txt_pool_a
gatk_collect_alignment_summary_metrics_txt_pool_b	File[]	gatk_collect_alignment_summary_metrics_txt_pool_b
fgbio_collect_duplex_seq_metrics_umi_counts_pool_a	File[]	fgbio_collect_duplex_seq_metrics_umi_counts_pool_a
fgbio_collect_duplex_seq_metrics_umi_counts_pool_b	File[]	fgbio_collect_duplex_seq_metrics_umi_counts_pool_b
fgbio_collect_duplex_seq_metrics_family_size_pool_a	File[]	fgbio_collect_duplex_seq_metrics_family_size_pool_a
fgbio_collect_duplex_seq_metrics_family_size_pool_b	File[]	fgbio_collect_duplex_seq_metrics_family_size_pool_b
gatk_collect_hs_metrics_per_base_coverage_txt_pool_a	File[]	gatk_collect_hs_metrics_per_base_coverage_txt_pool_a
gatk_collect_hs_metrics_per_base_coverage_txt_pool_b	File[]	gatk_collect_hs_metrics_per_base_coverage_txt_pool_b
gatk_collect_insert_size_metrics_histogram_pdf_pool_a	File[]	gatk_collect_insert_size_metrics_histogram_pdf_pool_a
gatk_collect_insert_size_metrics_histogram_pdf_pool_b	File[]	gatk_collect_insert_size_metrics_histogram_pdf_pool_b
gatk_collect_hs_metrics_per_target_coverage_txt_pool_a	File[]	gatk_collect_hs_metrics_per_target_coverage_txt_pool_a
gatk_collect_hs_metrics_per_target_coverage_txt_pool_b	File[]	gatk_collect_hs_metrics_per_target_coverage_txt_pool_b
fgbio_collect_duplex_seq_metrics_duplex_umi_counts_pool_b	File[] (Optional)	fgbio_collect_duplex_seq_metrics_duplex_umi_counts_pool_b
fgbio_collect_duplex_seq_metrics_duplex_family_size_pool_a	File[]	fgbio_collect_duplex_seq_metrics_duplex_family_size_pool_a
fgbio_collect_duplex_seq_metrics_duplex_family_size_pool_b	File[]	fgbio_collect_duplex_seq_metrics_duplex_family_size_pool_b
fgbio_collect_duplex_seq_metrics_duplex_yield_metrics_pool_a	File[]	fgbio_collect_duplex_seq_metrics_duplex_yield_metrics_pool_a
fgbio_collect_duplex_seq_metrics_duplex_yield_metrics_pool_b	File[]	fgbio_collect_duplex_seq_metrics_duplex_yield_metrics_pool_b