Workflow: qc_collapsed_bam

Fetched 2023-01-04 18:16:20 GMT
children parents
Workflow as SVG
  • Selected
  • Default Values
  • Nested Workflows
  • Tools
  • Inputs/Outputs

Inputs

ID Type Title Doc
maf File
json Boolean (Optional)

Also output data in JSON format.

plot Boolean (Optional)

Also output plots of the data.

json_1 Boolean (Optional)
prefix String (Optional)
bed_file File (Optional)
vcf_file File
reference File
sample_sex String (Optional)
sample_name String
sample_group String (Optional)
collapsed_bam File[] collapsed_bam
major_threshold Float (Optional)
minor_threshold Float (Optional)

Minor contamination threshold for bad sample.

coverage_threshold Integer (Optional)

Samples with Y chromosome above this value will be considered male.

pool_a_bait_intervals File (Optional) pool_a_bait_intervals

Optional set of intervals over which to restrict analysis. [Optional].

pool_b_bait_intervals File (Optional) pool_b_bait_intervals

Optional set of intervals over which to restrict analysis. [Optional].

group_reads_by_umi_bam File[] group_reads_by_umi_bam

Input BAM file generated by GroupReadByUmi.

hsmetrics_coverage_cap Integer (Optional)
pool_a_target_intervals File pool_a_target_intervals
pool_b_target_intervals File pool_b_target_intervals
hsmetrics_minimum_base_quality Integer (Optional)
hsmetrics_minimum_mapping_quality Integer (Optional)

Steps

ID Runs Label Doc
biometrics_minor
access_qc__packed.cwl#biometrics_minor.cwl_2 (CommandLineTool)
bam_qc_stats_pool_a bam_qc_stats
bam_qc_stats_pool_b bam_qc_stats
biometrics_sexmismatch
access_qc__packed.cwl#biometrics_sexmismatch.cwl_2 (CommandLineTool)
biometrics_major_0_2_13
access_qc__packed.cwl#biometrics_major.cwl_2 (CommandLineTool)
biometrics_extract_0_2_13
access_qc__packed.cwl#biometrics_extract.cwl (CommandLineTool)
getbasecountsmultisample_1_2_5
access_qc__packed.cwl#getbasecountsmultisample_1.2.5.cwl (CommandLineTool)
getbasecountsmultisample_1.2.5
fgbio_collect_duplex_seq_metrics_1_2_0
access_qc__packed.cwl#fgbio_collect_duplex_seq_metrics_1.2.0.cwl (CommandLineTool)
fgbio_collect_duplex_seq_metrics_1.2.0

Collects a suite of metrics to QC duplex sequencing data. Inputs ------ The input to this tool must be a BAM file that is either: 1. The exact BAM output by the 'GroupReadsByUmi' tool (in the sort-order it was produced in) 2. A BAM file that has MI tags present on all reads (usually set by 'GroupReadsByUmi' and has been sorted with 'SortBam' into 'TemplateCoordinate' order.

Calculation of metrics may be restricted to a set of regions using the '--intervals' parameter. This can significantly affect results as off-target reads in duplex sequencing experiments often have very different properties than on-target reads due to the lack of enrichment. Several metrics are calculated related to the fraction of tag families that have duplex coverage. The definition of \"duplex\" is controlled by the '--min-ab-reads' and '--min-ba-reads' parameters. The default is to treat any tag family with at least one observation of each strand as a duplex, but this could be made more stringent, e.g. by setting '--min-ab-reads=3 --min-ba-reads=3'. If different thresholds are used then '--min-ab-reads' must be the higher value. Outputs ------- The following output files are produced: 1. <output>.family_sizes.txt: metrics on the frequency of different types of families of different sizes 2. <output>.duplex_family_sizes.txt: metrics on the frequency of duplex tag families by the number of observations from each strand 3. <output>.duplex_yield_metrics.txt: summary QC metrics produced using 5%, 10%, 15%...100% of the data 4. <output>.umi_counts.txt: metrics on the frequency of observations of UMIs within reads and tag families 5. <output>.duplex_qc.pdf: a series of plots generated from the preceding metrics files for visualization 6. <output>.duplex_umi_counts.txt: (optional) metrics on the frequency of observations of duplex UMIs within reads and tag families. This file is only produced if the '--duplex-umi-counts' option is used as it requires significantly more memory to track all pairs of UMIs seen when a large number of UMI sequences are present.

Within the metrics files the prefixes 'CS', 'SS' and 'DS' are used to mean: * CS: tag families where membership is defined solely on matching genome coordinates and strand * SS: single-stranded tag families where membership is defined by genome coordinates, strand and UMI; ie. 50/A and 50/B are considered different tag families. * DS: double-stranded tag families where membership is collapsed across single-stranded tag families from the same double-stranded source molecule; i.e. 50/A and 50/B become one family

Requirements ------------ For plots to be generated R must be installed and the ggplot2 package installed with suggested dependencies. Successfully executing the following in R will ensure a working installation: install.packages(\"ggplot2\", repos=\"http://cran.us.r-project.org\", dependencies=TRUE)

fgbio_collect_duplex_seq_metrics_1_2_1
access_qc__packed.cwl#fgbio_collect_duplex_seq_metrics_1.2.0.cwl (CommandLineTool)
fgbio_collect_duplex_seq_metrics_1.2.0

Collects a suite of metrics to QC duplex sequencing data. Inputs ------ The input to this tool must be a BAM file that is either: 1. The exact BAM output by the 'GroupReadsByUmi' tool (in the sort-order it was produced in) 2. A BAM file that has MI tags present on all reads (usually set by 'GroupReadsByUmi' and has been sorted with 'SortBam' into 'TemplateCoordinate' order.

Calculation of metrics may be restricted to a set of regions using the '--intervals' parameter. This can significantly affect results as off-target reads in duplex sequencing experiments often have very different properties than on-target reads due to the lack of enrichment. Several metrics are calculated related to the fraction of tag families that have duplex coverage. The definition of \"duplex\" is controlled by the '--min-ab-reads' and '--min-ba-reads' parameters. The default is to treat any tag family with at least one observation of each strand as a duplex, but this could be made more stringent, e.g. by setting '--min-ab-reads=3 --min-ba-reads=3'. If different thresholds are used then '--min-ab-reads' must be the higher value. Outputs ------- The following output files are produced: 1. <output>.family_sizes.txt: metrics on the frequency of different types of families of different sizes 2. <output>.duplex_family_sizes.txt: metrics on the frequency of duplex tag families by the number of observations from each strand 3. <output>.duplex_yield_metrics.txt: summary QC metrics produced using 5%, 10%, 15%...100% of the data 4. <output>.umi_counts.txt: metrics on the frequency of observations of UMIs within reads and tag families 5. <output>.duplex_qc.pdf: a series of plots generated from the preceding metrics files for visualization 6. <output>.duplex_umi_counts.txt: (optional) metrics on the frequency of observations of duplex UMIs within reads and tag families. This file is only produced if the '--duplex-umi-counts' option is used as it requires significantly more memory to track all pairs of UMIs seen when a large number of UMI sequences are present.

Within the metrics files the prefixes 'CS', 'SS' and 'DS' are used to mean: * CS: tag families where membership is defined solely on matching genome coordinates and strand * SS: single-stranded tag families where membership is defined by genome coordinates, strand and UMI; ie. 50/A and 50/B are considered different tag families. * DS: double-stranded tag families where membership is collapsed across single-stranded tag families from the same double-stranded source molecule; i.e. 50/A and 50/B become one family

Requirements ------------ For plots to be generated R must be installed and the ggplot2 package installed with suggested dependencies. Successfully executing the following in R will ensure a working installation: install.packages(\"ggplot2\", repos=\"http://cran.us.r-project.org\", dependencies=TRUE)

Outputs

ID Type Label Doc
fillout_maf File
biometrics_major_csv File
biometrics_minor_csv File[]
biometrics_major_json File (Optional)
biometrics_major_plot File (Optional)
biometrics_minor_json File[] (Optional)
biometrics_minor_plot File[] (Optional)
biometrics_extract_pickle File
biometrics_sexmismatch_csv File[]
biometrics_minor_sites_plot File[] (Optional)
biometrics_sexmismatch_json File[] (Optional)
gatk_collect_hs_metrics_txt_pool_a File[] gatk_collect_hs_metrics_txt_pool_a
gatk_collect_hs_metrics_txt_pool_b File[] gatk_collect_hs_metrics_txt_pool_b
gatk_collect_insert_size_metrics_txt_pool_a File[] gatk_collect_insert_size_metrics_txt_pool_a
gatk_collect_insert_size_metrics_txt_pool_b File[] gatk_collect_insert_size_metrics_txt_pool_b
fgbio_collect_duplex_seq_metrics_duplex_pool_a File[] (Optional) fgbio_collect_duplex_seq_metrics_duplex_pool_a
fgbio_collect_duplex_seq_metrics_duplex_qc_pool_a File[] (Optional) fgbio_collect_duplex_seq_metrics_duplex_qc_pool_a
fgbio_collect_duplex_seq_metrics_duplex_qc_pool_b File[] (Optional) fgbio_collect_duplex_seq_metrics_duplex_qc_pool_b
gatk_collect_alignment_summary_metrics_txt_pool_a File[] gatk_collect_alignment_summary_metrics_txt_pool_a
gatk_collect_alignment_summary_metrics_txt_pool_b File[] gatk_collect_alignment_summary_metrics_txt_pool_b
fgbio_collect_duplex_seq_metrics_umi_counts_pool_a File[] fgbio_collect_duplex_seq_metrics_umi_counts_pool_a
fgbio_collect_duplex_seq_metrics_umi_counts_pool_b File[] fgbio_collect_duplex_seq_metrics_umi_counts_pool_b
fgbio_collect_duplex_seq_metrics_family_size_pool_a File[] fgbio_collect_duplex_seq_metrics_family_size_pool_a
fgbio_collect_duplex_seq_metrics_family_size_pool_b File[] fgbio_collect_duplex_seq_metrics_family_size_pool_b
gatk_collect_hs_metrics_per_base_coverage_txt_pool_a File[] gatk_collect_hs_metrics_per_base_coverage_txt_pool_a
gatk_collect_hs_metrics_per_base_coverage_txt_pool_b File[] gatk_collect_hs_metrics_per_base_coverage_txt_pool_b
gatk_collect_insert_size_metrics_histogram_pdf_pool_a File[] gatk_collect_insert_size_metrics_histogram_pdf_pool_a
gatk_collect_insert_size_metrics_histogram_pdf_pool_b File[] gatk_collect_insert_size_metrics_histogram_pdf_pool_b
gatk_collect_hs_metrics_per_target_coverage_txt_pool_a File[] gatk_collect_hs_metrics_per_target_coverage_txt_pool_a
gatk_collect_hs_metrics_per_target_coverage_txt_pool_b File[] gatk_collect_hs_metrics_per_target_coverage_txt_pool_b
fgbio_collect_duplex_seq_metrics_duplex_umi_counts_pool_b File[] (Optional) fgbio_collect_duplex_seq_metrics_duplex_umi_counts_pool_b
fgbio_collect_duplex_seq_metrics_duplex_family_size_pool_a File[] fgbio_collect_duplex_seq_metrics_duplex_family_size_pool_a
fgbio_collect_duplex_seq_metrics_duplex_family_size_pool_b File[] fgbio_collect_duplex_seq_metrics_duplex_family_size_pool_b
fgbio_collect_duplex_seq_metrics_duplex_yield_metrics_pool_a File[] fgbio_collect_duplex_seq_metrics_duplex_yield_metrics_pool_a
fgbio_collect_duplex_seq_metrics_duplex_yield_metrics_pool_b File[] fgbio_collect_duplex_seq_metrics_duplex_yield_metrics_pool_b
Permalink: https://w3id.org/cwl/view/git/248e7c3edaff48e1b97a7931d66aa3b23ce97f54/access_qc__packed.cwl?part=qc_collapsed_bam.cwl