Workflow: 04-quantification-pe-unstranded.cwl

Fetched 2023-01-09 02:52:52 GMT

RNA-seq 04 quantification

children parents
Workflow as SVG
  • Selected
  • Default Values
  • Nested Workflows
  • Tools
  • Inputs/Outputs

Inputs

ID Type Title Doc
nthreads Integer
annotation_file File

GTF annotation file

input_bam_files File[]
input_genome_sizes File
rsem_reference_files Directory

RSEM genome reference files - generated with the rsem-prepare-reference command

input_transcripts_bam_files File[]
bamtools_forward_filter_file File

JSON filter file for forward strand used in bamtools (see bamtools-filter command)

bamtools_reverse_filter_file File

JSON filter file for reverse strand used in bamtools (see bamtools-filter command)

Steps

ID Runs Label Doc
basename
../utils/basename.cwl (ExpressionTool)
split_bams

Split reads in a BAM file by strands and index forward and reverse output BAM files

bw2bdg-minus
../quant/bigWigToBedGraph.cwl (CommandLineTool)

bigWigToBedGraph - Convert from bigWig to bedGraph format. usage: bigWigToBedGraph in.bigWig out.bedGraph options: -chrom=chr1 - if set restrict output to given chromosome -start=N - if set, restrict output to only that over start -end=N - if set, restict output to only that under end -udcDir=/dir/to/cache - place to put cache for remote bigBed/bigWigs

featurecounts
../quant/subread-featurecounts.cwl (CommandLineTool)

featureCounts is a highly efficient general-purpose read summarization program that counts mapped reads for genomic features such as genes, exons, promoter, gene bodies, genomic bins and chromosomal locations. It can be used to count both RNA-seq and genomic DNA-seq reads.

rsem-calc-expr
../quant/rsem-calculate-expression.cwl (CommandLineTool)

In its default mode, this program aligns input reads against a reference transcriptome with Bowtie and calculates expression values using the alignments. RSEM assumes the data are single-end reads with quality scores, unless the '--paired-end' or '--no-qualities' options are specified. Users may use an alternative aligner by specifying one of the --sam and --bam options, and providing an alignment file in the specified format. However, users should make sure that they align against the indices generated by 'rsem-prepare-reference' and the alignment file satisfies the requirements mentioned in ARGUMENTS section. One simple way to make the alignment file satisfying RSEM's requirements (assuming the aligner used put mates in a paired-end read adjacent) is to use 'convert-sam-for-rsem' script. This script only accept SAM format files as input. If a BAM format file is obtained, please use samtools to convert it to a SAM file first. For example, if '/ref/mouse_125' is the 'reference_name' and the SAM file is named 'input.sam', you can run the following command: convert-sam-for-rsem /ref/mouse_125 input.sam -o input_for_rsem.sam For details, please refer to 'convert-sam-for-rsem's documentation page. The SAM/BAM format RSEM uses is v1.4. However, it is compatible with old SAM/BAM format. However, RSEM cannot recognize 0x100 in the FLAG field. In addition, RSEM requires SEQ and QUAL are not '*'. The user must run 'rsem-prepare-reference' with the appropriate reference before using this program. For single-end data, it is strongly recommended that the user provide the fragment length distribution parameters (--fragment-length-mean and --fragment-length-sd). For paired-end data, RSEM will automatically learn a fragment length distribution from the data. Please note that some of the default values for the Bowtie parameters are not the same as those defined for Bowtie itself. The temporary directory and all intermediate files will be removed when RSEM finishes unless '--keep-intermediate-files' is specified. With the '--calc-pme' option, posterior mean estimates will be calculated in addition to maximum likelihood estimates. With the '--calc-ci' option, 95% credibility intervals and posterior mean estimates will be calculated in addition to maximum likelihood estimates.

bdg2bw-raw-plus
../quant/bedGraphToBigWig.cwl (CommandLineTool)

Tool: bedGraphToBigWig v 4 - Convert a bedGraph file to bigWig format.

bamcoverage-plus
../quant/deeptools-bamcoverage.cwl (CommandLineTool)

usage: An example usage is:$ bamCoverage -b reads.bam -o coverage.bw

This tool takes an alignment of reads or fragments as input (BAM file) and generates a coverage track (bigWig or bedGraph) as output. The coverage is calculated as the number of reads per bin, where bins are short consecutive counting windows of a defined size. It is possible to extended the length of the reads to better reflect the actual fragment length. *bamCoverage* offers normalization by scaling factor, Reads Per Kilobase per Million mapped reads (RPKM), and 1x depth (reads per genome coverage, RPGC). Required arguments: --bam BAM file, -b BAM file BAM file to process (default: None) Output: --outFileName FILENAME, -o FILENAME Output file name. (default: None) --outFileFormat {bigwig,bedgraph}, -of {bigwig,bedgraph} Output file type. Either \"bigwig\" or \"bedgraph\". (default: bigwig) Optional arguments: --help, -h show this help message and exit --scaleFactor SCALEFACTOR The smooth length defines a window, larger than the binSize, to average the number of reads. For example, if the –binSize is set to 20 and the –smoothLength is set to 60, then, for each bin, the average of the bin and its left and right neighbors is considered. Any value smaller than –binSize will be ignored and no smoothing will be applied. (default: 1.0) --MNase Determine nucleosome positions from MNase-seq data. Only 3 nucleotides at the center of each fragment are counted. The fragment ends are defined by the two mate reads. Only fragment lengthsbetween 130 - 200 bp are considered to avoid dinucleosomes or other artifacts.*NOTE*: Requires paired-end data. A bin size of 1 is recommended. (default: False) --filterRNAstrand {forward,reverse} Selects RNA-seq reads (single-end or paired-end) in the given strand. (default: None) --version show program's version number and exit --binSize INT bp, -bs INT bp Size of the bins, in bases, for the output of the bigwig/bedgraph file. (default: 50) --region CHR:START:END, -r CHR:START:END Region of the genome to limit the operation to - this is useful when testing parameters to reduce the computing time. The format is chr:start:end, for example --region chr10 or --region chr10:456700:891000. (default: None) --blackListFileName BED file, -bl BED file A BED file containing regions that should be excluded from all analyses. Currently this works by rejecting genomic chunks that happen to overlap an entry. Consequently, for BAM files, if a read partially overlaps a blacklisted region or a fragment spans over it, then the read/fragment might still be considered. (default: None) --numberOfProcessors INT, -p INT Number of processors to use. Type \"max/2\" to use half the maximum number of processors or \"max\" to use all available processors. (default: max/2) --verbose, -v Set to see processing messages. (default: False) Read coverage normalization options: --normalizeTo1x EFFECTIVE GENOME SIZE LENGTH Report read coverage normalized to 1x sequencing depth (also known as Reads Per Genomic Content (RPGC)). Sequencing depth is defined as: (total number of mapped reads * fragment length) / effective genome size. The scaling factor used is the inverse of the sequencing depth computed for the sample to match the 1x coverage. To use this option, the effective genome size has to be indicated after the option. The effective genome size is the portion of the genome that is mappable. Large fractions of the genome are stretches of NNNN that should be discarded. Also, if repetitive regions were not included in the mapping of reads, the effective genome size needs to be adjusted accordingly. Common values are: mm9: 2,150,570,000; hg19:2,451,960,000; dm3:121,400,000 and ce10:93,260,000. See Table 2 of http://www.plosone.org /article/info:doi/10.1371/journal.pone.0030377 or http ://www.nature.com/nbt/journal/v27/n1/fig_tab/nbt.1518_ T1.html for several effective genome sizes. (default: None) --ignoreForNormalization IGNOREFORNORMALIZATION [IGNOREFORNORMALIZATION ...] A list of space-delimited chromosome names containing those chromosomes that should be excluded for computing the normalization. This is useful when considering samples with unequal coverage across chromosomes, like male samples. An usage examples is --ignoreForNormalization chrX chrM. (default: None) --skipNonCoveredRegions, --skipNAs This parameter determines if non-covered regions (regions without overlapping reads) in a BAM file should be skipped. The default is to treat those regions as having a value of zero. The decision to skip non-covered regions depends on the interpretation of the data. Non-covered regions may represent, for example, repetitive regions that should be skipped. (default: False) --smoothLength INT bp The smooth length defines a window, larger than the binSize, to average the number of reads. For example, if the --binSize is set to 20 and the --smoothLength is set to 60, then, for each bin, the average of the bin and its left and right neighbors is considered. Any value smaller than --binSize will be ignored and no smoothing will be applied. (default: None) Read processing options: --extendReads [INT bp], -e [INT bp] This parameter allows the extension of reads to fragment size. If set, each read is extended, without exception. *NOTE*: This feature is generally NOT recommended for spliced-read data, such as RNA-seq, as it would extend reads over skipped regions. *Single- end*: Requires a user specified value for the final fragment length. Reads that already exceed this fragment length will not be extended. *Paired-end*: Reads with mates are always extended to match the fragment size defined by the two read mates. Unmated reads, mate reads that map too far apart (>4x fragment length) or even map to different chromosomes are treated like single-end reads. The input of a fragment length value is optional. If no value is specified, it is estimated from the data (mean of the fragment size of all mate reads). (default: False) --ignoreDuplicates If set, reads that have the same orientation and start position will be considered only once. If reads are paired, the mate's position also has to coincide to ignore a read. (default: False) --minMappingQuality INT If set, only reads that have a mapping quality score of at least this are considered. (default: None) --centerReads By adding this option, reads are centered with respect to the fragment length. For paired-end data, the read is centered at the fragment length defined by the two ends of the fragment. For single-end data, the given fragment length is used. This option is useful to get a sharper signal around enriched regions. (default: False) --samFlagInclude INT Include reads based on the SAM flag. For example, to get only reads that are the first mate, use a flag of 64. This is useful to count properly paired reads only once, as otherwise the second mate will be also considered for the coverage. (default: None) --samFlagExclude INT Exclude reads based on the SAM flag. For example, to get only reads that map to the forward strand, use --samFlagExclude 16, where 16 is the SAM flag for reads that map to the reverse strand. (default: None)

bdg2bw-raw-minus
../quant/bedGraphToBigWig.cwl (CommandLineTool)

Tool: bedGraphToBigWig v 4 - Convert a bedGraph file to bigWig format.

negate_minus_bdg
../quant/negate-minus-strand-bedgraph.cwl (CommandLineTool)

Negate minus strand bedGraph values.

bamcoverage-minus
../quant/deeptools-bamcoverage.cwl (CommandLineTool)

usage: An example usage is:$ bamCoverage -b reads.bam -o coverage.bw

This tool takes an alignment of reads or fragments as input (BAM file) and generates a coverage track (bigWig or bedGraph) as output. The coverage is calculated as the number of reads per bin, where bins are short consecutive counting windows of a defined size. It is possible to extended the length of the reads to better reflect the actual fragment length. *bamCoverage* offers normalization by scaling factor, Reads Per Kilobase per Million mapped reads (RPKM), and 1x depth (reads per genome coverage, RPGC). Required arguments: --bam BAM file, -b BAM file BAM file to process (default: None) Output: --outFileName FILENAME, -o FILENAME Output file name. (default: None) --outFileFormat {bigwig,bedgraph}, -of {bigwig,bedgraph} Output file type. Either \"bigwig\" or \"bedgraph\". (default: bigwig) Optional arguments: --help, -h show this help message and exit --scaleFactor SCALEFACTOR The smooth length defines a window, larger than the binSize, to average the number of reads. For example, if the –binSize is set to 20 and the –smoothLength is set to 60, then, for each bin, the average of the bin and its left and right neighbors is considered. Any value smaller than –binSize will be ignored and no smoothing will be applied. (default: 1.0) --MNase Determine nucleosome positions from MNase-seq data. Only 3 nucleotides at the center of each fragment are counted. The fragment ends are defined by the two mate reads. Only fragment lengthsbetween 130 - 200 bp are considered to avoid dinucleosomes or other artifacts.*NOTE*: Requires paired-end data. A bin size of 1 is recommended. (default: False) --filterRNAstrand {forward,reverse} Selects RNA-seq reads (single-end or paired-end) in the given strand. (default: None) --version show program's version number and exit --binSize INT bp, -bs INT bp Size of the bins, in bases, for the output of the bigwig/bedgraph file. (default: 50) --region CHR:START:END, -r CHR:START:END Region of the genome to limit the operation to - this is useful when testing parameters to reduce the computing time. The format is chr:start:end, for example --region chr10 or --region chr10:456700:891000. (default: None) --blackListFileName BED file, -bl BED file A BED file containing regions that should be excluded from all analyses. Currently this works by rejecting genomic chunks that happen to overlap an entry. Consequently, for BAM files, if a read partially overlaps a blacklisted region or a fragment spans over it, then the read/fragment might still be considered. (default: None) --numberOfProcessors INT, -p INT Number of processors to use. Type \"max/2\" to use half the maximum number of processors or \"max\" to use all available processors. (default: max/2) --verbose, -v Set to see processing messages. (default: False) Read coverage normalization options: --normalizeTo1x EFFECTIVE GENOME SIZE LENGTH Report read coverage normalized to 1x sequencing depth (also known as Reads Per Genomic Content (RPGC)). Sequencing depth is defined as: (total number of mapped reads * fragment length) / effective genome size. The scaling factor used is the inverse of the sequencing depth computed for the sample to match the 1x coverage. To use this option, the effective genome size has to be indicated after the option. The effective genome size is the portion of the genome that is mappable. Large fractions of the genome are stretches of NNNN that should be discarded. Also, if repetitive regions were not included in the mapping of reads, the effective genome size needs to be adjusted accordingly. Common values are: mm9: 2,150,570,000; hg19:2,451,960,000; dm3:121,400,000 and ce10:93,260,000. See Table 2 of http://www.plosone.org /article/info:doi/10.1371/journal.pone.0030377 or http ://www.nature.com/nbt/journal/v27/n1/fig_tab/nbt.1518_ T1.html for several effective genome sizes. (default: None) --ignoreForNormalization IGNOREFORNORMALIZATION [IGNOREFORNORMALIZATION ...] A list of space-delimited chromosome names containing those chromosomes that should be excluded for computing the normalization. This is useful when considering samples with unequal coverage across chromosomes, like male samples. An usage examples is --ignoreForNormalization chrX chrM. (default: None) --skipNonCoveredRegions, --skipNAs This parameter determines if non-covered regions (regions without overlapping reads) in a BAM file should be skipped. The default is to treat those regions as having a value of zero. The decision to skip non-covered regions depends on the interpretation of the data. Non-covered regions may represent, for example, repetitive regions that should be skipped. (default: False) --smoothLength INT bp The smooth length defines a window, larger than the binSize, to average the number of reads. For example, if the --binSize is set to 20 and the --smoothLength is set to 60, then, for each bin, the average of the bin and its left and right neighbors is considered. Any value smaller than --binSize will be ignored and no smoothing will be applied. (default: None) Read processing options: --extendReads [INT bp], -e [INT bp] This parameter allows the extension of reads to fragment size. If set, each read is extended, without exception. *NOTE*: This feature is generally NOT recommended for spliced-read data, such as RNA-seq, as it would extend reads over skipped regions. *Single- end*: Requires a user specified value for the final fragment length. Reads that already exceed this fragment length will not be extended. *Paired-end*: Reads with mates are always extended to match the fragment size defined by the two read mates. Unmated reads, mate reads that map too far apart (>4x fragment length) or even map to different chromosomes are treated like single-end reads. The input of a fragment length value is optional. If no value is specified, it is estimated from the data (mean of the fragment size of all mate reads). (default: False) --ignoreDuplicates If set, reads that have the same orientation and start position will be considered only once. If reads are paired, the mate's position also has to coincide to ignore a read. (default: False) --minMappingQuality INT If set, only reads that have a mapping quality score of at least this are considered. (default: None) --centerReads By adding this option, reads are centered with respect to the fragment length. For paired-end data, the read is centered at the fragment length defined by the two ends of the fragment. For single-end data, the given fragment length is used. This option is useful to get a sharper signal around enriched regions. (default: False) --samFlagInclude INT Include reads based on the SAM flag. For example, to get only reads that are the first mate, use a flag of 64. This is useful to count properly paired reads only once, as otherwise the second mate will be also considered for the coverage. (default: None) --samFlagExclude INT Exclude reads based on the SAM flag. For example, to get only reads that map to the forward strand, use --samFlagExclude 16, where 16 is the SAM flag for reads that map to the reverse strand. (default: None)

bdg2bw-norm-minus
../quant/bedGraphToBigWig.cwl (CommandLineTool)

Tool: bedGraphToBigWig v 4 - Convert a bedGraph file to bigWig format.

bedsort-norm-minus
../quant/bedSort.cwl (CommandLineTool)

bedSort - Sort a .bed file by chrom,chromStart usage: bedSort in.bed out.bed in.bed and out.bed may be the same.

negate_minus_bdg_norm
../quant/negate-minus-strand-bedgraph.cwl (CommandLineTool)

Negate minus strand bedGraph values.

bedsort_genomecov_plus
../quant/bedSort.cwl (CommandLineTool)

bedSort - Sort a .bed file by chrom,chromStart usage: bedSort in.bed out.bed in.bed and out.bed may be the same.

bedsort_genomecov_minus
../quant/bedSort.cwl (CommandLineTool)

bedSort - Sort a .bed file by chrom,chromStart usage: bedSort in.bed out.bed in.bed and out.bed may be the same.

bedtools_genomecov_plus
../map/bedtools-genomecov.cwl (CommandLineTool)

Tool: bedtools genomecov (aka genomeCoverageBed) Version: v2.25.0 Summary: Compute the coverage of a feature file among a genome.

Usage: bedtools genomecov [OPTIONS] -i <bed/gff/vcf> -g <genome>

Options: -ibam The input file is in BAM format. Note: BAM _must_ be sorted by position

-d Report the depth at each genome position (with one-based coordinates). Default behavior is to report a histogram.

-dz Report the depth at each genome position (with zero-based coordinates). Reports only non-zero positions. Default behavior is to report a histogram.

-bg Report depth in BedGraph format. For details, see: genome.ucsc.edu/goldenPath/help/bedgraph.html

-bga Report depth in BedGraph format, as above (-bg). However with this option, regions with zero coverage are also reported. This allows one to quickly extract all regions of a genome with 0 coverage by applying: \"grep -w 0$\" to the output.

-split Treat \"split\" BAM or BED12 entries as distinct BED intervals. when computing coverage. For BAM files, this uses the CIGAR \"N\" and \"D\" operations to infer the blocks for computing coverage. For BED12 files, this uses the BlockCount, BlockStarts, and BlockEnds fields (i.e., columns 10,11,12).

-strand Calculate coverage of intervals from a specific strand. With BED files, requires at least 6 columns (strand is column 6). - (STRING): can be + or -

-5 Calculate coverage of 5\" positions (instead of entire interval).

-3 Calculate coverage of 3\" positions (instead of entire interval).

-max Combine all positions with a depth >= max into a single bin in the histogram. Irrelevant for -d and -bedGraph - (INTEGER)

-scale Scale the coverage by a constant factor. Each coverage value is multiplied by this factor before being reported. Useful for normalizing coverage by, e.g., reads per million (RPM). - Default is 1.0; i.e., unscaled. - (FLOAT)

-trackline Adds a UCSC/Genome-Browser track line definition in the first line of the output. - See here for more details about track line definition: http://genome.ucsc.edu/goldenPath/help/bedgraph.html - NOTE: When adding a trackline definition, the output BedGraph can be easily uploaded to the Genome Browser as a custom track, BUT CAN NOT be converted into a BigWig file (w/o removing the first line).

-trackopts Writes additional track line definition parameters in the first line. - Example: -trackopts 'name=\"My Track\" visibility=2 color=255,30,30' Note the use of single-quotes if you have spaces in your parameters. - (TEXT)

Notes: (1) The genome file should tab delimited and structured as follows: <chromName><TAB><chromSize>

For example, Human (hg19): chr1 249250621 chr2 243199373 ... chr18_gl000207_random 4262

(2) The input BED (-i) file must be grouped by chromosome. A simple \"sort -k 1,1 <BED> > <BED>.sorted\" will suffice.

(3) The input BAM (-ibam) file must be sorted by position. A \"samtools sort <BAM>\" should suffice.

Tips: One can use the UCSC Genome Browser's MySQL database to extract chromosome sizes. For example, H. sapiens:

mysql --user=genome --host=genome-mysql.cse.ucsc.edu -A -e \ \"select chrom, size from hg19.chromInfo\" > hg19.genome

bedtools_genomecov_minus
../map/bedtools-genomecov.cwl (CommandLineTool)

Tool: bedtools genomecov (aka genomeCoverageBed) Version: v2.25.0 Summary: Compute the coverage of a feature file among a genome.

Usage: bedtools genomecov [OPTIONS] -i <bed/gff/vcf> -g <genome>

Options: -ibam The input file is in BAM format. Note: BAM _must_ be sorted by position

-d Report the depth at each genome position (with one-based coordinates). Default behavior is to report a histogram.

-dz Report the depth at each genome position (with zero-based coordinates). Reports only non-zero positions. Default behavior is to report a histogram.

-bg Report depth in BedGraph format. For details, see: genome.ucsc.edu/goldenPath/help/bedgraph.html

-bga Report depth in BedGraph format, as above (-bg). However with this option, regions with zero coverage are also reported. This allows one to quickly extract all regions of a genome with 0 coverage by applying: \"grep -w 0$\" to the output.

-split Treat \"split\" BAM or BED12 entries as distinct BED intervals. when computing coverage. For BAM files, this uses the CIGAR \"N\" and \"D\" operations to infer the blocks for computing coverage. For BED12 files, this uses the BlockCount, BlockStarts, and BlockEnds fields (i.e., columns 10,11,12).

-strand Calculate coverage of intervals from a specific strand. With BED files, requires at least 6 columns (strand is column 6). - (STRING): can be + or -

-5 Calculate coverage of 5\" positions (instead of entire interval).

-3 Calculate coverage of 3\" positions (instead of entire interval).

-max Combine all positions with a depth >= max into a single bin in the histogram. Irrelevant for -d and -bedGraph - (INTEGER)

-scale Scale the coverage by a constant factor. Each coverage value is multiplied by this factor before being reported. Useful for normalizing coverage by, e.g., reads per million (RPM). - Default is 1.0; i.e., unscaled. - (FLOAT)

-trackline Adds a UCSC/Genome-Browser track line definition in the first line of the output. - See here for more details about track line definition: http://genome.ucsc.edu/goldenPath/help/bedgraph.html - NOTE: When adding a trackline definition, the output BedGraph can be easily uploaded to the Genome Browser as a custom track, BUT CAN NOT be converted into a BigWig file (w/o removing the first line).

-trackopts Writes additional track line definition parameters in the first line. - Example: -trackopts 'name=\"My Track\" visibility=2 color=255,30,30' Note the use of single-quotes if you have spaces in your parameters. - (TEXT)

Notes: (1) The genome file should tab delimited and structured as follows: <chromName><TAB><chromSize>

For example, Human (hg19): chr1 249250621 chr2 243199373 ... chr18_gl000207_random 4262

(2) The input BED (-i) file must be grouped by chromosome. A simple \"sort -k 1,1 <BED> > <BED>.sorted\" will suffice.

(3) The input BAM (-ibam) file must be sorted by position. A \"samtools sort <BAM>\" should suffice.

Tips: One can use the UCSC Genome Browser's MySQL database to extract chromosome sizes. For example, H. sapiens:

mysql --user=genome --host=genome-mysql.cse.ucsc.edu -A -e \ \"select chrom, size from hg19.chromInfo\" > hg19.genome

Outputs

ID Type Label Doc
bam_plus_files File[]

BAM files containing only reads in the forward (plus) strand.

bam_minus_files File[]

BAM files containing only reads in the reverse (minus) strand.

rsem_genes_files File[]

RSEM genes files

bw_raw_plus_files File[]

Raw bigWig files from BAM files containing only reads in the forward (plus) strand.

bw_norm_plus_files File[]

Normalized by RPKM bigWig files from BAM files containing only reads in the forward (plus) strand.

bw_raw_minus_files File[]

Raw bigWig files from BAM files containing only reads in the reverse (minus) strand.

bw_norm_minus_files File[]

Normalized by RPKM bigWig files from BAM files containing only reads in the forward (plus) strand.

rsem_isoforms_files File[]

RSEM isoforms files

featurecounts_counts File[]

Normalized fragment extended reads bigWig (signal) files

Permalink: https://w3id.org/cwl/view/git/8aabde14169421a7115c5cd48c4740b3a7bd818f/v1.0/RNA-seq_pipeline/04-quantification-pe-unstranded.cwl