CWL Workflow: CLIP-Seq pipeline for single-read experiment NNNNG

Workflow: CLIP-Seq pipeline for single-read experiment NNNNG

Fetched 2023-01-08 22:13:38 GMT

Verified with cwltool version 3.1.20221201130942

Cross-Linking ImmunoPrecipitation ================================= `CLIP` (`cross-linking immunoprecipitation`) is a method used in molecular biology that combines UV cross-linking with immunoprecipitation in order to analyse protein interactions with RNA or to precisely locate RNA modifications (e.g. m6A). (Uhl|Houwaart|Corrado|Wright|Backofen|2017)(Ule|Jensen|Ruggiu|Mele|2003)(Sugimoto|König|Hussain|Zupan|2012)(Zhang|Darnell|2011) (Ke| Alemu| Mertens| Gantman|2015) CLIP-based techniques can be used to map RNA binding protein binding sites or RNA modification sites (Ke| Alemu| Mertens| Gantman|2015)(Ke| Pandya-Jones| Saito| Fak|2017) of interest on a genome-wide scale, thereby increasing the understanding of post-transcriptional regulatory networks. The identification of sites where RNA-binding proteins (RNABPs) interact with target RNAs opens the door to understanding the vast complexity of RNA regulation. UV cross-linking and immunoprecipitation (CLIP) is a transformative technology in which RNAs purified from _in vivo_ cross-linked RNA-protein complexes are sequenced to reveal footprints of RNABP:RNA contacts. CLIP combined with high-throughput sequencing (HITS-CLIP) is a generalizable strategy to produce transcriptome-wide maps of RNA binding with higher accuracy and resolution than standard RNA immunoprecipitation (RIP) profiling or purely computational approaches. The application of CLIP to Argonaute proteins has expanded the utility of this approach to mapping binding sites for microRNAs and other small regulatory RNAs. Finally, recent advances in data analysis take advantage of cross-link–induced mutation sites (CIMS) to refine RNA-binding maps to single-nucleotide resolution. Once IP conditions are established, HITS-CLIP takes ~8 d to prepare RNA for sequencing. Established pipelines for data analysis, including those for CIMS, take 3–4 d. Workflow -------- CLIP begins with the in-vivo cross-linking of RNA-protein complexes using ultraviolet light (UV). Upon UV exposure, covalent bonds are formed between proteins and nucleic acids that are in close proximity. (Darnell|2012) The cross-linked cells are then lysed, and the protein of interest is isolated via immunoprecipitation. In order to allow for sequence specific priming of reverse transcription, RNA adapters are ligated to the 3' ends, while radiolabeled phosphates are transferred to the 5' ends of the RNA fragments. The RNA-protein complexes are then separated from free RNA using gel electrophoresis and membrane transfer. Proteinase K digestion is then performed in order to remove protein from the RNA-protein complexes. This step leaves a peptide at the cross-link site, allowing for the identification of the cross-linked nucleotide. (König| McGlincy| Ule|2012) After ligating RNA linkers to the RNA 5' ends, cDNA is synthesized via RT-PCR. High-throughput sequencing is then used to generate reads containing distinct barcodes that identify the last cDNA nucleotide. Interaction sites can be identified by mapping the reads back to the transcriptome.

Selected
|
Default Values
Nested Workflows
Tools
Inputs/Outputs

This workflow is Open Source and may be reused according to the terms of: Apache License 2.0

Note that the tools invoked by the workflow may have separate licenses.

Inputs

ID	Type	Title	Doc
adapter	String	Adapter sequence to be trimmed	Adapter sequence to be trimmed. If not specified explicitly, Trim Galore will try to auto-detect whether the Illumina universal, Nextera transposase or Illumina small RNA adapter sequence was used. Also see '--illumina', '--nextera' and '--small_rna'. If no adapter can be detected within the first 1 million sequences of the first file specified Trim Galore defaults to '--illumina'.
species	String	Species string for clipper (hg38, mm10)	species: one of ce10 ce11 dm3 hg19 GRCh38 mm9 mm10
threads	Integer (Optional)	Number of threads	Number of threads for those steps that support multithreading
bc_pattern	String	Barcode pattern
fastq_file	File [FASTQ]	FASTQ input file	Reads data in a FASTQ format, received after single end sequencing
clip_3p_end	Integer (Optional)	Clip from 3p end	Number of bases to clip from the 3p end
clip_5p_end	Integer (Optional)	Clip from 5p end	Number of bases to clip from the 5p end
exclude_chr	String (Optional)	Chromosome to be excluded in rpkm calculation	Chromosome to be excluded in rpkm calculation
extract_method		UMI extract method 'string' or 'regex'	How to extract the umi +/- cell barcodes, Choose from 'string' or 'regex'
annotation_file	File [TSV]	Annotation file	Tab-separated annotation file
chrom_length_file	File [Textual format]	Chromosomes length file	Chromosomes length file
star_indices_folder	Directory	STAR indices folder	Path to STAR generated indices
bowtie_indices_folder	Directory	BowTie Ribosomal Indices	Path to Bowtie generated indices

Steps

ID	Runs	Label	Doc
clipper	../tools/clipper.cwl (CommandLineTool)		CLIPper is a tool to define peaks in your CLIP-seq dataset. CLIPper was developed in the Yeo Lab at the University of California, San Diego. Usage: clipper --bam CLIP-seq_reads.srt.bam --species hg19 --outfile CLIP-seq_reads.srt.peaks.bed
bamtobed	../tools/bedtools-bamtobed.cwl (CommandLineTool)	bedtools-bamtobed	Tool: bedtools bamtobed (aka bamToBed) Version: v2.26.0 Summary: Converts BAM alignments to BED6 or BEDPE format. Usage: bedtools bamtobed [OPTIONS] -i <bam> Options: -bedpe Write BEDPE format. - Requires BAM to be grouped or sorted by query. -mate1 When writing BEDPE (-bedpe) format, always report mate one as the first BEDPE \"block\". -bed12 Write \"blocked\" BED format (aka \"BED12\"). Forces -split. http://genome-test.cse.ucsc.edu/FAQ/FAQformat#format1 -split Report \"split\" BAM alignments as separate BED entries. Splits only on N CIGAR operations. -splitD Split alignments based on N and D CIGAR operators. Forces -split. -ed Use BAM edit distance (NM tag) for BED score. - Default for BED is to use mapping quality. - Default for BEDPE is to use the minimum of the two mapping qualities for the pair. - When -ed is used with -bedpe, the total edit distance from the two mates is reported. -tag Use other NUMERIC BAM alignment tag for BED score. - Default for BED is to use mapping quality. Disallowed with BEDPE output. -color An R,G,B string for the color used with BED12 format. Default is (255,0,0). -cigar Add the CIGAR string to the BED entry as a 7th column.
dedup_umi	../tools/umi_tools-dedup.cwl (CommandLineTool)	Deduplicate BAM files based on the first mapping co-ordinate and the UMI attached to the read	dedup.py - Deduplicate reads that are coded with a UMI ========================================================= :Author: Ian Sudbery, Tom Smith :Release: $Id$ :Date: \|today\| :Tags: Python UMI Purpose ------- The purpose of this command is to deduplicate BAM files based on the first mapping co-ordinate and the UMI attached to the read. Selecting the representative read --------------------------------- The following criteria are applied to select the read that will be retained from a group of duplicated reads: 1. The read with the lowest number of mapping coordinates (see --multimapping-detection-method option) 2. The read with the highest mapping quality Otherwise a read is chosen at random.
tagstopeak	../tools/clip-toolkit-tag2peak.cwl (CommandLineTool)		detecting peaks from CLIP data Usage: tag2peak.pl [options] <tag.bed> <peak.bed> <tag.bed> : BED file of unique CLIP tags, input <peak.bed>: BED file of called peaks, output Options: -big : big input file -ss : separate the two strands --valley-seeking : find candidate peaks by valley seeking --valley-depth [float] : depth of valley if valley seeking (0.9) --out-boundary [string]: output cluster boundaries --out-half-PH [string]: output half peak height boundaries --dbkey [string]: species to retrieve the default gene bed file (mm10\|hg19) --gene [string]: custom gene bed file for scan statistics (will override --dbkey) --use-expr : use expression levels given in the score column in the custom gene bed file for normalization -p [float] : threshold of p-value to call peak (0.01) --multi-test : do Bonferroni multiple test correction -minPH [int] : min peak height (2) -maxPH [int] : max peak height to calculate p-value(-1, no limit if < 0) --skip-out-of-range-peaks: skip peaks with PH > maxPH -gap [int] : merge cluster peaks closer than the gap (-1, no merge if < 0) --prefix [string]: prefix of peak id (Peak) -c [dir] : cache dir --keep-cache : keep cache when the job done -v : verbose
trim_fastq	../tools/trimgalore.cwl (CommandLineTool)		Tool runs Trimgalore - the wrapper around Cutadapt and FastQC to consistently apply adapter and quality trimming to FastQ files. `default_log_name` function returns names for generated log files (for both paired-end and single-end cases). `trim_galore` itself doesn't support setting custom names for output files. For paired-end data processing both `input_file_pair` and `paired` should be set. If either of them is not set, the other one becomes unset automatically.
extract_umi	../tools/umi_tools-extract.cwl (CommandLineTool)	Extract UMI barcode from a read and add it to the read name	extract.py - Extract UMI from fastq ==================================================== :Author: Ian Sudbery, Tom Smith :Release: $Id$ :Date: \|today\| :Tags: Python UMI Purpose ------- Extract UMI barcode from a read and add it to the read name, leaving any sample barcode in place. Can deal with paired end reads and UMIs split across the paired ends. Can also optionally extract cell barcodes and append these to the read name also. See the section below for an explanation for how to encode the barcode pattern(s) to specficy the position of the UMI +/- cell barcode. Filtering and correcting cell barcodes -------------------------------------- umi_tools extract can optionally filter cell barcodes (--filter-cell-barcode) against a user-supplied whitelist (--whitelist). If a whitelist is not available for your data, e.g if you have performed droplet-based scRNA-Seq, you can use the whitelist tool. Cell barcodes which do not match the whitelist (user-generated or automatically generated) can also be optionally corrected using the --error-correct-cell option. The whitelist should be in the following format (tab-separated): AAAAAA AGAAAA AAAATC AAACAT AAACTA AAACTN,GAACTA AAATAC AAATCA GAATCA AAATGT AAAGGT,CAATGT Where column 1 is the whitelisted cell barcodes and column 2 is the list (comma-separated) of other cell barcodes which should be corrected to the barcode in column 1. If the --error-correct-cell option is not used, this column will be ignored. Any additional columns in the whitelist input, such as the counts columns from the output of umi_tools whitelist, will be ignored.
star_aligner	../tools/star-alignreads.cwl (CommandLineTool)		Tool runs STAR alignReads. `default_output_name_prefix` function returns output files prefix if `outFileNamePrefix` is not set. By default prefix is equal to basename of `readFilesIn`.
bam_to_bigwig	../subworkflows/bam-bedgraph-bigwig.cwl (Workflow)		Workflow converts input BAM file into bigWig and bedGraph files
extract_fastq	../tools/extract-fastq.cwl (CommandLineTool)		Tool to decompress input FASTQ file Bash script's logic: - disable case sensitive glob check - check if root name of input file already include '.fastq' or '.fq' extension. If yes, set DEFAULT_EXT to \"\" - check file type, decompress if needed - return 1, if file type is not recognized This script also works of input file doesn't have any extension at all
island_intersect	../tools/iaintersect.cwl (CommandLineTool)		Tool assigns each peak obtained from MACS2 to a gene and region (upstream, promoter, exon, intron, intergenic) `default_output_filename` function returns output filename with sufix set as `ext` argument. Function is called when either `output_filename` or `log_filename` inputs are not provided.
samtools_sort_index1	../tools/samtools-sort-index.cwl (CommandLineTool)		Tool to sort and index input BAM/SAM/CRAM. If input `trigger` is set to `true` or isn't set at all (`true` is used by default), run `samtools sort` and `samtools index`, return sorted BAM and BAI/CSI index file. If input `trigger` is set to `false`, return unchanged `sort_input` (BAM/SAM/CRAM) and index (BAI/CSI, if provided in `secondaryFiles`) files, previously staged into output directory. Before execution `baseCommand`, `sort_input` and `secondaryFiles` (if provided) are staged into directory set as docker parameter `--workdir` (tool's output directory), using `InitialWorkDirRequirement`. Setting `writable: true` makes cwl-runner to make copies of the `sort_input` and `secondaryFiles` (if provided) and mount them to docker container with `rw` mode as part of `--workdir` (if set to false, the files staged into output directory will be mounted to docker container separately with `ro` mode). Because both `samtools sort` and `samtools index` can overwrite files with the same names (and in case of `samtools sort` even the input file can be overwritten), we don't need to rename any of the staged files. Trigger logic is implemented in two bash scripts set by default as `bash_script_sort` and `bash_script_index` inputs. For both of then, if the first argument $0 (which is `trigger` input) is true, run `samtools sort/index` with the rest of the arguments. If $0 is not true, skip `samtools sort/index` and return `sort_input` and `secondaryFiles` (if provided) staged into output directory. Input `trigger` is Boolean, but returns String, because of `valueFrom` field. The `valueFrom` is used, because if `trigger` is false, cwl-runner doesn't append this argument at all to the the `baseCommand` - new feature of CWL v1.0.2. Alternatively, `prefix` field could be used, but it causes changing in script logic. If using `sort_output_filename`, the output file extension should be `.bam`, because `samtools sort` defines the output file format on the base of the file extension. If `.sam` is sed as output filename, it cannot be usefully indexed by `samtools index`. `default_bam` function is used to generate output filename for `samtools sort` if input `sort_output_filename` is not set or when `trigger` is false and we need to return `sort_input` and `secondaryFiles` (if provided) files staged into output directory. Output filename is generated on the base of `sort_input` basename with `.bam` extension by default. `ext` function is used to return the index file extension (BAI/CSI) based on `csi` and `bai` inputs according to the following logic `csi` && `bai` => BAI !`csi` && !`bai ` => BAI `csi` && !`bai ` => CSI
samtools_sort_index2	../tools/samtools-sort-index.cwl (CommandLineTool)		Tool to sort and index input BAM/SAM/CRAM. If input `trigger` is set to `true` or isn't set at all (`true` is used by default), run `samtools sort` and `samtools index`, return sorted BAM and BAI/CSI index file. If input `trigger` is set to `false`, return unchanged `sort_input` (BAM/SAM/CRAM) and index (BAI/CSI, if provided in `secondaryFiles`) files, previously staged into output directory. Before execution `baseCommand`, `sort_input` and `secondaryFiles` (if provided) are staged into directory set as docker parameter `--workdir` (tool's output directory), using `InitialWorkDirRequirement`. Setting `writable: true` makes cwl-runner to make copies of the `sort_input` and `secondaryFiles` (if provided) and mount them to docker container with `rw` mode as part of `--workdir` (if set to false, the files staged into output directory will be mounted to docker container separately with `ro` mode). Because both `samtools sort` and `samtools index` can overwrite files with the same names (and in case of `samtools sort` even the input file can be overwritten), we don't need to rename any of the staged files. Trigger logic is implemented in two bash scripts set by default as `bash_script_sort` and `bash_script_index` inputs. For both of then, if the first argument $0 (which is `trigger` input) is true, run `samtools sort/index` with the rest of the arguments. If $0 is not true, skip `samtools sort/index` and return `sort_input` and `secondaryFiles` (if provided) staged into output directory. Input `trigger` is Boolean, but returns String, because of `valueFrom` field. The `valueFrom` is used, because if `trigger` is false, cwl-runner doesn't append this argument at all to the the `baseCommand` - new feature of CWL v1.0.2. Alternatively, `prefix` field could be used, but it causes changing in script logic. If using `sort_output_filename`, the output file extension should be `.bam`, because `samtools sort` defines the output file format on the base of the file extension. If `.sam` is sed as output filename, it cannot be usefully indexed by `samtools index`. `default_bam` function is used to generate output filename for `samtools sort` if input `sort_output_filename` is not set or when `trigger` is false and we need to return `sort_input` and `secondaryFiles` (if provided) files staged into output directory. Output filename is generated on the base of `sort_input` basename with `.bam` extension by default. `ext` function is used to return the index file extension (BAI/CSI) based on `csi` and `bai` inputs according to the following logic `csi` && `bai` => BAI !`csi` && !`bai ` => BAI `csi` && !`bai ` => CSI
ribosomal_bowtie_aligner	../tools/bowtie-alignreads.cwl (CommandLineTool)		Tool maps input raw reads files to reference genome using Bowtie. `default_output_filename` function returns default name for SAM output and log files. In case when `sam` and `output_filename` inputs are not set, default filename will have `.sam` extension but format may not correspond SAM specification. To set output filename manually use `output_filename` input. Default output filename is based on `output_filename` or basename of `upstream_filelist`, `downstream_filelist` or `crossbow_filelist` file (if array, the first file in array is taken). If function is called without argenments and `output_filename` input is set, it will be returned from the function. For single-end input data any of the `upstream_filelist` or `downstream_filelist` inputs can be used. Log filename (`log_file` output) is generated by `default_output_filename` function with ex='.bw' `indices_folder` defines folder to contain Bowtie indices. Based on the first found file with `rev.1.ebwt` or `rev.1.ebwtl` extension, bowtie index prefix is returned from input's `valueFrom` field.
fastx_quality_stats_after	../tools/fastx-quality-stats.cwl (CommandLineTool)		Tool calculates statistics on the base of FASTQ file quality scores. If `output_filename` is not provided call function `default_output_filename` to return default output file name generated as `input_file` basename + `.fastxstat` extension.
stats_and_transformations	clipseq-se.cwl#stats_and_transformations/32f63e1a-b24b-4a39-abfe-2c6397a47cb9 (CommandLineTool)
tagstopeak_transformations	clipseq-se.cwl#tagstopeak_transformations/dfe700a3-c0b5-4156-a5a3-383ed26e5397 (CommandLineTool)
fastx_quality_stats_original	../tools/fastx-quality-stats.cwl (CommandLineTool)		Tool calculates statistics on the base of FASTQ file quality scores. If `output_filename` is not provided call function `default_output_filename` to return default output file name generated as `input_file` basename + `.fastxstat` extension.

Outputs

ID	Type	Label	Doc
bigwig	File [bigWig]	BigWig file	Generated BigWig file
dedup_log	File	deduped CLIP log file	deduped CLIP log file
error_log	File	clipped error log file	clipped error log file
peaks_bed	File
output_bed	File
atdp_result	File [TSV]	Fake ATDP results for BioWardrobe	Average Tag Density generated results
bambai_pair	File [BAM]	Deduped BAM alignment file	Coordinate sorted BAM file and BAI index file (+index BAI)
clipper_bed	File
extract_log	File	clipped extract log file	clipped extract log file
star_sj_log	File (Optional) [Textual format]	STAR sj log	STAR SJ.out.tab
trim_report	File [Textual format]	trimm report	TrimGalore generated log
dedup_output	File	deduped CLIP file
get_stat_log	File (Optional) [Textual format]	Old Bowtie, STAR and GEEP combined log	Processed and combined Bowtie & STAR aligner and GEEP logs
star_out_log	File (Optional) [Textual format]	STAR log out	STAR Log.out
clipper_pickle	File
star_final_log	File [Textual format]	STAR final log	STAR Log.final.out
dedup_error_log	File	deduped CLIP error log file	deduped CLIP error log file
star_stdout_log	File (Optional) [Textual format]	STAR stdout log	STAR Log.std.out
star_progress_log	File (Optional) [Textual format]	STAR progress log	STAR Log.progress.out
transformed_peaks	File [TSV]	Transformed peaks Mimics MACS2
iaintersect_result	File [TSV]	Island intersect results	Iaintersect generated results
get_formatted_stats	File (Optional) [Textual format]	Bowtie, STAR and GEEP mapping stats	Processed and combined Bowtie & STAR aligner and GEEP logs
rebosomal_bowtie_log	File [Textual format]	Bowtie alignment log	Bowtie alignment log file
fastx_statistics_after	File [Textual format]	FASTQ statistics	fastx_quality_stats generated FASTQ file quality statistics file
fastx_statistics_original	File [Textual format]	FASTQ statistics	fastx_quality_stats generated FASTQ file quality statistics file

Permalink: https://w3id.org/cwl/view/git/7518b100d8cbc80c8be32e9e939dfbb27d6b4361/workflows/clipseq-se.cwl