Explore Workflows
View already parsed workflows here or click here to add your own
Graph | Name | Retrieved From | View |
---|---|---|---|
DESeq2 (LRT) - differential gene expression analysis using likelihood ratio test
Runs DESeq2 using LRT (Likelihood Ratio Test) ============================================= The LRT examines two models for the counts, a full model with a certain number of terms and a reduced model, in which some of the terms of the full model are removed. The test determines if the increased likelihood of the data using the extra terms in the full model is more than expected if those extra terms are truly zero. The LRT is therefore useful for testing multiple terms at once, for example testing 3 or more levels of a factor at once, or all interactions between two variables. The LRT for count data is conceptually similar to an analysis of variance (ANOVA) calculation in linear regression, except that in the case of the Negative Binomial GLM, we use an analysis of deviance (ANODEV), where the deviance captures the difference in likelihood between a full and a reduced model. When one performs a likelihood ratio test, the p values and the test statistic (the stat column) are values for the test that removes all of the variables which are present in the full design and not in the reduced design. This tests the null hypothesis that all the coefficients from these variables and levels of these factors are equal to zero. The likelihood ratio test p values therefore represent a test of all the variables and all the levels of factors which are among these variables. However, the results table only has space for one column of log fold change, so a single variable and a single comparison is shown (among the potentially multiple log fold changes which were tested in the likelihood ratio test). This indicates that the p value is for the likelihood ratio test of all the variables and all the levels, while the log fold change is a single comparison from among those variables and levels. **Technical notes** 1. At least two biological replicates are required for every compared category 2. Metadata file describes relations between compared experiments, for example ``` ,time,condition DH1,day5,WT DH2,day5,KO DH3,day7,WT DH4,day7,KO DH5,day7,KO ``` where `time, condition, day5, day7, WT, KO` should be a single words (without spaces) and `DH1, DH2, DH3, DH4, DH5` correspond to the experiment aliases set in **RNA-Seq experiments** input. 3. Design and reduced formulas should start with **~** and include categories or, optionally, their interactions from the metadata file header. See details in DESeq2 manual [here](https://bioconductor.org/packages/release/bioc/vignettes/DESeq2/inst/doc/DESeq2.html#interactions) and [here](https://bioconductor.org/packages/release/bioc/vignettes/DESeq2/inst/doc/DESeq2.html#likelihood-ratio-test) 4. Contrast should be set based on your metadata file header and available categories in a form of `Factor Numerator Denominator`, where `Factor` - column name from metadata file, `Numerator` - category from metadata file to be used as numerator in fold change calculation, `Denominator` - category from metadata file to be used as denominator in fold change calculation. For example `condition WT KO`. |
https://github.com/datirium/workflows.git
Path: workflows/deseq-lrt.cwl Branch/Commit ID: 9850a859de1f42d3d252c50e15701928856fe774 |
||
ChIP-Seq pipeline paired-end
The original [BioWardrobe's](https://biowardrobe.com) [PubMed ID:26248465](https://www.ncbi.nlm.nih.gov/pubmed/26248465) **ChIP-Seq** basic analysis workflow for a **paired-end** experiment. A [FASTQ](http://maq.sourceforge.net/fastq.shtml) input file has to be provided. The pipeline produces a sorted BAM file alongside with index BAI file, quality statistics of the input FASTQ file, coverage by estimated fragments as a BigWig file, peaks calling data in a form of narrowPeak or broadPeak files, islands with the assigned nearest genes and region type, data for average tag density plot. Workflow starts with step *fastx\_quality\_stats* from FASTX-Toolkit to calculate quality statistics for input FASTQ file. At the same time `bowtie` is used to align reads from input FASTQ file to reference genome *bowtie\_aligner*. The output of this step is an unsorted SAM file which is being sorted and indexed by `samtools sort` and `samtools index` *samtools\_sort\_index*. Depending on workflow’s input parameters indexed and sorted BAM file can be processed by `samtools rmdup` *samtools\_rmdup* to get rid of duplicated reads. If removing duplicates is not required the original BAM and BAI files are returned. Otherwise step *samtools\_sort\_index\_after\_rmdup* repeat `samtools sort` and `samtools index` with BAM and BAI files without duplicates. Next `macs2 callpeak` performs peak calling *macs2\_callpeak* and the next step reports *macs2\_island\_count* the number of islands and estimated fragment size. If the latter is less that 80bp (hardcoded in the workflow) `macs2 callpeak` is rerun again with forced fixed fragment size value (*macs2\_callpeak\_forced*). It is also possible to force MACS2 to use pre set fragment size in the first place. Next step (*macs2\_stat*) is used to define which of the islands and estimated fragment size should be used in workflow output: either from *macs2\_island\_count* step or from *macs2\_island\_count\_forced* step. If input trigger of this step is set to True it means that *macs2\_callpeak\_forced* step was run and it returned different from *macs2\_callpeak* step results, so *macs2\_stat* step should return [fragments\_new, fragments\_old, islands\_new], if trigger is False the step returns [fragments\_old, fragments\_old, islands\_old], where sufix \"old\" defines results obtained from *macs2\_island\_count* step and sufix \"new\" - from *macs2\_island\_count\_forced* step. The following two steps (*bamtools\_stats* and *bam\_to\_bigwig*) are used to calculate coverage from BAM file and save it in BigWig format. For that purpose bamtools stats returns the number of mapped reads which is then used as scaling factor by bedtools genomecov when it performs coverage calculation and saves it as a BEDgraph file whichis then sorted and converted to BigWig format by bedGraphToBigWig tool from UCSC utilities. Step *get\_stat* is used to return a text file with statistics in a form of [TOTAL, ALIGNED, SUPRESSED, USED] reads count. Step *island\_intersect* assigns nearest genes and regions to the islands obtained from *macs2\_callpeak\_forced*. Step *average\_tag\_density* is used to calculate data for average tag density plot from the BAM file. |
https://github.com/datirium/workflows.git
Path: workflows/chipseq-pe.cwl Branch/Commit ID: 10ce6e113f749c7bd725e426445220c3bdc5ddf1 |
||
AltAnalyze Build Reference Indices
AltAnalyze Build Reference Indices ================================== |
https://github.com/datirium/workflows.git
Path: workflows/altanalyze-prepare-genome.cwl Branch/Commit ID: 10ce6e113f749c7bd725e426445220c3bdc5ddf1 |
||
RNA-Seq pipeline paired-end stranded mitochondrial
Slightly changed original [BioWardrobe's](https://biowardrobe.com) [PubMed ID:26248465](https://www.ncbi.nlm.nih.gov/pubmed/26248465) **RNA-Seq** basic analysis for **strand specific pair-end** experiment. An additional steps were added to map data to mitochondrial chromosome only and then merge the output. Experiment files in [FASTQ](http://maq.sourceforge.net/fastq.shtml) format either compressed or not can be used. Current workflow should be used only with the pair-end strand specific RNA-Seq data. It performs the following steps: 1. `STAR` to align reads from input FASTQ file according to the predefined reference indices; generate unsorted BAM file and alignment statistics file 2. `fastx_quality_stats` to analyze input FASTQ file and generate quality statistics file 3. `samtools sort` to generate coordinate sorted BAM(+BAI) file pair from the unsorted BAM file obtained on the step 1 (after running STAR) 5. Generate BigWig file on the base of sorted BAM file 6. Map input FASTQ file to predefined rRNA reference indices using Bowtie to define the level of rRNA contamination; export resulted statistics to file 7. Calculate isoform expression level for the sorted BAM file and GTF/TAB annotation file using `GEEP` reads-counting utility; export results to file |
https://github.com/datirium/workflows.git
Path: workflows/rnaseq-pe-dutp-mitochondrial.cwl Branch/Commit ID: 7fb8a1ebf8145791440bc2fed9c5f2d78a19d04c |
||
Spliced RNAseq workflow
Workflow for Spliced RNAseq data Steps: - workflow_illumina_quality: - FastQC (Read Quality Control) - fastp (Read Trimming) - STAR (Read mapping) - featurecounts (transcript read counts) - kallisto (transcript [pseudo]counts) |
https://git.wageningenur.nl/unlock/cwl.git
Path: cwl/workflows/workflow_RNAseq_Spliced.cwl Branch/Commit ID: b9097b82e6ab6f2c9496013ce4dd6877092956a0 |
||
Trim Galore RNA-Seq pipeline paired-end
The original [BioWardrobe's](https://biowardrobe.com) [PubMed ID:26248465](https://www.ncbi.nlm.nih.gov/pubmed/26248465) **RNA-Seq** basic analysis for a **pair-end** experiment. A corresponded input [FASTQ](http://maq.sourceforge.net/fastq.shtml) file has to be provided. Current workflow should be used only with the single-end RNA-Seq data. It performs the following steps: 1. Trim adapters from input FASTQ files 2. Use STAR to align reads from input FASTQ files according to the predefined reference indices; generate unsorted BAM file and alignment statistics file 3. Use fastx_quality_stats to analyze input FASTQ files and generate quality statistics files 4. Use samtools sort to generate coordinate sorted BAM(+BAI) file pair from the unsorted BAM file obtained on the step 1 (after running STAR) 5. Generate BigWig file on the base of sorted BAM file 6. Map input FASTQ files to predefined rRNA reference indices using Bowtie to define the level of rRNA contamination; export resulted statistics to file 7. Calculate isoform expression level for the sorted BAM file and GTF/TAB annotation file using GEEP reads-counting utility; export results to file |
https://github.com/datirium/workflows.git
Path: workflows/trim-rnaseq-pe.cwl Branch/Commit ID: 7fb8a1ebf8145791440bc2fed9c5f2d78a19d04c |
||
EMG assembly for paired end Illumina
|
https://github.com/proteinswebteam/ebi-metagenomics-cwl.git
Path: workflows/emg-pipeline-v4-assembly-metaSPAdes.cwl Branch/Commit ID: 25129f55226dee595ef941edc24d3c44414e0523 |
||
Trim Galore ChIP-Seq pipeline single-read
. This ChIP-Seq pipeline is based on the original [BioWardrobe's](https://biowardrobe.com) [PubMed ID:26248465](https://www.ncbi.nlm.nih.gov/pubmed/26248465) **ChIP-Seq** basic analysis workflow for a **single-read** experiment with Trim Galore. ### Data Analysis SciDAP starts from the .fastq files which most DNA cores and commercial NGS companies return. Starting from raw data allows us to ensure that all experiments have been processed in the same way and simplifies the deposition of data to GEO upon publication. The data can be uploaded from users computer, downloaded directly from an ftp server of the core facility by providing a URL or from GEO by providing SRA accession number. Our current pipelines include the following steps: 1. Trimming the adapters with TrimGalore. This step is particularly important when the reads are long and the fragments are short-resulting in sequencing adapters at the end of read. If adapter is not removed the read will not map. TrimGalore can recognize standard adapters, such as Illumina or Nexterra/Tn5 adapters. 2. QC 3. (Optional) trimming adapters on 5' or 3' end by the specified number of bases. 4. Mapping reads with BowTie. Only uniquely mapped reads with less than 3 mismatches are used in the downstream analysis. Results are saved as a .bam file. 5. (Optional) Removal of duplicates (reads/pairs of reads mapping to exactly same location). This step is used to remove reads overamplified in PCR. Unfortunately, it may also remove \"good\" reads. We usually do not remove duplicates unless the library is heavily duplicated. Please note that MACS2 will remove 'excessive' duplicates during peak calling ina smart way (those not supported by other nearby reads). 6. Peakcalling by MACS2. (Optionally), it is possible to specify read extension length for MACS2 to use if the length determined automatically is wrong. 7. Generation of BigWig coverage files for display on the browser. The coverage shows the number of fragments at each base in the genome normalized to the number of millions of mapped reads. In the case of PE sequencing the fragments are real, but in the case of single reads the fragments are estimated by extending reads to the average fragment length found by MACS2 or specified by the user in 6. ### Details _Trim Galore_ is a wrapper around [Cutadapt](https://github.com/marcelm/cutadapt) and [FastQC](http://www.bioinformatics.babraham.ac.uk/projects/fastqc/) to consistently apply adapter and quality trimming to FastQ files, with extra functionality for RRBS data. In outputs it returns coordinate sorted BAM file alongside with index BAI file, quality statistics of the input FASTQ file, reads coverage in a form of BigWig file, peaks calling data in a form of narrowPeak or broadPeak files, islands with the assigned nearest genes and region type, data for average tag density plot (on the base of BAM file). Workflow starts with step *fastx\_quality\_stats* from FASTX-Toolkit to calculate quality statistics for input FASTQ file. At the same time `bowtie` is used to align reads from input FASTQ file to reference genome *bowtie\_aligner*. The output of this step is unsorted SAM file which is being sorted and indexed by `samtools sort` and `samtools index` *samtools\_sort\_index*. Based on workflow’s input parameters indexed and sorted BAM file can be processed by `samtools rmdup` *samtools\_rmdup* to get rid of duplicated reads. If removing duplicates is not required the original input BAM and BAI files return. Otherwise step *samtools\_sort\_index\_after\_rmdup* repeat `samtools sort` and `samtools index` with BAM and BAI files. Right after that `macs2 callpeak` performs peak calling *macs2\_callpeak*. On the base of returned outputs the next step *macs2\_island\_count* calculates the number of islands and estimated fragment size. If the last one is less that 80bp (hardcoded in the workflow) `macs2 callpeak` is rerun again with forced fixed fragment size value (*macs2\_callpeak\_forced*). If at the very beginning it was set in workflow input parameters to force run peak calling with fixed fragment size, this step is skipped and the original peak calling results are saved. In the next step workflow again calculates the number of islands and estimates fragment size (*macs2\_island\_count\_forced*) for the data obtained from *macs2\_callpeak\_forced* step. If the last one was skipped the results from *macs2\_island\_count\_forced* step are equal to the ones obtained from *macs2\_island\_count* step. Next step (*macs2\_stat*) is used to define which of the islands and estimated fragment size should be used in workflow output: either from *macs2\_island\_count* step or from *macs2\_island\_count\_forced* step. If input trigger of this step is set to True it means that *macs2\_callpeak\_forced* step was run and it returned different from *macs2\_callpeak* step results, so *macs2\_stat* step should return [fragments\_new, fragments\_old, islands\_new], if trigger is False the step returns [fragments\_old, fragments\_old, islands\_old], where sufix \"old\" defines results obtained from *macs2\_island\_count* step and sufix \"new\" - from *macs2\_island\_count\_forced* step. The following two steps (*bamtools\_stats* and *bam\_to\_bigwig*) are used to calculate coverage on the base of input BAM file and save it in BigWig format. For that purpose bamtools stats returns the number of mapped reads number which is then used as scaling factor by bedtools genomecov when it performs coverage calculation and saves it in BED format. The last one is then being sorted and converted to BigWig format by bedGraphToBigWig tool from UCSC utilities. Step *get\_stat* is used to return a text file with statistics in a form of [TOTAL, ALIGNED, SUPRESSED, USED] reads count. Step *island\_intersect* assigns genes and regions to the islands obtained from *macs2\_callpeak\_forced*. Step *average\_tag\_density* is used to calculate data for average tag density plot on the base of BAM file. |
https://github.com/datirium/workflows.git
Path: workflows/trim-chipseq-se.cwl Branch/Commit ID: 581156366f91861bd4dbb5bcb59f67d468b32af3 |
||
ChIP-Seq pipeline single-read
# ChIP-Seq basic analysis workflow for single-read data Reads are aligned to the reference genome with [Bowtie](http://bowtie-bio.sourceforge.net/index.shtml). Results are saved as coordinate sorted [BAM](http://samtools.github.io/hts-specs/SAMv1.pdf) alignment and index BAI files. Optionally, PCR duplicates can be removed. To obtain coverage in [bigWig](https://genome.ucsc.edu/goldenpath/help/bigWig.html) format, average fragment length is calculated by [MACS2](https://github.com/taoliu/MACS), and individual reads are extended to this length in the 3’ direction. Areas of enrichment identified by MACS2 are saved in ENCODE [narrow peak](http://genome.ucsc.edu/FAQ/FAQformat.html#format12) or [broad peak](https://genome.ucsc.edu/FAQ/FAQformat.html#format13) formats. Called peaks together with the nearest genes are saved in TSV format. In addition to basic statistics (number of total/mapped/multi-mapped/unmapped/duplicate reads), pipeline generates several quality control measures. Base frequency plots are used to estimate adapter contamination, a frequent occurrence in low-input ChIP-Seq experiments. Expected distinct reads count from [Preseq](http://smithlabresearch.org/software/preseq/) can be used to estimate read redundancy for a given sequencing depth. Average tag density profiles can be used to estimate ChIP enrichment for promoter proximal histone modifications. Use of different parameters for different antibodies (calling broad or narrow peaks) is possible. Additionally, users can elect to use BAM file from another experiment as control for MACS2 peak calling. ## Cite as *Kartashov AV, Barski A. BioWardrobe: an integrated platform for analysis of epigenomics and transcriptomics data. Genome Biol. 2015;16(1):158. Published 2015 Aug 7. [doi:10.1186/s13059-015-0720-3](https://www.ncbi.nlm.nih.gov/pubmed/26248465)* ## Software versions - Bowtie 1.2.0 - Samtools 1.4 - Preseq 2.0 - MACS2 2.1.1.20160309 - Bedtools 2.26.0 - UCSC userApps v358 ## Inputs | ID | Label | Description | Required | Default | Upstream analyses | | ------------------------- | ---------------------------------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------- | :------: | ------- | ------------------------------- | | **fastq\_file** | FASTQ file | Single-read sequencing data in FASTQ format (fastq, fq, bzip2, gzip, zip) | + | | | | **indices\_folder** | Genome indices | Directory with the genome indices generated by Bowtie | + | | genome\_indices/bowtie\_indices | | **annotation\_file** | Genome annotation file | Genome annotation file in TSV format | + | | genome\_indices/annotation | | **genome\_size** | Effective genome size | The length of the mappable genome (hs, mm, ce, dm or number, for example 2.7e9) | + | | genome\_indices/genome\_size | | **chrom\_length** | Chromosome lengths file | Chromosome lengths file in TSV format | + | | genome\_indices/chrom\_length | | **broad\_peak** | Call broad peaks | Make MACS2 call broad peaks by linking nearby highly enriched regions | + | | | | **control\_file** | Control ChIP-Seq single-read experiment | Indexed BAM file from the ChIP-Seq single-read experiment to be used as a control for MACS2 peak calling | | Null | control\_file/bambai\_pair | | **exp\_fragment\_size** | Expected fragment size | Expected fragment size for read extenstion towards 3' end if *force\_fragment\_size* was set to True or if calculated by MACS2 fragment size was less that 80 bp | | 150 | | | **force\_fragment\_size** | Force peak calling with expected fragment size | Make MACS2 don't build the shifting model and use expected fragment size for read extenstion towards 3' end | | False | | | **clip\_3p\_end** | Clip from 3' end | Number of base pairs to clip from 3' end | | 0 | | | **clip\_5p\_end** | Clip from 5' end | Number of base pairs to clip from 5' end | | 0 | | | **remove\_duplicates** | Remove PCR duplicates | Remove PCR duplicates from sorted BAM file | | False | | | **threads** | Number of threads | Number of threads for those steps that support multithreading | | 2 | | ## Outputs | ID | Label | Description | Required | Visualization | | ------------------------ | ---------------------------------- | ------------------------------------------------------------------------------------ | :------: | ------------------------------------------------------------------ | | **fastx\_statistics** | FASTQ quality statistics | FASTQ quality statistics in TSV format | + | *Base Frequency* and *Quality Control* plots in *QC Plots* tab | | **bambai\_pair** | Aligned reads | Coordinate sorted BAM alignment and index BAI files | + | *Nucleotide Sequence Alignments* track in *IGV Genome Browser* tab | | **bigwig** | Genome coverage | Genome coverage in bigWig format | + | *Genome Coverage* track in *IGV Genome Browser* tab | | **iaintersect\_result** | Gene annotated peaks | MACS2 peak file annotated with nearby genes | + | *Peak Coordinates* table in *Peak Calling* tab | | **atdp\_result** | Average Tag Density Plot | Average Tag Density Plot file in TSV format | + | *Average Tag Density Plot* in *QC Plots* tab | | **macs2\_called\_peaks** | Called peaks | Called peaks file with 1-based coordinates in XLS format | + | | | **macs2\_narrow\_peaks** | Narrow peaks | Called peaks file in ENCODE narrow peak format | | *Narrow peaks* track in *IGV Genome Browser* tab | | **macs2\_broad\_peaks** | Broad peaks | Called peaks file in ENCODE broad peak format | | *Broad peaks* track in *IGV Genome Browser* tab | | **preseq\_estimates** | Expected Distinct Reads Count Plot | Expected distinct reads count file from Preseq in TSV format | | *Expected Distinct Reads Count Plot* in *QC Plots* tab | | **workflow\_statistics** | Workflow execution statistics | Overall workflow execution statistics from bowtie\_aligner and samtools\_rmdup steps | + | *Overview* tab and experiment's preview | | **bowtie\_log** | Read alignment log | Read alignment log file from Bowtie | + | | |
https://github.com/datirium/workflows.git
Path: workflows/chipseq-se.cwl Branch/Commit ID: 9850a859de1f42d3d252c50e15701928856fe774 |
||
RNA-Seq pipeline single-read stranded mitochondrial
Slightly changed original [BioWardrobe's](https://biowardrobe.com) [PubMed ID:26248465](https://www.ncbi.nlm.nih.gov/pubmed/26248465) **RNA-Seq** basic analysis for **strand specific single-read** experiment. An additional steps were added to map data to mitochondrial chromosome only and then merge the output. Experiment files in [FASTQ](http://maq.sourceforge.net/fastq.shtml) format either compressed or not can be used. Current workflow should be used only with single-read strand specific RNA-Seq data. It performs the following steps: 1. `STAR` to align reads from input FASTQ file according to the predefined reference indices; generate unsorted BAM file and alignment statistics file 2. `fastx_quality_stats` to analyze input FASTQ file and generate quality statistics file 3. `samtools sort` to generate coordinate sorted BAM(+BAI) file pair from the unsorted BAM file obtained on the step 1 (after running STAR) 5. Generate BigWig file on the base of sorted BAM file 6. Map input FASTQ file to predefined rRNA reference indices using Bowtie to define the level of rRNA contamination; export resulted statistics to file 7. Calculate isoform expression level for the sorted BAM file and GTF/TAB annotation file using `GEEP` reads-counting utility; export results to file |
https://github.com/datirium/workflows.git
Path: workflows/rnaseq-se-dutp-mitochondrial.cwl Branch/Commit ID: c9e7f3de7f6ba38ee663bd3f9649e8d7dbac0c86 |