Explore Workflows

View already parsed workflows here or click here to add your own

Graph Name Retrieved From View
workflow graph Variant calling germline paired-end

A workflow for the Broad Institute's best practices gatk4 germline variant calling pipeline. ## __Outputs__ #### Primary Output files: - bqsr2_indels.vcf, filtered and recalibrated indels (IGV browser) - bqsr2_snps.vcf, filtered and recalibrated snps (IGV browser) - bqsr2_snps.ann.vcf, filtered and recalibrated snps with effect annotations #### Secondary Output files: - sorted_dedup_reads.bam, sorted deduplicated alignments (IGV browser) - raw_indels.vcf, first pass indel calls - raw_snps.vcf, first pass snp calls #### Reports: - overview.md (input list, alignment metrics, variant counts) - insert_size_histogram.pdf - recalibration_plots.pdf - snpEff_summary.html ## __Inputs__ #### General Info - Sample short name/Alias: unique name for sample - Experimental condition: condition, variable, etc name (e.g. \"control\" or \"20C 60min\") - Cells: name of cells used for the sample - Catalog No.: vender catalog number if available - BWA index: BWA index sample that contains reference genome FASTA with associated indices. - SNPEFF database: Name of SNPEFF database to use for SNP effect annotation. - Read 1 file: First FASTQ file (generally contains \"R1\" in the filename) - Read 2 file: Paired FASTQ file (generally contains \"R2\" in the filename) #### Advanced - Ploidy: number of copies per chromosome (default should be 2) - SNP filters: see Step 6 Notes: https://gencore.bio.nyu.edu/variant-calling-pipeline-gatk4/ - Indel filters: see Step 7 Notes: https://gencore.bio.nyu.edu/variant-calling-pipeline-gatk4/ #### SNPEFF notes: Get snpeff databases using `docker run --rm -ti gatk4-dev /bin/bash` then running `java -jar $SNPEFF_JAR databases`. Then, use the first column as SNPEFF input (e.g. \"hg38\"). - hg38, Homo_sapiens (USCS), http://downloads.sourceforge.net/project/snpeff/databases/v4_3/snpEff_v4_3_hg38.zip - mm10, Mus_musculus, http://downloads.sourceforge.net/project/snpeff/databases/v4_3/snpEff_v4_3_mm10.zip - dm6.03, Drosophila_melanogaster, http://downloads.sourceforge.net/project/snpeff/databases/v4_3/snpEff_v4_3_dm6.03.zip - Rnor_6.0.86, Rattus_norvegicus, http://downloads.sourceforge.net/project/snpeff/databases/v4_3/snpEff_v4_3_Rnor_6.0.86.zip - R64-1-1.86, Saccharomyces_cerevisiae, http://downloads.sourceforge.net/project/snpeff/databases/v4_3/snpEff_v4_3_R64-1-1.86.zip ### __Data Analysis Steps__ 1. Trimming the adapters with TrimGalore. - This step is particularly important when the reads are long and the fragments are short - resulting in sequencing adapters at the ends of reads. If adapter is not removed the read will not map. TrimGalore can recognize standard adapters, such as Illumina or Nextera/Tn5 adapters. 2. Generate quality control statistics of trimmed, unmapped sequence data 3. Run germline variant calling pipeline, custom wrapper script implementing Steps 1 - 17 of the Broad Institute's best practices gatk4 germline variant calling pipeline (https://gencore.bio.nyu.edu/variant-calling-pipeline-gatk4/) ### __References__ 1. https://gencore.bio.nyu.edu/variant-calling-pipeline-gatk4/ 2. https://gatk.broadinstitute.org/hc/en-us/articles/360035535932-Germline-short-variant-discovery-SNPs-Indels- 3. https://software.broadinstitute.org/software/igv/VCF

https://github.com/datirium/workflows.git

Path: workflows/vc-germline-pe.cwl

Branch/Commit ID: 57863b6131d8262c5ce864adaf8e4038401e71a2

workflow graph Cellranger Reanalyze

Cellranger Reanalyze ====================

https://github.com/datirium/workflows.git

Path: workflows/cellranger-reanalyze.cwl

Branch/Commit ID: cc6fa135d04737fdde3b4414d6e214cf8c812f6e

workflow graph kmer_seq_entry_extract_wnode

https://github.com/ncbi/pgap.git

Path: task_types/tt_kmer_seq_entry_extract_wnode.cwl

Branch/Commit ID: 3e7a3c1cc1ed5164ae0a51a96f20d7c480d1d70b

workflow graph DESeq - differential gene expression analysis for spike-in normalized RNA-Seq

# Differential gene expression analysis This differential gene expression (DGE) analysis takes as input samples from two experimental conditions that have been processed with a spike-in normalized RNA-Seq workflow (see list of \"Upstream workflows\" at top of file). The size factor estimation and application for normalization is disabled in this version of the DESeq workflow, otherwise all other aspects are the same. DESeq estimates variance-mean dependence in count data from high-throughput sequencing assays, then tests for DGE based on a model which assumes a negative binomial distribution of gene expression (aligned read count per gene). ### Experimental Setup and Results Interpretation The workflow design uses as its fold change (FC) calculation: condition 1 (c1, e.g. treatment) over condition 2 (c2, e.g. control). In other words: `FC == (c1/c2)` Therefore: - if FC<1 the log2(FC) is <0 (negative), meaning expression in condition1<condition2 (gene is downregulated in c1) - if FC>1 the log2(FC) is >0 (positive), meaning expression in condition1>condition2 (gene is upregulated in c1) In other words, if you have input TREATMENT samples as condition 1, and CONTROL samples as condition 2, a positive L2FC for a gene indicates that expression of the gene in TREATMENT is greater (or upregulated) compared to CONTROL. Next, threshold the p-adjusted values with your FDR (false discovery rate) cutoff to determine if the change may be considered significant or not. It is important to note when DESeq1 or DESeq2 is used in our DGE analysis workflow. If a user inputs only a single sample per condition DESeq1 is used for calculating DGE. In this experimental setup, there are no repeated measurements per gene per condition, therefore biological variability in each condition cannot be captured so the output p-values are assumed to be purely \"technical\". On the other hand, if >1 sample(s) are input per condition DESeq2 is used. In this case, biological variability per gene within each condition is available to be incorporated into the model, and resulting p-values are assumed to be \"biological\". Additionally, DESeq2 fold change is \"shrunk\" to account for sample variability, and as Michael Love (DESeq maintainer) puts it, \"it looks at the largest fold changes that are not due to low counts and uses these to inform a prior distribution. So the large fold changes from genes with lots of statistical information are not shrunk, while the imprecise fold changes are shrunk. This allows you to compare all estimated LFC across experiments, for example, which is not really feasible without the use of a prior\". In either case, the null hypothesis (H0) tested is that there are no significantly differentially expressed genes between conditions, therefore a smaller p-value indicates a lower probability of the H0 occurring by random chance and therefore, below a certain threshold (traditionally <0.05), H0 should be rejected. Additionally, due to the many thousands of independent hypotheses being tested (each gene representing an independent test), the p-values attained by the Wald test are adjusted using the Benjamini and Hochberg method by default. These \"padj\" values should be used for determination of significance (a reasonable value here would be <0.10, i.e. below a 10% FDR). Further Analysis: Output from the DESeq workflow may be used as input to the GSEA (Gene Set Enrichment Analysis) workflow for identifying enriched marker gene sets between conditions. ### DESeq1 High-throughput sequencing assays such as RNA-Seq, ChIP-Seq or barcode counting provide quantitative readouts in the form of count data. To infer differential signal in such data correctly and with good statistical power, estimation of data variability throughout the dynamic range and a suitable error model are required. Simon Anders and Wolfgang Huber propose a method based on the negative binomial distribution, with variance and mean linked by local regression and present an implementation, [DESeq](http://www.bioconductor.org/packages/3.8/bioc/html/DESeq.html), as an R/Bioconductor package. ### DESeq2 In comparative high-throughput sequencing assays, a fundamental task is the analysis of count data, such as read counts per gene in RNA-seq, for evidence of systematic changes across experimental conditions. Small replicate numbers, discreteness, large dynamic range and the presence of outliers require a suitable statistical approach. [DESeq2](http://www.bioconductor.org/packages/release/bioc/html/DESeq2.html), a method for differential analysis of count data, using shrinkage estimation for dispersions and fold changes to improve stability and interpretability of estimates. This enables a more quantitative analysis focused on the strength rather than the mere presence of differential expression. ### __References__ - Anders S, Huber W (2010). “Differential expression analysis for sequence count data.” Genome Biology, 11, R106. doi: 10.1186/gb-2010-11-10-r106, http://genomebiology.com/2010/11/10/R106/. - Love MI, Huber W, Anders S (2014). “Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2.” Genome Biology, 15, 550. doi: 10.1186/s13059-014-0550-8.

https://github.com/datirium/workflows.git

Path: workflows/deseq-for-spikein.cwl

Branch/Commit ID: cc6fa135d04737fdde3b4414d6e214cf8c812f6e

workflow graph Filter ChIP/ATAC peaks for Tag Density Profile or Motif Enrichment analyses

Filters ChIP/ATAC peaks with the neatest genes assigned for Tag Density Profile or Motif Enrichment analyses ============================================================================================================ Tool filters output from any ChIP/ATAC pipeline to create a file with regions of interest for Tag Density Profile or Motif Enrichment analyses. Peaks with duplicated coordinates are discarded.

https://github.com/datirium/workflows.git

Path: workflows/filter-peaks-for-heatmap.cwl

Branch/Commit ID: c6bfa0de917efb536dd385624fc7702e6748e61d

workflow graph Cell Ranger ARC Aggregate

Cell Ranger ARC Aggregate =========================

https://github.com/datirium/workflows.git

Path: workflows/cellranger-arc-aggr.cwl

Branch/Commit ID: 00ea05e22788029370898fd4c17798b11edf0e57

workflow graph kmer_cache_retrieve

https://github.com/ncbi/pgap.git

Path: task_types/tt_kmer_cache_retrieve.cwl

Branch/Commit ID: 9ff3e17888a15f4691ba82380472317214e20a1c

workflow graph Single-Cell ATAC-Seq Dimensionality Reduction Analysis

Single-Cell ATAC-Seq Dimensionality Reduction Analysis Removes noise and confounding sources of variation by reducing dimensionality of chromatin accessibility data from the outputs of “Single-Cell Multiome ATAC and RNA-Seq Filtering Analysis” pipelines. The results of this workflow are primarily used in “Single-Cell ATAC-Seq Cluster Analysis” or “Single-Cell WNN Cluster Analysis” pipelines.

https://github.com/datirium/workflows.git

Path: workflows/sc-atac-reduce.cwl

Branch/Commit ID: cc6fa135d04737fdde3b4414d6e214cf8c812f6e

workflow graph Trim Galore RNA-Seq pipeline single-read strand specific

Note: should be updated The original [BioWardrobe's](https://biowardrobe.com) [PubMed ID:26248465](https://www.ncbi.nlm.nih.gov/pubmed/26248465) **RNA-Seq** basic analysis for a **single-end** experiment. A corresponded input [FASTQ](http://maq.sourceforge.net/fastq.shtml) file has to be provided. Current workflow should be used only with the single-end RNA-Seq data. It performs the following steps: 1. Trim adapters from input FASTQ file 2. Use STAR to align reads from input FASTQ file according to the predefined reference indices; generate unsorted BAM file and alignment statistics file 3. Use fastx_quality_stats to analyze input FASTQ file and generate quality statistics file 4. Use samtools sort to generate coordinate sorted BAM(+BAI) file pair from the unsorted BAM file obtained on the step 1 (after running STAR) 5. Generate BigWig file on the base of sorted BAM file 6. Map input FASTQ file to predefined rRNA reference indices using Bowtie to define the level of rRNA contamination; export resulted statistics to file 7. Calculate isoform expression level for the sorted BAM file and GTF/TAB annotation file using GEEP reads-counting utility; export results to file

https://github.com/datirium/workflows.git

Path: workflows/trim-rnaseq-se-dutp.cwl

Branch/Commit ID: 564156a9e1cc7c3679a926c479ba3ae133b1bfd4

workflow graph Deprecated. RNA-Seq pipeline single-read strand specific

Note: should be updated The original [BioWardrobe's](https://biowardrobe.com) [PubMed ID:26248465](https://www.ncbi.nlm.nih.gov/pubmed/26248465) **RNA-Seq** basic analysis for **strand specific single-read** experiment. A corresponded input [FASTQ](http://maq.sourceforge.net/fastq.shtml) file has to be provided. Current workflow should be used only with the single-read RNA-Seq data. It performs the following steps: 1. Use STAR to align reads from input FASTQ file according to the predefined reference indices; generate unsorted BAM file and alignment statistics file 2. Use fastx_quality_stats to analyze input FASTQ file and generate quality statistics file 3. Use samtools sort to generate coordinate sorted BAM(+BAI) file pair from the unsorted BAM file obtained on the step 1 (after running STAR) 5. Generate BigWig file on the base of sorted BAM file 6. Map input FASTQ file to predefined rRNA reference indices using Bowtie to define the level of rRNA contamination; export resulted statistics to file 7. Calculate isoform expression level for the sorted BAM file and GTF/TAB annotation file using GEEP reads-counting utility; export results to file

https://github.com/datirium/workflows.git

Path: workflows/rnaseq-se-dutp.cwl

Branch/Commit ID: cc6fa135d04737fdde3b4414d6e214cf8c812f6e