Workflow: DESeq - differential gene expression analysis

Fetched 2023-01-03 19:15:29 GMT

Differential gene expression analysis ===================================== Differential gene expression analysis based on the negative binomial distribution Estimate variance-mean dependence in count data from high-throughput sequencing assays and test for differential expression based on a model using the negative binomial distribution. DESeq1 ------ High-throughput sequencing assays such as RNA-Seq, ChIP-Seq or barcode counting provide quantitative readouts in the form of count data. To infer differential signal in such data correctly and with good statistical power, estimation of data variability throughout the dynamic range and a suitable error model are required. Simon Anders and Wolfgang Huber propose a method based on the negative binomial distribution, with variance and mean linked by local regression and present an implementation, [DESeq](http://bioconductor.org/packages/release/bioc/html/DESeq.html), as an R/Bioconductor package DESeq2 ------ In comparative high-throughput sequencing assays, a fundamental task is the analysis of count data, such as read counts per gene in RNA-seq, for evidence of systematic changes across experimental conditions. Small replicate numbers, discreteness, large dynamic range and the presence of outliers require a suitable statistical approach. [DESeq2](http://www.bioconductor.org/packages/release/bioc/html/DESeq2.html), a method for differential analysis of count data, using shrinkage estimation for dispersions and fold changes to improve stability and interpretability of estimates. This enables a more quantitative analysis focused on the strength rather than the mere presence of differential expression.

children parents
Workflow as SVG
  • Selected
  • Default Values
  • Nested Workflows
  • Tools
  • Inputs/Outputs

Inputs

ID Type Title Doc
alias String Experiment short name/Alias
threads Integer (Optional) Number of threads

Number of threads for those steps that support multithreading

group_by Group by

Grouping method for features: isoforms, genes or common tss

batch_file File (Optional) [Textual format] Headerless TSV/CSV file for multi-factor analysis. First column - experiments' names from condition 1 and 2, second column - batch name

Metadata file for multi-factor analysis. Headerless TSV/CSV file. First column - names from --ua and --ta, second column - batch name. Default: None

rpkm_cutoff Float (Optional) Minimum rpkm cutoff. Applied before running DEseq

Minimum threshold for rpkm filtering. Default: 5

alias_cond_1 String (Optional) Alias for condition 1, aka 'untreated' (letters and numbers only)

Name to be displayed for condition 1, aka 'untreated' (letters and numbers only)

alias_cond_2 String (Optional) Alias for condition 2, aka 'treated' (letters and numbers only)

Name to be displayed for condition 2, aka 'treated' (letters and numbers only)

rpkm_genes_cond_1 File[] (Optional) [CSV] RNA-Seq experiments (condition 1, aka 'untreated')

CSV/TSV input files grouped by genes (condition 1, aka 'untreated')

rpkm_genes_cond_2 File[] (Optional) [CSV] RNA-Seq experiments (condition 2, aka 'treated')

CSV/TSV input files grouped by genes (condition 2, aka 'treated')

sample_names_cond_1 String[] (Optional) Sample names for RNA-Seq experiments (condition 1, aka 'untreated')

Aliases for RNA-Seq experiments (condition 1, aka 'untreated') to make the legend for generated plots. Order corresponds to the rpkm_isoforms_cond_1

sample_names_cond_2 String[] (Optional) Sample names for RNA-Seq experiments (condition 2, aka 'treated')

Aliases for RNA-Seq experiments (condition 2, aka 'treated') to make the legend for generated plots. Order corresponds to the rpkm_isoforms_cond_2

rpkm_isoforms_cond_1 File[] (Optional) [CSV] RNA-Seq experiments (condition 1, aka 'untreated')

CSV/TSV input files grouped by isoforms (condition 1, aka 'untreated')

rpkm_isoforms_cond_2 File[] (Optional) [CSV] RNA-Seq experiments (condition 2, aka 'treated')

CSV/TSV input files grouped by isoforms (condition 2, aka 'treated')

rpkm_common_tss_cond_1 File[] (Optional) [CSV] RNA-Seq experiments (condition 1, aka 'untreated')

CSV/TSV input files grouped by common TSS (condition 1, aka 'untreated')

rpkm_common_tss_cond_2 File[] (Optional) [CSV] RNA-Seq experiments (condition 2, aka 'treated')

CSV/TSV input files grouped by common TSS (condition 2, aka 'treated')

Steps

ID Runs Label Doc
deseq
../tools/deseq-advanced.cwl (CommandLineTool)

Tool runs DESeq/DESeq2 script similar to the original one from BioWArdrobe. untreated_files and treated_files input files should have the following header (case-sensitive) <RefseqId,GeneId,Chrom,TxStart,TxEnd,Strand,TotalReads,Rpkm> - CSV <RefseqId\tGeneId\tChrom\tTxStart\tTxEnd\tStrand\tTotalReads\tRpkm> - TSV

Format of the input files is identified based on file's extension *.csv - CSV *.tsv - TSV Otherwise used CSV by default

The output file's rows order corresponds to the rows order of the first CSV/TSV file in the untreated group. Output is always saved in TSV format

Output file includes only intersected rows from all input files. Intersected by RefseqId, GeneId, Chrom, TxStart, TxEnd, Strand

DESeq/DESeq2 always compares untreated_vs_treated groups. Normalized read counts and phenotype table are exported as GCT and CLS files for GSEA downstream analysis.

Outputs

ID Type Label Doc
plot_pca File (Optional) [PNG] PCA plot for variance stabilized count data

PCA plot for variance stabilized count data. Values are now approximately homoskedastic (have constant variance along the range of mean values)

plot_pca_pdf File (Optional) [PDF] PCA plot for variance stabilized count data

PCA plot for variance stabilized count data. Values are now approximately homoskedastic (have constant variance along the range of mean values)

diff_expr_file File [TSV] Differentially expressed features grouped by isoforms, genes or common TSS

DESeq generated file of differentially expressed features grouped by isoforms, genes or common TSS in TSV format

phenotypes_file File [Textual format] Phenotype data file in CLS format. Compatible with GSEA

DESeq generated file with phenotypes in CLS format. Compatible with GSEA

deseq_stderr_log File [Textual format] DESeq stderr log

DESeq stderr log

deseq_stdout_log File [Textual format] DESeq stdout log

DESeq stdout log

plot_lfc_vs_mean File (Optional) [PNG] Plot of normalised mean versus log2 fold change

Plot of the log2 fold changes attributable to a given variable over the mean of normalized counts for all the samples

read_counts_file File [GCT/Res format] Normalized read counts in GCT format. Compatible with GSEA

DESeq generated file of with normalized read counts in GCT format. Compatible with GSEA

gene_expr_heatmap File (Optional) [PNG] Heatmap of the 30 most highly expressed features

Heatmap showing the expression data of the 30 most highly expressed features grouped by isoforms, genes or common TSS, based on the variance stabilisation transformed data

plot_lfc_vs_mean_pdf File (Optional) [PDF] Plot of normalised mean versus log2 fold change

Plot of the log2 fold changes attributable to a given variable over the mean of normalized counts for all the samples

gene_expr_heatmap_pdf File (Optional) [PDF] Heatmap of the 30 most highly expressed features

Heatmap showing the expression data of the 30 most highly expressed features grouped by isoforms, genes or common TSS, based on the variance stabilisation transformed data

Permalink: https://w3id.org/cwl/view/git/4360fb2e778ecee42e5f78f83b78c65ab3a2b1df/workflows/deseq.cwl