Workflow: Generate genome indices for STAR & bowtie

Fetched 2023-07-20 17:29:56 GMT

Creates indices for: * [STAR](https://github.com/alexdobin/STAR) v2.5.3a (03/17/2017) PMID: [23104886](https://www.ncbi.nlm.nih.gov/pubmed/23104886) * [bowtie](http://bowtie-bio.sourceforge.net/tutorial.shtml) v1.2.0 (12/30/2016) It performs the following steps: 1. `STAR --runMode genomeGenerate` to generate indices, based on [FASTA](http://zhanglab.ccmb.med.umich.edu/FASTA/) and [GTF](http://mblab.wustl.edu/GTF2.html) input files, returns results as an array of files 2. Outputs indices as [Direcotry](http://www.commonwl.org/v1.0/CommandLineTool.html#Directory) data type 3. Separates *chrNameLength.txt* file from Directory output 4. `bowtie-build` to generate indices requires genome [FASTA](http://zhanglab.ccmb.med.umich.edu/FASTA/) file as input, returns results as a group of main and secondary files

children parents
Workflow as SVG
  • Selected
  • Default Values
  • Nested Workflows
  • Tools
  • Inputs/Outputs

Inputs

ID Type Title Doc
genome String Genome type

Genome type, such as mm10, hg19, hg38, etc

threads Integer (Optional) Number of threads to run tools

Number of threads for those steps that support multithreading

cytoband File [TSV] Compressed cytoBand file for IGV browser

Compressed tab-separated cytoBand file for IGV browser

genome_file File [2bit] Reference genome file (*.2bit, *.fasta, *.fa, *.fa.gz, *.fasta.gz)

Reference genome file (*.2bit, *.fasta, *.fa, *.fa.gz, *.fasta.gz). All chromosomes are included

genome_label String (Optional) Genome label
annotation_tab File [TSV] Compressed tsv.gz annotation file

Compressed tab-separated annotation file. Doesn't include chrM

genome_details String (Optional) Genome details
chromosome_list String[] (Optional) Chromosome list to be included into the reference genome FASTA file

Filter chromosomes while extracting FASTA from 2bit

fasta_ribosomal File (Optional) [FASTA] Ribosomal DNA file (*.fasta, *.fa)

Ribosomal DNA file (*.fasta, *.fa). Default: hg19

genome_description String (Optional) Genome description
genome_sa_sparse_d Integer (Optional) Suffix array sparsity for reference genome and mitochondrial DNA indices

Suffix array sparsity, i.e. distance between indices: use bigger numbers to decrease needed RAMat the cost of mapping speed reduction\"

effective_genome_size String Effective genome size

MACS2 effective genome sizes: hs, mm, ce, dm or number, for example 2.7e9

genome_chr_bin_n_bits Integer (Optional) Number of bins allocated for each chromosome of reference genome

If you are using a genome with a large (>5,000) number of references (chrosomes/scaffolds), you may need to reduce the --genomeChrBinNbits to reduce RAM consumption. For a genome with large number of contigs, it is recommended to scale this parameter as min(18, log2[max(GenomeLength/NumberOfReferences,ReadLength)]). default: 18

genome_sa_index_n_bases Integer (Optional) Length of SA pre-indexing string for reference genome indices

Length (bases) of the SA pre-indexing string. Typically between 10 and 15. Longer strings will use much more memory, but allow faster searches. For small genomes, the parameter –genomeSAindexNbases must be scaled down to min(14, log2(GenomeLength)/2 - 1). For example, for 1 megaBase genome, this is equal to 9, for 100 kiloBase genome, this is equal to 7. default: 14

limit_genome_generate_ram Long (Optional) Limit maximum available RAM (bytes) for reference genome indices generation

Maximum available RAM (bytes) for genome generation. Default 31000000000

mitochondrial_annotation_tab File [TSV] Compressed tsv.gz mitochondrial DNA annotation file

Compressed mitochondrial DNA tab-separated annotation file. Includes only chrM

genome_sa_index_n_bases_mitochondrial Integer (Optional) Length of SA pre-indexing string for mitochondrial DNA indices

Length (bases) of the SA pre-indexing string. Typically between 10 and 15. Longer strings will use much more memory, but allow faster searches. For small genomes, the parameter –genomeSAindexNbases must be scaled down to min(14, log2(GenomeLength)/2 - 1). For example, for 1 megaBase genome, this is equal to 9, for 100 kiloBase genome, this is equal to 7. default: 14

Steps

ID Runs Label Doc
index_fasta
../tools/samtools-faidx.cwl (CommandLineTool)

Generates FAI index file for input FASTA file Output file has the same basename, as input file, but with updated `.fai` extension. `samtools faidx` exports output file alognside the input file. To prevent tool from failing, `input_file` should be staged into output directory using `\"writable\": true`. Setting `writable: true` makes cwl-runner to make a copy of input file and mount it to docker container with `rw` mode as part of `--workdir` (if set to false, the file staged into output directory will be mounted to docker container separately with `ro` mode)

extract_fasta
../tools/ucsc-twobit-to-fa.cwl (CommandLineTool)

twoBitToFa - Convert all or part of .2bit file to fasta. Outputs only those chromosomes that are set in chr_list intput. Tool will fail if you include in chr_list those chromosomes that are absent in 2bit file. If gz is provided - use gunzip instead of twoBitToFa If FASTA file is provided, do nothing

extract_cytoband
genome-indices.cwl#extract_cytoband/0897e575-8a23-4a6c-8928-1c29451a1d59 (CommandLineTool)
prepare_annotation
genome-indices.cwl#prepare_annotation/881c8fff-6b2f-468f-b297-f36ab9f71602 (CommandLineTool)
sort_annotation_bed
../tools/linux-sort.cwl (CommandLineTool)

Tool sorts data from `unsorted_file` by key

`default_output_filename` function returns file name identical to `unsorted_file`, if `output_filename` is not provided.

star_generate_indices
../tools/star-genomegenerate.cwl (CommandLineTool)

Tool returns directory with indices generated by STAR. If genome_dir input is not provided, use default output directory name star_indices. Output chr_name_length should not be moved outside the indices folder.

bowtie_generate_indices
../tools/bowtie-build.cwl (CommandLineTool)

Tool runs bowtie-build Not supported parameters: -c - reference sequences given on cmd line (as <seq_in>)

annotation_bed_to_bigbed
../tools/ucsc-bedtobigbed.cwl (CommandLineTool)

Tool converts bed file to bigBed

Before running `baseCommand` the following files are created in Docker working directory (using `InitialWorkDirRequirement`): `narrowpeak.as` - default BED file structure template for ENCODE narrowPeak format `broadpeak.as` - default BED file structure template for ENCODE broadPeak format

`default_output_filename` function returns default output file name based on `input_bed` basename with `*.bb` extension if `output_filename` is not provided.

`get_bed_type` function returns default BED file type if `bed_type` is not provided. Depending on `input_bed` file extension the following values are returned: `*.narrowpeak` --> bed6+4 `*.broadpeak` --> bed6+3 else --> null (`bedToBigBed` will use its own default value)

`get_bed_template` function returns default BED file template if `bed_template` is not provided. Depending on `input_bed` file extension the following values are returned: `*.narrowpeak` --> narrowpeak.as (previously staged into Docker working directory) `*.broadpeak` --> broadpeak.as (previously staged into Docker working directory) else --> null (`bedToBigBed` will use its own default value)

convert_annotation_to_bed
genome-indices.cwl#convert_annotation_to_bed/d890ffab-ac59-41f7-982e-eb92029f3673 (CommandLineTool)
ribosomal_generate_indices
../tools/bowtie-build.cwl (CommandLineTool)

Tool runs bowtie-build Not supported parameters: -c - reference sequences given on cmd line (as <seq_in>)

extract_mitochondrial_fasta
../tools/ucsc-twobit-to-fa.cwl (CommandLineTool)

twoBitToFa - Convert all or part of .2bit file to fasta. Outputs only those chromosomes that are set in chr_list intput. Tool will fail if you include in chr_list those chromosomes that are absent in 2bit file. If gz is provided - use gunzip instead of twoBitToFa If FASTA file is provided, do nothing

mitochondrial_generate_indices
../tools/star-genomegenerate.cwl (CommandLineTool)

Tool returns directory with indices generated by STAR. If genome_dir input is not provided, use default output directory name star_indices. Output chr_name_length should not be moved outside the indices folder.

Outputs

ID Type Label Doc
annotation File [TSV] TSV annotation file

Tab-separated annotation file. Includes reference genome and mitochondrial DNA annotations

genome_size String Effective genome size

MACS2 effective genome sizes: hs, mm, ce, dm or number, for example 2.7e9

chrom_length File [Textual format] Genome chromosome length file

Genome chromosome length file

fasta_output File [FASTA] Reference genome FASTA file

Reference genome FASTA file. Includes only selected chromosomes

star_indices Directory STAR genome indices

STAR generated genome indices folder

annotation_bed File [BED] Sorted BED annotation file

Sorted BED annotation file

annotation_gtf File [GTF] GTF annotation file

GTF annotation file. Includes reference genome and mitochondrial DNA annotations

bowtie_indices Directory Bowtie genome indices

Bowtie generated genome indices folder

cytoband_output File [TSV] CytoBand file for IGV browser

Tab-separated cytoBand file for IGV browser

fasta_fai_output File [TSV] FAI index for genome FASTA file

Tab-separated FAI index file

ribosomal_indices Directory Bowtie ribosomal DNA indices

Bowtie generated ribosomal DNA indices folder

annotation_bed_tbi File [bigBed] Sorted bigBed annotation file

Sorted bigBed annotation file

mitochondrial_indices Directory STAR mitochondrial DNA indices

STAR generated mitochondrial DNA indices folder

star_indices_stderr_log File STAR stderr log for genome indices

STAR generated stderr log for genome indices

star_indices_stdout_log File STAR stdout log for genome indices

STAR generated stdout log for genome indices

bowtie_indices_stderr_log File Bowtie stderr log genome indices

Bowtie generated stderr log for genome indices

bowtie_indices_stdout_log File Bowtie stdout log for genome indices

Bowtie generated stdout log for genome indices

ribosomal_indices_stderr_log File Bowtie stderr log for ribosomal DNA indices

Bowtie generated stderr log for ribosomal DNA indices

ribosomal_indices_stdout_log File Bowtie stdout log for ribosomal DNA indices

Bowtie generated stdout log for ribosomal DNA indices

mitochondrial_indices_stderr_log File STAR stderr log for mitochondrial DNA indices

STAR generated stderr log for mitochondrial DNA indices

mitochondrial_indices_stdout_log File STAR stdout log for mitochondrial DNA indices

STAR generated stdout log for mitochondrial DNA indices

Permalink: https://w3id.org/cwl/view/git/a8eaf61c809d76f55780b14f2febeb363cf6373f/workflows/genome-indices.cwl