CWL Workflow: Generate genome indices for STAR & bowtie

Workflow: Generate genome indices for STAR & bowtie

Fetched 2023-07-20 17:29:56 GMT

Verified with cwltool version 3.1.20230201224320

Creates indices for: * [STAR](https://github.com/alexdobin/STAR) v2.5.3a (03/17/2017) PMID: [23104886](https://www.ncbi.nlm.nih.gov/pubmed/23104886) * [bowtie](http://bowtie-bio.sourceforge.net/tutorial.shtml) v1.2.0 (12/30/2016) It performs the following steps: 1. `STAR --runMode genomeGenerate` to generate indices, based on [FASTA](http://zhanglab.ccmb.med.umich.edu/FASTA/) and [GTF](http://mblab.wustl.edu/GTF2.html) input files, returns results as an array of files 2. Outputs indices as [Direcotry](http://www.commonwl.org/v1.0/CommandLineTool.html#Directory) data type 3. Separates *chrNameLength.txt* file from Directory output 4. `bowtie-build` to generate indices requires genome [FASTA](http://zhanglab.ccmb.med.umich.edu/FASTA/) file as input, returns results as a group of main and secondary files

Selected
|
Default Values
Nested Workflows
Tools
Inputs/Outputs

This workflow is Open Source and may be reused according to the terms of: Apache License 2.0

Note that the tools invoked by the workflow may have separate licenses.

Inputs

ID	Type	Title	Doc
genome	String	Genome type	Genome type, such as mm10, hg19, hg38, etc
threads	Integer (Optional)	Number of threads to run tools	Number of threads for those steps that support multithreading
cytoband	File [TSV]	Compressed cytoBand file for IGV browser	Compressed tab-separated cytoBand file for IGV browser
genome_file	File [2bit]	Reference genome file (.2bit, .fasta, .fa, .fa.gz, *.fasta.gz)	Reference genome file (.2bit, .fasta, .fa, .fa.gz, *.fasta.gz). All chromosomes are included
genome_label	String (Optional)	Genome label
annotation_tab	File [TSV]	Compressed tsv.gz annotation file	Compressed tab-separated annotation file. Doesn't include chrM
genome_details	String (Optional)	Genome details
chromosome_list	String[] (Optional)	Chromosome list to be included into the reference genome FASTA file	Filter chromosomes while extracting FASTA from 2bit
fasta_ribosomal	File (Optional) [FASTA]	Ribosomal DNA file (.fasta, .fa)	Ribosomal DNA file (.fasta, .fa). Default: hg19
genome_description	String (Optional)	Genome description
genome_sa_sparse_d	Integer (Optional)	Suffix array sparsity for reference genome and mitochondrial DNA indices	Suffix array sparsity, i.e. distance between indices: use bigger numbers to decrease needed RAMat the cost of mapping speed reduction\"
effective_genome_size	String	Effective genome size	MACS2 effective genome sizes: hs, mm, ce, dm or number, for example 2.7e9
genome_chr_bin_n_bits	Integer (Optional)	Number of bins allocated for each chromosome of reference genome	If you are using a genome with a large (>5,000) number of references (chrosomes/scaﬀolds), you may need to reduce the --genomeChrBinNbits to reduce RAM consumption. For a genome with large number of contigs, it is recommended to scale this parameter as min(18, log2[max(GenomeLength/NumberOfReferences,ReadLength)]). default: 18
genome_sa_index_n_bases	Integer (Optional)	Length of SA pre-indexing string for reference genome indices	Length (bases) of the SA pre-indexing string. Typically between 10 and 15. Longer strings will use much more memory, but allow faster searches. For small genomes, the parameter –genomeSAindexNbases must be scaled down to min(14, log2(GenomeLength)/2 - 1). For example, for 1 megaBase genome, this is equal to 9, for 100 kiloBase genome, this is equal to 7. default: 14
limit_genome_generate_ram	Long (Optional)	Limit maximum available RAM (bytes) for reference genome indices generation	Maximum available RAM (bytes) for genome generation. Default 31000000000
mitochondrial_annotation_tab	File [TSV]	Compressed tsv.gz mitochondrial DNA annotation file	Compressed mitochondrial DNA tab-separated annotation file. Includes only chrM
genome_sa_index_n_bases_mitochondrial	Integer (Optional)	Length of SA pre-indexing string for mitochondrial DNA indices	Length (bases) of the SA pre-indexing string. Typically between 10 and 15. Longer strings will use much more memory, but allow faster searches. For small genomes, the parameter –genomeSAindexNbases must be scaled down to min(14, log2(GenomeLength)/2 - 1). For example, for 1 megaBase genome, this is equal to 9, for 100 kiloBase genome, this is equal to 7. default: 14

Steps

ID	Runs	Doc
index_fasta	../tools/samtools-faidx.cwl (CommandLineTool)	Generates FAI index file for input FASTA file Output file has the same basename, as input file, but with updated `.fai` extension. `samtools faidx` exports output file alognside the input file. To prevent tool from failing, `input_file` should be staged into output directory using `\"writable\": true`. Setting `writable: true` makes cwl-runner to make a copy of input file and mount it to docker container with `rw` mode as part of `--workdir` (if set to false, the file staged into output directory will be mounted to docker container separately with `ro` mode)
extract_fasta	../tools/ucsc-twobit-to-fa.cwl (CommandLineTool)	twoBitToFa - Convert all or part of .2bit file to fasta. Outputs only those chromosomes that are set in chr_list intput. Tool will fail if you include in chr_list those chromosomes that are absent in 2bit file. If gz is provided - use gunzip instead of twoBitToFa If FASTA file is provided, do nothing
extract_cytoband	genome-indices.cwl#extract_cytoband/0897e575-8a23-4a6c-8928-1c29451a1d59 (CommandLineTool)
prepare_annotation	genome-indices.cwl#prepare_annotation/881c8fff-6b2f-468f-b297-f36ab9f71602 (CommandLineTool)
sort_annotation_bed	../tools/linux-sort.cwl (CommandLineTool)	Tool sorts data from `unsorted_file` by key `default_output_filename` function returns file name identical to `unsorted_file`, if `output_filename` is not provided.
star_generate_indices	../tools/star-genomegenerate.cwl (CommandLineTool)	Tool returns directory with indices generated by STAR. If genome_dir input is not provided, use default output directory name star_indices. Output chr_name_length should not be moved outside the indices folder.
bowtie_generate_indices	../tools/bowtie-build.cwl (CommandLineTool)	Tool runs bowtie-build Not supported parameters: -c - reference sequences given on cmd line (as <seq_in>)
annotation_bed_to_bigbed	../tools/ucsc-bedtobigbed.cwl (CommandLineTool)	Tool converts bed file to bigBed Before running `baseCommand` the following files are created in Docker working directory (using `InitialWorkDirRequirement`): `narrowpeak.as` - default BED file structure template for ENCODE narrowPeak format `broadpeak.as` - default BED file structure template for ENCODE broadPeak format `default_output_filename` function returns default output file name based on `input_bed` basename with `.bb` extension if `output_filename` is not provided. `get_bed_type` function returns default BED file type if `bed_type` is not provided. Depending on `input_bed` file extension the following values are returned: `.narrowpeak` --> bed6+4 `.broadpeak` --> bed6+3 else --> null (`bedToBigBed` will use its own default value) `get_bed_template` function returns default BED file template if `bed_template` is not provided. Depending on `input_bed` file extension the following values are returned: `.narrowpeak` --> narrowpeak.as (previously staged into Docker working directory) `*.broadpeak` --> broadpeak.as (previously staged into Docker working directory) else --> null (`bedToBigBed` will use its own default value)
convert_annotation_to_bed	genome-indices.cwl#convert_annotation_to_bed/d890ffab-ac59-41f7-982e-eb92029f3673 (CommandLineTool)
ribosomal_generate_indices	../tools/bowtie-build.cwl (CommandLineTool)	Tool runs bowtie-build Not supported parameters: -c - reference sequences given on cmd line (as <seq_in>)
extract_mitochondrial_fasta	../tools/ucsc-twobit-to-fa.cwl (CommandLineTool)	twoBitToFa - Convert all or part of .2bit file to fasta. Outputs only those chromosomes that are set in chr_list intput. Tool will fail if you include in chr_list those chromosomes that are absent in 2bit file. If gz is provided - use gunzip instead of twoBitToFa If FASTA file is provided, do nothing
mitochondrial_generate_indices	../tools/star-genomegenerate.cwl (CommandLineTool)	Tool returns directory with indices generated by STAR. If genome_dir input is not provided, use default output directory name star_indices. Output chr_name_length should not be moved outside the indices folder.

Outputs

ID	Type	Label	Doc
annotation	File [TSV]	TSV annotation file	Tab-separated annotation file. Includes reference genome and mitochondrial DNA annotations
genome_size	String	Effective genome size	MACS2 effective genome sizes: hs, mm, ce, dm or number, for example 2.7e9
chrom_length	File [Textual format]	Genome chromosome length file	Genome chromosome length file
fasta_output	File [FASTA]	Reference genome FASTA file	Reference genome FASTA file. Includes only selected chromosomes
star_indices	Directory	STAR genome indices	STAR generated genome indices folder
annotation_bed	File [BED]	Sorted BED annotation file	Sorted BED annotation file
annotation_gtf	File [GTF]	GTF annotation file	GTF annotation file. Includes reference genome and mitochondrial DNA annotations
bowtie_indices	Directory	Bowtie genome indices	Bowtie generated genome indices folder
cytoband_output	File [TSV]	CytoBand file for IGV browser	Tab-separated cytoBand file for IGV browser
fasta_fai_output	File [TSV]	FAI index for genome FASTA file	Tab-separated FAI index file
ribosomal_indices	Directory	Bowtie ribosomal DNA indices	Bowtie generated ribosomal DNA indices folder
annotation_bed_tbi	File [bigBed]	Sorted bigBed annotation file	Sorted bigBed annotation file
mitochondrial_indices	Directory	STAR mitochondrial DNA indices	STAR generated mitochondrial DNA indices folder
star_indices_stderr_log	File	STAR stderr log for genome indices	STAR generated stderr log for genome indices
star_indices_stdout_log	File	STAR stdout log for genome indices	STAR generated stdout log for genome indices
bowtie_indices_stderr_log	File	Bowtie stderr log genome indices	Bowtie generated stderr log for genome indices
bowtie_indices_stdout_log	File	Bowtie stdout log for genome indices	Bowtie generated stdout log for genome indices
ribosomal_indices_stderr_log	File	Bowtie stderr log for ribosomal DNA indices	Bowtie generated stderr log for ribosomal DNA indices
ribosomal_indices_stdout_log	File	Bowtie stdout log for ribosomal DNA indices	Bowtie generated stdout log for ribosomal DNA indices
mitochondrial_indices_stderr_log	File	STAR stderr log for mitochondrial DNA indices	STAR generated stderr log for mitochondrial DNA indices
mitochondrial_indices_stdout_log	File	STAR stdout log for mitochondrial DNA indices	STAR generated stdout log for mitochondrial DNA indices

Permalink: https://w3id.org/cwl/view/git/a8eaf61c809d76f55780b14f2febeb363cf6373f/workflows/genome-indices.cwl