Workflow: protein annotation

Fetched 2024-04-25 10:17:41 GMT

Proteins - predict, filter, cluster, identify, annotate

children parents
Workflow as SVG
  • Selected
  • Default Values
  • Nested Workflows
  • Tools
  • Inputs/Outputs

Inputs

ID Type Title Doc
jobid String
m5nrBDB File
m5nrSCG File
rnaSims File
m5nrFull File[]
sequences File
rnaClustMap File
protIdentity Float (Optional)

Steps

ID Runs Label Doc
catSims
../Tools/cat.tool.cwl (CommandLineTool)
GNU cat

Concatenate FILE(s) to standard output

sortProt
../Tools/seqUtil.tool.cwl (CommandLineTool)
seqUtil

Utility tool for various sequence file transformations.

sortSims
../Tools/sort.tool.cwl (CommandLineTool)
GNU sort

sort text file base on given field(s)

superblat
../Tools/superblat.tool.cwl (CommandLineTool)
superBLAT

multi-threaded fast sequence search command line tool, protein only >superblat -fastMap -prot -out blast8 <database> <query> <output>

bleachSims
../Tools/bleachsims.tool.cwl (CommandLineTool)
bleachsims

filter similarity file by E-value and number of hits >bleachsims -s <input> -o <output> -m 20 -r 0 -c 3

protFilter
../Tools/filter_feature.tool.cwl (CommandLineTool)
filter features

remove predicted genes that have overlap with identified rRNAs >filter_feature.pl --seq <sequences> --sim <similarity> --clust <cluster> --output <output> --overlap <overlap> --memory <memory in MB> --tmp_dir <temp directory>

protCluster
../Tools/cdhit.tool.cwl (CommandLineTool)
CD-HIT

cluster protein sequences use max available cpus and memory >cdhit -n 5 -d 0 -T 0 -M 0 -c 0.9 -i <input> -o <output>

protFeature
../Tools/fraggenescan.tool.cwl (CommandLineTool)
FragGeneScan

hidden Markov model for predicting prokaryotic coding regions >run_FragGeneScan.pl --genome <input> --out <output> --complete 0 --train 454_30

annotateSims
../Tools/sims_annotate.tool.cwl (CommandLineTool)
annotate sims

create expanded annotated sims files from input md5 sim file and m5nr db sims_annotate.pl --verbose --in_sim <input> --in_scg <scgs> --ann_file <database> --format <seqFormat> --out_filter <outFilter> --out_expand <outExpand> -out_lca <outLca> --frag_num 5000

formatCluster
../Tools/format_cluster.tool.cwl (CommandLineTool)
cluster file reformat

re-formats cd-hit .clstr file into mg-rast .mapping file >format_cluster.pl --input <input> --output <output>

Outputs

ID Type Label Doc
protLCAOut File
protSimsOut File
protExpandOut File
protFilterOut File
protFeatureOut File
protClustMapOut File
protClustSeqOut File
protFilterFeatureOut File
Permalink: https://w3id.org/cwl/view/git/f5839797da8209a9d3e441023f88130219751020/CWL/Workflows/protein-filter-annotation.workflow.cwl