CWL Workflow: plant2human workflow

Workflow: plant2human workflow

Fetched 2024-11-26 05:15:05 GMT

Verified with cwltool version 3.1.20230201224320

\"Novel gene discovery workflow by comparing plant species and human based on structural similarity search.\"

Selected
|
Default Values
Nested Workflows
Tools
Inputs/Outputs

This workflow is Open Source and may be reused according to the terms of: MIT License

Note that the tools invoked by the workflow may have separate licenses.

Inputs

ID	Type	Title	Doc
EVALUE	Double	e-value (foldseek easy-search)	e-value threshold for foldseek easy-search. workflowdefault: 0.1
THREADS	Integer	threads (foldseek easy-search)	threads for foldseek easy-search. default: 16
ROUTE_DATASET	String	route dataset (togoid convert)	route dataset for togoid convert. This operation selects the UniProt ID of the target species (human) for which cross-references exist (final destination is HGNC gene symbol). default: uniprot,ensembl_protein,ensembl_transcript,ensembl_gene,hgnc,hgnc_symbol
FOLDSEEK_INDEX	File	foldseek index file	foldseek index file for foldseek easy-search input. This index file can be retrieved by executing the `foldseek databases` command. example: `foldseek databases Alphafold/Swiss-Prot index_swissprot/swissprot tmp --threads 8`
INPUT_DIRECTORY	Directory	input structure file directory	query protein structure file (default: mmCIF) directory for foldseek easy-search input.
TAXONOMY_ID_LIST	String	taxonomy id list (foldseek easy-search)	taxonomy id list. separated by comma. Be sure to set “9606”. default: 9606 (human), 10090 (mouse), 3702 (Arabidopsis), 4577 (Zea mays), 4529 (Oryza rufipogon)
OUTPUT_FILE_NAME1	String [File name]	output file name (foldseek easy-search)	output file name for foldseek easy-search result. Currently, this workflow only supports TSV file output.
OUTPUT_FILE_NAME2	String [File name]	output file name (extract target species)	output file name for extract target species (default: human) python script.
OUTPUT_FILE_NAME3	String [File name]	output file name (togoid convert)	output file name for togoid convert python script. default: foldseek_hit_species_togoid_convert.tsv
OUT_NOTEBOOK_NAME	String [File name]	output notebook name (papermill)	output notebook name for papermill. After the analysis workflow is output, it can be freely customized such as changing the parameter values. default: plant2human_report.ipynb
FILE_MATCH_PATTERN	String	file match pattern	file match pattern for listing input files. default: *.cif
SPLIT_MEMORY_LIMIT	String	split memory limit (foldseek easy-search)	split memory limit for foldseek easy-search. default: 120G
QUERY_GENE_LIST_TSV	File [TSV]	query gene list tsv (papermill)	query gene list tsv file. Retrieve files in advance. default: rice random gene list
QUERY_IDMAPPING_TSV	File [TSV]	query idmapping tsv (papermill)	query idmapping tsv file. Retrieve files in advance. default: rice UniProt ID mapping file
OUTPUT_FILE_NAME_HIT_SPECIES	String [File name]	output file name (extract hit species column)	output file name for extract hit species column python script. default: foldseek_result_hit_species.txt
WF_COLUMN_NUMBER_HIT_SPECIES	Integer	column number of hit species	column number of hit species. default: 2 (UniProt ID list)
OUTPUT_FILE_NAME_QUERY_SPECIES	String [File name]	output file name (extract query species column)	output file name for extract query species column python script. default: foldseek_result_query_species.txt
WF_COLUMN_NUMBER_QUERY_SPECIES	Integer	column number of query species	column number of query species. default: 1 (UniProt ID list)
SW_INPUT_FASTA_FILE_HIT_SPECIES	File [FASTA]	input fasta file (for blastdbcmd)	input fasta file for blastdbcmd. Retrieve files in advance. default: human UniProt FASTA file
SW_INPUT_FASTA_FILE_QUERY_SPECIES	File [FASTA]	input fasta file (for blastdbcmd)	input fasta file for blastdbcmd. Retrieve files in advance. default: rice UniProt FASTA file

Steps

ID	Runs	Label	Doc
papermill	../Tools/19_papermill.cwl (CommandLineTool)	papermill execution	papermill execution for plant2human notebook report
togoid_convert	../Tools/18_togoid_convert.cwl (CommandLineTool)	togoid convert	togoid convert using TOGO ID API see article: doi:10.1093/bioinformatics/btac491
extract_target_species	../Tools/12_extract_target_species.cwl (CommandLineTool)	extract target species	extract target species from foldseek easy-search result using python script python script: ../scripts/extract_target_species.py
extract_hit_species_column	../Tools/13_extract_id.cwl (CommandLineTool)	extract result	extract result from tsv file based on taxonomy id (9606) awk -> sort -> uniq -> redirect to uniprot_id.txt
extract_query_species_column	../Tools/13_extract_id.cwl (CommandLineTool)	extract result	extract result from tsv file based on taxonomy id (9606) awk -> sort -> uniq -> redirect to uniprot_id.txt
sub_workflow_foldseek_easy_search	10_foldseek_easy_search_wf.cwl (Workflow)	foldseek easy-search workflow	foldseek easy-search workflow listing files and foldseek easy-search process
sub_workflow_retrieve_sequence_query_species	11_retrieve_sequence_wf.cwl (Workflow)	foldseek easy-search sub-workflow	retrieve sequence from blastdbcmd result makeblastdb: ../Tools/14_makeblastdb.cwl blastdbcmd: ../Tools/15_blastdbcmd.cwl seqretsplit: ../Tools/16_seqretsplit.cwl needle (Global alignment): ../Tools/17_needle.cwl water (Local alignment): ../Tools/17_water.cwl

Outputs

ID	Type	Label	Doc
DIR1	Directory	directory (seqretsplit query species)	directory for seqretsplit query species.
DIR2	Directory	directory (seqretsplit hit species)	directory for seqretsplit hit species.
DIR3	Directory	needle result directory	needle (global alignment) result directory.
DIR4	Directory	water result directory	water (local alignment) result directory.
IDLIST1	File	output file (extract query species column)	extract query species column UniProt ID list file.
IDLIST2	File	output file (extract hit species column)	extract hit species column UniProt ID list file.
LOGFILE1	File	logfile (blastdbcmd query species)	logfile for blastdbcmd query species.
LOGFILE2	File	logfile (blastdbcmd hit species)	logfile for blastdbcmd hit species.
TSVFILE1	File [TSV]	output file (foldseek easy-search)	output file for foldseek easy-search all hit result.
TSVFILE2	File [TSV]	output file (extract target species)	extract target species foldseek result file. (in this workflow, human result only)
TSVFILE3	File [TSV]	output file (togoid convert)	output file for togoid convert.
INDEX_DIR1	Directory	index directory (query species)	index directory for query species.
INDEX_DIR2	Directory	index directory (hit species)	index directory for hit species.
FASTA_FILES1	File[] [FASTA]	split fasta files (seqretsplit query species)	split fasta files using seqretsplit for pairwise sequence alignment.
FASTA_FILES2	File[] [FASTA]	split fasta files (seqretsplit hit species)	split fasta files using seqretsplit for pairwise sequence alignment.
INDEX_FILES1	File	index files (query species)	index files for query species.
INDEX_FILES2	File	index files (hit species)	index files for hit species.
REPORT_NOTEBOOK	File	output notebook (papermill)	output notebook using papermill. notebook name is `plant2human_report.ipynb`.
WATER_RESULT_FILE	File[]	water result file (.water)	water (local alignment) result files. suffix is .water.
BLASTDBCMD_RESULT1	File [FASTA]	blastdbcmd result (query species)	blastdbcmd result file for query species.
BLASTDBCMD_RESULT2	File [FASTA]	blastdbcmd result (hit species)	blastdbcmd result file for hit species.
NEEDLE_RESULT_FILE	File[]	needle result file (.needle)	needle (global alignment) result files. suffix is .needle.

Permalink: https://w3id.org/cwl/view/git/4fc824b36b986af931cc136be7a91355b772b39b/Workflow/plant2human.cwl