CWL Workflow: plant2human main workflow

Workflow: plant2human main workflow

Fetched 2025-09-05 11:58:37 GMT

Verified with cwltool version 3.1.20230201224320

\" plant2human main workflow: compare structural similarity and sequence similarity Compare distantly related species, such as plants and humans, using measures of structural similarity and sequence similarity. This workflow will contribute to the discovery of protein-coding genes with features that are “sequence dissimilar but structurally similar”. \"

Selected
|
Default Values
Nested Workflows
Tools
Inputs/Outputs

This workflow is Open Source and may be reused according to the terms of: MIT License

Note that the tools invoked by the workflow may have separate licenses.

Inputs

ID	Type	Title	Doc
EVALUE	Double	e-value (foldseek easy-search)	e-value threshold for foldseek easy-search. workflowdefault: 0.1
THREADS	Integer	threads (foldseek easy-search)	threads for foldseek easy-search. default: 16
ROUTE_DATASET	String	route dataset (ID conversion using togoID)	route dataset for ID conversion. This operation selects the UniProt ID of the target species (human) for which cross-references exist (final destination is HGNC gene symbol). default: uniprot,ensembl_protein,ensembl_transcript,ensembl_gene,hgnc,hgnc_symbol
ALIGNMENT_TYPE	Integer	alignment type (foldseek easy-search)	alignment type for foldseek easy-search. default: 2 (3Di + AA: local alignment) for detailed information, see foldseek GitHub repository.
FOLDSEEK_INDEX	File	foldseek index files	\"foldseek index files for foldseek easy-search input. default: ../index/index_swissprot/swissprot Note: At this time (2025/02/02), the process of acquiring and indexing index files for execution has not been incorporated into the workflow. Therefore, we would like you to execute the following commands in advance. example: `foldseek databases Alphafold/Swiss-Prot index_swissprot/swissprot tmp --threads 8` \"
INPUT_DIRECTORY	Directory	input protein structure file directory	query protein structure file (default: mmCIF) directory for foldseek easy-search input.
TAXONOMY_ID_LIST	String	taxonomy id list (foldseek easy-search)	taxonomy id list. separated by comma. Be sure to set “9606”. default: 9606 (human), 10090 (mouse), 3702 (Arabidopsis), 4577 (Zea mays), 4529 (Oryza rufipogon)
OUTPUT_FILE_NAME1	String [File name]	output file name (foldseek easy-search)	output file name for foldseek easy-search result. Currently, this workflow only supports TSV file output.
OUTPUT_FILE_NAME2	String [File name]	output file name (extract target species)	output file name for extract target species (default: human) python script.
OUTPUT_FILE_NAME3	String [File name]	output file name (ID conversion using togoID)	output file name for ID conversion. default: foldseek_hit_species_togoid_convert.tsv
OUT_NOTEBOOK_NAME	String [File name]	output notebook name (papermill process)	output notebook name for papermill. After the analysis workflow is output, it can be freely customized such as changing the parameter values. default: plant2human_report.ipynb
FILE_MATCH_PATTERN	String	file match pattern	file match pattern for listing input files. default: *.cif
SPLIT_MEMORY_LIMIT	String	split memory limit (foldseek easy-search)	split memory limit for foldseek easy-search. default: 120G
QUERY_GENE_LIST_TSV	File [TSV]	query gene list tsv (papermill process)	query gene list tsv file. Retrieve files in advance. default: rice random gene list
QUERY_IDMAPPING_TSV	File [TSV]	query idmapping tsv (papermill process)	query idmapping tsv file. Retrieve files in advance. default: rice UniProt ID mapping file
OUTPUT_FILE_NAME_HIT_SPECIES	String [File name]	output file name (extract hit species column)	output file name for extract hit species column python script. default: foldseek_result_hit_species.txt
WF_COLUMN_NUMBER_HIT_SPECIES	Integer	column number of hit species	column number of hit species. default: 2 (UniProt ID list)
OUTPUT_FILE_NAME_QUERY_SPECIES	String [File name]	output file name (extract query species column)	output file name for extract query species column python script. default: foldseek_result_query_species.txt
WF_COLUMN_NUMBER_QUERY_SPECIES	Integer	column number of query species	column number of query species. default: 1 (UniProt ID list)
SW_INPUT_FASTA_FILE_HIT_SPECIES	File [FASTA]	input fasta file (blastdbcmd process)	input fasta file for blastdbcmd. Retrieve files in advance. default: human UniProt FASTA file
SW_INPUT_FASTA_FILE_QUERY_SPECIES	File [FASTA]	input fasta file (blastdbcmd process)	input fasta file for blastdbcmd. Retrieve files in advance. default: rice UniProt FASTA file

Steps

ID	Runs	Label	Doc
papermill	../Tools/19_papermill.cwl (CommandLineTool)	papermill execution	papermill execution for plant2human notebook report. This notebook includes a scatterplot of structural similarity vs. sequence similarity, etc. It can be customized according to the user's needs.
togoid_convert	../Tools/18_togoid_convert.cwl (CommandLineTool)	ID conversion using TOGO ID API	\" ID conversion using TOGO ID API. Process for selecting hits in UniProt entries of target species (human in this workflow) hit by Foldseek for which cross-referencing to HGNC is maintained. This process can be combined to make it easier to interpret the results of Foldseek. [TogoID Article] doi:10.1093/bioinformatics/btac491 [New TogoID Article] doi:10.1186/s13326-024-00322-1 [TogoID Web Application] (2025/02/02 checked) https://togoid.dbcls.jp/ \"
extract_target_species	../Tools/12_extract_target_species.cwl (CommandLineTool)	extract target species	extract target species (in this workflow, human is used as target species) from foldseek easy-search result using python script: ../scripts/extract_target_species.py
extract_hit_species_column	../Tools/13_extract_id.cwl (CommandLineTool)	extract result	extract result from tsv file based on taxonomy id (9606) process: awk -> sort -> uniq -> redirect to uniprot_id.txt
extract_query_species_column	../Tools/13_extract_id.cwl (CommandLineTool)	extract result	extract result from tsv file based on taxonomy id (9606) process: awk -> sort -> uniq -> redirect to uniprot_id.txt
sub_workflow_foldseek_easy_search	10_foldseek_easy_search_wf.cwl (Workflow)	foldseek easy-search workflow	\"foldseek easy-search sub-workflow for plant2human workflow Step 1: listing files Step 2: foldseek easy-search process\"
sub_workflow_retrieve_sequence_query_species	11_retrieve_sequence_wf.cwl (Workflow)	retrieve sequence and perform pairwise alignment (sub-workflow process)	\"Perform pairwise alignment of protein sequences for pairs identified by structural similarity search. Step 1: retrieve sequence from blastdbcmd result Step 2: makeblastdb: ../Tools/14_makeblastdb.cwl Step 3: blastdbcmd: ../Tools/15_blastdbcmd.cwl Step 4: seqretsplit: ../Tools/16_seqretsplit.cwl Step 5: needle (Global alignment): ../Tools/17_needle.cwl Step 6: water (Local alignment): ../Tools/17_water.cwl\"

Outputs

ID	Type	Label	Doc
DIR1	Directory	directory (seqretsplit query species)	directory for seqretsplit query species.
DIR2	Directory	directory (seqretsplit hit species)	directory for seqretsplit hit species.
DIR3	Directory	needle result directory	needle (global alignment) result directory.
DIR4	Directory	water result directory	water (local alignment) result directory.
IDLIST1	File	output file (extract query species column)	extract query species column UniProt ID list file.
IDLIST2	File	output file (extract hit species column)	extract hit species column UniProt ID list file.
LOGFILE1	File	logfile (blastdbcmd query species)	logfile for blastdbcmd query species.
LOGFILE2	File	logfile (blastdbcmd hit species)	logfile for blastdbcmd hit species.
TSVFILE1	File [TSV]	output file (foldseek easy-search result)	output file for foldseek easy-search all hit result.
TSVFILE2	File [TSV]	output file (extract target species)	extract target species foldseek result file. (in this workflow, human result only)
TSVFILE3	File [TSV]	output file (togoid convert)	output file for togoid convert.
INDEX_DIR1	Directory	index directory (query species)	index directory for query species.
INDEX_DIR2	Directory	index directory (hit species)	index directory for hit species.
FASTA_FILES1	File[] [FASTA]	split fasta files (seqretsplit query species)	split fasta files using seqretsplit for pairwise sequence alignment.
FASTA_FILES2	File[] [FASTA]	split fasta files (seqretsplit hit species)	split fasta files using seqretsplit for pairwise sequence alignment.
INDEX_FILES1	File	index files (query species)	index files for query species.
INDEX_FILES2	File	index files (hit species)	index files for hit species.
REPORT_NOTEBOOK	File	output notebook (papermill)	output notebook using papermill. notebook name is `plant2human_report.ipynb`.
WATER_RESULT_FILE	File[]	water result file (.water)	water (local alignment) result files. suffix is .water.
BLASTDBCMD_RESULT1	File [FASTA]	blastdbcmd result (query species)	blastdbcmd result file for query species.
BLASTDBCMD_RESULT2	File [FASTA]	blastdbcmd result (hit species)	blastdbcmd result file for hit species.
NEEDLE_RESULT_FILE	File[]	needle result file (.needle)	needle (global alignment) result files. suffix is .needle.

Permalink: https://w3id.org/cwl/view/git/ad71cdbde9ec1af0f73c8dcee0bb16db8bc09584/Workflow/plant2human.cwl