Workflow: plant2human main workflow
\" plant2human main workflow: compare structural similarity and sequence similarity Compare distantly related species, such as plants and humans, using measures of structural similarity and sequence similarity. This workflow will contribute to the discovery of protein-coding genes with features that are “sequence dissimilar but structurally similar”. \"
- Selected
- |
- Default Values
- Nested Workflows
- Tools
- Inputs/Outputs
Inputs
ID | Type | Title | Doc |
---|---|---|---|
EVALUE | Double | e-value (foldseek easy-search) |
e-value threshold for foldseek easy-search. workflowdefault: 0.1 |
THREADS | Integer | threads (foldseek easy-search) |
threads for foldseek easy-search. default: 16 |
ROUTE_DATASET | String | route dataset (ID conversion using togoID) |
route dataset for ID conversion. This operation selects the UniProt ID of the target species (human) for which cross-references exist (final destination is HGNC gene symbol). default: uniprot,ensembl_protein,ensembl_transcript,ensembl_gene,hgnc,hgnc_symbol |
ALIGNMENT_TYPE | Integer | alignment type (foldseek easy-search) |
alignment type for foldseek easy-search. default: 2 (3Di + AA: local alignment) for detailed information, see foldseek GitHub repository. |
FOLDSEEK_INDEX | File | foldseek index files |
\"foldseek index files for foldseek easy-search input. default: ../index/index_swissprot/swissprot Note: At this time (2025/02/02), the process of acquiring and indexing index files for execution has not been incorporated into the workflow. Therefore, we would like you to execute the following commands in advance. example: `foldseek databases Alphafold/Swiss-Prot index_swissprot/swissprot tmp --threads 8` \" |
INPUT_DIRECTORY | Directory | input protein structure file directory |
query protein structure file (default: mmCIF) directory for foldseek easy-search input. |
TAXONOMY_ID_LIST | String | taxonomy id list (foldseek easy-search) |
taxonomy id list. separated by comma. Be sure to set “9606”. default: 9606 (human), 10090 (mouse), 3702 (Arabidopsis), 4577 (Zea mays), 4529 (Oryza rufipogon) |
OUTPUT_FILE_NAME1 | String [File name] | output file name (foldseek easy-search) |
output file name for foldseek easy-search result. Currently, this workflow only supports TSV file output. |
OUTPUT_FILE_NAME2 | String [File name] | output file name (extract target species) |
output file name for extract target species (default: human) python script. |
OUTPUT_FILE_NAME3 | String [File name] | output file name (ID conversion using togoID) |
output file name for ID conversion. default: foldseek_hit_species_togoid_convert.tsv |
OUT_NOTEBOOK_NAME | String [File name] | output notebook name (papermill process) |
output notebook name for papermill. After the analysis workflow is output, it can be freely customized such as changing the parameter values. default: plant2human_report.ipynb |
FILE_MATCH_PATTERN | String | file match pattern |
file match pattern for listing input files. default: *.cif |
SPLIT_MEMORY_LIMIT | String | split memory limit (foldseek easy-search) |
split memory limit for foldseek easy-search. default: 120G |
QUERY_GENE_LIST_TSV | File [TSV] | query gene list tsv (papermill process) |
query gene list tsv file. Retrieve files in advance. default: rice random gene list |
QUERY_IDMAPPING_TSV | File [TSV] | query idmapping tsv (papermill process) |
query idmapping tsv file. Retrieve files in advance. default: rice UniProt ID mapping file |
OUTPUT_FILE_NAME_HIT_SPECIES | String [File name] | output file name (extract hit species column) |
output file name for extract hit species column python script. default: foldseek_result_hit_species.txt |
WF_COLUMN_NUMBER_HIT_SPECIES | Integer | column number of hit species |
column number of hit species. default: 2 (UniProt ID list) |
OUTPUT_FILE_NAME_QUERY_SPECIES | String [File name] | output file name (extract query species column) |
output file name for extract query species column python script. default: foldseek_result_query_species.txt |
WF_COLUMN_NUMBER_QUERY_SPECIES | Integer | column number of query species |
column number of query species. default: 1 (UniProt ID list) |
SW_INPUT_FASTA_FILE_HIT_SPECIES | File [FASTA] | input fasta file (blastdbcmd process) |
input fasta file for blastdbcmd. Retrieve files in advance. default: human UniProt FASTA file |
SW_INPUT_FASTA_FILE_QUERY_SPECIES | File [FASTA] | input fasta file (blastdbcmd process) |
input fasta file for blastdbcmd. Retrieve files in advance. default: rice UniProt FASTA file |
Steps
ID | Runs | Label | Doc |
---|---|---|---|
papermill |
../Tools/19_papermill.cwl
(CommandLineTool)
|
papermill execution |
papermill execution for plant2human notebook report. This notebook includes a scatterplot of structural similarity vs. sequence similarity, etc. It can be customized according to the user's needs. |
togoid_convert |
../Tools/18_togoid_convert.cwl
(CommandLineTool)
|
ID conversion using TOGO ID API |
\"
ID conversion using TOGO ID API.
Process for selecting hits in UniProt entries of target species (human in this workflow) hit by Foldseek for which cross-referencing to HGNC is maintained.
This process can be combined to make it easier to interpret the results of Foldseek. |
extract_target_species |
../Tools/12_extract_target_species.cwl
(CommandLineTool)
|
extract target species |
extract target species (in this workflow, human is used as target species) from foldseek easy-search result using python script: ../scripts/extract_target_species.py |
extract_hit_species_column |
../Tools/13_extract_id.cwl
(CommandLineTool)
|
extract result |
extract result from tsv file based on taxonomy id (9606) process: awk -> sort -> uniq -> redirect to uniprot_id.txt |
extract_query_species_column |
../Tools/13_extract_id.cwl
(CommandLineTool)
|
extract result |
extract result from tsv file based on taxonomy id (9606) process: awk -> sort -> uniq -> redirect to uniprot_id.txt |
sub_workflow_foldseek_easy_search |
10_foldseek_easy_search_wf.cwl
(Workflow)
|
foldseek easy-search workflow |
\"foldseek easy-search sub-workflow for plant2human workflow Step 1: listing files Step 2: foldseek easy-search process\" |
sub_workflow_retrieve_sequence_query_species |
11_retrieve_sequence_wf.cwl
(Workflow)
|
retrieve sequence and perform pairwise alignment (sub-workflow process) |
\"Perform pairwise alignment of protein sequences for pairs identified by structural similarity search. Step 1: retrieve sequence from blastdbcmd result Step 2: makeblastdb: ../Tools/14_makeblastdb.cwl Step 3: blastdbcmd: ../Tools/15_blastdbcmd.cwl Step 4: seqretsplit: ../Tools/16_seqretsplit.cwl Step 5: needle (Global alignment): ../Tools/17_needle.cwl Step 6: water (Local alignment): ../Tools/17_water.cwl\" |
Outputs
ID | Type | Label | Doc |
---|---|---|---|
DIR1 | Directory | directory (seqretsplit query species) |
directory for seqretsplit query species. |
DIR2 | Directory | directory (seqretsplit hit species) |
directory for seqretsplit hit species. |
DIR3 | Directory | needle result directory |
needle (global alignment) result directory. |
DIR4 | Directory | water result directory |
water (local alignment) result directory. |
IDLIST1 | File | output file (extract query species column) |
extract query species column UniProt ID list file. |
IDLIST2 | File | output file (extract hit species column) |
extract hit species column UniProt ID list file. |
LOGFILE1 | File | logfile (blastdbcmd query species) |
logfile for blastdbcmd query species. |
LOGFILE2 | File | logfile (blastdbcmd hit species) |
logfile for blastdbcmd hit species. |
TSVFILE1 | File [TSV] | output file (foldseek easy-search result) |
output file for foldseek easy-search all hit result. |
TSVFILE2 | File [TSV] | output file (extract target species) |
extract target species foldseek result file. (in this workflow, human result only) |
TSVFILE3 | File [TSV] | output file (togoid convert) |
output file for togoid convert. |
INDEX_DIR1 | Directory | index directory (query species) |
index directory for query species. |
INDEX_DIR2 | Directory | index directory (hit species) |
index directory for hit species. |
FASTA_FILES1 | File[] [FASTA] | split fasta files (seqretsplit query species) |
split fasta files using seqretsplit for pairwise sequence alignment. |
FASTA_FILES2 | File[] [FASTA] | split fasta files (seqretsplit hit species) |
split fasta files using seqretsplit for pairwise sequence alignment. |
INDEX_FILES1 | File | index files (query species) |
index files for query species. |
INDEX_FILES2 | File | index files (hit species) |
index files for hit species. |
REPORT_NOTEBOOK | File | output notebook (papermill) |
output notebook using papermill. notebook name is `plant2human_report.ipynb`. |
WATER_RESULT_FILE | File[] | water result file (.water) |
water (local alignment) result files. suffix is .water. |
BLASTDBCMD_RESULT1 | File [FASTA] | blastdbcmd result (query species) |
blastdbcmd result file for query species. |
BLASTDBCMD_RESULT2 | File [FASTA] | blastdbcmd result (hit species) |
blastdbcmd result file for hit species. |
NEEDLE_RESULT_FILE | File[] | needle result file (.needle) |
needle (global alignment) result files. suffix is .needle. |
https://w3id.org/cwl/view/git/ad71cdbde9ec1af0f73c8dcee0bb16db8bc09584/Workflow/plant2human.cwl