Workflow: plant2human workflow
\"Novel gene discovery workflow by comparing plant species and human based on structural similarity search.\"
- Selected
- |
- Default Values
- Nested Workflows
- Tools
- Inputs/Outputs
Inputs
ID | Type | Title | Doc |
---|---|---|---|
EVALUE | Double | e-value (foldseek easy-search) |
e-value threshold for foldseek easy-search. workflowdefault: 0.1 |
THREADS | Integer | threads (foldseek easy-search) |
threads for foldseek easy-search. default: 16 |
ROUTE_DATASET | String | route dataset (togoid convert) |
route dataset for togoid convert. This operation selects the UniProt ID of the target species (human) for which cross-references exist (final destination is HGNC gene symbol). default: uniprot,ensembl_protein,ensembl_transcript,ensembl_gene,hgnc,hgnc_symbol |
FOLDSEEK_INDEX | File | foldseek index file |
foldseek index file for foldseek easy-search input. This index file can be retrieved by executing the `foldseek databases` command. example: `foldseek databases Alphafold/Swiss-Prot index_swissprot/swissprot tmp --threads 8` |
INPUT_DIRECTORY | Directory | input structure file directory |
query protein structure file (default: mmCIF) directory for foldseek easy-search input. |
TAXONOMY_ID_LIST | String | taxonomy id list (foldseek easy-search) |
taxonomy id list. separated by comma. Be sure to set “9606”. default: 9606 (human), 10090 (mouse), 3702 (Arabidopsis), 4577 (Zea mays), 4529 (Oryza rufipogon) |
OUTPUT_FILE_NAME1 | String [File name] | output file name (foldseek easy-search) |
output file name for foldseek easy-search result. Currently, this workflow only supports TSV file output. |
OUTPUT_FILE_NAME2 | String [File name] | output file name (extract target species) |
output file name for extract target species (default: human) python script. |
OUTPUT_FILE_NAME3 | String [File name] | output file name (togoid convert) |
output file name for togoid convert python script. default: foldseek_hit_species_togoid_convert.tsv |
OUT_NOTEBOOK_NAME | String [File name] | output notebook name (papermill) |
output notebook name for papermill. After the analysis workflow is output, it can be freely customized such as changing the parameter values. default: plant2human_report.ipynb |
FILE_MATCH_PATTERN | String | file match pattern |
file match pattern for listing input files. default: *.cif |
SPLIT_MEMORY_LIMIT | String | split memory limit (foldseek easy-search) |
split memory limit for foldseek easy-search. default: 120G |
QUERY_GENE_LIST_TSV | File [TSV] | query gene list tsv (papermill) |
query gene list tsv file. Retrieve files in advance. default: rice random gene list |
QUERY_IDMAPPING_TSV | File [TSV] | query idmapping tsv (papermill) |
query idmapping tsv file. Retrieve files in advance. default: rice UniProt ID mapping file |
OUTPUT_FILE_NAME_HIT_SPECIES | String [File name] | output file name (extract hit species column) |
output file name for extract hit species column python script. default: foldseek_result_hit_species.txt |
WF_COLUMN_NUMBER_HIT_SPECIES | Integer | column number of hit species |
column number of hit species. default: 2 (UniProt ID list) |
OUTPUT_FILE_NAME_QUERY_SPECIES | String [File name] | output file name (extract query species column) |
output file name for extract query species column python script. default: foldseek_result_query_species.txt |
WF_COLUMN_NUMBER_QUERY_SPECIES | Integer | column number of query species |
column number of query species. default: 1 (UniProt ID list) |
SW_INPUT_FASTA_FILE_HIT_SPECIES | File [FASTA] | input fasta file (for blastdbcmd) |
input fasta file for blastdbcmd. Retrieve files in advance. default: human UniProt FASTA file |
SW_INPUT_FASTA_FILE_QUERY_SPECIES | File [FASTA] | input fasta file (for blastdbcmd) |
input fasta file for blastdbcmd. Retrieve files in advance. default: rice UniProt FASTA file |
Steps
ID | Runs | Label | Doc |
---|---|---|---|
papermill |
../Tools/19_papermill.cwl
(CommandLineTool)
|
papermill execution |
papermill execution for plant2human notebook report |
togoid_convert |
../Tools/18_togoid_convert.cwl
(CommandLineTool)
|
togoid convert |
togoid convert using TOGO ID API see article: doi:10.1093/bioinformatics/btac491 |
extract_target_species |
../Tools/12_extract_target_species.cwl
(CommandLineTool)
|
extract target species |
extract target species from foldseek easy-search result using python script python script: ../scripts/extract_target_species.py |
extract_hit_species_column |
../Tools/13_extract_id.cwl
(CommandLineTool)
|
extract result |
extract result from tsv file based on taxonomy id (9606) awk -> sort -> uniq -> redirect to uniprot_id.txt |
extract_query_species_column |
../Tools/13_extract_id.cwl
(CommandLineTool)
|
extract result |
extract result from tsv file based on taxonomy id (9606) awk -> sort -> uniq -> redirect to uniprot_id.txt |
sub_workflow_foldseek_easy_search |
10_foldseek_easy_search_wf.cwl
(Workflow)
|
foldseek easy-search workflow |
foldseek easy-search workflow listing files and foldseek easy-search process |
sub_workflow_retrieve_sequence_query_species |
11_retrieve_sequence_wf.cwl
(Workflow)
|
foldseek easy-search sub-workflow |
retrieve sequence from blastdbcmd result makeblastdb: ../Tools/14_makeblastdb.cwl blastdbcmd: ../Tools/15_blastdbcmd.cwl seqretsplit: ../Tools/16_seqretsplit.cwl needle (Global alignment): ../Tools/17_needle.cwl water (Local alignment): ../Tools/17_water.cwl |
Outputs
ID | Type | Label | Doc |
---|---|---|---|
DIR1 | Directory | directory (seqretsplit query species) |
directory for seqretsplit query species. |
DIR2 | Directory | directory (seqretsplit hit species) |
directory for seqretsplit hit species. |
DIR3 | Directory | needle result directory |
needle (global alignment) result directory. |
DIR4 | Directory | water result directory |
water (local alignment) result directory. |
IDLIST1 | File | output file (extract query species column) |
extract query species column UniProt ID list file. |
IDLIST2 | File | output file (extract hit species column) |
extract hit species column UniProt ID list file. |
LOGFILE1 | File | logfile (blastdbcmd query species) |
logfile for blastdbcmd query species. |
LOGFILE2 | File | logfile (blastdbcmd hit species) |
logfile for blastdbcmd hit species. |
TSVFILE1 | File [TSV] | output file (foldseek easy-search) |
output file for foldseek easy-search all hit result. |
TSVFILE2 | File [TSV] | output file (extract target species) |
extract target species foldseek result file. (in this workflow, human result only) |
TSVFILE3 | File [TSV] | output file (togoid convert) |
output file for togoid convert. |
INDEX_DIR1 | Directory | index directory (query species) |
index directory for query species. |
INDEX_DIR2 | Directory | index directory (hit species) |
index directory for hit species. |
FASTA_FILES1 | File[] [FASTA] | split fasta files (seqretsplit query species) |
split fasta files using seqretsplit for pairwise sequence alignment. |
FASTA_FILES2 | File[] [FASTA] | split fasta files (seqretsplit hit species) |
split fasta files using seqretsplit for pairwise sequence alignment. |
INDEX_FILES1 | File | index files (query species) |
index files for query species. |
INDEX_FILES2 | File | index files (hit species) |
index files for hit species. |
REPORT_NOTEBOOK | File | output notebook (papermill) |
output notebook using papermill. notebook name is `plant2human_report.ipynb`. |
WATER_RESULT_FILE | File[] | water result file (.water) |
water (local alignment) result files. suffix is .water. |
BLASTDBCMD_RESULT1 | File [FASTA] | blastdbcmd result (query species) |
blastdbcmd result file for query species. |
BLASTDBCMD_RESULT2 | File [FASTA] | blastdbcmd result (hit species) |
blastdbcmd result file for hit species. |
NEEDLE_RESULT_FILE | File[] | needle result file (.needle) |
needle (global alignment) result files. suffix is .needle. |
https://w3id.org/cwl/view/git/4fc824b36b986af931cc136be7a91355b772b39b/Workflow/plant2human.cwl