Workflow: plant2human workflow

Fetched 2024-11-26 05:15:05 GMT

\"Novel gene discovery workflow by comparing plant species and human based on structural similarity search.\"

children parents
Workflow as SVG
  • Selected
  • Default Values
  • Nested Workflows
  • Tools
  • Inputs/Outputs

Inputs

ID Type Title Doc
EVALUE Double e-value (foldseek easy-search)

e-value threshold for foldseek easy-search. workflowdefault: 0.1

THREADS Integer threads (foldseek easy-search)

threads for foldseek easy-search. default: 16

ROUTE_DATASET String route dataset (togoid convert)

route dataset for togoid convert. This operation selects the UniProt ID of the target species (human) for which cross-references exist (final destination is HGNC gene symbol). default: uniprot,ensembl_protein,ensembl_transcript,ensembl_gene,hgnc,hgnc_symbol

FOLDSEEK_INDEX File foldseek index file

foldseek index file for foldseek easy-search input. This index file can be retrieved by executing the `foldseek databases` command. example: `foldseek databases Alphafold/Swiss-Prot index_swissprot/swissprot tmp --threads 8`

INPUT_DIRECTORY Directory input structure file directory

query protein structure file (default: mmCIF) directory for foldseek easy-search input.

TAXONOMY_ID_LIST String taxonomy id list (foldseek easy-search)

taxonomy id list. separated by comma. Be sure to set “9606”. default: 9606 (human), 10090 (mouse), 3702 (Arabidopsis), 4577 (Zea mays), 4529 (Oryza rufipogon)

OUTPUT_FILE_NAME1 String [File name] output file name (foldseek easy-search)

output file name for foldseek easy-search result. Currently, this workflow only supports TSV file output.

OUTPUT_FILE_NAME2 String [File name] output file name (extract target species)

output file name for extract target species (default: human) python script.

OUTPUT_FILE_NAME3 String [File name] output file name (togoid convert)

output file name for togoid convert python script. default: foldseek_hit_species_togoid_convert.tsv

OUT_NOTEBOOK_NAME String [File name] output notebook name (papermill)

output notebook name for papermill. After the analysis workflow is output, it can be freely customized such as changing the parameter values. default: plant2human_report.ipynb

FILE_MATCH_PATTERN String file match pattern

file match pattern for listing input files. default: *.cif

SPLIT_MEMORY_LIMIT String split memory limit (foldseek easy-search)

split memory limit for foldseek easy-search. default: 120G

QUERY_GENE_LIST_TSV File [TSV] query gene list tsv (papermill)

query gene list tsv file. Retrieve files in advance. default: rice random gene list

QUERY_IDMAPPING_TSV File [TSV] query idmapping tsv (papermill)

query idmapping tsv file. Retrieve files in advance. default: rice UniProt ID mapping file

OUTPUT_FILE_NAME_HIT_SPECIES String [File name] output file name (extract hit species column)

output file name for extract hit species column python script. default: foldseek_result_hit_species.txt

WF_COLUMN_NUMBER_HIT_SPECIES Integer column number of hit species

column number of hit species. default: 2 (UniProt ID list)

OUTPUT_FILE_NAME_QUERY_SPECIES String [File name] output file name (extract query species column)

output file name for extract query species column python script. default: foldseek_result_query_species.txt

WF_COLUMN_NUMBER_QUERY_SPECIES Integer column number of query species

column number of query species. default: 1 (UniProt ID list)

SW_INPUT_FASTA_FILE_HIT_SPECIES File [FASTA] input fasta file (for blastdbcmd)

input fasta file for blastdbcmd. Retrieve files in advance. default: human UniProt FASTA file

SW_INPUT_FASTA_FILE_QUERY_SPECIES File [FASTA] input fasta file (for blastdbcmd)

input fasta file for blastdbcmd. Retrieve files in advance. default: rice UniProt FASTA file

Steps

ID Runs Label Doc
papermill
../Tools/19_papermill.cwl (CommandLineTool)
papermill execution

papermill execution for plant2human notebook report

togoid_convert
../Tools/18_togoid_convert.cwl (CommandLineTool)
togoid convert

togoid convert using TOGO ID API see article: doi:10.1093/bioinformatics/btac491

extract_target_species
../Tools/12_extract_target_species.cwl (CommandLineTool)
extract target species

extract target species from foldseek easy-search result using python script python script: ../scripts/extract_target_species.py

extract_hit_species_column
../Tools/13_extract_id.cwl (CommandLineTool)
extract result

extract result from tsv file based on taxonomy id (9606) awk -> sort -> uniq -> redirect to uniprot_id.txt

extract_query_species_column
../Tools/13_extract_id.cwl (CommandLineTool)
extract result

extract result from tsv file based on taxonomy id (9606) awk -> sort -> uniq -> redirect to uniprot_id.txt

sub_workflow_foldseek_easy_search foldseek easy-search workflow

foldseek easy-search workflow listing files and foldseek easy-search process

sub_workflow_retrieve_sequence_query_species foldseek easy-search sub-workflow

retrieve sequence from blastdbcmd result makeblastdb: ../Tools/14_makeblastdb.cwl blastdbcmd: ../Tools/15_blastdbcmd.cwl seqretsplit: ../Tools/16_seqretsplit.cwl needle (Global alignment): ../Tools/17_needle.cwl water (Local alignment): ../Tools/17_water.cwl

Outputs

ID Type Label Doc
DIR1 Directory directory (seqretsplit query species)

directory for seqretsplit query species.

DIR2 Directory directory (seqretsplit hit species)

directory for seqretsplit hit species.

DIR3 Directory needle result directory

needle (global alignment) result directory.

DIR4 Directory water result directory

water (local alignment) result directory.

IDLIST1 File output file (extract query species column)

extract query species column UniProt ID list file.

IDLIST2 File output file (extract hit species column)

extract hit species column UniProt ID list file.

LOGFILE1 File logfile (blastdbcmd query species)

logfile for blastdbcmd query species.

LOGFILE2 File logfile (blastdbcmd hit species)

logfile for blastdbcmd hit species.

TSVFILE1 File [TSV] output file (foldseek easy-search)

output file for foldseek easy-search all hit result.

TSVFILE2 File [TSV] output file (extract target species)

extract target species foldseek result file. (in this workflow, human result only)

TSVFILE3 File [TSV] output file (togoid convert)

output file for togoid convert.

INDEX_DIR1 Directory index directory (query species)

index directory for query species.

INDEX_DIR2 Directory index directory (hit species)

index directory for hit species.

FASTA_FILES1 File[] [FASTA] split fasta files (seqretsplit query species)

split fasta files using seqretsplit for pairwise sequence alignment.

FASTA_FILES2 File[] [FASTA] split fasta files (seqretsplit hit species)

split fasta files using seqretsplit for pairwise sequence alignment.

INDEX_FILES1 File index files (query species)

index files for query species.

INDEX_FILES2 File index files (hit species)

index files for hit species.

REPORT_NOTEBOOK File output notebook (papermill)

output notebook using papermill. notebook name is `plant2human_report.ipynb`.

WATER_RESULT_FILE File[] water result file (.water)

water (local alignment) result files. suffix is .water.

BLASTDBCMD_RESULT1 File [FASTA] blastdbcmd result (query species)

blastdbcmd result file for query species.

BLASTDBCMD_RESULT2 File [FASTA] blastdbcmd result (hit species)

blastdbcmd result file for hit species.

NEEDLE_RESULT_FILE File[] needle result file (.needle)

needle (global alignment) result files. suffix is .needle.

Permalink: https://w3id.org/cwl/view/git/4fc824b36b986af931cc136be7a91355b772b39b/Workflow/plant2human.cwl