ebi-metagenomics / pipeline-v5 Goto Github PK

This repository contains all CWL descriptions of the MGnify pipeline version 5.0.

Home Page: https://www.ebi.ac.uk/metagenomics/

License: Apache License 2.0

Common Workflow Language 19.51% Shell 0.88% Dockerfile 0.55% Python 28.53% HTML 47.61% Perl 2.71% Roff 0.21%

common-workflow-language cwl workflow cwl-descriptions cwl-workflow mgnify metagenomic-pipeline metagenomic-analysis metagenomic-classification metagenomic-data

pipeline-v5's Introduction

pipeline-v5

This repository contains all CWL descriptions of the MGnify pipeline version 5.0.

Download necessary dbs

# ---------------- common files:
mkdir ref-dbs && cd ref-dbs && 
# download silva dbs
mkdir silva_ssu silva_lsu
wget \
  ftp://ftp.ebi.ac.uk/pub/databases/metagenomics/pipeline-5.0/ref-dbs/silva_ssu-20200130.tar.gz \
  ftp://ftp.ebi.ac.uk/pub/databases/metagenomics/pipeline-5.0/ref-dbs/silva_lsu-20200130.tar.gz 
tar --extract --gzip --directory=silva_ssu silva_ssu-20200130.tar.gz
tar --extract --gzip --directory=silva_lsu silva_lsu-20200130.tar.gz
# download Pfam ribosomal models
mkdir ribosomal
wget ftp://ftp.ebi.ac.uk/pub/databases/metagenomics/pipeline-5.0/ref-dbs/rfam_models/ribosomal_models/ribo.cm \
  ftp://ftp.ebi.ac.uk/pub/databases/metagenomics/pipeline-5.0/ref-dbs/rfam_models/ribosomal_models/ribo.claninfo \
  -P ribosomal 
  
  
# ----------------- AMPLICON -----------------
mkdir UNITE
wget ftp://ftp.ebi.ac.uk/pub/databases/metagenomics/pipeline-5.0/ref-dbs/UNITE-20200214.tar.gz
tar --extract --gzip --directory=UNITE UNITE-20200214.tar.gz

mkdir ITSonedb
wget ftp://ftp.ebi.ac.uk/pub/databases/metagenomics/pipeline-5.0/ref-dbs/ITSoneDB-20200214.tar.gz
tar --extract --gzip --directory=ITSonedb ITSoneDB-20200214.tar.gz


# ----------------- WGS -----------------
# rRNA.claninfo
wget ftp://ftp.ebi.ac.uk/pub/databases/metagenomics/pipeline-5.0/ref-dbs/rRNA.claninfo
# other Rfam models
mkdir other
wget "ftp://ftp.ebi.ac.uk/pub/databases/metagenomics/pipeline-5.0/ref-dbs/rfam_models/other_models/*.cm" \
  -P other 
# kofam db  
wget ftp://ftp.ebi.ac.uk/pub/databases/metagenomics/pipeline-5.0/ref-dbs/db_kofam.hmm.h3?.gz
# InterProScan
wget ftp://ftp.ebi.ac.uk/pub/software/unix/iprscan/5/5.36-75.0/interproscan-5.36-75.0.tar.gz
tar --extract --gzip interproscan-5.36-75.0.tar.gz


# ----------------- ASSEMBLY -----------------
# rRNA.claninfo
wget ftp://ftp.ebi.ac.uk/pub/databases/metagenomics/pipeline-5.0/ref-dbs/rRNA.claninfo
# other Rfam models
mkdir other
wget "ftp://ftp.ebi.ac.uk/pub/databases/metagenomics/pipeline-5.0/ref-dbs/rfam_models/other_models/*.cm" \
  -P other 
# kofam db  
wget ftp://ftp.ebi.ac.uk/pub/databases/metagenomics/pipeline-5.0/ref-dbs/db_kofam.hmm.h3?.gz
# InterProScan
wget ftp://ftp.ebi.ac.uk/pub/software/unix/iprscan/5/5.36-75.0/interproscan-5.36-75.0.tar.gz
tar --extract --gzip interproscan-5.36-75.0.tar.gz
# Diamond
wget ftp://ftp.ebi.ac.uk/pub/databases/metagenomics/pipeline-5.0/ref-dbs/db_uniref90_result.txt.gz \
    ftp://ftp.ebi.ac.uk/pub/databases/metagenomics/pipeline-5.0/ref-dbs/uniref90_v2019_08_diamond-v0.9.25.dmnd.gz
gunzip db_uniref90_result.txt.gz uniref90_v2019_08_diamond-v0.9.25.dmnd.gz
# KEGG pathways
wget ftp://ftp.ebi.ac.uk/pub/databases/metagenomics/pipeline-5.0/ref-dbs/graphs.pkl.gz \
   ftp://ftp.ebi.ac.uk/pub/databases/metagenomics/pipeline-5.0/ref-dbs/all_pathways_class.txt.gz \
   ftp://ftp.ebi.ac.uk/pub/databases/metagenomics/pipeline-5.0/ref-dbs/all_pathways_names.txt.gz
gunzip graphs.pkl.gz all_pathways_class.txt.gz all_pathways_names.txt.gz
# antismash summary
wget ftp://ftp.ebi.ac.uk/pub/databases/metagenomics/pipeline-5.0/ref-dbs/antismash_glossary.tsv.gz
gunzip antismash_glossary.tsv.gz
# EggNOG ??
#eggnog-mapper/data/eggnog.db, eggnog-mapper/data/eggnog_proteins.dmnd
# Genome Properties ??
# flatfiles?

pipeline-v5's People

Contributors

Stargazers

Watchers

Forkers

ropolomx abdo3a zhaoxia413 biotovarx chuym726 emo-bon habiba-atef nigusekelile aalhendi1707 mattoslmp mr-c

pipeline-v5's Issues

inferal issue

Greeting
I used the model that you created against my metagenomic data to do my cmsearch, but I got a lot of ncRNA in my sample with each cmsearch model file, I expect my fill will lose a lot of sequence information after coordinate masking, I do not really know what should I do
If you can give me advice it will really help
Many thanks
Kind regards

Update envs

Package libtiff conflicts for: ( - libtiff=4.1.0 )
antismash=4.2.0 -> perl-bioperl -> perl-bioperl-core -> perl-gd -> libgd[version='>=2.2.5,<2.3.0a0'] -> libwebp[version='>=1.0.0,<1.1.0a0'] -> libtiff
....
2)
Package libffi conflicts for: ( - libffi=3.3 )
subprocess32=3.5.4 -> python[version='>=2.7,<2.8.0a0'] -> libffi[version='3.2.*|>=3.2.1,<3.3.0a0|>=3.2.1,<3.3a0|>=3.3,<3.4.0a0']

ribosomal_models not available

It seems that the ribosomal_models are not available from the provided link anymore:

wget ftp://ftp.ebi.ac.uk/pub/databases/metagenomics/pipeline-5.0/ref-dbs/rfam_models/ribosomal_models/RF*.cm \
  ftp://ftp.ebi.ac.uk/pub/databases/metagenomics/pipeline-5.0/ref-dbs/rfam_models/ribosomal_models/ribo.claninfo \
  -P ribosomal

Error:

--2023-04-03 14:57:04--  ftp://ftp.ebi.ac.uk/pub/databases/metagenomics/pipeline-5.0/ref-dbs/rfam_models/ribosomal_models/RF*.cm
           => ‘.listing’
Resolving ftp.ebi.ac.uk (ftp.ebi.ac.uk)... 193.62.193.138
Connecting to ftp.ebi.ac.uk (ftp.ebi.ac.uk)|193.62.193.138|:21... connected.
Logging in as anonymous ... Logged in!
==> SYST ... done.    ==> PWD ... done.
==> TYPE I ... done.  ==> CWD (1) /pub/databases/metagenomics/pipeline-5.0/ref-dbs/rfam_models/ribosomal_models ... 
No such directory ‘pub/databases/metagenomics/pipeline-5.0/ref-dbs/rfam_models/ribosomal_models’.

Could you please update the link, we are trying to run the pipeline locally.

Bug with chunking run_result_file_chunker.py

Hi,

The following line of code:
pipeline-v5/utils/result-file-chunker/run_result_file_chunker.py:78 --> tool_path = 'gt'

Is causing a FileNotFoundError: FileNotFoundError: [Errno 2] No such file or directory: 'gt': 'gt'

Can suggest a fix for it?

Thanks in advance,
A

File missing & Output parameter missing

raw_read_single_Error.txt
Assembly_Error.txt

These two shortened errors (full error messages available at request) occur when i run raw_reads single and assembly pipelines.

Any light on the issues would be a great help

Docker and CWL databases type "string"

Running the pipeline using docker is currently blocked by the databases paths being strings instead of Directory|File.

The paths change to string was required due to some issues with Toil and the EBI cluster.

We are currently working to fix this problem and make the pipeline fully compatible with docker. The current hacky workaround to this problem is to patch the docker.py file in CWL and add the databases' folder path hardcoded.

The patch only works for Toil[CWL] and cwltool.

Patch for docker.py.

self.append_volume(runtime, "<PATH to DBS>", "/data/databases", writable=False)

typo in awk-tool?

Hey,

I am currently working on porting the amplicon workflow to galaxy.
Building the wrappers for the different tools I noticed in the pipeline-v5/tools/RNA_prediction/extract-coords/awk_tool there is a ‘q’ in line 9 awk '{print $1"-"$3"/q"$8"-"$9" "$8" "$9" "$1}' ${INPUT} > ${NAME}".matched_seqs_with_coords"
that actually does not appear in the expected output. Is that just a typo?

Best,
Rand

A number of typos/bugs with suggestions for some

Dear EBI developers,

First, thank you for the great work you're doing on this.

I am using the development branch and trying to run the wgs-single-reads pipeline. The sample input_example file somewhat finishes successfully (it skips some steps), but with my input file I ran into the following issues (my main stumbling block, however, is the last one below):

(as its non-issue really)
The input_examples/wgs-single-ERR1995312_small.fastq.gz is not a gzip file.
The following scripts have a wrong hashbang (#!/usr/bin/env /hps/nobackup2/production/...) at the top, preventing a successful run

docker/scripts_python3/count_lines.py
docker/scripts_python3/its-length-new.py

The following has remnants of text from git merge, preventing success docker image build
tools/chunks/dna_chunker/Dockerfile
The following seem to use wrong basecommand for the Alpine docker image. They refer to bash, changing to sh works

 utils/count_lines/count_fastq_exp.cwl. 
 utils/count_number_lines.cwl

The rfam_models, and (ssu/lsu)_(db/tax/otus) are declared as strings in the YML and subsequest CWL files. Should they be type File (as in sample workflows/ymls/amplicon-wf--v.5-cond.yml file) instead? With type string the cwltool runs the Docker image specifying the absolute path of the DB file as a string that resides outside the docker image. I changed to type File and the pipeline runs successfully for the infernal's cmsearch and other steps that use these databaes.
I changed the following files:

modified:   tools/RNA_prediction/cmsearch-deoverlap/cmsearch-deoverlap-v0.02.cwl
modified:   tools/RNA_prediction/cmsearch/infernal-cmsearch-v1.1.2.cwl
          --> further need to change in this file  
          - glob: $(inputs.query_sequences.basename).$(inputs.covariance_model_database.basename).cmsearch_matches.tbl
          + glob: $(inputs.query_sequences.basename).$(inputs.covariance_model_database.split('/').slice(-1)[0]).cmsearch_matches.tbl

modified:   tools/RNA_prediction/mapseq/mapseq.cwl
modified:   tools/RNA_prediction/mapseq2biom/mapseq2biom.cwl
modified:   workflows/conditionals/raw-reads/raw-reads-2.cwl
modified:   workflows/raw-reads-wf--v.5-cond.cwl
modified:   workflows/subworkflows/classify-otu-visualise.cwl
modified:   workflows/subworkflows/cmsearch-multimodel-wf.cwl
modified:   workflows/subworkflows/rna_prediction-sub-wf.cwl

My main issue is the input type Directory, at lines 252-260 in workflows/conditionals/raw-reads/raw-reads-2.cwl file. The pipeline fails with the following error message for this step. At the moment, I don't really have much of a clue how to rectify this.

INFO [step return_tax_dir] start
ERROR Exception on step 'return_tax_dir'
ERROR [step return_tax_dir] Cannot make job: Invalid job input record:
the `dir_list` field is not valid because
  tried array of <Directory> but
    item is invalid because
      is not a dict

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/.../.local/lib/python3.7/site-packages/cwltool/workflow_job.py", line 852, in job
    for newjob in step.iterable:
  File "/.../.local/lib/python3.7/site-packages/cwltool/workflow_job.py", line 771, in try_make_job
    for j in jobs:
  File "/home/kibriyam/.local/lib/python3.7/site-packages/cwltool/workflow_job.py", line 78, in job
    for j in self.step.job(joborder, output_callback, runtimeContext):
  File "/home/kibriyam/.local/lib/python3.7/site-packages/cwltool/workflow.py", line 443, in job
    runtimeContext,
  File "/.../.local/lib/python3.7/site-packages/cwltool/command_line_tool.py", line 166, in job
    builder = self._init_job(job_order, runtimeContext)
  File "/.../.local/lib/python3.7/site-packages/cwltool/process.py", line 819, in _init_job
    raise WorkflowException("Invalid job input record:\n" + str(err)) from err
cwltool.errors.WorkflowException: Invalid job input record:
the `dir_list` field is not valid because
  tried array of <Directory> but
    item is invalid because
      is not a dict

Let me know if I can provide something more to help you fix the above.

Thanks,
Ashraf

Go Slim Banding IDs

Hi,

I am trying to implement only GO-slim part in the pipeline. However, I am a bit confused about Go-Slim part.

Should the metagenomics_go_slim_banding.txt and metagenomics_go_slim_ids.txt files, where is in the directory tools/Go-Slim, be updated ? If so, how could I update, based on what ? Or how could it be created new one ? I guess, It is input-specific, is not it or is it like should be manually created an annotation (.gaf) for your gene set of interest ?

Thank you.
Best,
Ugur

Taxonomic summary output

A question rather than an issue:

To what classification system does the "#OTU ID" values of column one of the *.merged_SSU.fasta.mseq.tsv taxonomic summaries of the MAPSeq output refer to? (They are not NCBI Taxid's, nor are they LTP/Silva IDs, and the MAPSeq documentation was no help...)

Thanks...

Due to a cwl version error I did have to change the version from v1.2-dev... to v1.0

Thank you in advance

Missing input parameter

Error.txt

After updating my cwltool to v1.2.0-dev2 and enabling via cwltool --enable-dev ...*

I ran into the attached error, If you are able to advise that would be great!