Giter Site home page Giter Site logo

ebi-metagenomics / pipeline-v5 Goto Github PK

View Code? Open in Web Editor NEW
20.0 7.0 18.0 385.11 MB

This repository contains all CWL descriptions of the MGnify pipeline version 5.0.

Home Page: https://www.ebi.ac.uk/metagenomics/

License: Apache License 2.0

Common Workflow Language 19.51% Shell 0.88% Dockerfile 0.55% Python 28.53% HTML 47.61% Perl 2.71% Roff 0.21%
common-workflow-language cwl workflow cwl-descriptions cwl-workflow mgnify metagenomic-pipeline metagenomic-analysis metagenomic-classification metagenomic-data

pipeline-v5's Introduction

Build Status

pipeline-v5

This repository contains all CWL descriptions of the MGnify pipeline version 5.0.

Download necessary dbs

# ---------------- common files:
mkdir ref-dbs && cd ref-dbs && 
# download silva dbs
mkdir silva_ssu silva_lsu
wget \
  ftp://ftp.ebi.ac.uk/pub/databases/metagenomics/pipeline-5.0/ref-dbs/silva_ssu-20200130.tar.gz \
  ftp://ftp.ebi.ac.uk/pub/databases/metagenomics/pipeline-5.0/ref-dbs/silva_lsu-20200130.tar.gz 
tar --extract --gzip --directory=silva_ssu silva_ssu-20200130.tar.gz
tar --extract --gzip --directory=silva_lsu silva_lsu-20200130.tar.gz
# download Pfam ribosomal models
mkdir ribosomal
wget ftp://ftp.ebi.ac.uk/pub/databases/metagenomics/pipeline-5.0/ref-dbs/rfam_models/ribosomal_models/ribo.cm \
  ftp://ftp.ebi.ac.uk/pub/databases/metagenomics/pipeline-5.0/ref-dbs/rfam_models/ribosomal_models/ribo.claninfo \
  -P ribosomal 
  
  
# ----------------- AMPLICON -----------------
mkdir UNITE
wget ftp://ftp.ebi.ac.uk/pub/databases/metagenomics/pipeline-5.0/ref-dbs/UNITE-20200214.tar.gz
tar --extract --gzip --directory=UNITE UNITE-20200214.tar.gz

mkdir ITSonedb
wget ftp://ftp.ebi.ac.uk/pub/databases/metagenomics/pipeline-5.0/ref-dbs/ITSoneDB-20200214.tar.gz
tar --extract --gzip --directory=ITSonedb ITSoneDB-20200214.tar.gz


# ----------------- WGS -----------------
# rRNA.claninfo
wget ftp://ftp.ebi.ac.uk/pub/databases/metagenomics/pipeline-5.0/ref-dbs/rRNA.claninfo
# other Rfam models
mkdir other
wget "ftp://ftp.ebi.ac.uk/pub/databases/metagenomics/pipeline-5.0/ref-dbs/rfam_models/other_models/*.cm" \
  -P other 
# kofam db  
wget ftp://ftp.ebi.ac.uk/pub/databases/metagenomics/pipeline-5.0/ref-dbs/db_kofam.hmm.h3?.gz
# InterProScan
wget ftp://ftp.ebi.ac.uk/pub/software/unix/iprscan/5/5.36-75.0/interproscan-5.36-75.0.tar.gz
tar --extract --gzip interproscan-5.36-75.0.tar.gz


# ----------------- ASSEMBLY -----------------
# rRNA.claninfo
wget ftp://ftp.ebi.ac.uk/pub/databases/metagenomics/pipeline-5.0/ref-dbs/rRNA.claninfo
# other Rfam models
mkdir other
wget "ftp://ftp.ebi.ac.uk/pub/databases/metagenomics/pipeline-5.0/ref-dbs/rfam_models/other_models/*.cm" \
  -P other 
# kofam db  
wget ftp://ftp.ebi.ac.uk/pub/databases/metagenomics/pipeline-5.0/ref-dbs/db_kofam.hmm.h3?.gz
# InterProScan
wget ftp://ftp.ebi.ac.uk/pub/software/unix/iprscan/5/5.36-75.0/interproscan-5.36-75.0.tar.gz
tar --extract --gzip interproscan-5.36-75.0.tar.gz
# Diamond
wget ftp://ftp.ebi.ac.uk/pub/databases/metagenomics/pipeline-5.0/ref-dbs/db_uniref90_result.txt.gz \
    ftp://ftp.ebi.ac.uk/pub/databases/metagenomics/pipeline-5.0/ref-dbs/uniref90_v2019_08_diamond-v0.9.25.dmnd.gz
gunzip db_uniref90_result.txt.gz uniref90_v2019_08_diamond-v0.9.25.dmnd.gz
# KEGG pathways
wget ftp://ftp.ebi.ac.uk/pub/databases/metagenomics/pipeline-5.0/ref-dbs/graphs.pkl.gz \
   ftp://ftp.ebi.ac.uk/pub/databases/metagenomics/pipeline-5.0/ref-dbs/all_pathways_class.txt.gz \
   ftp://ftp.ebi.ac.uk/pub/databases/metagenomics/pipeline-5.0/ref-dbs/all_pathways_names.txt.gz
gunzip graphs.pkl.gz all_pathways_class.txt.gz all_pathways_names.txt.gz
# antismash summary
wget ftp://ftp.ebi.ac.uk/pub/databases/metagenomics/pipeline-5.0/ref-dbs/antismash_glossary.tsv.gz
gunzip antismash_glossary.tsv.gz
# EggNOG ??
#eggnog-mapper/data/eggnog.db, eggnog-mapper/data/eggnog_proteins.dmnd
# Genome Properties ??
# flatfiles?

pipeline-v5's People

Contributors

caballero avatar katesakharova avatar mb1069 avatar mberacochea avatar mr-c avatar mscheremetjew avatar vkale1 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

pipeline-v5's Issues

inferal issue

Greeting
I used the model that you created against my metagenomic data to do my cmsearch, but I got a lot of ncRNA in my sample with each cmsearch model file, I expect my fill will lose a lot of sequence information after coordinate masking, I do not really know what should I do
If you can give me advice it will really help
Many thanks
Kind regards

Update envs

Package libtiff conflicts for: ( - libtiff=4.1.0 )
antismash=4.2.0 -> perl-bioperl -> perl-bioperl-core -> perl-gd -> libgd[version='>=2.2.5,<2.3.0a0'] -> libwebp[version='>=1.0.0,<1.1.0a0'] -> libtiff
....
2)
Package libffi conflicts for: ( - libffi=3.3 )
subprocess32=3.5.4 -> python[version='>=2.7,<2.8.0a0'] -> libffi[version='3.2.*|>=3.2.1,<3.3.0a0|>=3.2.1,<3.3a0|>=3.3,<3.4.0a0']

ribosomal_models not available

It seems that the ribosomal_models are not available from the provided link anymore:

wget ftp://ftp.ebi.ac.uk/pub/databases/metagenomics/pipeline-5.0/ref-dbs/rfam_models/ribosomal_models/RF*.cm \
  ftp://ftp.ebi.ac.uk/pub/databases/metagenomics/pipeline-5.0/ref-dbs/rfam_models/ribosomal_models/ribo.claninfo \
  -P ribosomal 

Error:

--2023-04-03 14:57:04--  ftp://ftp.ebi.ac.uk/pub/databases/metagenomics/pipeline-5.0/ref-dbs/rfam_models/ribosomal_models/RF*.cm
           => ‘.listing’
Resolving ftp.ebi.ac.uk (ftp.ebi.ac.uk)... 193.62.193.138
Connecting to ftp.ebi.ac.uk (ftp.ebi.ac.uk)|193.62.193.138|:21... connected.
Logging in as anonymous ... Logged in!
==> SYST ... done.    ==> PWD ... done.
==> TYPE I ... done.  ==> CWD (1) /pub/databases/metagenomics/pipeline-5.0/ref-dbs/rfam_models/ribosomal_models ... 
No such directory ‘pub/databases/metagenomics/pipeline-5.0/ref-dbs/rfam_models/ribosomal_models’.

Could you please update the link, we are trying to run the pipeline locally.

Bug with chunking run_result_file_chunker.py

Hi,

The following line of code:
pipeline-v5/utils/result-file-chunker/run_result_file_chunker.py:78 --> tool_path = 'gt'

Is causing a FileNotFoundError: FileNotFoundError: [Errno 2] No such file or directory: 'gt': 'gt'

Can suggest a fix for it?

Thanks in advance,
A

Docker and CWL databases type "string"

Running the pipeline using docker is currently blocked by the databases paths being strings instead of Directory|File.

The paths change to string was required due to some issues with Toil and the EBI cluster.

We are currently working to fix this problem and make the pipeline fully compatible with docker. The current hacky workaround to this problem is to patch the docker.py file in CWL and add the databases' folder path hardcoded.

The patch only works for Toil[CWL] and cwltool.

Patch for docker.py.

self.append_volume(runtime, "<PATH to DBS>", "/data/databases", writable=False)

typo in awk-tool?

Hey,

I am currently working on porting the amplicon workflow to galaxy.
Building the wrappers for the different tools I noticed in the pipeline-v5/tools/RNA_prediction/extract-coords/awk_tool there is a ‘q’ in line 9 awk '{print $1"-"$3"/q"$8"-"$9" "$8" "$9" "$1}' ${INPUT} > ${NAME}".matched_seqs_with_coords"
that actually does not appear in the expected output. Is that just a typo?

Best,
Rand

A number of typos/bugs with suggestions for some

Dear EBI developers,

First, thank you for the great work you're doing on this.

I am using the development branch and trying to run the wgs-single-reads pipeline. The sample input_example file somewhat finishes successfully (it skips some steps), but with my input file I ran into the following issues (my main stumbling block, however, is the last one below):

  1. (as its non-issue really)
    The input_examples/wgs-single-ERR1995312_small.fastq.gz is not a gzip file.

  2. The following scripts have a wrong hashbang (#!/usr/bin/env /hps/nobackup2/production/...) at the top, preventing a successful run

docker/scripts_python3/count_lines.py
docker/scripts_python3/its-length-new.py
  1. The following has remnants of text from git merge, preventing success docker image build
    tools/chunks/dna_chunker/Dockerfile

  2. The following seem to use wrong basecommand for the Alpine docker image. They refer to bash, changing to sh works

 utils/count_lines/count_fastq_exp.cwl. 
 utils/count_number_lines.cwl
  1. The rfam_models, and (ssu/lsu)_(db/tax/otus) are declared as strings in the YML and subsequest CWL files. Should they be type File (as in sample workflows/ymls/amplicon-wf--v.5-cond.yml file) instead? With type string the cwltool runs the Docker image specifying the absolute path of the DB file as a string that resides outside the docker image. I changed to type File and the pipeline runs successfully for the infernal's cmsearch and other steps that use these databaes.
    I changed the following files:
modified:   tools/RNA_prediction/cmsearch-deoverlap/cmsearch-deoverlap-v0.02.cwl
modified:   tools/RNA_prediction/cmsearch/infernal-cmsearch-v1.1.2.cwl
          --> further need to change in this file  
          - glob: $(inputs.query_sequences.basename).$(inputs.covariance_model_database.basename).cmsearch_matches.tbl
          + glob: $(inputs.query_sequences.basename).$(inputs.covariance_model_database.split('/').slice(-1)[0]).cmsearch_matches.tbl

modified:   tools/RNA_prediction/mapseq/mapseq.cwl
modified:   tools/RNA_prediction/mapseq2biom/mapseq2biom.cwl
modified:   workflows/conditionals/raw-reads/raw-reads-2.cwl
modified:   workflows/raw-reads-wf--v.5-cond.cwl
modified:   workflows/subworkflows/classify-otu-visualise.cwl
modified:   workflows/subworkflows/cmsearch-multimodel-wf.cwl
modified:   workflows/subworkflows/rna_prediction-sub-wf.cwl
  1. My main issue is the input type Directory, at lines 252-260 in workflows/conditionals/raw-reads/raw-reads-2.cwl file. The pipeline fails with the following error message for this step. At the moment, I don't really have much of a clue how to rectify this.
INFO [step return_tax_dir] start
ERROR Exception on step 'return_tax_dir'
ERROR [step return_tax_dir] Cannot make job: Invalid job input record:
the `dir_list` field is not valid because
  tried array of <Directory> but
    item is invalid because
      is not a dict

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/.../.local/lib/python3.7/site-packages/cwltool/workflow_job.py", line 852, in job
    for newjob in step.iterable:
  File "/.../.local/lib/python3.7/site-packages/cwltool/workflow_job.py", line 771, in try_make_job
    for j in jobs:
  File "/home/kibriyam/.local/lib/python3.7/site-packages/cwltool/workflow_job.py", line 78, in job
    for j in self.step.job(joborder, output_callback, runtimeContext):
  File "/home/kibriyam/.local/lib/python3.7/site-packages/cwltool/workflow.py", line 443, in job
    runtimeContext,
  File "/.../.local/lib/python3.7/site-packages/cwltool/command_line_tool.py", line 166, in job
    builder = self._init_job(job_order, runtimeContext)
  File "/.../.local/lib/python3.7/site-packages/cwltool/process.py", line 819, in _init_job
    raise WorkflowException("Invalid job input record:\n" + str(err)) from err
cwltool.errors.WorkflowException: Invalid job input record:
the `dir_list` field is not valid because
  tried array of <Directory> but
    item is invalid because
      is not a dict

Let me know if I can provide something more to help you fix the above.

Thanks,
Ashraf

Go Slim Banding IDs

Hi,

I am trying to implement only GO-slim part in the pipeline. However, I am a bit confused about Go-Slim part.

Should the metagenomics_go_slim_banding.txt and metagenomics_go_slim_ids.txt files, where is in the directory tools/Go-Slim, be updated ? If so, how could I update, based on what ? Or how could it be created new one ? I guess, It is input-specific, is not it or is it like should be manually created an annotation (.gaf) for your gene set of interest ?

Thank you.
Best,
Ugur

Taxonomic summary output

A question rather than an issue:

To what classification system does the "#OTU ID" values of column one of the *.merged_SSU.fasta.mseq.tsv taxonomic summaries of the MAPSeq output refer to? (They are not NCBI Taxid's, nor are they LTP/Silva IDs, and the MAPSeq documentation was no help...)

Thanks...

Refseq db

Please can I know the version of used SILVA database (eg SILVA_138.1_SSURef_NR99.fasta.gz) also please, what is the source of uniref90_diamond-v0.9.21.dmnd.gz is this database generate from NCBI nr for diamond?

Failed Validation

Error.txt

I am following the readme, using a basic single fasta sequence, I encounter this error. If you are unable to recreate the error I am happy to provide the files I use.

Due to a cwl version error I did have to change the version from v1.2-dev... to v1.0

Thank you in advance

Missing input parameter

Error.txt

After updating my cwltool to v1.2.0-dev2 and enabling via cwltool --enable-dev ...*

I ran into the attached error, If you are able to advise that would be great!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.