Giter Site home page Giter Site logo

hoelzer-lab / ribap Goto Github PK

View Code? Open in Web Editor NEW
20.0 20.0 3.0 15.07 MB

A comprehensive bacterial core gene-set annotation pipeline based on Roary and pairwise ILPs

License: GNU General Public License v3.0

Python 3.88% Shell 0.57% Nextflow 0.99% HTML 94.56%

ribap's People

Contributors

hoelzer avatar klamkiew avatar lisa-mariebarf avatar marielataretu avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

ribap's Issues

Prokka fails with input genomes containing a dot ('.') in their file names

Chlamydia_muridarum_strain_Nigg3_full_genome.fna                             Chlamydia_muridarum_str._Nigg3_CMUT3-5_strain_Nigg3_CMUT3-5_full_genome.fna  Chlamydia_muridarum_str._Nigg_strain_Nigg_full_genome.fna
Chlamydia_muridarum_str._Nigg_2_MCR_strain_Nigg_2_MCR_full_genome.fna        Chlamydia_muridarum_str._Nigg_CM972_strain_CM972_full_genome.fna

Renaming of the files and fasta records works fine, however, the prokka directory stops at the str., removes the . and we end up with Chlamydia_muridarum_str_RENAMED/ several times, leading to the following error:

Error executing process > 'roary (5)'

Caused by:
  Process `roary` input file name collision -- There are multiple input files for each of the following file names: Chlamydia_muridarum_str_RENAMED.gff


Tip: you can replicate the issue by changing to the process work dir and entering the command `bash .command.run`

IQ-Tree update

There is a newer and faster version of RAxML:
https://github.com/amkozlov/raxml-ng

https://doi.org/10.1093/bioinformatics/btz305

I replaced

raxmlHPC-PTHREADS-SSE3 -T ${task.cpus} -f a -x 1234 -p 1234 -s coreGenome_mafft.aln -n aa -m PROTGAMMAWAG -N 100

with

raxml-ng --all --threads ${task.cpus} --msa coreGenome_mafft.aln --prefix coreGenome_mafft --msa-format FASTA --model PROTGTR+G --bs-trees 100 

in the raxml.nf Nextflow module using

condo install -c bioconda raxml-ng=0.9.0

The output looks like this (marked in bold is the tree that also holds the bootstrap values):

coreGenome_mafft.raxml.bestModel
coreGenome_mafft.raxml.bestTree
coreGenome_mafft.raxml.bootstraps
coreGenome_mafft.raxml.log
coreGenome_mafft.raxml.mlTrees
coreGenome_mafft.raxml.rba
coreGenome_mafft.raxml.reduced.phy
coreGenome_mafft.raxml.startTree
coreGenome_mafft.raxml.support

@klamkiew I think it's fairly easy to also update this in the shell script for a next release

Start ribap from own annotation

In case you already have high-quality annotation of your genomes, allow to skip the Prokka annotation and start the pipeline directly from your own annotations.

filter mmseqs2 output?

Currently, it seems that the mmseqs2 output is not filtered like the blastp output before.

HTML table searchable

Problem: non-visiable elements in the HTML table can not be searched

Asked some web designer guy:

Hi,
also viel hab ich mit den dataTables auch noch nicht gemacht. also wenn das js-framework das nicht kann wird das wohl eh schwierig da dazuzuprogrammieren sein.
Zumindest in ihrer eigenen Doku können sie das auch nicht: https://datatables.net/examples/api/row_details.html
Wenn du da die erste Zeile aufklappst und nach der Nummer suchst, findet er auch nix.

Aber viell versuchst du es mal hiermit:
https://www.datatables.net/examples/basic_init/hidden_columns.html

beim Initialisieren der Tabelle kannst du das ja mal mit angeben, viell hilft es ja:

"columnDefs": [
            {
                "searchable": true
            }
        ]

Ansonsten mal bei denen im Forum fragen, viell haben die da ne Idee: https://datatables.net/forums/

Bash Install Script

I was able to run the INSTALL.sh script on my local Linux System with the following adjustments:

I manually soft-linked pip to pip3 because the install script is checking for pip3 explicitly

I installed prokka via conda because the install script was failing (maybe due to the tbl2asn binary)

I manually run

cpanm  Array::Utils Bio::Perl Exception::Class File::Basename File::Copy File::Find::Rule File::Grep File::Path File::Slurper File::Spec File::Temp File::Which FindBin Getopt::Long Graph Graph::Writer::Dot List::Util Log::Log4perl Moose Moose::Role Text::CSV PerlIO::utf8_strict Devel::OverloadInfo Digest::MD5::File 

after ROARY install failed.

Then everything was working.

Untangle the Nextflow DAG

The Channel handling is currently not optimal and can be improved by relying on IDs for strains and all alignments.

add UpSet visualization as a final step

  • first parse the output of the pipeline according to
    /data/prostlocal2/projects/chlamydia_comparative_study/flamingo_ribap/getsubsets_ribap.py

  • then add upsetr code

Estimate model per MSA and then design phylogeny run

It makes more sense to estimate a substitution model per single RIBAP group (thus MSA) instead of a single model for the concatenated MSA or just using some default one.

This can be done via IQtree:

iqtree -s $ALN --mem 1G -T AUTO --threads-max 4 -nt 4 -m TEST -pre ${BN}-modeltest

Then, for each single MSA a model is estimated. Now, the model-MSA combination can be defined for inout of IQtree w/o the need of concatenating all MSAs into one big MSA.

http://www.iqtree.org/doc/Advanced-Tutorial
check the section "Partitioned analysis with mixed data"

So you write a nexus file with the partition scheme and the alignment files can be specified in there too.

Disk space

Disk space is a serious issue in the pipeline due to the many ILPs. E.g. I just run 36 Listeria monocytogenes genomes and already after ~150/750 ILPs to solve I ran out of disk space: the work dir has ~200 GB.

What can we do about this?

a) Big disclaimer in README and when starting the pipeline, "Attention, RIBAP needs a lot of disk space for the ILPs."
b) Delete ILPs directly after they are solved? Maybe as a default and implementing a switch to keep them for experienced users

GLPSOL: Multiple use of variable 'x_A1h_A1t' not allowed

Test data: 36 Mycoplasma bovis assemblies

Run on a HPC w/ LSF (using dockers)

Command:

glpsol --lp ilp/DHBIBIFB-vs-FNBNKPCI_0.ilp --mipgap 0.01 --pcost --cuts --memlim 16834 --tmlim 240 -o solved/DHBIBIFB-vs-FNBNKPCI_0.sol

Error:

Command output:
  GLPSOL: GLPK LP/MIP Solver, v4.65
  Parameter(s) specified in the command line:
   --lp ilp/DHBIBIFB-vs-FNBNKPCI_0.ilp --mipgap 0.01 --pcost --cuts --memlim
   16834 --tmlim 240 -o solved/DHBIBIFB-vs-FNBNKPCI_0.sol
  Reading problem data from 'ilp/DHBIBIFB-vs-FNBNKPCI_0.ilp'...
  ilp/DHBIBIFB-vs-FNBNKPCI_0.ilp:911: multiple use of variable 'x_A1h_A1t' not allowed
  CPLEX LP file processing error

Looking for x_A1h_A1t in the .ilp file:

grep x_A1h_A1t /hps/nobackup2/metagenomics/mhoelzer/nextflow-work-mhoelzer/ribap/8b/9dcf15beab58ae5c3644c9fb48c93c/ilp/DHBIBIFB-vs-FNBNKPCI_0.ilp 
c2849: x_A1t_B849t + x_A1h_A1t = 1
c2850: x_A1h_B849h + x_A1h_A1t = 1
c3719: x_A1h_A1t + x_B849h_B849t <= 1
c14378: y_A1h - y_A1t  + 1764 x_A1h_A1t <= 1764
c14379: y_A1t - y_A1h  + 1764 x_A1h_A1t <= 1764
c16131: x_A1h_A1t - x_A1h_A1t - b_A1 <= 0
x_A1h_A1t

I can't tell if this is fine or not.

IQTree resources

  iqtree -spp core_genome.nex -bb 1000 --threads-max 24 -nt AUTO -m TEST -pre "$(basename core_genome.nex .nex)"-modeltest

Command exit status:
  137

I think this is an out of RAM issue. We should add an automatic increase of the RAM in case the process fails. Which otherwise is especially annoying when the ILPs are deleted automatically and thus, the -resume will recalc everything from the ILP step.

Output folder names

atm we have

07-msa 08-mafft 08-msa

  • why two msa folders? can we give them more descriptive names?
  • what's the difference between these three?

Chunk size

We have a new --chunk parameter to split the ILP corpus for faster parallel computing.

However, when the chunk size is too large concerning the number of input genomes, RIBAP crashes. E.g., I tried --chunks 80 for eight input genomes: crash.

We could add a check and warning. Or even better: we automatically adjust the chunk size when the user is defining something to high in comparison to the input genomes (not sure what would be a good formula here... e.g. --chunks 200 for 167 Klebsiella was fine, ...)

non-unique prokka IDs if input sequences are identical

Command error:
  Traceback (most recent call last):
    File "/home/co68mol/ribap/bin/combine_roary_ilp.py", line 435, in <module>
      read_roary_table(sys.argv[2])
    File "/home/co68mol/ribap/bin/combine_roary_ilp.py", line 65, in read_roary_table
      strain = formattedArray[column].replace('"','').strip()
  IndexError: list index out of range

sigh To quote a famous german musician: Es koennt alles, so einfach sein...

IQ-TREE failing for test data set

Hey, I run

nextflow run hoelzer-lab/ribap -r dev --fasta '../2022-06-05-cpsittaci-icarus/2022-07-cleaned-data-from-kevin/reassembled-n38/cps_02*.fasta' --cores 4 --max_cores 8 -profile local,docker -w work --output ribap-results --tree --bootstrap 1010 -resume

which worked fine until the IQ-TREE step:

Error executing process > 'iqtree'

Caused by:
  Process `iqtree` terminated with an error exit status (134)

Command executed:

  iqtree -spp core_genome.nex -bb 1010 --threads-max 4 -nt AUTO -m TEST -pre "$(basename core_genome.nex .nex)"-modeltest

Command exit status:
  134

Command error:
  OMP: Info #276: omp_set_nested routine deprecated, please use omp_set_max_active_levels instead.
  ERROR: phylotesting.cpp:593: std::string computeFastMLTree(Params&, Alignment*, ModelCheckpoint&, ModelsBlock*, int&, int, std::string): Assertion `subst_names.size() == rate_names.size()' failed.
  ERROR: STACK TRACE FOR DEBUGGING:
  ERROR: 1   funcAbort()
  ERROR: 2   ()
  ERROR: 3   gsignal()
  ERROR: 4   abort()
  ERROR: 5   ()
  ERROR: 6   computeFastMLTree(Params&, Alignment*, ModelCheckpoint&, ModelsBlock*, int&, int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >)
  ERROR: 7   runModelFinder(Params&, IQTree&, ModelCheckpoint&)
  ERROR: 8   runPhyloAnalysis(Params&, Checkpoint*, IQTree*&, Alignment*&)
  ERROR: 9   runPhyloAnalysis(Params&, Checkpoint*)
  ERROR: 10   main()
  ERROR: 11   __libc_start_main()
  ERROR: 12   ()
  ERROR: 
  ERROR: *** IQ-TREE CRASHES WITH SIGNAL ABORTED
  ERROR: *** For bug report please send to developers:
  ERROR: ***    Log file: core_genome-modeltest.log
  ERROR: ***    Alignment files (if possible)
  .command.sh: line 2:     9 Aborted                 (core dumped) iqtree -spp core_genome.nex -bb 1010 --threads-max 4 -nt AUTO -m TEST -pre "$(basename core_genome.nex .nex)"-modeltest

Could it be that the input genomes are too similar :D the core gene set comprises 980 MSAs.

I also attach the input FASTAs:
cps-test-fastas.tar.gz

Cluster small processes such as mafft, nw_display, ...

We perform a lot of small processes (mafft, fasttree, nw_display) that not necessarily need to run in a single job submission when executing the pipeline on an HPC or Cloud.

Solution:
cluster processes together in chunks of e.g. 20 or 50 files and then submit jobs. So instead of submitting 1000 mafft jobs submit 20*50 mafft jobs.

This should help with latency problems on HPCs.

Check output combine/ribap_roaryxx_summary.txt file

hi, ich hab die tage mal ribap via nextflow gestartet. hat alles gut funktioniert (nachdem ich rausgefunden habe, dass ich auf prost gehen muss, wegen docker :D )
ein minor bug der mir aufgefallen ist: in den combine/ribap_roaryxx_summary.txt files steht noch ein dict über der actual summary

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.