hoelzer-lab / ribap Goto Github PK

View Code? Open in Web Editor NEW

20.0 20.0 3.0 15.07 MB

A comprehensive bacterial core gene-set annotation pipeline based on Roary and pairwise ILPs

License: GNU General Public License v3.0

Python 3.88% Shell 0.57% Nextflow 0.99% HTML 94.56%

ribap's People

Contributors

Stargazers

Watchers

Forkers

drpatelh marielataretu lisa-mariebarf

ribap's Issues

Deal with fragmented gene annotations from PROKKA

see tseemann/prokka#502 (comment)

Link ILP output instead of copy

The ILPs are getting large, linking them instead of copying them during the nextflow publish procedure makes more sense.

Automatically adjust the height/width of the UpSetR plot

Currently, the pipeline can be -resum'ed to adjust the UpSetR plot parameters. But it would be nice to estimate good values based on the input data to prevent plots like

Run Prokka with reference annotation

Add the possibility to run Prokka with good quality reference genomes to ensure gene naming is consistent.
prokka --proteins *.gbk

Prokka fails with input genomes containing a dot ('.') in their file names

Chlamydia_muridarum_strain_Nigg3_full_genome.fna                             Chlamydia_muridarum_str._Nigg3_CMUT3-5_strain_Nigg3_CMUT3-5_full_genome.fna  Chlamydia_muridarum_str._Nigg_strain_Nigg_full_genome.fna
Chlamydia_muridarum_str._Nigg_2_MCR_strain_Nigg_2_MCR_full_genome.fna        Chlamydia_muridarum_str._Nigg_CM972_strain_CM972_full_genome.fna

Renaming of the files and fasta records works fine, however, the prokka directory stops at the str., removes the . and we end up with Chlamydia_muridarum_str_RENAMED/ several times, leading to the following error:

Error executing process > 'roary (5)'

Caused by:
  Process `roary` input file name collision -- There are multiple input files for each of the following file names: Chlamydia_muridarum_str_RENAMED.gff


Tip: you can replicate the issue by changing to the process work dir and entering the command `bash .command.run`

Wrongly assigned orthologous gene

Test data: the 'flamingo' data set

Clearly, mxiA of C. avium should belong to the large group instead of the hypothetical gene.

Align a lot of genomes

https://www.nature.com/articles/s41586-020-2871-y

IQ-Tree update

There is a newer and faster version of RAxML:
https://github.com/amkozlov/raxml-ng

https://doi.org/10.1093/bioinformatics/btz305

I replaced

raxmlHPC-PTHREADS-SSE3 -T ${task.cpus} -f a -x 1234 -p 1234 -s coreGenome_mafft.aln -n aa -m PROTGAMMAWAG -N 100

with

raxml-ng --all --threads ${task.cpus} --msa coreGenome_mafft.aln --prefix coreGenome_mafft --msa-format FASTA --model PROTGTR+G --bs-trees 100

in the raxml.nf Nextflow module using

condo install -c bioconda raxml-ng=0.9.0

The output looks like this (marked in bold is the tree that also holds the bootstrap values):

coreGenome_mafft.raxml.bestModel
coreGenome_mafft.raxml.bestTree
coreGenome_mafft.raxml.bootstraps
coreGenome_mafft.raxml.log
coreGenome_mafft.raxml.mlTrees
coreGenome_mafft.raxml.rba
coreGenome_mafft.raxml.reduced.phy
coreGenome_mafft.raxml.startTree
coreGenome_mafft.raxml.support

@klamkiew I think it's fairly easy to also update this in the shell script for a next release

Interactive UpSet

https://vdl.sci.utah.edu/upset2/

RIBAP could automatically prepare input data for an interactive UpSet plot.

Start ribap from own annotation

In case you already have high-quality annotation of your genomes, allow to skip the Prokka annotation and start the pipeline directly from your own annotations.

filter mmseqs2 output?

Currently, it seems that the mmseqs2 output is not filtered like the blastp output before.

HTML table searchable

Problem: non-visiable elements in the HTML table can not be searched

Asked some web designer guy:

Hi,
also viel hab ich mit den dataTables auch noch nicht gemacht. also wenn das js-framework das nicht kann wird das wohl eh schwierig da dazuzuprogrammieren sein.
Zumindest in ihrer eigenen Doku können sie das auch nicht: https://datatables.net/examples/api/row_details.html
Wenn du da die erste Zeile aufklappst und nach der Nummer suchst, findet er auch nix.

Aber viell versuchst du es mal hiermit:
https://www.datatables.net/examples/basic_init/hidden_columns.html

beim Initialisieren der Tabelle kannst du das ja mal mit angeben, viell hilft es ja:

"columnDefs": [
            {
                "searchable": true
            }
        ]

Ansonsten mal bei denen im Forum fragen, viell haben die da ne Idee: https://datatables.net/forums/

Bash Install Script

I was able to run the INSTALL.sh script on my local Linux System with the following adjustments:

I manually soft-linked pip to pip3 because the install script is checking for pip3 explicitly

I installed prokka via conda because the install script was failing (maybe due to the tbl2asn binary)

I manually run

cpanm  Array::Utils Bio::Perl Exception::Class File::Basename File::Copy File::Find::Rule File::Grep File::Path File::Slurper File::Spec File::Temp File::Which FindBin Getopt::Long Graph Graph::Writer::Dot List::Util Log::Log4perl Moose Moose::Role Text::CSV PerlIO::utf8_strict Devel::OverloadInfo Digest::MD5::File

after ROARY install failed.

Then everything was working.

Untangle the Nextflow DAG

The Channel handling is currently not optimal and can be improved by relying on IDs for strains and all alignments.

Additional annotation of hypothetical proteins

Add a module for https://github.com/hoelzer-lab/hypro

conda environment of roary is not resolving

see title, I am working on it ;)

add UpSet visualization as a final step

first parse the output of the pipeline according to
/data/prostlocal2/projects/chlamydia_comparative_study/flamingo_ribap/getsubsets_ribap.py
then add upsetr code

Estimate model per MSA and then design phylogeny run

It makes more sense to estimate a substitution model per single RIBAP group (thus MSA) instead of a single model for the concatenated MSA or just using some default one.

This can be done via IQtree:

iqtree -s $ALN --mem 1G -T AUTO --threads-max 4 -nt 4 -m TEST -pre ${BN}-modeltest

Then, for each single MSA a model is estimated. Now, the model-MSA combination can be defined for inout of IQtree w/o the need of concatenating all MSAs into one big MSA.

http://www.iqtree.org/doc/Advanced-Tutorial
check the section "Partitioned analysis with mixed data"

So you write a nexus file with the partition scheme and the alignment files can be specified in there too.

Disk space

Disk space is a serious issue in the pipeline due to the many ILPs. E.g. I just run 36 Listeria monocytogenes genomes and already after ~150/750 ILPs to solve I ran out of disk space: the work dir has ~200 GB.

What can we do about this?

a) Big disclaimer in README and when starting the pipeline, "Attention, RIBAP needs a lot of disk space for the ILPs."
b) Delete ILPs directly after they are solved? Maybe as a default and implementing a switch to keep them for experienced users

GLPSOL: Multiple use of variable 'x_A1h_A1t' not allowed

Test data: 36 Mycoplasma bovis assemblies

Run on a HPC w/ LSF (using dockers)

Command:

glpsol --lp ilp/DHBIBIFB-vs-FNBNKPCI_0.ilp --mipgap 0.01 --pcost --cuts --memlim 16834 --tmlim 240 -o solved/DHBIBIFB-vs-FNBNKPCI_0.sol

Error:

Command output:
  GLPSOL: GLPK LP/MIP Solver, v4.65
  Parameter(s) specified in the command line:
   --lp ilp/DHBIBIFB-vs-FNBNKPCI_0.ilp --mipgap 0.01 --pcost --cuts --memlim
   16834 --tmlim 240 -o solved/DHBIBIFB-vs-FNBNKPCI_0.sol
  Reading problem data from 'ilp/DHBIBIFB-vs-FNBNKPCI_0.ilp'...
  ilp/DHBIBIFB-vs-FNBNKPCI_0.ilp:911: multiple use of variable 'x_A1h_A1t' not allowed
  CPLEX LP file processing error

Looking for x_A1h_A1t in the .ilp file:

grep x_A1h_A1t /hps/nobackup2/metagenomics/mhoelzer/nextflow-work-mhoelzer/ribap/8b/9dcf15beab58ae5c3644c9fb48c93c/ilp/DHBIBIFB-vs-FNBNKPCI_0.ilp 
c2849: x_A1t_B849t + x_A1h_A1t = 1
c2850: x_A1h_B849h + x_A1h_A1t = 1
c3719: x_A1h_A1t + x_B849h_B849t <= 1
c14378: y_A1h - y_A1t  + 1764 x_A1h_A1t <= 1764
c14379: y_A1t - y_A1h  + 1764 x_A1h_A1t <= 1764
c16131: x_A1h_A1t - x_A1h_A1t - b_A1 <= 0
x_A1h_A1t

I can't tell if this is fine or not.

IQTree resources

  iqtree -spp core_genome.nex -bb 1000 --threads-max 24 -nt AUTO -m TEST -pre "$(basename core_genome.nex .nex)"-modeltest

Command exit status:
  137

I think this is an out of RAM issue. We should add an automatic increase of the RAM in case the process fails. Which otherwise is especially annoying when the ILPs are deleted automatically and thus, the -resume will recalc everything from the ILP step.

supervenn viz

A potential additional viz for core gene sets:
https://github.com/gecko984/supervenn

Output folder names

atm we have

07-msa 08-mafft 08-msa

why two msa folders? can we give them more descriptive names?
what's the difference between these three?

combine ILP results only with 95-roary call

Currently, all roary results are compared with the ILPs - however, we only use the 95-roary-ilp combination, the others are just "there" w/o any usage at all.

Bakki Genome visualization

This looks useful:
https://github.com/mjsull/chromatiblock#chromatiblock

Chunk size

We have a new --chunk parameter to split the ILP corpus for faster parallel computing.

However, when the chunk size is too large concerning the number of input genomes, RIBAP crashes. E.g., I tried --chunks 80 for eight input genomes: crash.

We could add a check and warning. Or even better: we automatically adjust the chunk size when the user is defining something to high in comparison to the input genomes (not sure what would be a good formula here... e.g. --chunks 200 for 167 Klebsiella was fine, ...)

non-unique prokka IDs if input sequences are identical

Command error:
  Traceback (most recent call last):
    File "/home/co68mol/ribap/bin/combine_roary_ilp.py", line 435, in <module>
      read_roary_table(sys.argv[2])
    File "/home/co68mol/ribap/bin/combine_roary_ilp.py", line 65, in read_roary_table
      strain = formattedArray[column].replace('"','').strip()
  IndexError: list index out of range

sigh To quote a famous german musician: Es koennt alles, so einfach sein...

IQ-TREE failing for test data set

Hey, I run

nextflow run hoelzer-lab/ribap -r dev --fasta '../2022-06-05-cpsittaci-icarus/2022-07-cleaned-data-from-kevin/reassembled-n38/cps_02*.fasta' --cores 4 --max_cores 8 -profile local,docker -w work --output ribap-results --tree --bootstrap 1010 -resume

which worked fine until the IQ-TREE step:

Error executing process > 'iqtree'

Caused by:
  Process `iqtree` terminated with an error exit status (134)

Command executed:

  iqtree -spp core_genome.nex -bb 1010 --threads-max 4 -nt AUTO -m TEST -pre "$(basename core_genome.nex .nex)"-modeltest

Command exit status:
  134

Command error:
  OMP: Info #276: omp_set_nested routine deprecated, please use omp_set_max_active_levels instead.
  ERROR: phylotesting.cpp:593: std::string computeFastMLTree(Params&, Alignment*, ModelCheckpoint&, ModelsBlock*, int&, int, std::string): Assertion `subst_names.size() == rate_names.size()' failed.
  ERROR: STACK TRACE FOR DEBUGGING:
  ERROR: 1   funcAbort()
  ERROR: 2   ()
  ERROR: 3   gsignal()
  ERROR: 4   abort()
  ERROR: 5   ()
  ERROR: 6   computeFastMLTree(Params&, Alignment*, ModelCheckpoint&, ModelsBlock*, int&, int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >)
  ERROR: 7   runModelFinder(Params&, IQTree&, ModelCheckpoint&)
  ERROR: 8   runPhyloAnalysis(Params&, Checkpoint*, IQTree*&, Alignment*&)
  ERROR: 9   runPhyloAnalysis(Params&, Checkpoint*)
  ERROR: 10   main()
  ERROR: 11   __libc_start_main()
  ERROR: 12   ()
  ERROR: 
  ERROR: *** IQ-TREE CRASHES WITH SIGNAL ABORTED
  ERROR: *** For bug report please send to developers:
  ERROR: ***    Log file: core_genome-modeltest.log
  ERROR: ***    Alignment files (if possible)
  .command.sh: line 2:     9 Aborted                 (core dumped) iqtree -spp core_genome.nex -bb 1010 --threads-max 4 -nt AUTO -m TEST -pre "$(basename core_genome.nex .nex)"-modeltest

Could it be that the input genomes are too similar :D the core gene set comprises 980 MSAs.

I also attach the input FASTAs:
cps-test-fastas.tar.gz

This should help with latency problems on HPCs.

Check output combine/ribap_roaryxx_summary.txt file

hi, ich hab die tage mal ribap via nextflow gestartet. hat alles gut funktioniert (nachdem ich rausgefunden habe, dass ich auf prost gehen muss, wegen docker :D )
ein minor bug der mir aufgefallen ist: in den combine/ribap_roaryxx_summary.txt files steht noch ein dict über der actual summary