Giter Site home page Giter Site logo

Problems with multithreading about panaroo HOT 3 CLOSED

thorellk avatar thorellk commented on September 8, 2024
Problems with multithreading

from panaroo.

Comments (3)

nzmacalasdair avatar nzmacalasdair commented on September 8, 2024

Hi Kaisa,

This seems likely to be normal behaviour - panaroo generally tries to make good use of resources provided to it, but some stages of the method (particularly those involving the draft pangenome network) are iterative, and therefore run single-threaded at the moment.

Total runtime depends primarily on the complexity of the draft pangenome graph, which is affected by genome size and number of isolates, as well as other factors. Datasets of > 10K isolates can take a considerable amount of time to run in a single panaroo run.

There are a number of ways to speed this up/make sure the total runtime falls below HPC limits:

  1. The easiest thing to do for large datasets (if you are interested in a core or pan genome alignment) is to separate the pangenome inference step (panaroo) from the gene alignment step, by running panaroo-msa separately after running panaroo without any gene alignment (the -a flag).
  2. Splitting up large datasets into smaller sets, running panaroo on these smaller datasets, and then running panaroo-merge to combine their output should be quicker than running the entire dataset with panaroo. The smaller datasets can be informed by descent from a common ancestor (ie, clustering), which may make them more interesting to analyse on their own/compare, but can also be random. If you are interested in core or pan genome alignments, you can always run panaroo-msa using the output folder from panaroo-merge

Hopefully this helps! Let us know if anything is unclear or if you run into any problems.

from panaroo.

thorellk avatar thorellk commented on September 8, 2024

Hi!

Thank you for your swift reply! I suspected this was the case and I understand that some steps are inherently hard/impossible to parallelize. Then I guess I should just stay patient and kindly ask the HPC support to extend the duration of my SLURM job :)

Thank you for the tips on how to combine different features of panaroo. I think especially the latter would make sense for me since I am currently "paying" for quite a lot of unused core hours at the cluster. Are there any drawbacks of using this approach? What if, for example two homologous genes gets clustered in one of the subsets but split in the other, how will panaroo-merge deal with that?

from panaroo.

nzmacalasdair avatar nzmacalasdair commented on September 8, 2024

Depending on cost/computational limits, you may want to consider cancelling the SLURM job and start running subsets -- datasets with high draft pangenome graph complexity can take weeks to finish in a single run. It may be worth examining the complexity of the pre_filt_graph.gml file if you'd like to have some rough idea of how long it might take.

Running panaroo-merge should produce a very similar graph to the output of panaroo, the speed increase is primarily due to 'parallelising' the initial error-correcting steps on the draft pangenome graph, by running them on multiple smaller graphs, instead of a single large, complex graph. The merge process then uses the final graphs from each of initial runs as starting input, and performs similar clustering methods to a normal panaroo run to infer the combined pangenome, not just a simple merging of the networks themselves. It's not easy to comment on specific examples, but the process has similar user options to control clustering as panaroo.

As for drawbacks, the most significant is probably the additional user input required to create the data subsets and run multiple panaroo runs. The merge command itself can take some time as well, though typically much less time that running the entire dataset through panaroo. Finally, you should see bigger benefits from using panaroo-merge if you are running panaroo with --clean-mode strict as that will lead to the biggest differences between the draft and final pangenome networks for each subset.

from panaroo.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.