Giter Site home page Giter Site logo

Comments (6)

mrvollger avatar mrvollger commented on May 25, 2024 1

Thanks for the info!

This might not be helpful but I have found that up to 8-16 threads setting this option can really speed things up!

// stuff reading in a bam file and a header from that bam
// ... 
let threads = 16;
let mut out = bam::Writer::from_path(out, &header, bam::Format::Bam).unwrap()
out.set_threads(threads).unwrap();

this of course assumes you use rust, rust-htslib, etc.

But when I use this I can write >10,000 pacbio reads per second.

from hiphase.

holtjma avatar holtjma commented on May 25, 2024 1

I'm not entirely sure what I'm looking at on that top readout. Is the rg command providing sequential timepoints?

Regardless, there is likely some optimization of threads that can happen around all forms of I/O and parallelization. Most internal tests so far have been on 16 threads, and we have not revisited parallelization components probably since proof-of-concept. Historically, they were not the bottlenecks, but we may need to revisit that if further speed improvements get prioritized.

from hiphase.

mrvollger avatar mrvollger commented on May 25, 2024 1

Ahh sorry. rg is just a grep alternative I like and it's just searching top for updates with hiphase over a minute or so.

But I was able to remove the need for the bam with the new haplotag file you made for me and I am happy with that speed. So feel free to close if you want, or leave open to bookmark potential future improvements.

from hiphase.

holtjma avatar holtjma commented on May 25, 2024 1

v0.10.0 leverages the thread pools provided by htslib. This was the lowest hanging fruit in the short term for optimizing I/O. Internally, we saw about a 40% speedup while haplotagging, although mileage will vary there across systems and depending on contention.

from hiphase.

holtjma avatar holtjma commented on May 25, 2024

Yea, this is an bottleneck we're aware of that's specifically related to writing haplotagged files. The phasing itself is parallelized well, but the writing of files is still handled in a single-threaded manner. If you are not writing BAM files, this isn't really an issues because the file sizes are small, but once you starting haplotagging the tool quickly becomes thread and/or I/O bound. Improving this is on our longer-term TODO list.

from hiphase.

mrvollger avatar mrvollger commented on May 25, 2024

Can confirm that it is much faster without the bam output file. But FYI I am still not seeing great utilization for all 32 threads.
image

from hiphase.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    πŸ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. πŸ“ŠπŸ“ˆπŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❀️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.