Giter Site home page Giter Site logo

Comments (4)

CodingKaiser avatar CodingKaiser commented on September 26, 2024

I think we realized what the issue is. It would appear that the hanging can be traced back to the loop on lines 509-520 and the fact that, if we run out of barcodes, there is no check to see if we have run out of barcodes, resulting in the program helplessly searching for an available barcode in the same empty barcodes list forever. The number of barcodes in my list of parameters did not exceed the 4.7M limit however, so it was surprising that we were running out. The culprit ended up being the size of the input genome, which was ~160M bp in length, and the fact that I was asking for 400M reads to be simulated.

This was a problem because it would appear that LRSim only uses as many of the simulated DWGSIM reads as there are positions in the genome, storing a mapping of genomic locations to reads in the .fp file (see readToPos and writeToPos). In the whole genome case, this is not an issue since the genome is about 3B in size and most whole genome simulation with 30X coverage requires only about 600M reads, so there are enough reads to go around. In my case however, since I subsampled the human genome to a total of ~160M bp, there weren't enough reads to meet the ~400M number I was feeding to the -x parameter.

Here is how we think this is resulting in the infinite-loop hanging I mentioned earlier. The first thing the loop on line 553 of the main program does is look for a read position at a given genomic location. If all of the reads in the immediate vicinity of the given genomic location have already been used, we skip this read and go to the next read (lines 555-556). Since virtually all of the available reads are exhausted once my simulation reaches close to 160M reads, almost every iteration of this loop will mean we don't decrement the read counter but do erase the barcode from the available barcodes list, therefore quickly exhausting all remaining barcodes without actually assigning any reads to them. Once all barcodes are used up in this manner, the loop on lines 509-520 will run infinitely.

Introducing a check such as the following in lines 504-521 at least lets the user know that we have run out of barcodes and exits rather than looping infinitely:

#Pick a barcode
my $selectedBarcode;
{
  my $idx = int(rand($numBarcodes));
  lock($barcodesMutexLock);
  my $wentToZero = 0;
  while(1)
  {
    if($barcodes[$idx] eq "")
    {
      ++$idx;
      if($idx == $numBarcodes && not($wentToZero)) {
          $idx = 0;
          $wentToZero = 1;
      } elsif($idx == $numBarcodes && $wentToZero) {
          &LogAndDie("Reached end of barcodes list. No more barcodes. Last read processed: $readsCountDown. Exiting.");
      }
      next;
    }
    $selectedBarcode = $barcodes[$idx];
    $barcodes[$idx] = "";
    last;
  }
}

It would also be great if the number of reads wasn't limited to the size of the input genome, but that would probably require a more extensive rewrite.

from lrsim.

aquaskyline avatar aquaskyline commented on September 26, 2024

@CodingKaiser thanks for the detailed analysis, that helps other users a lot. For the fix, do you want me to do the integration or you want to make a pull request? The latter better records your contribution to the code.

from lrsim.

CodingKaiser avatar CodingKaiser commented on September 26, 2024

Thanks for getting back to me. I can submit a pull request shortly. Is there any chance that the rest of the code will be rewritten to remove the limitation of 1 read per genomic location?

from lrsim.

aquaskyline avatar aquaskyline commented on September 26, 2024

That requires a revamp algorithmic-wise. Would it possible for you to superimpose two simulated datasets to imitate reads duplication?

from lrsim.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.