Comments (4)
I think we realized what the issue is. It would appear that the hanging can be traced back to the loop on lines 509-520 and the fact that, if we run out of barcodes, there is no check to see if we have run out of barcodes, resulting in the program helplessly searching for an available barcode in the same empty barcodes list forever. The number of barcodes in my list of parameters did not exceed the 4.7M limit however, so it was surprising that we were running out. The culprit ended up being the size of the input genome, which was ~160M bp in length, and the fact that I was asking for 400M reads to be simulated.
This was a problem because it would appear that LRSim only uses as many of the simulated DWGSIM reads as there are positions in the genome, storing a mapping of genomic locations to reads in the .fp file (see readToPos
and writeToPos
). In the whole genome case, this is not an issue since the genome is about 3B in size and most whole genome simulation with 30X coverage requires only about 600M reads, so there are enough reads to go around. In my case however, since I subsampled the human genome to a total of ~160M bp, there weren't enough reads to meet the ~400M number I was feeding to the -x
parameter.
Here is how we think this is resulting in the infinite-loop hanging I mentioned earlier. The first thing the loop on line 553 of the main program does is look for a read position at a given genomic location. If all of the reads in the immediate vicinity of the given genomic location have already been used, we skip this read and go to the next read (lines 555-556). Since virtually all of the available reads are exhausted once my simulation reaches close to 160M reads, almost every iteration of this loop will mean we don't decrement the read counter but do erase the barcode from the available barcodes list, therefore quickly exhausting all remaining barcodes without actually assigning any reads to them. Once all barcodes are used up in this manner, the loop on lines 509-520 will run infinitely.
Introducing a check such as the following in lines 504-521 at least lets the user know that we have run out of barcodes and exits rather than looping infinitely:
#Pick a barcode
my $selectedBarcode;
{
my $idx = int(rand($numBarcodes));
lock($barcodesMutexLock);
my $wentToZero = 0;
while(1)
{
if($barcodes[$idx] eq "")
{
++$idx;
if($idx == $numBarcodes && not($wentToZero)) {
$idx = 0;
$wentToZero = 1;
} elsif($idx == $numBarcodes && $wentToZero) {
&LogAndDie("Reached end of barcodes list. No more barcodes. Last read processed: $readsCountDown. Exiting.");
}
next;
}
$selectedBarcode = $barcodes[$idx];
$barcodes[$idx] = "";
last;
}
}
It would also be great if the number of reads wasn't limited to the size of the input genome, but that would probably require a more extensive rewrite.
from lrsim.
@CodingKaiser thanks for the detailed analysis, that helps other users a lot. For the fix, do you want me to do the integration or you want to make a pull request? The latter better records your contribution to the code.
from lrsim.
Thanks for getting back to me. I can submit a pull request shortly. Is there any chance that the rest of the code will be rewritten to remove the limitation of 1 read per genomic location?
from lrsim.
That requires a revamp algorithmic-wise. Would it possible for you to superimpose two simulated datasets to imitate reads duplication?
from lrsim.
Related Issues (20)
- LongRanger crashes HOT 1
- LRSIM does not filter out duplicate reads with different barcodes HOT 1
- R1 R2 reads count inconsistent
- Problem with installing LRSIM HOT 4
- Small test-set sequence
- Can't locte Math/Random.pm HOT 1
- Barcode issue in few-molecules case HOT 2
- time HOT 13
- Increased read depth flanking N-stetches in reference HOT 1
- Few SNPs generated using two haplotype sequence HOT 1
- non overlapping region HOT 27
- Complile fails HOT 1
- LRSIM phase4 problem HOT 1
- Extension to BGI stLFR HOT 1
- Using LRSIM with LongRanger: Extremely high rate of incorrect barcodes observed (99.90 %) HOT 13
- SURVIVOR step not progressing HOT 1
- LRSIM crashes and reports "not defined chr1_182578874_182579@chr1" HOT 1
- Ran out of barcodes HOT 3
- 0 readpairs per molecule
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from lrsim.