After running successfully through the example dataset I've ran <code class="notransla

Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

Sorry I should have asked earlier. Was it only <code class="notransl

count_control error: estimated false positive rate is 0.385 (FPR too high, bailing out!!! about kevlar HOT 5 OPEN

moldach commented on September 8, 2024

count_control error: estimated false positive rate is 0.385 (FPR too high, bailing out!!!

from kevlar.

Comments (5)

standage commented on September 8, 2024

Hi @moldach! The workflow is failing at a k-mer counting step due to insufficient memory. There are a few ways to address this.

If memory is abundant on your machine, you could simply increase the amount of memory allocated to counting k-mers in each case/proband and control/parent sample.
Alternatively, you could use a tool like Lighter to do error correction* on the reads before running Kevlar. The amount of memory required for counting k-mers accurately depends on the number of distinct k-mers in a data set: sequencing errors often account for the majority of k-mers in a sequencing run, so eliminating those errors will bring the false positive rate down significantly.
Another alternative is to increase the tolerance for error (max_fpr) in some samples. I'd recommend limiting the parents' FPRs to the default 0.05, but I've had decent success while relaxing max_fpr to >0.3 for case/proband samples.

None of these solutions is mutually exclusive: you could increase memory AND do error correction AND increase the max_fpr for the controls. 1) and 3) would be the quickest to try, but only if you have access to a machine with sufficient memory. Note that at some steps of the workflow, all case + control + reference k-mer counts are loaded into memory simultaneously. With your current setup, that looks like 16 + (16 + 16) + 12 GB of memory.

*Error correction for low-coverage reads is challenging, and there were a few instances in which Lighter erroneously "corrected" reads that contained an actual (low coverage) variant rather than a sequencing error. But depending on the constraints of the system to which one has access, missing 1 or 2 variants out of 90-100 is worth the reduction in memory required for k-mer counting.

from kevlar.

moldach commented on September 8, 2024

Hi @standage thank you for getting back to me.

I do have abundant memory so I increased everywhere it said "memory" in the config.json file to 80G - as this seems to be how rules are given memory in the Snakefile.

This time the pipeline ran through 45 of 47 steps before failing with the following error:

[Sun Nov  1 17:19:16 2020]
Finished job 11.
45 of 47 steps (96%) done

[Sun Nov  1 17:19:16 2020]
Job 1: Filter calls, compute likelihood scores, and sort calls by score.

Job counts:
	count	jobs
	1	like_scores
	1
/home/moldach/miniconda3/lib/python3.7/site-packages/pulp/pulp.py:1195: UserWarning: Spaces are not permitted in the name. Converted to '_'
  warnings.warn("Spaces are not permitted in the name. Converted to '_'")
kevlar --tee --logfile Logs/simlike.log simlike --mu 30.0 --sigma 10.0 --epsilon 0.001 --case-min 6 --refr Reference/refr-counts.smallcounttable --sample-labels Proband Mother Father --out calls.scored.sorted.vcf.gz --controls Sketches/ctrl0-counts.counttable Sketches/ctrl1-counts.counttable --case Sketches/case-counts.counttable calls.0.prelim.vcf.gz calls.1.prelim.vcf.gz calls.2.prelim.vcf.gz calls.3.prelim.vcf.gz calls.4.prelim.vcf.gz calls.5.prelim.vcf.gz calls.6.prelim.vcf.gz calls.7.prelim.vcf.gz calls.8.prelim.vcf.gz calls.9.prelim.vcf.gz calls.10.prelim.vcf.gz calls.11.prelim.vcf.gz calls.12.prelim.vcf.gz calls.13.prelim.vcf.gz calls.14.prelim.vcf.gz calls.15.prelim.vcf.gz
[kevlar] running version 0.7+15.gebabd62
[kevlar::simlike] Loading k-mer counts for each sample
Traceback (most recent call last):
  File "/export/home/moldach/kavlar-test/kevlar-env/bin/kevlar", line 33, in <module>
    sys.exit(load_entry_point('biokevlar==0.7+15.gebabd62', 'console_scripts', 'kevlar')())
  File "/export/home/moldach/kavlar-test/kevlar-env/lib/python3.8/site-packages/kevlar/__main__.py", line 30, in main
    mainmethod(args)
  File "/export/home/moldach/kavlar-test/kevlar-env/lib/python3.8/site-packages/kevlar/simlike.py", line 363, in main
    refr = kevlar.sketch.load(args.refr)
  File "/export/home/moldach/kavlar-test/kevlar-env/lib/python3.8/site-packages/kevlar/sketch.py", line 92, in load
    return loadfunc(filename)
  File "khmer/_oxli/graphs.pyx", line 306, in khmer._oxli.graphs.Hashtable.load
OSError: Error reading from k-mer count file: Reference/refr-counts.smallcounttable Cannot allocate memory
[Sun Nov  1 17:23:40 2020]
Error in rule like_scores:
    jobid: 0
    output: calls.scored.sorted.vcf.gz, Logs/simlike.log

RuleException:
CalledProcessError in line 375 of /gpfs/home/moldach/projects/CG00018/Snakefile:
Command 'set -euo pipefail;  kevlar --tee --logfile Logs/simlike.log simlike --mu 30.0 --sigma 10.0 --epsilon 0.001 --case-min 6 --refr Reference/refr-counts.smallcounttable --sample-labels Proband Mother Father --out calls.scored.sorted.vcf.gz --controls Sketches/ctrl0-counts.counttable Sketches/ctrl1-counts.counttable --case Sketches/case-counts.counttable calls.0.prelim.vcf.gz calls.1.prelim.vcf.gz calls.2.prelim.vcf.gz calls.3.prelim.vcf.gz calls.4.prelim.vcf.gz calls.5.prelim.vcf.gz calls.6.prelim.vcf.gz calls.7.prelim.vcf.gz calls.8.prelim.vcf.gz calls.9.prelim.vcf.gz calls.10.prelim.vcf.gz calls.11.prelim.vcf.gz calls.12.prelim.vcf.gz calls.13.prelim.vcf.gz calls.14.prelim.vcf.gz calls.15.prelim.vcf.gz' returned non-zero exit status 1.
  File "/home/moldach/miniconda3/lib/python3.7/site-packages/snakemake/executors/__init__.py", line 2189, in run_wrapper
  File "/gpfs/home/moldach/projects/CG00018/Snakefile", line 375, in __rule_like_scores
  File "/home/moldach/miniconda3/lib/python3.7/site-packages/snakemake/executors/__init__.py", line 529, in _callback
  File "/home/moldach/miniconda3/lib/python3.7/concurrent/futures/thread.py", line 57, in run
  File "/home/moldach/miniconda3/lib/python3.7/site-packages/snakemake/executors/__init__.py", line 515, in cached_or_run
  File "/home/moldach/miniconda3/lib/python3.7/site-packages/snakemake/executors/__init__.py", line 2201, in run_wrapper
Exiting because a job execution failed. Look above for error message
Shutting down, this might take some time.
Exiting because a job execution failed. Look above for error message
Complete log: /gpfs/home/moldach/projects/CG00018/.snakemake/log/2020-10-29T164206.431603.snakemake.log

I'd like to be able to re-submit the job with more memory to finish these last two steps but I'm not sure which part of the config.json I should be adjusting the memory for.

Also, when I tried to re-submit again I got the following error:

Building DAG of jobs...
ChildIOException:
File/directory is a child to another output:
('/tiered/mtgraovac/indexes/GRCh37/Homo_sapiens.GRCh37.dna.toplevel.fa', link_reference)
('/tiered/mtgraovac/indexes/GRCh37/Homo_sapiens.GRCh37.dna.toplevel.fa', link_mask)

from kevlar.

standage commented on September 8, 2024

OSError: Error reading from k-mer count file: Reference/refr-counts.smallcounttable Cannot allocate memory

This means you ran out of memory on the machine: it cannot hold all the k-mer count tables in memory. Using 80GB for the reference and the mask file is overkill. Those can (and according to this error, probably should) be kept at their original values. If you delete the mask and reference counttables/nodetables, you should be able to rebuild them and continue with the workflow without the need to start over again from scratch.

from kevlar.

moldach commented on September 8, 2024

Sorry I should have asked earlier.

Was it only "recountmem" which needed to be increased?

Seems like the run completed successfully with your suggestion - so, to confirm, I'll be looking at the result in calls.scored.sorted.vcf.gz?

Thanks

from kevlar.

standage commented on September 8, 2024

Was it only recountmem which needed to be increased?

Actually, not much memory is required for recounting. Increasing the memory for each case and control sample would have been recommended.

so, to confirm, I'll be looking at the result in calls.scored.sorted.vcf.gz?

Yep, that's the one!

from kevlar.

count_control error: estimated false positive rate is 0.385 (FPR too high, bailing out!!! about kevlar HOT 5 OPEN

Comments (5)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent