Comments (6)
Hi @moldach. How many of those calls have a PASS
value in the FILTER
column? If that number is still high, my first guess is that many of these calls are going to be associated with common variants or repetitive regions such as segmental duplications. I'd suggest providing a BED file with the locations of common variants and segdups to the varfilter
config value so that associated variant calls can be marked appropriately. See https://kevlar.readthedocs.io/en/latest/tutorial.html for more info.
Given the amount of memory you're using for each sample, I'm wondering what your coverage is. I tested mostly trios with 30x-50x average coverage. You're using much more memory than I ever did, and if that is due to high coverage you might need to fiddle with the case/control abundance thresholds a bit as well.
I hope this helps!
from kevlar.
Didn't think of that! I think two of the GATK
callers I tried before reported everything in the VCF
and you had to filter for PASS
but not with the tools I'm using most often so I forgot about checking that.
I tried using bcftools
to get that information from kevlar
output but I get an error:
$ bcftools view -i 'ID=PASS' calls.scored.sorted.vcf > output.vcf
[E::bcf_hdr_parse_line] Could not parse the header line: "##INFO=<ID=IKMERS,Number=1,Type=Integer,Description="number of "interesting" (novel) k-mers spanning the variant alternate allele">"
[W::bcf_hdr_parse] Could not parse header line: ##INFO=<ID=IKMERS,Number=1,Type=Integer,Description="number of "interesting" (novel) k-mers spanning the variant alternate allele">
No such INFO field: PASS
from kevlar.
I can't say that I'm familiar with bcftools
, but the FILTER
column is the last of the seven "core" columns in VCF, not one of the subsequent variable number of INFO
columns described in the headers. For a less-sophisticated approach, I'd suggest the following shell one-liners.
$ # Just count the variants passing filters
$ grep -c $'\tPASS\t' calls.scored.sorted.vcf
$
$ # Actually retrieve the passing variants
$ grep $'\tPASS\t' calls.scored.sorted.vcf > output.vcf
from kevlar.
Okay thank you.
But, riddle me this, kevlar
run with max_fpr
of 0.05
and max_fpr
of 0.001
are returning exactly the same number of putative variants when filtered for PASS
. Why would this be?
max_fpr=0.05
{
"ksize": 31,
"recountmem": "250G",
"numsplit": 16,
"samples": {
"casemin": 6,
"ctrlmax": 1,
"case": {
"fastx": [
"/tiered/mtgraovac/CombGene/CG00018/FASTQ/CG00018P_R1.fastq.gz",
"/tiered/mtgraovac/CombGene/CG00018/FASTQ/CG00018P_R2.fastq.gz"
],
"memory": "250G",
"label": "Proband",
"max_fpr": 0.05
},
"controls": [
{
"fastx": [
"/tiered/mtgraovac/CombGene/CG00018/FASTQ/CG00018M_R1.fastq.gz",
"/tiered/mtgraovac/CombGene/CG00018/FASTQ/CG00018M_R2.fastq.gz"
],
"memory": "250G",
"label": "Mother",
"max_fpr": 0.05
},
{
"fastx": [
"/tiered/mtgraovac/CombGene/CG00018/FASTQ/CG00018F_R1.fastq.gz",
"/tiered/mtgraovac/CombGene/CG00018/FASTQ/CG00018F_R2.fastq.gz"
],
"memory": "250G",
"label": "Father",
"max_fpr": 0.05
}
],
"coverage": {
"mean": 30.0,
"stdev": 10.0
}
},
"mask": {
"fastx": [
"/tiered/mtgraovac/indexes/GRCh37/Homo_sapiens.GRCh37.dna.toplevel.fa"
],
"memory": "40G",
"max_fpr": 0.005
},
"reference": {
"fasta": "/tiered/mtgraovac/indexes/GRCh37/Homo_sapiens.GRCh37.dna.toplevel.fa",
"memory": "80G",
"max_fpr": 0.025
},
"localize": {
"seedsize": 51,
"delta": 50,
"seqpattern": ".",
"maxdiff": 10000
},
"varfilter": null
}
$ grep -c $'\tPASS\t' calls.scored.sorted.vcf
3233
max_fpr=0.001
{
"ksize": 31,
"recountmem": "950G",
"numsplit": 16,
"samples": {
"casemin": 6,
"ctrlmax": 1,
"case": {
"fastx": [
"/tiered/mtgraovac/CombGene/CG00018/FASTQ/CG00018P_R1.fastq.gz",
"/tiered/mtgraovac/CombGene/CG00018/FASTQ/CG00018P_R2.fastq.gz"
],
"memory": "950G",
"label": "Proband",
"max_fpr": 0.001
},
"controls": [
{
"fastx": [
"/tiered/mtgraovac/CombGene/CG00018/FASTQ/CG00018M_R1.fastq.gz",
"/tiered/mtgraovac/CombGene/CG00018/FASTQ/CG00018M_R2.fastq.gz"
],
"memory": "950G",
"label": "Mother",
"max_fpr": 0.001
},
{
"fastx": [
"/tiered/mtgraovac/CombGene/CG00018/FASTQ/CG00018F_R1.fastq.gz",
"/tiered/mtgraovac/CombGene/CG00018/FASTQ/CG00018F_R2.fastq.gz"
],
"memory": "950G",
"label": "Father",
"max_fpr": 0.001
}
],
"coverage": {
"mean": 30.0,
"stdev": 10.0
}
},
"mask": {
"fastx": [
"/tiered/mtgraovac/indexes/GRCh37/Homo_sapiens.GRCh37.dna.toplevel.fa"
],
"memory": "100G",
"max_fpr": 0.005
},
"reference": {
"fasta": "/tiered/mtgraovac/indexes/GRCh37/Homo_sapiens.GRCh37.dna.toplevel.fa",
"memory": "100G",
"max_fpr": 0.025
},
"localize": {
"seedsize": 51,
"delta": 50,
"seqpattern": ".",
"maxdiff": 10000
},
"varfilter": null
}
$ grep -c $'\tPASS\t' calls.scored.sorted.vcf
3233
from kevlar.
Most likely explanation: while the lower FPR offers better theoretical accuracy, both FPRs offer identical effective accuracy. A more thorough investigation of the 3k PASSing variants may reveal another more interesting story though.
from kevlar.
To be clear, I meant another additional story, not another alternative story.
from kevlar.
Related Issues (20)
- Bug with log-space transformation
- Command or script for inspecting ambiguous calls
- MemoryError HOT 3
- Execution failed due to high FPR in case HOT 4
- Failed tests HOT 8
- Dead links to tutorial files HOT 3
- Empty files produced by kevlar split cause crash of kevlar assemble HOT 3
- python 3.6 install - pytest failures - 26 failed, 347 passed (pytest 5.3.5?) HOT 5
- Tutorial: Error on running kevlar filter HOT 9
- Clarification - how to avoid high FPR HOT 3
- kevlar memory error HOT 2
- partition, node error HOT 3
- Controls for simlike HOT 2
- VCF Parsing Issue HOT 1
- Where is kevlar? HOT 3
- How to just get the Kmer?
- Kevlar novel multithreading HOT 2
- count_control error: estimated false positive rate is 0.385 (FPR too high, bailing out!!! HOT 5
- Is kevlar able to detect duplications/repeat expansions and in general CNV? HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from kevlar.