csb5 / lofreq Goto Github PK
View Code? Open in Web Editor NEWLoFreq Star: Sensitive variant calling from sequencing data
Home Page: http://csb5.github.io/lofreq/
License: Other
LoFreq Star: Sensitive variant calling from sequencing data
Home Page: http://csb5.github.io/lofreq/
License: Other
Hi,
I used picard-tools-1.141 SortVcf to sort VCF file produced by lofreq (version 2.1.2). Then the sorted VCF file will be combined with indel calls from other algorithms (as suggested in somaticseq pipeline). Picard gave me
"Exception in thread "main" java.lang.IllegalStateException: Key SOMATIC found in VariantContext field INFO at 1:11113181 but this key isn't defined in the VCFHeader. We require all VCFs to have complete VCF headers by default."
My VCF file example:
`##fileformat=VCFv4.0
1 5177961 . GA G 110 PASS DP=12;AF=0.416667;SB=0;DP4=2,5,2,3;INDEL;HRUN=9;SOMATIC;UQ=63`
My question is how can I modify my VCF FILE header to solve this small problem?
Best regards,
Hongen
With Lofreq, what does the AF column really represent? My assumption is that it is the minority variant allele frequency. It appears to range between 0 and 1, does that mean that a variant at 1.0 frequency is an actual SNP? Is there some documentation on interpreting these outputs? Many thanks and sorry if this is naive.
Hi,
I wanted to check with you if Lofreq can call multi-nucleotide variants (MNV). e.g. AG>ACT. I created a simulated bam with this MNP and Lofreq wasnt able to call it. I used --call-indels paramter also, thinking it can call it as SNV and indel if not MNV, but it didnt call anything. Please let me know if there are any parameters I can tweak in order to make it work.
Thanks,
Ashini
To speedup preprocessing. Precomputation also avoids memory problem in on-the-fly computation for high indel error-rate data (keeps data for all piled up reads in memory)
Is there a way to print the actual p-value that is associated with each variant?
As discussed with NN and IS on 2015-04-24
I've been using LoFreq for a while and haven't had any issues with it. However, recently, I got an error that my VCF was not valid because it did not have a sample column (it stops at INFO
). Is it possible to add the sample column? If that's not already built in, is it possible to add it somehow with an external tool?
I can't get lofreq to compile against the latest 1.2.x version of samtools. It seems to depend on kaln.h which is no longer in the samtools 1.2.x repo.
Hi,
I am getting an invalid syntax error, when I try to run lofreq somatic mode.
$lofreq somatic
File "Reference/Apps/LoFreq/lofreq_star-2.1.2/bin/lofreq2_somatic.py", line 378
def rlx_to_str(self, sample_type, (num_snv_tests, num_indel_tests)):
^
SyntaxError: invalid syntax
Can you please help me fix it.
Thanks,
Ashini
Hi,
I tried to run lofreq like this
lofreq call-parallel --pp-threads 4 -f 'path with spaces and (parenthesis)/ref_sequence.fna' bam_file.bam -o calls.vcf
and I get a bunch of
/bin/sh: -c: line 0: syntax error near unexpected token `('
I know, I know... One should not have path with spaces, but Dropbox recently renamed its folder to Dropbox (Personal)
. I guess that protecting the file names with shlex.quote
should do the trick.
When using --only-indels as argument to the call commands --call-indels should be implied.
Raised by gmy on sourceforge's lofreq:discussion mailing list.
Either really implement this as filter (i.e. also remove in DP[4] and DP) or make clear it's just temporary masking (or have both)
Hi Andreas,
After a Lofreq run, I saw messages like "Number of substitution tests and indel tests are: ... respectively". I am wondering what do these tests do? Are they necessary for Lofreq and for the output? If they can be turned off, how can I do that? I didn't find an input option that could do that.
Thank you very much,
Zheng
Will not matter much in practice (e.g. get roughly 30 pooled commands for a ecoli which makes LoFreq blind to 30-1 positions).
Reported by isovic
u_within_limits() in bam_md_ext.c handles the problem (which might be a feature) for now. Larger bandwith settings might help, but will slow things down even more.
If Python version 3 as well as Python version 2.7 are installed and a binary or link called python
is missing, configure will find and use version 3 (which is not supported by LoFreq). This is easily fixed by creating such link, but it would be even easier if the corresponding interpreter could be set as option to ./configure
.
There could be at least two cases were this is needed for the final user:
Of course, the final user could merge the BAM files to obtain a huge one and call the variant on that one whatever is the case; nevertheles, sometimes impossible to handle, because too many samples are used. I think that the htslib pileup function can handle several BAM and this could be easy to implement, just taking all the parameters after the last flag as input files and pass them to it.
Hi,
I installed LoFreq 2.1.2 and 2.1.3.1 from binary packages on Ubuntu. I don't know if it's necessary but Samtools-1.1 and Htslib-1.1 are installed in the PATH.
Both versions give me an error when I try to run LoFreq call:
./lofreq call -f /genome_hg38.fa -o /sample1.lofreq_variants.vcf -l /SeqCap_EZ.hg38.analysis.variant_calling.target.bed.merged /sample1.alnqual-lofreq.bam --verbose --no-default-filter
Alive and happily crunching away on pos 17018314 of 1...
Alive and happily crunching away on pos 121526406 of 10...
Alive and happily crunching away on pos 25227485 of 12...
Alive and happily crunching away on pos 31337657 of 17...
Alive and happily crunching away on pos 54290077 of 4...
Executing lofreq filter -i /tmp/lofreq2-call-dyn-bonf.iMIoN6 -o /sample1.lofreq_variants.vcf --no-defaults --snvqual-thresh 76 --indelqual-thresh 20
FATAL(lofreq_filter.c|main_filter:1104): Unrecognized argument found. Exiting...
ERROR(lofreq_call.c|main_call:1529): The following command failed: lofreq filter -i /tmp/lofreq2-call-dyn-bonf.iMIoN6 -o /sample1.lofreq_variants.vcf --no-defaults --snvqual-thresh 76 --indelqual-thresh 20
I already checked the bam-file that I'm using, and it was created using the same reference genome, also the qualities were checked and should be ok.
Even when I use the --no-default-filter
there is still the problem with the filter step. Is there something that I'm doing wrong?
Thanks
How can I call the variants that allele frequence is under 1% ? Is there any parameter can be set?
Ideally overlapping paired-end reads shouldn't be counted twice. This could be achieved by proper pre-processing (e.g. SeqPrep and single-end mapping for merged pairs) or internally in LoFreq by making use of samtools "read-pair overlap detection". Problem with the latter is that it bumps up one quality and sets the other to zero instead of just swallowing the base.
Hi there,
I am trying to install lofreq on Ubuntu 16.04 and get this following error msg when run make
washington@Linux:~/lofreq$ make
Making all in src/cdflib90
make[1]: Entering directory '/home/washington/lofreq/src/cdflib90'
make[1]: Nothing to be done for 'all'.
make[1]: Leaving directory '/home/washington/lofreq/src/cdflib90'
Making all in src/uthash
make[1]: Entering directory '/home/washington/lofreq/src/uthash'
make[1]: Nothing to be done for 'all'.
make[1]: Leaving directory '/home/washington/lofreq/src/uthash'
Making all in src/lofreq
make[1]: Entering directory '/home/washington/lofreq/src/lofreq'
gcc -DPACKAGE_NAME="LoFreq_Star" -DPACKAGE_TARNAME="lofreq_star" -DPACKAGE_VERSION="2.1.3.1" -DPACKAGE_STRING="LoFreq_Star\ 2.1.3.1" -DPACKAGE_BUGREPORT="[email protected]" -DPACKAGE_URL="" -DGIT_VERSION="v2.1.3.1-dirty" -DPACKAGE="lofreq_star" -DVERSION="2.1.3.1" -DSTDC_HEADERS=1 -DHAVE_SYS_TYPES_H=1 -DHAVE_SYS_STAT_H=1 -DHAVE_STDLIB_H=1 -DHAVE_STRING_H=1 -DHAVE_MEMORY_H=1 -DHAVE_STRINGS_H=1 -DHAVE_INTTYPES_H=1 -DHAVE_STDINT_H=1 -DHAVE_UNISTD_H=1 -D__EXTENSIONS__=1 -D_ALL_SOURCE=1 -D_GNU_SOURCE=1 -D_POSIX_PTHREAD_SEMANTICS=1 -D_TANDEM_SOURCE=1 -DHAVE_DLFCN_H=1 -DLT_OBJDIR=".libs/" -DSTDC_HEADERS=1 -DHAVE_PTHREAD_PRIO_INHERIT=1 -DHAVE_PTHREAD=1 -DHAVE_LIBM=1 -DHAVE_LIBZ=1 -DNDEBUG=/**/ -D_USE_KNETFILE=1 -I. -D_FILE_OFFSET_BITS=64 -D_LARGEFILE64_SOURCE -Wall -I../cdflib90/ -I../uthash -I/home/washington/htslib-1.5 -I/home/washington/samtools-1.5 -g -O2 -pthread -MT bam_md_ext.o -MD -MP -MF .deps/bam_md_ext.Tpo -c -o bam_md_ext.o bam_md_ext.c
In file included from bam_md_ext.c:38:0:
/home/washington/samtools-1.5/bam.h:189:28: error: conflicting types for ‘seq_nt16_int’
#define bam_nt16_nt4_table seq_nt16_int
^
bam_md_ext.c:54:12: note: in expansion of macro ‘bam_nt16_nt4_table’
const char bam_nt16_nt4_table[] = { 4, 0, 1, 4, 2, 4, 4, 4, 3, 4, 4, 4, 4, 4, 4, 4 };
^
In file included from /home/washington/htslib-1.5/htslib/sam.h:31:0,
from bam_md_ext.c:37:
/home/washington/htslib-1.5/htslib/hts.h:319:18: note: previous declaration of ‘seq_nt16_int’ was here
extern const int seq_nt16_int[];
^
In file included from bam_md_ext.c:43:0:
samutils.h:47:14: warning: ‘op_cat_str’ defined but not used [-Wunused-variable]
static char *op_cat_str[] = {
^
Makefile:467: recipe for target 'bam_md_ext.o' failed
make[1]: *** [bam_md_ext.o] Error 1
make[1]: Leaving directory '/home/washington/lofreq/src/lofreq'
Makefile:403: recipe for target 'all-recursive' failed
make: *** [all-recursive] Error 1
can anyone provide and help?
Thank you,
WLD
Hi Andreas,
I already anticipated by mail this observation I made. I simulated forward and reverse reads in quite equal amount and the alignment seems to agree. Nevertheless, the counts of the INFO tag DP4 are strange. For example:
##fileformat=VCFv4.0
##fileDate=20150429
##source=lofreq call -f cns.fasta --no-default-filter -r sample_cons_Pol:1-188 -o /tmp/lofreq2_call_parallelKepDFM/0.vcf.gz hq_2_cons_sorted.bam
##reference=cns.fasta
...
#CHROM POS ID REF ALT QUAL FILTER INFO
sample_cons_Pol 50 . C G 85 PASS DP=1109;AF=0.005410;SB=0;DP4=1103,0,6,0
sample_cons_Pol 328 . A G 8400 PASS DP=3231;AF=0.100898;SB=0;DP4=2886,9,325,1
Any thought? I'm using
version: 2.1.2a
commit: v2.1.2a-92-g6f6e173
build-date: Apr 9 2015
Thanks!
I've attached a .bam that shows evidence of two variants of significant depth (here's an IGV snapshot):
While I am able to detect the first one with lofreq, the second one fails to appear, with any set of cmdline options I can think of. Could someone have a look at this and explain why the ~12% AF variant at chr7:148506396 A / C is not reported by lofreq? I am using lofreq 2.1.1.
Some flag could be provided for allow calling variants when there is an N in the reference. As we discuss by email, it could be done just replacing the N for a random reference; but it will be very useful for the final user to be allow to output this kind of variants.
Including this in the code could be done by:
The first point could be change by using the consensus variant approach as in previous versions.
I had previous success running lofreq adn now am getting a syntax error. Any suggestion? The commad I run:
$python lofreq_star-2.1.2/bin/lofreq call-parallel --pp-threads 12 -f human_g1k_v37.fa -s -S All_20161122.vcf.gz -l SureSelect.bed -o SampleA_lofreq.vcf SampleA_realigned_recal.bam
File "lofreq_star-2.1.2/bin/lofreq", line 1
SyntaxError: Non-ASCII character '\xb6' in file lofreq_star-2.1.2/bin/lofreq on line 2, but no encoding declared; see http://python.org/dev/peps/pep-0263/ for details
I reinstalled the package incase something strange happend to original installation since I last used the software. Please advice how to fix the error. THanks.
I have many Illumina datasets (> 1000 samples from different individuals within a population) which I will analyze with LoFreq.
If I run "LoFreq call" one a single set of aligned reads, all goes smoothly.
However, if I submit all 1000+ independent jobs to the compute cluster (SGE), then many of them return with an error:
ERROR(lofreq_call.c|main_call:1394): Couldn't open /tmp/lofreq2-call-dyn-bonf.9pEkER
But with a different suffix: e.g. lofreq2-call-dyn-bonf.nydjvL or lofreq2-call-dyn-bonf.uHy7CC , etc.
I noticed that LoFreq uses "mktemp" in lofreq_call.c. But I have read that "mktemp" is unsafe and that "mkstemp" should be used instead.
I tried editing the source file from "mktemp" to "mkstemp" but I got a segfault at run time (I may be lacking the zlib devel version, maybe that's why)
please note that when I run the same exact command alone (i.e. without 1000 other simultaneous jobs), then it executes perfectly fine, no error message.
Maybe there can be an option for the user to give a temporary/working directory to LoFreq, to aovid the "mktemp" problem entirely?
When configure with pre-compiled samtools-1.3 and htslib-1.3, make produce the following error:
In file included from bam_md_ext.c:38:0:
samtools-1.3/bam.h:189:28: error: conflicting types for ‘seq_nt16_int’
#define bam_nt16_nt4_table seq_nt16_int
^
bam_md_ext.c:54:12: note: in expansion of macro ‘bam_nt16_nt4_table’
const char bam_nt16_nt4_table[] = { 4, 0, 1, 4, 2, 4, 4, 4, 3, 4, 4, 4, 4, 4, 4, 4 };
^
In file included from htslib-1.3/htslib/sam.h:30:0,
from bam_md_ext.c:37:
htslib-1.3/htslib/hts.h:255:18: note: previous declaration of ‘seq_nt16_int’ was here
extern const int seq_nt16_int[];
^
In file included from bam_md_ext.c:43:0:
samutils.h:47:14: warning: ‘op_cat_str’ defined but not used [-Wunused-variable]
static char *op_cat_str[] = {
^
make[1]: *** [bam_md_ext.o] Error 1
make: *** [all-recursive] Error 1
Suggested by Daniel Gómez Sánchez
Lofreq assigns a Phred-scaled score for strand bias. Why use Phred scores? Is it possible to report the raw probability scores? What does it mean when a site has a Phred score SB=0?
Hi Andreas,
Can you please explain how DP and DP4 fields differ? I know DP is raw depth and DP4 is counts for ref and alt on forward & reverse bases. So, I believe DP4 is subset of raw depth and does not include reads with strand bias. Is that correct?
We have a variant for which the allele frequency calculated by Lofreq is 3% ( I use -m20 -q 13 and -Q 13) and quality of 290, but when we calculate it from the DP4 counts for ref and alt, it comes out to be 49.19%. IGV also has VAF of 48%, so just wondering how should I go about interpreting these. Should we always calculate the VAF from DP4?
Moreover, when I run Lofreq with default settings, it calculates VAF as 47.7% and DP4 estimated VAF is still 49% with a quality of 3586. So, definitely the settings change the allele frequency calculation by Lofreq but why is DP4 not affected by this. Can you please help me solve this issue.
Thank you.
After all reads are aligned independently.
Hi
I am developer and trying to wrap up your lofreq tool for galaxyproject. I have the forms generated and the command executed.
lofreq call -f "/galaxy/database/files/000/dataset_131.dat" -o "/galaxy/database/files/000/dataset_132.dat" --min-bq "6" --min-alt-bq "6" --def-alt-bq "0" --min-jq "0" --min-alt-jq "0" --def-alt-jq "0" --no-baq --del-baq --no-ext-baq --min-mq "0" --max-mq "0" --def-nm-q "0" --min-cov "1" --max-depth "1000000" --src-qual --sig --bon f --illumina-1.3 --use-orphan --plp-summary-only --no-default-filter --force-overwrite --verbose "/galaxy/database/files/000/dataset_129.dat"
dataset_131 is the fasta file (reference_h37rv.fasta)
dataset_131 is the output vcf file
dataset_132 is the bam file (ext1.bam)
This is though the error, I receive when trying to run the tool:
FATAL(lofreq_call.c|main_call:1243): Couldn't parse sign-threshold
What does it mean, in program terms (simple terms)?
Regards
Zipho
just copy logic from somewhere else
Pre-compiled version does not work in OS X 10.8.5 (12F2560) but it works in OS X 10.9.5 (13F1507), because of the libSystem.B.dylib:
dyld: Symbol not found: ___exp10
Referenced from: lofreq
Expected in: /usr/lib/libSystem.B.dylib
I understand that it could not be compiled for every OS, but I would appreciate if it could be specified somewhere for which version could be used, because it does not complain until it needs this function.
It is not a major issue, because when I compiled the source in my computer is working perfectly, but for a final user it could be a problem.
Useful for debugging and circumventing preprocessing and duplication of BAM where not necessary
I am compiling lofreq using this
./configure SAMTOOLS=/apps/samtools/1.2.1 HTSLIB=/apps/htslib/1.2.1/ --
prefix=/apps/lofreq/2.1.3
And it starts to compile
configure: Configuring LoFreq_Star (version 2.1.3a) for your system...
| ____|
| _ \ | __| _ \ _` |
| ( | __| | __/ ( |
_____| \___/ _| _| \___| \__, |
_|
checking for a BSD-compatible install... /usr/bin/install -c
checking whether build environment is sane... yes
checking for a thread-safe mkdir -p... /usr/bin/mkdir -p
checking for gawk... gawk
checking whether make sets $(MAKE)... yes
checking whether make supports nested variables... yes
checking for style of include used by make... GNU
checking for gcc... /apps/gcc/5.3.0/bin/gcc
checking whether the C compiler works... yes
. .... truncated
But then it lands on this issue.
> checking whether to build with debug information... no
> checking for /apps/samtools/1.2.1/bam.h... no
> configure: error: bam.h not found
I get this error against htslib 1.2.1 and 1.3.1 and the same with samtools 1.2.1 and 1.3.1. I am not sure how to resolve this error. What compilation steps do you recommend.
Matters esp. for high coverage data where huge numbers make any diff look significant
Implement and benchmark and first
First, thanks for the very useable and documented package. This is just a feature request/enhancement.
It would be great to have a "--overwrite" option to allow lofreq to overwrite existing output files if the user chooses. As it is, the user must go through and delete files before rerunning lofreq.
I was able to download the binary. However, I wanted to use the optional tools as well. It looks like the only way to get them is by downloading the source. I did that. I copied them to a location in my $PATH. However, when I run lofreq vcfplot
or lofreq2_vcfplot.py
, I get:
FATAL: Couldn't find LoFreq modules. Are you sure your PYTHONPATH is set correctly?
I also tried navigating to lofreq-2.1.2/src/tools
and running python setup.py install
, but then I get an error:
Traceback (most recent call last):
File "setup.py", line 8, in <module>
import setup_conf
ImportError: No module named setup_conf
How do I get setup_conf?
Is there a better way to install optional tools?
Simply add AF and possibly number of orphans for now. Take from Info in case of tumor and compute in within uniq in case of normal
This was introduced as a hack to avoid calling SNVs on consensus indels (which might not be visible due to missing indel qualities). See call_vars() in lofreq_call.c. One way of dealing with this might be to enforce indel qualities (would avoid trouble with "missing" indels as well).
Intially reported by John Jianwei Che
Hi Andreas,
In my analysis, the quality scores from Lofreq seem to reach a plateau at 49,314. Is this the maximum quality score a variant can have? And if so, is there a way to change this value and set it to something else?
Thank you so much.
Ashini
Happens because vcf file is loaded fully into memory to determine multiple testing correction threshold. Easily fixed by loading twice, once to determine thresholds and then a second time to apply them.
Initially reported by Kanika Arora (NYG)
Hi,
The command help statements for lofreq call mentions:
How do we pass a single tuple of say, chr pos-start pos-end, to lofreq call? Does this have to be in bed format, or can this be done from the command line (without needing a file)?
Hi!
I just noticed that compilation fails when passing a new version of htslib.
The error is (shortened)
bam_md_ext.c:54: error: conflicting types for ‘seq_nt16_int’
/home/ozagordi/tmp/htslib-1.2/htslib/hts.h:186: note: previous declaration of ‘seq_nt16_int’ was here
make[1]: *** [bam_md_ext.o] Error 1
make[1]: Leaving directory `/home/ozagordi/tmp/lofreq/src/lofreq'
I managed to compile successfully by following the steps in .travis.yml
Thanks for lofreq!
...for all scripts in ./src/scripts and also helpers. Should still be able to use 2.7
See FPs in Pacbio data on or next to true (consensus) indels or SNVs: resp. other type is wrongly called on same or neighboring site. This could also happen in pooled data with non-consensus variants. Can't simply disallow multiallelic calls, which is a useful feature in other contexts (e.g. viral quasispecies). Easiest but hacky approach is to remove one of the two overlapping/neighbouring variants as post filter and keep the one with higher Q or AF.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.