brentp / cyvcf2 Goto Github PK

View Code? Open in Web Editor NEW

357.0 7.0 71.0 15.04 MB

cython + htslib == fast VCF and BCF processing

License: MIT License

Python 35.40% C 5.12% Shell 1.88% Cython 57.60%

vcf htslib cython genomics bioinformatics

cyvcf2's Introduction

cyvcf2

Note: cyvcf2 versions < 0.20.0 require htslib < 1.10. cyvcf2 versions >= 0.20.0 require htslib >= 1.10

The latest documentation for cyvcf2 can be found here:

If you use cyvcf2, please cite the paper

Fast python (2 and 3) parsing of VCF and BCF including region-queries.

cyvcf2 is a cython wrapper around htslib built for fast parsing of Variant Call Format (VCF) files.

Attributes like variant.gt_ref_depths work for diploid samples and return a numpy array directly so they are immediately ready for downstream use. note that the array is backed by the underlying C data, so, once variant goes out of scope. The array will contain nonsense. To persist a copy, use: cpy = np.array(variant.gt_ref_depths) instead of just arr = variant.gt_ref_depths.

Example

The example below shows much of the use of cyvcf2.

from cyvcf2 import VCF

for variant in VCF('some.vcf.gz'): # or VCF('some.bcf')
    variant.REF, variant.ALT # e.g. REF='A', ALT=['C', 'T']

    variant.CHROM, variant.start, variant.end, variant.ID, \
                variant.FILTER, variant.QUAL

    # numpy arrays of specific things we pull from the sample fields.
    # gt_types is array of 0,1,2,3==HOM_REF, HET, UNKNOWN, HOM_ALT
    variant.gt_types, variant.gt_ref_depths, variant.gt_alt_depths # numpy arrays
    variant.gt_phases, variant.gt_quals, variant.gt_bases # numpy array

    ## INFO Field.
    ## extract from the info field by it's name:
    variant.INFO.get('DP') # int
    variant.INFO.get('FS') # float
    variant.INFO.get('AC') # float

    # convert back to a string.
    str(variant)


    ## sample info...

    # Get a numpy array of the depth per sample:
    dp = variant.format('DP')
    # or of any other format field:
    sb = variant.format('SB')
    assert sb.shape == (n_samples, 4) # 4-values per

# to do a region-query:

vcf = VCF('some.vcf.gz')
for v in vcf('11:435345-556565'):
    if v.INFO["AF"] > 0.1: continue
    print(str(v))

Installation

pip with bundled htslib

pip install cyvcf2

pip with system htslib

Assuming you have already built and installed htslib version 1.12 or higher.

CYVCF2_HTSLIB_MODE=EXTERNAL pip install --no-binary cyvcf2 cyvcf2

windows (experimental, only test on MSYS2)

Assuming you have already built and installed htslib.

SETUPTOOLS_USE_DISTUTILS=stdlib pip install cyvcf2

github (building htslib and cyvcf2 from source)

git clone --recursive https://github.com/brentp/cyvcf2
pip install -r requirements.txt
# sometimes it can be required to remove old files:
# python setup.py clean_ext
CYVCF2_HTSLIB_MODE=BUILTIN CYTHONIZE=1 python setup.py install
# or to use a system htslib.so
CYVCF2_HTSLIB_MODE=EXTERNAL python setup.py install

On OSX, using brew, you may have to set the following as indicated by the brew install:

For compilers to find openssl you may need to set:
  export LDFLAGS="-L/usr/local/opt/openssl/lib"
  export CPPFLAGS="-I/usr/local/opt/openssl/include"

For pkg-config to find openssl you may need to set:
  export PKG_CONFIG_PATH="/usr/local/opt/openssl/lib/pkgconfig"

Testing

Install pytest, then tests can be run with:

pytest

CLI

Run with cyvcf2 path_to_vcf

$ cyvcf2 --help
Usage: cyvcf2 [OPTIONS] <vcf_file> or -

  fast vcf parsing with cython + htslib

Options:
  -c, --chrom TEXT                Specify what chromosome to include.
  -s, --start INTEGER             Specify the start of region.
  -e, --end INTEGER               Specify the end of the region.
  --include TEXT                  Specify what info field to include.
  --exclude TEXT                  Specify what info field to exclude.
  --loglevel [DEBUG|INFO|WARNING|ERROR|CRITICAL]
                                  Set the level of log output.  [default:
                                  INFO]
  --silent                        Skip printing of vcf.
  --help                          Show this message and exit.

Performance

For the performance comparison in the paper, we used thousand genomes chromosome 22 With the full comparison runner here.

cyvcf2's People

Contributors

Stargazers

Watchers

Forkers

thongnt2 chapmanb jayhesselberth ernfrid moonso guochangjiang modojo dnil arq5x drveera indraniel tomsasani phyllispeng04 raonyguimaraes johanneskoester sndrtj seqpipe ichobits everestial giladmishne east301 juliangehring christophelegendre nh13 jimhavrilla vladsavelyev juzheng87 technoslerphile kforti justinfinkle arontommi x-bioinformatics nickengland smoe hoeze literallyuniquelogin mbhall88 ismailm akaped ujjwalsh timothymillar horta tomwhite fagan2888 ruslan-abasov tubiejun nyagam stefanor thomasbrazier grahamgower piechottam dridk leoisl zengzhengcheng weixuanfu khurrammaqbool gileshall eqt astraheywood hyanwong kdm9 yang-mj tedil benjeffery cholotook yudgnahk jeromekelleher graphenn nakib103

cyvcf2's Issues

variant.FILTER is None

Hi,

when parsing vcfs with cyvcf2 there variant.FILTER is always None.
I have tried with different vcf files and the result is always the same. Is this a known bug or is there something I'm missing?
I can provide examples but are not sure if it would help...

numpy dependency in setup.py mishandled

I'm not entirely sure what the best solution is in this case, but in setup.py, numpy is specified as a dependency, but it is imported at the top of the file before the dependency can be satisfied.

I suspect that this problem can be addressed via the example here: http://docs.scipy.org/doc/numpy-dev/f2py/distutils.html and the documentation here: http://docs.scipy.org/doc/numpy/reference/distutils.html

Read VCF From StringIO or Buffer?

I'm writing an AWS lambda function that fetches a VCF remotely and converts it to JSON - using cyvcf2 to read and format input. I want to avoid having to write the VCF that is fetched remotely to disk and then having cyvcf2 read it. Is there a way for cyvcf2 to read a VCF from a string buffer (or stdout from subprocess?).

GT type for HOMALT with multiple alternative alleles

For variants with multiple alt alleles, samples with a genotype homozygous in the second/third/... alt alleles are reported as UNKNOWN. Shouldn't they rather be HOMALT?

##fileformat=VCFv4.2
##FILTER=<ID=PASS,Description="All filters passed">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  Tumor   Normal
7   55086956    .   A   C,G 0   .   .   GT  0/1 2/2
7   55086957    .   A   C,G,T   0   .   .   GT  1/2 3/3
7   55086958    .   A   C,G,T   0   .   .   GT  1/1 2/3

import cyvcf2 as vcf
vf = vcf.VCF("test.vcf")
v = next(vf)
v.gt_types

2/2 and 3/3 are reported as UNKNOWN (i.e. 2) instead of HOMALT
0/1, 1/2 and 2/3 are correctly as HET (i.e. 1)
1/1 is correctly reported as HOMALT (i.e. 3)

Add documentation

Hi,

thought I do some documentation while getting into cyvcf2.
Is it ok if I do it in the mkdocs way?

In that case I can push some stuff and you can just build the docs when they are done.

Bump version in pypi to 0.7.0

Is there a specific reason the version of cyvcf2 has not yet been bumped to 0.7.0?
We would like to update our Python setup with the latest changes without having to manually "build" this package.

Thanks
Filip Nollet

Enable zip and indexing from cyvcf

Would it be possible to bgzip and create index from cyvcf?

write slim VCF?

Hi,

Would it be possible to use cyvcf2 to skip particular VCF header fields (i.e. altering the VCF template) when writing records? It seems it can add headers, but I could not figure out to set existing fields to NULL.

best,
Sigve

Retrieval of FORMAT string field fails

Accessing the content of a FORMAT field with a text string content

import cyvcf2
vf = vcf.VCF("test.vcf.gz")
v = next(vf)
v.format('ADP_ALL', int) ## control, works
v.format('RULE', str) ## fails

currently (latest release version 0.5.3 as well as built from master) fails with

ValueError Traceback (most recent call last)
in ()
----> 1 v.format("RULE", str)
/usr/local/lib/python2.7/site-packages/cyvcf2/cyvcf2.pyx in cyvcf2.cyvcf2.Variant.format (cyvcf2/cyvcf2.c:21838)()
894 shape[0] = <np.npy_intp> self.vcf.n_samples
895 shape[1] = fmt.n # values per sample
--> 896 v = np.PyArray_SimpleNewFromData(2, shape, typenum, buf)
897 ret = np.array(v)
898 if vtype == str:
ValueError: data type must provide an itemsize

Example data (derived from bug.vcf.gz test case) with a 'RULE' format field that contains a text string:

##fileformat=VCFv4.2
##FILTER=<ID=PASS,Description="All filters passed">
##INFO=<ID=SOMATIC_CALL,Number=R,Type=Integer,Description="Boolean value for a somatic call for each allele.">
##FORMAT=<ID=ADP_ALL,Number=R,Type=Integer,Description="Number of all unique observations for each allele, inc. both low- and high-confidence.">
##FORMAT=<ID=RULE,Number=1,Type=String,Description="Text field">
#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  Tumor
7   55086956    .   C   G   0   .   .   ADP_ALL:RULE    6728,1:F
7   55086957    .   T   A,C,G   0   .   .   ADP_ALL:RULE    6768,2,2,1:F2

question about putting "variant" into pandas dataframe

hi,
i'm trying to parse my vcf files using cyvcf2. I am wondering how do i make the cyvcf2.cyvcf2.Variant
and put them into panda dataframe? Thanks

Update format and filter field in header

It would be a great extension to have the VCF.update function also allow the updating of format and filter fields in the VCF header; perhaps with a similar interface as the VCF.add_{info,filter,format}_to_header` functions. Or at least I have not been able to achieve this with the existing functionality, and there may already a clever way that I may have missed.

Handling multiple alternative alleles

When iterating over a (large) multisample vcf file, how are multiple alternative alleles encoded in variant.gt_types? What would be the best way to classify those correctly?

Also return contig/chromosome sizes from the seqnames function.

Hi,

Would it be possible to also return contig/chromosome sizes from the seqnames function?

Currently a list of contig/chromosome names is returned.

from cyvcf2 import VCF
reader = VCF("my.vcf.gz")
reader.seqnames
['Chr_00', 'Chr_01', 'Chr_02', 'Chr_03', 'Chr_04', 'Chr_05', 'Chr_06', 'Chr_07', 'Chr_08', 'Chr_09']

To be even more sure that the VCF was made using the expected reference genome we would like to be able compare also on contig/chromosome size. Most VCF files have contig/chromosome size defined in the header.

So than the seqnames function could return a dictionary. With None values if the contig/chromosome size was not defined in the VCF header.

from cyvcf2 import VCF
reader = VCF("my.vcf.gz")
reader.seqnames
{'Chr_00': 12345,
 'Chr_01': 12345,
 'Chr_02': 12345,
 'Chr_03': 12345,
 'Chr_04': 12345,
 'Chr_05': 12345,
 'Chr_06': 12345,
 'Chr_07': 12345,
 'Chr_08': 12345,
 'Chr_09': 12345}

Thank you.

Accessing GT field via format method

Hello,

Thanks for your effort on cyvcf2, this is a great project.
I have a question regarding accessing GT field via format function. I can access genotypes using the genotypes attribute:

In [5]: variant.genotypes
Out[5]: [[0, 1, False], [0, 1, False]]

However, when I try to access it via 'format' method, here's what I get:

In [6]: variant.format('GT')
Out[6]: 
array(['\x02\x04', '\x02\x04'],
      dtype='|S2')

How do I interpret this output? I can access other fields via format method just fine (e.g. I get a human-friendly output). I wonder if there's something obvious that I'm missing.

Any help is much appreciated. Thanks in advance!

Subset variants for samles of interest

Hi Brent,

I am not sure how feasible this is and if this is something that cyvcf2 is meant to do.
But at least I can ask.

Would it be possible to subset samples in cyvcf2?

The same functionality as that bcftools view -s sample1,sample2 does? Including updating the AN and AC INFO values?

Thank you.

variant.ID returns binary string

Hi Brent,

Thank you for the very nice and useful htslib wrapper cyvcf2.

When I parse a VCF file with cyvcf2 and try to print the ID assigned to the variant I get printed a binary string.

E.g.

reader = VCF("my_file.vcf.gz")
variant = next(reader)
print(variant.ID)

b'myID123456789'

Is this the intended behavior, or does this depend somehow on my input vcf or python version?
I am using python3.4 . Both cyvcf2 version 0.6.5 and 0.7.0 give me the binary string for the ID.

Decoding the binary string before printing solves my issue but it would be nicer if that wasn't necessary.

print(variant.ID.decode('ascii'))

'myID123456789'

Thank you.

Neill

slow speed of vcf.samples

Hi Brentp,
cyvcf2 is very good, thanks for your effort!
But I found the vcf.samples is very slow in this code:
The test.vcf is just a part of 1000k genome file (~1000 head lines )with ~2500 samples.

from cyvcf2 import VCF
vcf = VCF("test.vcf")
for variant in vcf:
    for i in range(0,len(vcf.samples)):
        print vcf.samples[i]

However, when I store vcf.samples as a new python variable, it is fast.

from cyvcf2 import VCF
vcf = VCF("test.vcf")
samplenames = vcf.samples
for variant in vcf:
    for i in range(0,len(vcf.samples)):
        print samplenames[i]

So, I infer there may be some excessive behaviour when using vcf.samples.

Enable access to genotypes using get for variant class

Support for genotype iteration and access to genotypes by sample name would be very useful.

For example:

vcf = VCF("test.vcf.gz")
for line in vcf:
    # FORMAT: = GT:DP
    # return {"GT": 3, "DP": 50}
    line["sample_name"] # return {"GT": 3, "DP": 50}

Alternatively, the above could stay inline with the properties defined in the variant class, returning something like:
{"gt_bases","T", "gt_quals":20, "is_snp": True, etc.}

Support for iteration would also be very useful:

for line in vcf:
    for call in line: 
         # Return dict of sample name + format params.

Although it is not essential, the ability to add/alter call information would be great as well.

Thanks for the hard work on this project. It is an immense help and I am integrating it within some python projects I hope to distribute.

Allow range to be none

It would be nice if one could specify a range that is None or ''

To make more clear see this example:

from cyvcf2 import VCF

path = "path/to/my.vcf.gz"
range = '1:1771120-1771130'
vcf = VCF(path)
for v in vcf(range):
    print(v.start, v.end)
>(1771128, 1771129)

range2 = ''
for v in vcf(range2):
    print(v.start, v.end)
>AttributeError                            Traceback (most recent call last)
<ipython-input-13-53ebbebc20cb> in <module>()
----> 1 for variant in vcf(range2):
      2     print(v.start, v.end)
      3

python2.7/site-packages/cyvcf2/cyvcf2.pyx in __call__ (cyvcf2/cyvcf2.c:3255)()
     49
     50         cdef hts_itr_t *itr = tbx_itr_querys(self.idx, region)
---> 51         assert itr != NULL, "error starting query for %s at %s" % (self.name, region)
     52         cdef kstring_t s
     53         cdef bcf1_t *b

AttributeError: 'cyvcf2.cyvcf2.VCF' object has no attribute 'name'

Would that be an easy fix?

gt_types returns empty list after using set_samples

Hi Brent,

I just tried to use set_samples and gt_types in combination. Here's an example VCF:

##fileformat=VCFv4.1
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
#CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO	FORMAT	HG004	HG003	HG002
chr1	60906167	.	G	A	.	PASS	.	GT	0/1	0/1	0/0

And here's what I did:

>>> from cyvcf2 import VCF
>>> vcf = VCF('trio.vcf', gts012=True)
>>> for variant in vcf:
...     print(variant.gt_types)
... 
[W::vcf_parse] contig 'chr1' is not defined in the header. (Quick workaround: index the file with tabix.)
[1 1 0]
>>> vcf = VCF('trio.vcf', gts012=True)
>>> vcf.set_samples('HG003')
>>> for variant in vcf:
...     print(variant.gt_types)
... 
[W::vcf_parse] contig 'chr1' is not defined in the header. (Quick workaround: index the file with tabix.)
[]

Why am I getting an empty list? Is this a bug or am I doing something wrong?

Add string type field in FORMAT

Is it possible to add a non-numeric FORMAT field?

For example, I'm trying to add a field called ZZ, which is similar to GT. So my VCF would look like:
GT:ZZ 0|1:p|q

Odd genotype resolution when with a single missing allele

There seems to be a misclassification when resolving a genotype containing a single missing allele. Genotypes of 0/., ./0, 1/., or ./1 seem to be classified as heterozygous ("HET" or id: 1) instead of "UNKNOWN" (or id: 2). Is this supposed to be the intended behavior for cyvcf2?

Shown below is an example test case I've constructed.

Test VCF File: test-cyvcf2-genotype-call.v2.vcf.gz

##fileformat=VCFv4.1
##FILTER=<ID=PASS,Description="All filters passed">
##ALT=<ID=*:DEL,Description="Represents any possible spanning deletion allele at this location">
##ALT=<ID=NON_REF,Description="Represents any possible alternative allele at this location">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##contig=<ID=MT,length=16569>
#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  foo1    foo2    foo3    foo4    foo5    foo6    foo7    foo8
MT      310     .       T       C       1.83648e+07     PASS    .       GT      0/0     1/1     0/1     1/0     1/.     ./1     0/.     ./0

Python genotype calling code:

#!/usr/bin/env python

from __future__ import print_function
import os

from cyvcf2 import VCF
import numpy as np

testvcf = os.path.join(os.environ['HOME'], 'test-cyvcf2-genotype-call.v2.vcf.gz')

vcf = VCF(testvcf)
variant = next(vcf)

print('variant.gt_bases: {}'.format(variant.gt_bases))
print('variant.gt_types: {}'.format(variant.gt_types))

And the corresponding output:

variant.gt_bases: ['T/T' 'C/C' 'T/C' 'C/T' 'C/.' './C' 'T/.' './T']
variant.gt_types: [0 3 1 1 1 1 1 1]

Shouldn't the above variant.gt_types be [0 3 1 1 2 2 2 2] ?

get a vep/snpeff dict

I know this is a bit of a hack but since annotations with vep or snpeff is common I'll give it a go and see what you think.

Would it be possible to create a dictionary(like) structure for variants based on the given annotations mentioned above as default for variants?

Or would it be possible to store the vep/snpeff header in a list in the vcf object like:

vcf = cyvcf2.VCF("some.vcf")
vcf.VEP_HEADER
> ['Symbol', 'BIOTYPE', ...]

Stripping of quotation marks in VCF header description

header_iter keeps the double quotation marks of the original VCF description:

Example taken from test.vcf.gz:

{'HeaderType': 'INFO', 'Description': '"Mean of all GQ values"', 'Type': 'Float', 'Number': '1', 'ID': 'GQ_MEAN'}
{'HeaderType': 'FORMAT', 'Description': '"Allelic depths for the ref and alt alleles in the order listed"', 'Type': 'Integer', 'Number': 'R', 'ID': 'AD'}

Since the data is represented as a string in cyvcf2/python anyway, there should be no need to keep them. Should header_iter strip the quotation marks automatically, i.e. return 'Mean of all GQ values' rather than '"Mean of all GQ values"'?

fail to install

Hi there,

I am running into an installation error (shown as below) and hopefully you can point me to some directions to resolve it. This is an attempt to install cyvcf2 at user's home directory on a RHEL6 (cluster) system and under anaconda v2.3.0

Thanks!

==========================

$ python setup.py install --prefix=~/.local

/net/gs/vol3/software/modules-sw/gcc/4.9.1/Linux/RHEL6/x86_64/bin/gcc -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -I/net/gs/vol3/software/modules-sw/samtools/1.1/Linux/RHEL6/x86_64/include/ -I/net/gs/vol3/software/modules-sw/bwa/0.7.3/Linux/RHEL6/x86_64/include/ -I/net/gs/vol3/software/modules-sw/boost/1.58.0/Linux/RHEL6/x86_64/include/ -I/net/gs/vol3/software/modules-sw/htslib/1.3.2/Linux/RHEL6/x86_64/include -I/net/gs/vol3/software/modules-sw/tabix/0.2.6/Linux/RHEL6/x86_64/include/ -I/net/gs/vol3/software/modules-sw/zlib/1.2.6/Linux/RHEL6/x86_64/include/ -I/net/gs/vol3/software/modules-sw/R/3.2.1/Linux/RHEL6/x86_64/include/ -I/net/gs/vol3/software/modules-sw/gcc/4.9.1/Linux/RHEL6/x86_64/include/ -I/net/gs/vol3/software/modules-sw/gmp/5.0.2/Linux/RHEL6/x86_64/include/ -I/net/gs/vol3/software/modules-sw/mpfr/3.1.0/Linux/RHEL6/x86_64/include/ -I/net/gs/vol3/software/modules-sw/mpc/0.8.2/Linux/RHEL6/x86_64/include/ -fPIC -Ihtslib -Icyvcf2 -I/net/eichler/vol6/software/modules-sw/anaconda/2.3.0/Linux/RHEL6/x86_64/envs/python3/lib/python3.4/site-packages/numpy/core/include -I/net/eichler/vol6/software/modules-sw/anaconda/2.3.0/Linux/RHEL6/x86_64/envs/python3/include/python3.4m -c htslib/cram/cram_io.c -o build/temp.linux-x86_64-3.4/htslib/cram/cram_io.o
htslib/cram/cram_io.c:2158:17: warning: function declaration isn’t a prototype [-Wstrict-prototypes]
 static unsigned get_int_threadid() {
                 ^
htslib/cram/cram_io.c: In function ‘cram_set_voption’:
htslib/cram/cram_io.c:4481:10: error: ‘CRAM_OPT_LOSSY_NAMES’ undeclared (first use in this function)
     case CRAM_OPT_LOSSY_NAMES:
          ^
htslib/cram/cram_io.c:4481:10: note: each undeclared identifier is reported only once for each function it appears in
error: command '/net/gs/vol3/software/modules-sw/gcc/4.9.1/Linux/RHEL6/x86_64/bin/gcc' failed with exit status 1

can't extract region without tabix or csi index

Hi there,
I followed the cyvcf2 code below, using my unzipped "mpg.snv.vcf" or "mpg.div.vcf" files but I got the error saying: "can't extract region without tabix or csi index"

1 import cyvcf2
2 import numpy as np
3 vcf = cyvcf2 . VCF ( path )
4 sample_counts = np. zeros (len (vcf . samples ), dtype
= float )
5 for variant in vcf(" chr1 :229993 -329993 "):
6 if variant . is_indel : continue
7 if variant . QUAL < 10: continue
8 depths = variant . format ("AD") # get all
depths .
9 sample_counts [( depths [:, 1] > 10) & ( variant
. gt_types == vcf .HET )] += 1
10 print (zip ( vcf . samples , sample_counts ))

I also tried with "mpg.snv.vcf.gz.tbi" file and got "Exception: bad path" error.
What am I doing wrong?

thanks

Python3 support

When trying to install this package with pip3, one gets an error message: "cyvcf2 is only for python 2.7".
I am interesting to port it to Python3. Have you already tried ? Was there any special difficulty ? Usually there are very few changes to make, and work has already been done in this direction (there is a PY_MAJOR_VERSION < 3 check for strings encoding).

This is in order to have vcf2db usable in modern Python programs.

Allow changing sample names

I'm filing this issue because despite the plethora of VCF reading/writing libraries for Python, no one of them implements this. (I filed a similar request for pysam).

how can we accomplish something like this ?

 for i in data:      
                 arrChr = i.split("\t")[0]
                 arrVal = i.split("\t")[1]
                 print (arrChr, arrVal)
                 for v in vcf(arrChr:arrVal-arrVal):
                              print (v.INFO["AF"])

above is throwing SyntaxError: invalid syntax error.
Any idea how to query like this dynamically?

genotype column unnecessarily modified with trailing missing values (".")

This:

chr1 125179503 . A AT . PASS . GT:GQ:GQX:DPI:AD:ADF:ADR:PL . 1/1:24:60:9:0,9:0,7:0,2:204,27,0 .

gets modified to:

chr1 125179503 . A AT . PASS . GT:GQ:GQX:DPI:AD:ADF:ADR:PL .:.:.:.:.:.:.:. 1/1:24:60:9:0,9:0,7:0,2:204,27,0 .:.:.:.:.:.:.:.

CSI index not recognized / used with BCF.gz files

Hi Brent,

We store our larger variant callings files in BCF.gz format. Because of the binary format bcftools queries on those files are much faster then on the same file in vcf.gz format.

When trying to select variants based on positions from that file cyvcf2 throws an error about a missing tbi index.

from cyvcf2 import VCF
reader = VCF("my_variant_calling.bcf.gz")
reader("Chr_00:252")
variant = next(reader("Chr_01:252"))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "cyvcf2/cyvcf2.pyx", line 355, in __call__ (cyvcf2/cyvcf2.c:9727)
AssertionError: error loading tabix index for b'my_variant_calling.bcf.gz'

A CSI index file already exists from that file, and bcftools view -r Chr_01:252 works.

It is not possible to create a TBI index for a BCF file:

bcftools index --tbi my_variant_calling.bcf.gz
[Warning] TBI-index does not work for BCF files. Generating CSI instead.
[E::main_vcfindex] the index file exists. Please use '-f' to overwrite.

Would it be possible to have cyvcf2 use the CSI index with BCF files? cyvcf2 is able to loop over the variants in the BCF file, so it understand BCF, just not the CSI index it seems.

Otherwise one would have to duplicate BCF files to VCF to make custom region queries possible via cyvcf2.

Thank you.

Accessing Format Fields

Hi,

Is there a way to access format fields through this? I decided to try using this over pysam because they currently don't support it.

I thought that perhaps the format method would work, so I tried it on a VCF produced by a simple pileup-like tool.

I have it segfault when given b.format(key, int), where key is any of my integer keys, even though the VCF is produced by htslib and seems valid.

"1\t11167450\t.\tA\tG,T\t0\t.\t.\tADP_ALL:ADP_PASS:ADPD:ADPO:ADPR:AFR:BMF_PASS:BMF_QUANT:FA_FAILED:FM_FAILED:FR_FAILED:PV_FAILED:QSS:RVF:AMBIG:OVERLAP\t755,2,1:588,0,1:231,0,0:360,0,0:415,0,0:0.996042,0.00263852,0.00131926:1,0,0:754,2,1:68,1,0:167,2,0:1,0,0:0,0,0:1280022,0,3077:0.549669,0,0:3:441"

Am I querying this incorrectly?

Memory overwriting (?) issues when using previous v.gt_types

Maybe this is expected behaviour, but it caught me by surprise and took a bit to track down:

Say you were trying to build a gt_types matrix:

from cyvcf2 import VCF
import numpy as np

vcf = VCF('test.vcf.gz')
gts = np.array([v.gt_types for v in vcf])

vcf = VCF('test.vcf.gz')
gts_good = np.array([np.copy(v.gt_types) for v in vcf])

print 'difference: '
print gts-gts_good

print 'gts:'
print gts[:3,:5]
print 'gts_good:'
print gts_good[:3,:5]

The results vary from run to run, and on my mac (but not my linux box) it sometimes works, but when it doesn't you certainly notice:

$ python test.py
difference:
[[  35738129   33686016   33686016 ...,    2818079    1769484    1900569]
 [     11776      11776      11776 ...,          0          0         46]
 [         0    3014656          0 ...,     344576 1895851264 1711276036]
 ...,
 [         0          0          0 ...,          0          0          0]
 [         0          0          0 ...,          0          0          0]
 [         0          0          0 ...,          0          0          0]]
gts:
[[35738129 33686018 33686018 33686018 33686018]
 [   11776    11776    11776    11776    11776]
 [       0  3014656        0        0       46]]
gts_good:
[[0 2 2 2 2]
 [0 0 0 0 0]
 [0 0 0 0 0]]

This doesn't seem to happen with other returned numpy arrays like gt_quals or the depths.

A way to quickly identify called / non called samples

I do a lot of sample-based filtering and excluding called/non called samples for a determined locus is an important step. Currently there's just the num_called property, but it doesn't tell which samples are called and which are not.

Possibly a numpy array of booleans called called might work?

How to check for presence of INFO or FORMAT fields in VCF that have the same ID?

Hi Brent,

I know how to check for the presence of a VCF attribute by ID:

from cyvcf2 import VCF
reader = VCF("my.vcf.gz")
"AC" in reader
True

Sometimes I need to check for the presence of a VCF FORMAT attribute that also has a INFO attribute with the same name. For example INFO/RO and FMT/RO:

##INFO=<ID=RO,Number=1,Type=Integer,Description="Reference allele observation count, with partial observations recorded fractionally">
##FORMAT=<ID=RO,Number=1,Type=Integer,Description="Reference allele observation count">

In this case the FMT/RO attribute has been removed but the INFO/RO attribute is still present.

If I just check on the ID I get back true:

from cyvcf2 import VCF
reader = VCF("my.vcf.gz")
"RO" in reader
True

I would like to be able to specifcly check on the INFO and FORMAT attribute:

from cyvcf2 import VCF
reader = VCF("my.vcf.gz")
"INFO/RO" in reader
True
"FMT/RO" in reader
False

Currently this gives back False twice since the INFO and FMT prefix is not understood.

Is it already possible to check specificly for presence of INFO and FMT attributes? Or would it be possible to add this functionality?

I guess it's related to this function:
https://github.com/brentp/cyvcf2/blob/master/cyvcf2/cyvcf2.pyx#L457

Thank you!

AttributeError: attribute 'genotypes' of 'cyvcf2.cyvcf2.Variant' objects is not writable

I'm trying to change the phase information of the genotypes but getting error:

vcf = VCF('temp.vcf.gz')
for v in vcf():
    gtlist = v.genotypes
    for idx in range(len(gtlist)):
        gtlist[idx][len(gtlist[idx])-1] = False
    v.genotypes = gtlist

AttributeError: attribute 'genotypes' of 'cyvcf2.cyvcf2.Variant' objects is not writable

I need to change the phase information and write that to a new VCF file.

EDIT: could it be because of cyvcf2 package version? I am installing cyvcf2 using bioconda.

> conda list | grep cyvcf2
cyvcf2                    0.6.6a                   py27_0    bioconda

> pip freeze | grep cyvcf2
cyvcf2==0.6.5

>pip show cyvcf2
Name: cyvcf2
Version: 0.6.5
Summary: fast vcf parsing with cython + htslib
Home-page: https://github.com/brentp/cyvcf2/
Author: Brent Pedersen
Author-email: [email protected]
License: MIT
Location: /anaconda/envs/snowflakes/lib/python2.7/site-packages
Requires: numpy

Notice the two different versions using conda and pip?

Allow listing the FORMAT types from the header

Useful when you assemble VCFs from different callers and you have to check for caller-specific FORMAT types.

Currently Variant.format() needs the key to be known in advance.

Support for reading phase information

Hi Brent,

Thanks for your effort on cyvcf2. I'm one of the authors of WhatsHap, which uses pyvcf. We are increasingly annoyed by slow VCF processing and consider switching to cyvcf2. I started playing with cyvcf2 today, but couldn't figure out how to read phase information from the GT field. Here's a test VCF that I used:

##fileformat=VCFv4.1
##FILTER=<ID=PASS,Description="All filters passed">
##fileDate=20170117
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=PS,Number=1,Type=Integer,Description="Phase set identifier">
##contig=<ID=chr1>
#CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO	FORMAT	sample1	sample2	sample3	sample4	sample5
chr1	15616	.	G	A	.	PASS	.	GT:PS	0/0:.	0/1:.	1/1:.	0|1:15616	1|0:1000

And here's what I tried:

>>> from cyvcf2 import VCF
>>> vcf = VCF('test.vcf')
>>> for record in vcf:
...     print(record, record.gt_types, record.format('GT'), record.format('PS'))
... 
chr1    15616   .       G       A       .       PASS    .       GT:PS   0/0:.   0/1:.   1/1:.   0|1:15616       1|0:1000
 [0 1 3 1 1] ['\x02\x02' '\x02\x04' '\x04\x04' '\x02\x05' '\x04\x03'] [[-2147483648]
 [-2147483648]
 [-2147483648]
 [      15616]
 [       1000]]

So record.format('GT') yields different values for 0/1, 0|1 or 1|0, but I'm not really sure how to interpret them properly.

Is there documentation on this somewhere (my apologies in case I should have missed it)?

Best,
Tobias

Trouble Installing

Hi - I am getting the following error when installing:

    cyvcf2/cyvcf2.c:278:10: fatal error: 'numpy/arrayobject.h' file not found
    #include "numpy/arrayobject.h"
             ^
    1 error generated.
    error: command 'clang' failed with exit status 1

    ----------------------------------------
Command "/usr/local/opt/python/bin/python2.7 -c "import setuptools, tokenize;__file__='/private/var/folders/s7/7x89ld8j1570grbn6_9sqlj40000gn/T/pip-build-AV3Ff2/cyvcf2/setup.py';exec(compile(getattr(tokenize, 'open', open)(__file__).read().replace('\r\n', '\n'), __file__, 'exec'))" install --record /var/folders/s7/7x89ld8j1570grbn6_9sqlj40000gn/T/pip-PjCxnQ-record/install-record.txt --single-version-externally-managed --compile" failed with error code 1 in /private/var/folders/s7/7x89ld8j1570grbn6_9sqlj40000gn/T/pip-build-AV3Ff2/cyvcf2

I have numpy installed and am using homebrew.

numpy version 1.9.2
python version 2.7.10

cyvcf2 pip install fails on travis-ci

Hi - I'm not sure if this is an error on my part (it may very well be). I'm incorporating cyvcf2 into a package I am developing and am able to install it fine locally. However, I recently started playing around with Travis CI and am unable to install cyvcf2 using the service. This is the error I am getting:

      File "<string>", line 20, in <module>

      File "/tmp/pip-build-fY1AZz/cyvcf2/setup.py", line 4, in <module>

        from Cython.Distutils import build_ext

    ImportError: No module named Cython.Distutils

Any idea why this is happening? I haven't been able to turn up much explanation googling around. And I'm not familiar with cython.

This log illustrates the issue in greater detail: https://travis-ci.org/AndersenLab/vcf-toolbox/builds/84329365

Add flag to vcf INFO field

Is it possible to add a Flag with Number=0 to a Variant?

installation using pip fails

When trying to install cyvcf2 using pip:
sudo pip3 install cyvcf2
the installation works, but unfortunately importing fails with the following error message:

In [2]: from cyvcf2 import VCF                                                                 
---------------------------------------------------------------------------
ImportError                               Traceback (most recent call last)
<ipython-input-2-b1d6330f36e2> in <module>()
----> 1 from cyvcf2 import VCF

home/bpeter/.local/lib/python3.4/site-packages/cyvcf2/__init__.py in <module>()
----> 1 from .cyvcf2 import (VCF, Variant, Writer, r_ as r_unphased, par_relatedness,
      2                      par_het)
      3 Reader = VCFReader = VCF
      4 
      5 __version__ = "0.6.4"

ImportError: No module named 'cyvcf2.cyvcf2'

installing directly from github worked fine, however.

Seqfault when opening corrupted VCF

Trying to open a corrupted VCF (or alternatively, a file that is no VCF at all) causes a segfault:

cyvcf2.VCF("no-valid.vcf")

[E::bcf_hdr_read] invalid BCF2 magic string
[1]    89493 segmentation fault  ipython

I'm not sure if this is a rather a problem of htslib than cyvcf2, but is there a way to guard against this? Ideally, one could capture this as an IOError exception.

Install fails on centos

Hi,

the conda release (bioconda) seems to be lagging behind a little and I wanted to try installing using pip. However, I run into a "gcc" error:

...
  gcc -pthread -Wsign-compare -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -fPIC -Ihtslib -Icyvcf2 -Isetup-requires/numpy/core/include -I/home/glsai/miniconda2/envs/cyvcf2-test/include/python3.6m -c htslib/vcfutils.c -o build/temp.linux-x86_64-3.6/htslib/vcfutils.o
  gcc -pthread -Wsign-compare -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -fPIC -Ihtslib -Icyvcf2 -Isetup-requires/numpy/core/include -I/home/glsai/miniconda2/envs/cyvcf2-test/include/python3.6m -c htslib/tbx.c -o build/temp.linux-x86_64-3.6/htslib/tbx.o
  In file included from htslib/tbx.c:33:
  htslib/htslib/bgzf.h:35:18: error: zlib.h: No such file or directory
  In file included from htslib/tbx.c:33:
  htslib/htslib/bgzf.h:69: error: expected specifier-qualifier-list before ‘z_stream’
  error: command 'gcc' failed with exit status 1

I'm running on:

Linux version 2.6.32-642.11.1.el6.x86_64 ([email protected]) (gcc version 4.4.7 20120313 (Red Hat 4.4.7-17) (GCC) ) #1 SMP Fri Nov 18 19:25:05 UTC 2016

I don't have sudo access on this cluster so I can't install dev packages so easy. Would you recommend installing using conda in this case? Numpy (the only dependency?) seems to install fine btw 🙂

Implement methods for modifying the API - Adding filters

Following up on #4, It would be nice to be able to edit the header of vcf files in addition to adding/editing filter/info/format data. The implementation might look like this, although you may know of a better way:

v = VCF("my_vcf.vcf.gz")
# Here I am adding a format field to mark snps as validated
v.header.INFO["VALID"] = {"Number": 1, "Type":"Flag", "Description": "flag indicating a snp/indel has been validated"}

for variant in v:
    if variant in validated_variants:
        variant.INFO["VALID"] = True

# Adding a filter
v.header.FILTER["low_complexity_region"] = {"Number": 1, "Type":"Flag", "Description": "flag indicating a snp/indel observed in a low complexity region."}

for variant in v:
    if variant in low_complexity_regions:
        variant.FILTER.append("low_complexity_region") # Appends to existing filters

I also have been having trouble accessing VCFs by region. I can open as a separate issue. Python crashes when I attempt to do something like:

v = VCF("test.vcf.gz", "I:1-10000")

Accessing Variant num_called property causes segmentation fault

First of all, thanks for the speedy package, @brentp. Much appreciated!

Now, about the issue: whenever I access the num_called property of a Variant instance, Python crashes with a segmentation fault. The VCF file was generated by the Strelka SNV and indel caller. I'm not familiar with Cython, so I can't readily submit a pull request. I figured this might be an easy fix. Then again, don't we all start by thinking that...

Steps to reproduce error

Here's a one-variant test VCF file. I had to change the file extension to .txt to upload it to GitHub.

I'm using cyvcf2 v0.5.1 on Python v3.5.2.

>>> from cyvcf2 import VCF
>>> vcf = VCF('test.txt')
>>> var = next(vcf)
>>> var.num_called
[1]    43035 segmentation fault  python

Let me know if you need anything else!

Single Sample VCF errors with because .n_samples is not calculated properly?

Hope this helps!

The VCF below throws an error. It has a single sample. Trigger the error by running:

from cyvcf2 import VCF
v = VCF("in.vcf")
for record in v:
    record.gt_bases()

Results in:

ValueError: range() step argument must not be zero

Traceback - slightly different code; but same idea

Traceback (most recent call last):
  File "/usr/local/var/pyenv/versions/2.7.11/envs/cendr/lib/python2.7/site-packages/flask/app.py", line 2000, in __call__
    return self.wsgi_app(environ, start_response)
  File "/usr/local/var/pyenv/versions/2.7.11/envs/cendr/lib/python2.7/site-packages/flask/app.py", line 1991, in wsgi_app
    response = self.make_response(self.handle_exception(e))
  File "/usr/local/var/pyenv/versions/2.7.11/envs/cendr/lib/python2.7/site-packages/flask_restful/__init__.py", line 271, in error_router
    return original_handler(e)
  File "/usr/local/var/pyenv/versions/2.7.11/envs/cendr/lib/python2.7/site-packages/flask/app.py", line 1567, in handle_exception
    reraise(exc_type, exc_value, tb)
  File "/usr/local/var/pyenv/versions/2.7.11/envs/cendr/lib/python2.7/site-packages/flask/app.py", line 1988, in wsgi_app
    response = self.full_dispatch_request()
  File "/usr/local/var/pyenv/versions/2.7.11/envs/cendr/lib/python2.7/site-packages/flask/app.py", line 1641, in full_dispatch_request
    rv = self.handle_user_exception(e)
  File "/usr/local/var/pyenv/versions/2.7.11/envs/cendr/lib/python2.7/site-packages/flask_restful/__init__.py", line 271, in error_router
    return original_handler(e)
  File "/usr/local/var/pyenv/versions/2.7.11/envs/cendr/lib/python2.7/site-packages/flask/app.py", line 1544, in handle_user_exception
    reraise(exc_type, exc_value, tb)
  File "/usr/local/var/pyenv/versions/2.7.11/envs/cendr/lib/python2.7/site-packages/flask/app.py", line 1639, in full_dispatch_request
    rv = self.dispatch_request()
  File "/usr/local/var/pyenv/versions/2.7.11/envs/cendr/lib/python2.7/site-packages/flask_debugtoolbar/__init__.py", line 125, in dispatch_request
    return view_func(**req.view_args)
  File "/Users/dancook/Documents/git/cendr/cendr/views/api/variant.py", line 93, in variant_api
    gt_set = zip(v.samples, record.gt_bases.tolist(), record.gt_types.tolist(), record.format("FT").tolist())
  File "cyvcf2/cyvcf2.pyx", line 942, in cyvcf2.cyvcf2.Variant.gt_bases.__get__ (cyvcf2/cyvcf2.c:22323)

The offending line
https://github.com/brentp/cyvcf2/blob/master/cyvcf2/cyvcf2.pyx#L942

##fileformat=VCFv4.2
##FILTER=<ID=PASS,Description="All filters passed">
##contig=<ID=I,length=15072434>
##contig=<ID=II,length=15279421>
##contig=<ID=III,length=13783801>
##contig=<ID=IV,length=17493829>
##contig=<ID=V,length=20924180>
##contig=<ID=X,length=17718942>
##contig=<ID=MtDNA,length=13794>
##FILTER=<ID=dv_dp,Description="Set if not true: (FORMAT/AD[1])/(FORMAT/DP) >= 0.5 || FORMAT/GT == '0/0'">
##FILTER=<ID=het,Description="Set if true: AC==1">
##FILTER=<ID=high_heterozygosity,Description="Apply filter if FREQUENCY(HET) > 0.1">
##FILTER=<ID=high_missing,Description="Apply filter if FREQUENCY(MISSING) > 0.9">
##FILTER=<ID=mapping_quality,Description="Set if not true: INFO/MQ > 40">
##FILTER=<ID=min_depth,Description="Set if not true: FORMAT/DP > 10">
##FILTER=<ID=quality,Description="Set if not true: QUAL >= 30 || FORMAT/GT == '0/0'">
##INFO=<ID=AC,Number=A,Type=Integer,Description="Total number of alternate alleles in called genotypes">
##INFO=<ID=AD,Number=R,Type=Integer,Description="Total allelic depths">
##INFO=<ID=AF,Number=A,Type=Float,Description="Estimated allele frequency in the range (0,1]">
##INFO=<ID=AN,Number=1,Type=Integer,Description="Total number of alleles in called genotypes">
##INFO=<ID=ANN,Number=.,Type=String,Description="Functional annotations: 'Allele | Annotation | Annotation_Impact | Gene_Name | Gene_ID | Feature_Type | Feature_ID | Transcript_BioType | Rank | HGVS.c | HGVS.p | cDNA.pos / cDNA.length | CDS.pos / CDS.length | AA.pos / AA.length | Distance | ERRORS / WARNINGS / INFO'">
##INFO=<ID=BQB,Number=1,Type=Float,Description="Mann-Whitney U test of Base Quality Bias (bigger is better)">
##INFO=<ID=DP,Number=1,Type=Integer,Description="Raw read depth">
##INFO=<ID=DP4,Number=4,Type=Integer,Description="Number of high-quality ref-forward , ref-reverse, alt-forward and alt-reverse bases">
##INFO=<ID=HOB,Number=1,Type=Float,Description="Bias in the number of HOMs number (smaller is better)">
##INFO=<ID=ICB,Number=1,Type=Float,Description="Inbreeding Coefficient Binomial test (bigger is better)">
##INFO=<ID=IDV,Number=1,Type=Integer,Description="Maximum number of reads supporting an indel">
##INFO=<ID=IMF,Number=1,Type=Float,Description="Maximum fraction of reads supporting an indel">
##INFO=<ID=INDEL,Number=0,Type=Flag,Description="Indicates that the variant is an INDEL.">
##INFO=<ID=LOF,Number=.,Type=String,Description="Predicted loss of function effects for this variant. Format: 'Gene_Name | Gene_ID | Number_of_transcripts_in_gene | Percent_of_transcripts_affected'">
##INFO=<ID=MQ,Number=1,Type=Integer,Description="Average mapping quality">
##INFO=<ID=MQ0F,Number=1,Type=Float,Description="Fraction of MQ0 reads (smaller is better)">
##INFO=<ID=MQB,Number=1,Type=Float,Description="Mann-Whitney U test of Mapping Quality Bias (bigger is better)">
##INFO=<ID=MQSB,Number=1,Type=Float,Description="Mann-Whitney U test of Mapping Quality vs Strand Bias (bigger is better)">
##INFO=<ID=NMD,Number=.,Type=String,Description="Predicted nonsense mediated decay effects for this variant. Format: 'Gene_Name | Gene_ID | Number_of_transcripts_in_gene | Percent_of_transcripts_affected'">
##INFO=<ID=NS,Number=1,Type=Integer,Description="Number of samples with data">
##INFO=<ID=RPB,Number=1,Type=Float,Description="Mann-Whitney U test of Read Position Bias (bigger is better)">
##INFO=<ID=SGB,Number=1,Type=Float,Description="Segregation based metric.">
##INFO=<ID=VDB,Number=1,Type=Float,Description="Variant Distance Bias for filtering splice-site artefacts in RNA-seq data (bigger is better)",Version="3">
##INFO=<ID=phastcons,Number=1,Type=String,Description="calculated by self of overlapping values in column 4 from phastcons.bed.gz">
##INFO=<ID=phylop,Number=1,Type=String,Description="calculated by self of overlapping values in column 4 from phylop.bed.gz">
##FORMAT=<ID=AD,Number=R,Type=Integer,Description="Allelic depths">
##FORMAT=<ID=ADF,Number=R,Type=Integer,Description="Allelic depths on the forward strand">
##FORMAT=<ID=ADR,Number=R,Type=Integer,Description="Allelic depths on the reverse strand">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Number of high-quality bases">
##FORMAT=<ID=FT,Number=1,Type=String,Description="Genotype-level filter">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=HP,Number=1,Type=String,Description="Flag used to mark whether a variant was polarized">
##FORMAT=<ID=PL,Number=G,Type=Integer,Description="List of Phred-scaled genotype likelihoods">
##FORMAT=<ID=SP,Number=1,Type=Integer,Description="Phred-scaled strand bias P-value">
##samtoolsVersion=1.3.1+htslib-1.3.2
##samtoolsCommand=samtools mpileup --redo-BAQ -r I --BCF --output-tags DP,AD,ADF,ADR,INFO/AD,SP --fasta-ref /projects/b1059/data/genomes/c_elegans/WS245/WS245.fa.gz QG556.bam
##reference=file:///projects/b1059/data/genomes/c_elegans/WS245/WS245.fa.gz
##ALT=<ID=*,Description="Represents allele(s) other than observed.">
##bcftools_callVersion=1.3.1+htslib-1.3.2
##bcftools_callCommand=call -T sitelist.tsv.gz --skip-variants indels --multiallelic-caller -O z -
##bcftools_concatVersion=1.3.1+htslib-1.3.2
##bcftools_concatCommand=concat -O v QG556.I.union.vcf.gz QG556.II.union.vcf.gz QG556.III.union.vcf.gz QG556.IV.union.vcf.gz QG556.V.union.vcf.gz QG556.X.union.vcf.gz QG556.MtDNA.union.vcf.gz
##VCF-kit(version 0.1.2) command=geno het-polarization
##bcftools_filterVersion=1.3.1+htslib-1.3.2
##bcftools_filterCommand=filter -O u --threads 6 --mode + --soft-filter quality --include 'QUAL >= 30 || FORMAT/GT == '0/0''
##bcftools_filterCommand=filter -O u --threads 6 --mode + --soft-filter min_depth --include 'FORMAT/DP > 10'
##bcftools_filterCommand=filter -O u --threads 6 --mode + --soft-filter mapping_quality --include 'INFO/MQ > 40'
##bcftools_filterCommand=filter -O u --threads 6 --mode + --soft-filter dv_dp --include '(FORMAT/AD[1])/(FORMAT/DP) >= 0.5 || FORMAT/GT == '0/0''
##bcftools_filterCommand=filter --mode + --soft-filter het --exclude AC==1
##bcftools_viewVersion=1.3.1+htslib-1.3.2
##bcftools_viewCommand=view -O z
##bcftools_mergeVersion=1.3.1+htslib-1.3.2
##bcftools_mergeCommand=merge --threads 16 --regions I -O z -m all --file-list union_vcfs.txt
##bcftools_concatCommand=concat -O z I.merged.raw.vcf.gz II.merged.raw.vcf.gz III.merged.raw.vcf.gz IV.merged.raw.vcf.gz V.merged.raw.vcf.gz X.merged.raw.vcf.gz MtDNA.merged.raw.vcf.gz
##bcftools_viewCommand=view merged.raw.vcf.gz
##bcftools_viewCommand=view -O z -
##bcftools_viewVersion=1.4+htslib-1.4
##bcftools_viewCommand=view -O v merged.filtered.vcf.gz; Date=Mon Apr  3 22:03:56 2017
##SnpEffVersion="4.2 (build 2015-12-05), by Pablo Cingolani"
##SnpEffCmd="SnpEff  -no-downstream -no-intergenic -no-upstream WS258 "
##bcftools_viewCommand=view -O z; Date=Mon Apr  3 22:04:04 2017
##bcftools_viewCommand=view -O z; Date=Mon Apr  3 22:38:00 2017
##bcftools_viewVersion=1.3+htslib-1.3
##bcftools_viewCommand=view --force-samples --samples CB4856 http://storage.googleapis.com/elegansvariation.org/releases/20170312/WI.20170312.vcf.gz II:14524173-14525112
#CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO	FORMAT	CB4856
II	14524236	.	C	T	999	PASS	AC=0;AD=0,65;AF=0.294872;AN=2;BQB=0.555202;DP=15777;DP4=4011,6019,1703,2364;HOB=0.5;ICB=1;MQ=60;MQ0F=0;MQB=1;MQSB=1;NS=234;RPB=0.514647;SGB=-0.693147;VDB=0.549834;ANN=T|synonymous_variant|LOW|WBGene00010195|WBGene00010195|transcript|F57C2.3|protein_coding|1/3|c.42C>T|p.Phe14Phe|64/844|42/756|14/251||;phastcons=0.25;phylop=0.582	GT:DP:SP:ADF:ADR:AD:FT:PL:HP	0/0:55:0:42,.:13,.:55,.:PASS:.:.
II	14524299	.	G	A	999	PASS	AC=0;AD=0,65;AF=0.25;AN=2;BQB=0.958856;DP=20099;DP4=8769,3477,3191,1393;HOB=0.5;ICB=1;MQ=60;MQ0F=0;MQB=1;MQSB=1;NS=234;RPB=0.436914;SGB=-0.693147;VDB=1.69288e-07;ANN=A|synonymous_variant|LOW|WBGene00010195|WBGene00010195|transcript|F57C2.3|protein_coding|1/3|c.105G>A|p.Ser35Ser|127/844|105/756|35/251||;phastcons=0.25;phylop=-0.256	GT:DP:SP:ADF:ADR:AD:FT:PL:HP	0/0:87:0:71,.:16,.:87,.:PASS:.:.
II	14524396	.	T	A	999	PASS	AC=0;AD=99,25;AF=0.0769231;AN=2;BQB=1;DP=19446;DP4=10712,5058,1001,485;MQ=60;MQ0F=0;MQB=1;MQSB=1;NS=234;RPB=1;SGB=-0.692914;VDB=0.58164;ANN=A|missense_variant|MODERATE|WBGene00010195|WBGene00010195|transcript|F57C2.3|protein_coding|1/3|c.202T>A|p.Phe68Ile|224/844|202/756|68/251||;phastcons=0.25;phylop=0.702	GT:DP:SP:ADF:ADR:AD:FT:PL	0/0:70:0:50,.:20,.:70,.:PASS:.

Writer writes incorrect GT

Whenever the GT field is missing, Writer writes the genotype as ./0

For example, if my v.genotypes[sampleid] is [-1, False], and if I write using w.write_record(v), the entry shows up as ./0 instead of ./.

EDIT, this happens if I store v.genotypes in another variable, and then put it back. To reproduce the issue, do the following:

vcf = VCF('read.vcf.gz')
w = Writer("write.vcf", vcf)
for v in vcf():
    gtlist = v.genotypes
    v.genotypes = gtlist
    w.write_record(v)
    
w.close()

Memory leak of string conversion

Following the example from the readme, it seems to me converting variants back to their string representation causes a memory leak.

As an example using public data from ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/, repeated conversions to strings

import cyvcf2 as vcf
vf = vcf.VCF("ALL.chr22.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz")

for v in vf:
    s = str(v)

increase the memory of the python process to several GB. Warning, this can quickly eat up the entire memory of the machine before all variants in the example have been processed! Without calling str or print, the process would require about 12MB. I have seen the same effect with other VCF files, so I don't think it is related to the data itself.

Can the str function be used differently, without consuming this much memory?

Versions:
python 2.7.12, cyvcf2 0.5.5

Add Variant FORMAT entries

I have had no issues adding new INFO and FORMAT fields to a VCF header, and also can add new INFO fields to individual Variant objects. However, there does not seem to be any way to add new FORMAT items to a Variant object. Do you know of any workarounds for this- perhaps by creating a new Variant object with updated FORMAT fields?

NOTE: This is not especially important, just a feature that would be somewhat useful to me.

brentp / cyvcf2 Goto Github PK

cyvcf2's Introduction

cyvcf2

Example

Installation

pip with bundled htslib

pip with system htslib

windows (experimental, only test on MSYS2)

github (building htslib and cyvcf2 from source)

Testing

CLI

See Also

Performance

cyvcf2's People

Contributors

Stargazers

Watchers

Forkers

cyvcf2's Issues

Steps to reproduce error

Recommend Projects

Recommend Topics

Recommend Org