Giter Site home page Giter Site logo

jamescasbon / pyvcf Goto Github PK

View Code? Open in Web Editor NEW

This project forked from jdoughertyii/pyvcf

396.0 31.0 201.0 1.64 MB

A Variant Call Format reader for Python.

Home Page: http://pyvcf.readthedocs.org/en/latest/index.html

License: Other

Python 100.00%

pyvcf's Introduction

A VCFv4.0 and 4.1 parser for Python.

Online version of PyVCF documentation is available at http://pyvcf.rtfd.org/

The intent of this module is to mimic the csv module in the Python stdlib, as opposed to more flexible serialization formats like JSON or YAML. vcf will attempt to parse the content of each record based on the data types specified in the meta-information lines -- specifically the ##INFO and ##FORMAT lines. If these lines are missing or incomplete, it will check against the reserved types mentioned in the spec. Failing that, it will just return strings.

There main interface is the class: Reader. It takes a file-like object and acts as a reader:

>>> import vcf
>>> vcf_reader = vcf.Reader(open('vcf/test/example-4.0.vcf', 'r'))
>>> for record in vcf_reader:
...     print record
Record(CHROM=20, POS=14370, REF=G, ALT=[A])
Record(CHROM=20, POS=17330, REF=T, ALT=[A])
Record(CHROM=20, POS=1110696, REF=A, ALT=[G, T])
Record(CHROM=20, POS=1230237, REF=T, ALT=[None])
Record(CHROM=20, POS=1234567, REF=GTCT, ALT=[G, GTACT])

This produces a great deal of information, but it is conveniently accessed. The attributes of a Record are the 8 fixed fields from the VCF spec:

* ``Record.CHROM``
* ``Record.POS``
* ``Record.ID``
* ``Record.REF``
* ``Record.ALT``
* ``Record.QUAL``
* ``Record.FILTER``
* ``Record.INFO``

plus attributes to handle genotype information:

  • Record.FORMAT
  • Record.samples
  • Record.genotype

samples and genotype, not being the title of any column, are left lowercase. The format of the fixed fields is from the spec. Comma-separated lists in the VCF are converted to lists. In particular, one-entry VCF lists are converted to one-entry Python lists (see, e.g., Record.ALT). Semicolon-delimited lists of key=value pairs are converted to Python dictionaries, with flags being given a True value. Integers and floats are handled exactly as you'd expect:

>>> vcf_reader = vcf.Reader(open('vcf/test/example-4.0.vcf', 'r'))
>>> record = next(vcf_reader)
>>> print record.POS
14370
>>> print record.ALT
[A]
>>> print record.INFO['AF']
[0.5]

There are a number of convenience methods and properties for each Record allowing you to examine properties of interest:

>>> print record.num_called, record.call_rate, record.num_unknown
3 1.0 0
>>> print record.num_hom_ref, record.num_het, record.num_hom_alt
1 1 1
>>> print record.nucl_diversity, record.aaf, record.heterozygosity
0.6 [0.5] 0.5
>>> print record.get_hets()
[Call(sample=NA00002, CallData(GT=1|0, GQ=48, DP=8, HQ=[51, 51]))]
>>> print record.is_snp, record.is_indel, record.is_transition, record.is_deletion
True False True False
>>> print record.var_type, record.var_subtype
snp ts
>>> print record.is_monomorphic
False

record.FORMAT will be a string specifying the format of the genotype fields. In case the FORMAT column does not exist, record.FORMAT is None. Finally, record.samples is a list of dictionaries containing the parsed sample column and record.genotype is a way of looking up genotypes by sample name:

>>> record = next(vcf_reader)
>>> for sample in record.samples:
...     print sample['GT']
0|0
0|1
0/0
>>> print record.genotype('NA00001')['GT']
0|0

The genotypes are represented by Call objects, which have three attributes: the corresponding Record site, the sample name in sample and a dictionary of call data in data:

>>> call = record.genotype('NA00001')
>>> print call.site
Record(CHROM=20, POS=17330, REF=T, ALT=[A])
>>> print call.sample
NA00001
>>> print call.data
CallData(GT=0|0, GQ=49, DP=3, HQ=[58, 50])

Please note that as of release 0.4.0, attributes known to have single values (such as DP and GQ above) are returned as values. Other attributes are returned as lists (such as HQ above).

There are also a number of methods:

>>> print call.called, call.gt_type, call.gt_bases, call.phased
True 0 T|T True

Metadata regarding the VCF file itself can be investigated through the following attributes:

  • Reader.metadata
  • Reader.infos
  • Reader.filters
  • Reader.formats
  • Reader.samples

For example:

>>> vcf_reader.metadata['fileDate']
'20090805'
>>> vcf_reader.samples
['NA00001', 'NA00002', 'NA00003']
>>> vcf_reader.filters
OrderedDict([('q10', Filter(id='q10', desc='Quality below 10')), ('s50', Filter(id='s50', desc='Less than 50% of samples have data'))])
>>> vcf_reader.infos['AA'].desc
'Ancestral Allele'

ALT records are actually classes, so that you can interrogate them:

>>> reader = vcf.Reader(open('vcf/test/example-4.1-bnd.vcf'))
>>> _ = next(reader); row = next(reader)
>>> print row
Record(CHROM=1, POS=2, REF=T, ALT=[T[2:3[])
>>> bnd = row.ALT[0]
>>> print bnd.withinMainAssembly, bnd.orientation, bnd.remoteOrientation, bnd.connectingSequence
True False True T

The Reader supports retrieval of records within designated regions for files with tabix indexes via the fetch method. This requires the pysam module as a dependency. Pass in a chromosome, and, optionally, start and end coordinates, for the regions of interest:

>>> vcf_reader = vcf.Reader(filename='vcf/test/tb.vcf.gz')
>>> # fetch all records on chromosome 20 from base 1110696 through 1230237
>>> for record in vcf_reader.fetch('20', 1110695, 1230237):  # doctest: +SKIP
...     print record
Record(CHROM=20, POS=1110696, REF=A, ALT=[G, T])
Record(CHROM=20, POS=1230237, REF=T, ALT=[None])

Note that the start and end coordinates are in the zero-based, half-open coordinate system, similar to _Record.start and _Record.end. The very first base of a chromosome is index 0, and the the region includes bases up to, but not including the base at the end coordinate. For example:

>>> # fetch all records on chromosome 4 from base 11 through 20
>>> vcf_reader.fetch('4', 10, 20)   # doctest: +SKIP

would include all records overlapping a 10 base pair region from the 11th base of through the 20th base (which is at index 19) of chromosome 4. It would not include the 21st base (at index 20). (See http://genomewiki.ucsc.edu/index.php/Coordinate_Transforms for more information on the zero-based, half-open coordinate system.)

The Writer class provides a way of writing a VCF file. Currently, you must specify a template Reader which provides the metadata:

>>> vcf_reader = vcf.Reader(filename='vcf/test/tb.vcf.gz')
>>> vcf_writer = vcf.Writer(open('/dev/null', 'w'), vcf_reader)
>>> for record in vcf_reader:
...     vcf_writer.write_record(record)

An extensible script is available to filter vcf files in vcf_filter.py. VCF filters declared by other packages will be available for use in this script. Please see :doc:`FILTERS` for full description.

pyvcf's People

Contributors

alexjironkin avatar alimanfoo avatar amwenger avatar arq5x avatar bow avatar brentp avatar cariaso avatar cgnh avatar chapmanb avatar cmclean avatar datagram avatar dzerbino avatar ericman93 avatar gotgenes avatar ian1roberts avatar jamescasbon avatar jdoughertyii avatar jsmedmar avatar krlk89 avatar libor-m avatar marcelm avatar marcofalcioni avatar martijnvermaat avatar mattions avatar msabramo avatar redmar-van-den-berg avatar rwness avatar sambrightman avatar seandavi avatar superbobry avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

pyvcf's Issues

example-4.1-sv.vcf appears to be malformed

I am currently working on support for SVs in PyVcf. In testing a simple parsing of the example SV file, I find that it fails at the first record because the GT field is defined as Integer in this file.

##FORMAT=<ID=GT,Number=1,Type=Integer,Description="Genotype">

This is in contrast to the typical spec for the GT field (String) that is found in every other VCF file I have ever seen. I checked the 1000 Genomes website for 4.1 and this file matches what is on their website, but I suspect it's a mistake. As such, I have emailed the VCF spec mailing list for clarification.

It may be that we need to update the test file in order to develop SV support.

UPDATE: it appears that this is indeed a mistake on the part of the folks that devised the example:
http://sourceforge.net/mailarchive/message.php?msg_id=29084484

PyPI 0.4.0 installation issues via setuptools

This is likely owing to my naivete, but I am working on another tool that depends upon PyVcf. As such, in my setup.py I have:

install_requires=['numpy', 'pyparsing', 'pysam', 'pyvcf>=0.4.0'],

When running setup.py develop/install, this causes the script to search PyPI for the latest version of PyPI. It succeeds in finding it, but when trying to install it, I get the following error. I am really, really new to setuptools, so I don't have many ideas here, but it looks like there is a problem compiling the pysam or tabix Extension module. Any ideas?

The same error occurs with (upgrading from 0.4.0pre to 0.4.0):

easy_install -U pyvcf

The error:

Processing dependencies for pop==0.1.0
Searching for pyvcf>=0.4.0
Reading http://pypi.python.org/simple/pyvcf/
Reading https://github.com/jamescasbon/PyVCF
Best match: PyVCF 0.4.0
Downloading http://pypi.python.org/packages/source/P/PyVCF/PyVCF-0.4.0.tar.gz#md5=165224e91a38f5a6c0fa3d63b8258b58
Processing PyVCF-0.4.0.tar.gz
Running PyVCF-0.4.0/setup.py -q bdist_egg --dist-dir /var/folders/tO/tOq24FDAF-WH1NXQLRXlhk+++TM/-Tmp-/easy_install-kE7hJH/PyVCF-0.4.0/egg-dist-tmp-1zQvoE
Traceback (most recent call last):
  File "setup.py", line 45, in <module>
    'Topic :: Scientific/Engineering :: Bio-Informatics']
  File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/distutils/core.py", line 152, in setup
    dist.run_commands()
  File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/distutils/dist.py", line 953, in run_commands
    self.run_command(cmd)
  File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/distutils/dist.py", line 972, in run_command
    cmd_obj.run()
  File "build/bdist.linux-i686/egg/setuptools/command/develop.py", line 27, in run
  File "build/bdist.linux-i686/egg/setuptools/command/develop.py", line 102, in install_for_development
  File "build/bdist.linux-i686/egg/setuptools/command/easy_install.py", line 519, in process_distribution
  File "build/bdist.linux-i686/egg/pkg_resources.py", line 563, in resolve
  File "build/bdist.linux-i686/egg/pkg_resources.py", line 799, in best_match
  File "build/bdist.linux-i686/egg/pkg_resources.py", line 811, in obtain
  File "build/bdist.linux-i686/egg/setuptools/command/easy_install.py", line 446, in easy_install
  File "build/bdist.linux-i686/egg/setuptools/command/easy_install.py", line 476, in install_item
  File "build/bdist.linux-i686/egg/setuptools/command/easy_install.py", line 655, in install_eggs
  File "build/bdist.linux-i686/egg/setuptools/command/easy_install.py", line 930, in build_and_install
  File "build/bdist.linux-i686/egg/setuptools/command/easy_install.py", line 919, in run_setup
  File "build/bdist.linux-i686/egg/setuptools/sandbox.py", line 62, in run_setup
  File "build/bdist.linux-i686/egg/setuptools/sandbox.py", line 105, in run
  File "build/bdist.linux-i686/egg/setuptools/sandbox.py", line 64, in <lambda>
  File "setup.py", line 12, in <module>
    m.Extension.__dict__ = m._Extension.__dict__
  File "/var/folders/tO/tOq24FDAF-WH1NXQLRXlhk+++TM/-Tmp-/easy_install-kE7hJH/PyVCF-0.4.0/vcf/__init__.py", line 166, in <module>
  File "/var/folders/tO/tOq24FDAF-WH1NXQLRXlhk+++TM/-Tmp-/easy_install-kE7hJH/PyVCF-0.4.0/vcf/parser.py", line 10, in <module>
  File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/pysam-0.6-py2.7-macosx-10.6-intel.egg/pysam/__init__.py", line 1, in <module>
    from csamtools import *
  File "csamtools.pxd", line 60, in init csamtools (pysam/csamtools.c:32571)
TypeError: __builtin__.file is not a type object

Drop pre-compute of bases and phase

Please see branch 'performance'.

I have made a number of tweaks which result in 37% faster parsing of the 1kg VCF (see vcf/test/prof.py).

The breaking change is that most of the time is spent precomputing these. @arq5x this was your change, are you bothered if it goes?

Performance benchmarks

Does PyVCF have any? I'm trying to parse dbSNP release with the latest version and it basically takes forever. So having some kind of performance metric would be handy for further development.

Check which properties are never lists

We currently return almost every property in a _Call as a list. Some of them never are (like GQ, DP) according to the example files with the spec.

We should return values for these properties, rather than lists.

Cythonize parser

As expected, looping through the samples in _parse_samples is rather slow for files with many samples (e.g., [1]). It seems that the main bottlenecks are Python's split() and the fact that looping in Python is known to be dreadfully slow.

One potential solution would be to port the slowest looping bits to Cython. Another would be to work with BCF.

Thoughts? I think dealing with deep and wide files will be crucial for the future.

[1] ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/working/20120131_omni_genotypes_and_intensities/Omni25_genotypes_2141_samples.b37.vcf.gz

Options for reporting genotypes.

Currently, we can do the following:

>>> for sample in record.samples:
...     print sample['GT']
'1|2'
'2|1'
'2/2'

It would be nice to have a built in method that looks at the ref and alt alleles and converts the encoded genotypes into DNA alleles (GTS == genotypes using Sequence).

>>> for sample in record.samples:
...     print sample['GTS']
'A|C'
'C|A'
'C/C'

Also, an option that returns the standard numeric encoding for genotypes: 0 == hom_ref, het == 1, hom_alt == 2, unknown (./.) == -1. This would allow one to easily compute useful popgen statistics such as HWE, pi_hat, and conduct multi-dimensional scaling comparisons.

>>> for sample in record.samples:
...     print sample['GTN']
1
2
0 
-1
etc.

Support VCF 4.1

Some new metadata in VCF 4.1 spec, notably contigs.

Added test data and tests, need to write code for this.

Problem reading vcf file.

Hi

I'm having trouble reading my vcf file with PyVCF. Any help would be useful the file I'm trying to read was created by vcf tools.

Here is the code:

import vcf
vcf_reader = vcf.Reader(open('all.mergedSNPs.vcf','r'))
for rec in vcf_reader:
print rec

But I get this error
Traceback (most recent call last):
File "", line 1, in
File "/Library/Python/2.7/site-packages/PyVCF-0.6.0-py2.7-macosx-10.7-intel.egg/vcf/parser.py", line 443, in next
pos = int(row[1])
IndexError: list index out of range

contig header lines are malformed and breaks GATK VariantAnnotator

Apologies if this issue has already been dealt with in a more recent version ...

When trying to add effects generating by snpEff to PyVCF filtered file using GATK VariantAnnotator receive following error

##### ERROR ------------------------------------------------------------------------------------------
##### ERROR A GATK RUNTIME ERROR has occurred (version 1.6-13-g91f02df):
##### ERROR
##### ERROR Please visit the wiki to see if this is a known problem
##### ERROR If not, please post the error, with stack trace, to the GATK forum
##### ERROR Visit our wiki for extensive documentation http://www.broadinstitute.org/gsa/wiki
##### ERROR Visit our forum to view answers to commonly asked questions http://getsatisfaction.com/gsa
##### ERROR
##### ERROR MESSAGE: Invalid VCFSimpleHeaderLine: key=contig name=null
##### ERROR ------------------------------------------------------------------------------------------

It appears to be the result of improperly written contig header lines

PyVCF writes as a dictionary

 (amp)[ir210@beast vars]$ grep 'contig' gatk_run2_vars.vcf
##contig={'length': '249250621', 'assembly': 'hg19', 'ID': 'chr1'}
##contig={'length': '135534747', 'assembly': 'hg19', 'ID': 'chr10'}
##contig={'length': '135006516', 'assembly': 'hg19', 'ID': 'chr11'}

but they should take the following form :

(amp)[ir210@beast vars]$ grep 'contig' gatk_t1-t2_raw.vcf
##contig=<ID=chr1,length=249250621,assembly=hg19>
##contig=<ID=chr10,length=135534747,assembly=hg19>
##contig=<ID=chr11,length=135006516,assembly=hg19>
##contig=<ID=chr11_gl000202_random,length=40103,assembly=hg19>

lazy parsing of info

often times, we'll filter variants on the quality or something.
I think it'd be nice to send the info to _Record as a string and
have .INFO be a property that parses the string on first access.

This could improve the speed.

Crash when QUAL = .

I have some sites in my data that have no quality, because no reads mapped there. For example:

scaffold_1 1330 . T . . . Dels=0.01;MQ=6.79;MQ0=0 GT ./. ./. ./. ./. ./. ./. ./. ./. ./. ./. ./. ./. ./.
scaffold_1 1331 . C . . . Dels=0.01;MQ=6.70;MQ0=0 GT ./. ./. ./. ./. ./. ./. ./. ./. ./. ./. ./. ./. ./.
scaffold_1 1332 . G . . . Dels=0.00;MQ=6.70;MQ0=0 GT ./. ./. ./. ./. ./. ./. ./. ./. ./. ./. ./. ./. ./.
scaffold_1 1333 . A . . . Dels=0.00;MQ=6.58;MQ0=0 GT ./. ./. ./. ./. ./. ./. ./. ./. ./. ./. ./. ./. ./.

The parser breaks on line 444. I've modified my version to give a 0 quality to these records, as a quick fix:

    qual = 0 if row[5] == '.' else (float(row[5]) if '.' in row[5] else int(row[5]))

Error in reading VCF

I'm trying to parse a VCF (generated by GATK and modified with snpEff and SnpSift). Unfortunately I'm not able to read any record because an unhandled exception:

In [29]: a = vcf.Reader(open("/Users/dawe/Downloads/Tier2.vcf", 'r'))

In [30]: for rec in a: print rec
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-30-e05998bcc130> in <module>()
----> 1 for rec in a: print rec

/Library/Python/2.7/site-packages/PyVCF-0.6.0-py2.7-macosx-10.8-intel.egg/vcf/parser.pyc in next(self)
    462         if filt == 'PASS':
    463             filt = None
--> 464         info = self._parse_info(row[7])
    465 
    466         try:

/Library/Python/2.7/site-packages/PyVCF-0.6.0-py2.7-macosx-10.8-intel.egg/vcf/parser.pyc in _parse_info(self, info_str)
    312             try:
    313                 if self.infos[ID].num == 1 and entry_type != 'String':
--> 314                     val = val[0]
    315             except KeyError:
    316                 pass

TypeError: 'bool' object is not subscriptable

bayes error filter removes all INDELs

May be I've misunderstood how the filter works, so might be my application error rather than a real issue.
Essentially, miseq amplicon data processed through GATK pipeline, calling glm=BOTH generates VCF that contains some nice INDELs

Post filtering thus ...

vcf_filter.py --site-quality=99 --genotype-quality=99 --eblr=-10  gatk_raw.vcf sq mgq eb > gatk_filtered.vcf

Contains no INDELs. I thought about it, and wondered if its appropriate to apply this filter to INDELs in tumour samples where the distribution of allele depths isn't expected to be negative binomial. So I added an argument to the filter to exclude it from variants typed INDEL

I hacked the ErrorBiasFilter to include the extra argparse:

parser.add_argument('--no-indels', type=bool, default=True,
                help='Apply filter to SNPs only')

and in init

self.no_indels = args.no_indels

and in call

if self.no_indels and record.is_indel:
            return None

Wanted to know if this is useful before I commit, or have I missed how this filter should work. Thanks, Ian

Incorrect uncalled call writing

Sometime between June 21 and now, the first sample in the last row of walk_left.vcf started being written as "./." instead of "./.:35:4".

Probably something I did, so I'll look into it asap.

But it makes me curious - if the call isn't called, is the data associated with it useful? I recall grepping the test directory for ./.: followed by a digit and only finding this file - is the ./.:35:4 something that never appears in nature?

If this is the case, would it be more correct to print "./.:.:."?

Support contig based sort

Where contigs are declared in the file, allow sort based on the contig order.

Prevents us having to guess the ordering of a mix of ints, strings, and non zero padded stuff.

Refactor INFO parsing code.

The INFO parsing code is very similar to the sample parsing code, and currently ignores the 'Number' property.

Refactor to use the Sample parsing code.

vcf.Writer outputs mangled vcf header

When I try out the following code the order of the lines in the original vcf file header is scrambled. The biggest problem with this is that the ##fileformat=VCFv4.1 line gets moved someplace else, and so GATK doesn't automatically recognize the file as a vcf. There may be a workaround by passing an option to GATK specifying the format, but really this is a problem with pyvcf, not GATK. Please update the next version so that the organization of the header is kept intact in the output vcf.

Python code:
"
import vcf
reader=vcf.Reader(filename='test.vcf')
writer=vcf.Writer(open('test.out.vcf', 'w'), reader)
for record in reader:
writer.write_record(record)
"

input vcf:
"

fileformat=VCFv4.1

INFO=<ID=LDAF,Number=1,Type=Float,Description="MLE Allele Frequency Accounting for LD">

INFO=<ID=AVGPOST,Number=1,Type=Float,Description="Average posterior probability from MaCH/Thunder">

INFO=<ID=RSQ,Number=1,Type=Float,Description="Genotype imputation quality from MaCH/Thunder">

INFO=<ID=ERATE,Number=1,Type=Float,Description="Per-marker Mutation rate from MaCH/Thunder">

INFO=<ID=THETA,Number=1,Type=Float,Description="Per-marker Transition rate from MaCH/Thunder">

INFO=<ID=CIEND,Number=2,Type=Integer,Description="Confidence interval around END for imprecise variants">

INFO=<ID=CIPOS,Number=2,Type=Integer,Description="Confidence interval around POS for imprecise variants">

INFO=<ID=END,Number=1,Type=Integer,Description="End position of the variant described in this record">

INFO=<ID=HOMLEN,Number=.,Type=Integer,Description="Length of base pair identical micro-homology at event breakpoints">

INFO=<ID=HOMSEQ,Number=.,Type=String,Description="Sequence of base pair identical micro-homology at event breakpoints">

INFO=<ID=SVLEN,Number=1,Type=Integer,Description="Difference in length between REF and ALT alleles">

INFO=<ID=SVTYPE,Number=1,Type=String,Description="Type of structural variant">

INFO=<ID=AC,Number=.,Type=Integer,Description="Alternate Allele Count">

INFO=<ID=AN,Number=1,Type=Integer,Description="Total Allele Count">

ALT=<ID=DEL,Description="Deletion">

FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">

FORMAT=<ID=DS,Number=1,Type=Float,Description="Genotype dosage from MaCH/Thunder">

FORMAT=<ID=GL,Number=.,Type=Float,Description="Genotype Likelihoods">

INFO=<ID=AA,Number=1,Type=String,Description="Ancestral Allele, ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/pilot_data/technical/reference/ancestral_alignments/README">

INFO=<ID=AF,Number=1,Type=Float,Description="Global Allele Frequency based on AC/AN">

INFO=<ID=AMR_AF,Number=1,Type=Float,Description="Allele Frequency for samples from AMR based on AC/AN">

INFO=<ID=ASN_AF,Number=1,Type=Float,Description="Allele Frequency for samples from ASN based on AC/AN">

INFO=<ID=AFR_AF,Number=1,Type=Float,Description="Allele Frequency for samples from AFR based on AC/AN">

INFO=<ID=EUR_AF,Number=1,Type=Float,Description="Allele Frequency for samples from EUR based on AC/AN">

INFO=<ID=VT,Number=1,Type=String,Description="indicates what type of variant the line represents">

INFO=<ID=SNPSOURCE,Number=.,Type=String,Description="indicates if a snp was called when analysing the low coverage or exome alignment data">

source_20120714.1=vcf-subset(r731) -c NA18505,NA18508,NA19648,NA19704, downloads/ALL.chr1.phase1_release_v3.20101123.snps_indels_svs.genotypes.vcf.gz

CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NA18505 NA18508 NA19648 NA19704

1 10583 rs58108140 G A 100 PASS AA=.;AC=0;AF=0.14;AFR_AF=0.04;AMR_AF=0.17;AN=8;ASN_AF=0.13;AVGPOST=0.7707;ERATE=0.0161;EUR_AF=0.21;LDAF=0.2327;RSQ=0.4319;SNPSOURCE=LOWCOV;THETA=0.0046;VT=SNP GT:DS:GL 0|0:0.000:-0.07,-0.80,-5.00 0|0:0.000:-0.01,-1.68,-5.00 0|0:0.100:-0.03,-1.15,-5.00 0|0:0.150:-0.19,-0.45,-2.42
"

output vcf:
"

ALT=<ID=DEL,Description="Deletion">

fileformat=VCFv4.1

source_20120714.1=vcf-subset(r731) -c NA18505,NA18508,NA19648,NA19704, downloads/ALL.chr1.phase1_release_v3.20101123.snps_indels_svs.genotypes.vcf.gz

INFO=<ID=ERATE,Number=1,Type=Float,Description="Per-marker Mutation rate from MaCH/Thunder">

INFO=<ID=AFR_AF,Number=1,Type=Float,Description="Allele Frequency for samples from AFR based on AC/AN">

INFO=<ID=CIEND,Number=2,Type=Integer,Description="Confidence interval around END for imprecise variants">

INFO=<ID=SNPSOURCE,Number=.,Type=String,Description="indicates if a snp was called when analysing the low coverage or exome alignment data">

INFO=<ID=SVLEN,Number=1,Type=Integer,Description="Difference in length between REF and ALT alleles">

INFO=<ID=AVGPOST,Number=1,Type=Float,Description="Average posterior probability from MaCH/Thunder">

INFO=<ID=HOMSEQ,Number=.,Type=String,Description="Sequence of base pair identical micro-homology at event breakpoints">

INFO=<ID=AA,Number=1,Type=String,Description="Ancestral Allele, ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/pilot_data/technical/reference/ancestral_alignments/README">

INFO=<ID=AC,Number=.,Type=Integer,Description="Alternate Allele Count">

INFO=<ID=END,Number=1,Type=Integer,Description="End position of the variant described in this record">

INFO=<ID=AF,Number=1,Type=Float,Description="Global Allele Frequency based on AC/AN">

INFO=<ID=ASN_AF,Number=1,Type=Float,Description="Allele Frequency for samples from ASN based on AC/AN">

INFO=<ID=AN,Number=1,Type=Integer,Description="Total Allele Count">

INFO=<ID=VT,Number=1,Type=String,Description="indicates what type of variant the line represents">

INFO=<ID=THETA,Number=1,Type=Float,Description="Per-marker Transition rate from MaCH/Thunder">

INFO=<ID=EUR_AF,Number=1,Type=Float,Description="Allele Frequency for samples from EUR based on AC/AN">

INFO=<ID=LDAF,Number=1,Type=Float,Description="MLE Allele Frequency Accounting for LD">

INFO=<ID=HOMLEN,Number=.,Type=Integer,Description="Length of base pair identical micro-homology at event breakpoints">

INFO=<ID=CIPOS,Number=2,Type=Integer,Description="Confidence interval around POS for imprecise variants">

INFO=<ID=AMR_AF,Number=1,Type=Float,Description="Allele Frequency for samples from AMR based on AC/AN">

INFO=<ID=SVTYPE,Number=1,Type=String,Description="Type of structural variant">

INFO=<ID=RSQ,Number=1,Type=Float,Description="Genotype imputation quality from MaCH/Thunder">

FORMAT=<ID=GL,Number=.,Type=Float,Description="Genotype Likelihoods">

FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">

FORMAT=<ID=DS,Number=1,Type=Float,Description="Genotype dosage from MaCH/Thunder">

CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NA18505 NA18508 NA19648 NA19704

1 10583 rs58108140 G A 100 . AA=.;AVGPOST=0.7707;AC=0;AF=0.14;ASN_AF=0.13;AFR_AF=0.04;AMR_AF=0.17;ERATE=0.0161;AN=8;LDAF=0.2327;VT=S;SNPSOURCE=LOWCOV;THETA=0.0046;RSQ=0.4319;EUR_AF=0.21 GT:DS:GL 0|0:0.000:-0.07,-0.8,-5.0 0|0:0.000:-0.01,-1.68,-5.0 0|0:0.100:-0.03,-1.15,-5.0 0|0:0.150:-0.19,-0.45,-2.42
"

Decide python 3 strategy

Should we try and use the same codebase or use one of the tools? Should we use six to help with the codebase? Which versions of python2 so we need to support?

I ran 2to3 and applied a few more fixes on the python3 branch. Nothing much to worry about except next(gen) rather than gen.next(), also some utf problems with the gzip code.

_Record attributes for variant type and sub_type

I think it would be very helpful to add attributes to the _Record API that report:

  1. The variant type (e.g., "snp", "indel", "sv").
  2. The variant sub_type:
    • if variant == "snp"
      "ts" for transition
      "tv" for transversion
    • if variant == "indel"
      "del" for deletion
      "ins" for insertion

etc.

If you are open to this, I have some existing code for this that I could incorporate.

Order of header lines messed up

If I open a VCF file and I immediately write it back to a different file with code like this:
vcf_reader = vcf.Reader(infile)
vcf_writer = vcf.Writer(outfile,vcf_reader)
I get that the order of things in the headers is now messed up. If at first I am more like:

fileformat=VCFv4.1

FORMAT=...

INFO=...

contig=...

reference=...

In the output file things are more like:

contig=...

reference=...

contig=...

fileformat=VCFv4.1

contig=...

INFO=...

FORMAT=...

Is this even allowed by the VCF format? Is there a way to control this behavior?

vcf_filter converts reference, sample, and vcfProcessLog fields incorrectly

When vcf_filter writes the filtered VCF file, it seems to use python style serialization on some of the header fields, rather than the proper VCF style. For example, a ##reference field looked like this in the original file:

##reference=<ID="hg36.1 m5:9ebc6df9496613f373e73396d5b3b6b6 sp:homo sapiens",source="http://www.hgsc.bcm.tmc.edu/collaborations/human-reference/hsap36.1-hg18.fasta">

...ends up looking like this in the output:

##reference={'source': '"http://www.hgsc.bcm.tmc.edu/collaborations/human-reference/hsap36.1-hg18.fasta"', 'ID': '"hg36.1 m5:9ebc6df9496613f373e73396d5b3b6b6 sp:homo sapiens"'}

SAMPLE is similarly affected, as well as the fields ##vcfProcessLog which gets added.

For reference, the command line I used to call vcf_filter was:

vcf_filter.py examples/example1.vcf dps --depth-per-sample 7 > tmp.vcf

add/edit call data

I'd really like a way to add and edit format fields to the sample call data when writing a VCF file. Specifically, what I'm looking to do is add the 'FT' format field, then fill it in for each call. Any suggestions for a workaround much appreciated.

Order of INFO fields random

I am reporting something that is not properly an issue, but I realized that the order of the fields in the INFO column of the VCF file is kind of random. I am obtaining this result since I am using PyVCF to populate some new INFO fields in a given VCF file (PyVCF works great for this purpose) and I realize that these new INFO fields are inserted in random order. The UnifiedGenotyper from GATK seems instead to neatly outputting them in alphabetical order. I suggest two possible options:

  1. The order of the fields in the INFO column respect the order of the INFO fields in the header.
  2. The order of the fields, both in the header and the INFO column be alphabetically sorted.

support for negative number in FORMAT (patch).

Otherwise, can get an error on output from samtools.

diff --git a/vcf.py b/vcf.py
index e52ebd9..103bbef 100644
--- a/vcf.py
+++ b/vcf.py
@@ -130,7 +130,7 @@ class _vcf_metadata_parser(object):
             >''', re.VERBOSE)
         self.format_pattern = re.compile(r'''\#\#FORMAT=<
             ID=(?P<id>.+),
-            Number=(?P<number>\d+|\.|[AG]),
+            Number=(?P<number>-?\d+|\.|[AG]),
             Type=(?P<type>.+),
             Description="(?P<desc>.*)"
             >''', re.VERBOSE)

Problem reading the QUAL variable

I got the following error:

File "/usr/local/lib/python2.7/dist-packages/PyVCF-0.4.4-py2.7.egg/vcf/parser.py", line 745, in next
qual = float(row[5]) if '.' in row[5] else int(row[5])
ValueError: invalid literal for int() with base 10: '1e+03'

it is clear what going wrong here. PyVCF thinks the absence of a '.' is enough to exclude the QUAL field from being a float, but this is not the case as '1e+03' is a float and does not contain a '.'

Bug while reading 1000 genomes data

It seems that this lib fails to parse file:
ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20110521/ALL.wgs.phase1_release_v3.20101123.snps_indels_sv.sites.vcf.gz

Code:

import vcf
import sys

vcfReader = vcf.Reader(open(sys.argv[1]))
for record in vcfReader:
print record

Fails with:

Traceback (most recent call last):
File "C:\Users\HP\My Documents\Aptana Studio 3 Workspace\ProcSNP\CalcProb.py", line 5, in
for record in vcfReader:
File "C:\Program Files\Python27\lib\site-packages\vcf\parser.py", line 872, in next
info = self._parse_info(row[7])
File "C:\Program Files\Python27\lib\site-packages\vcf\parser.py", line 717, in _parse_info
val = entry[1]
IndexError: list index out of range

Python 2.7, Win 7 x64.

Collect stats while parsing calls

In the loop where we parse sample calls, collect aggregate numbers so that we do not need to iterate again to see, eg, call rate.

Attach this to the _Record object.

Prevent cmp on list values in Call data

If I have a list of values, say AD=10,12 and in some downstream code I say:

if call.data['AD'] > 10: 
   do_something()

Then I get unknown behaviour since python implements comparisons for all basic types. I want this to raise an exception so that I know my filter is bad.

This will be much worse since we fixed #19. Previously all values were lists. Now some are lists and some are not.

Single floats in Reader._sample_parser not being converted to float

Hi,

In _parse_sample, there's a small typo that prevents single floats from being converted to float. Here's the patch

--- a/vcf/parser.py
+++ b/vcf/parser.py
@@ -624,7 +624,7 @@ class Reader(object):

                 if entry_type == 'Integer':
                     sampdict[fmt] = int(vals)
-                elif sampdict[fmt] == 'Float':
+                elif entry_type == 'Float':
                     sampdict[fmt] = float(vals)
                 else:
                     sampdict[fmt] = vals

Thanks,

Kevin

Remove single ALT assumption

Many methods use the gt_nums property, which is only defined for single ALT calls. We should try and use other methods where possible, and not return incorrect data when there are multiple ALTs.

Multiple metadata fields are overwritten

If multiple metadata lines (other than info, format, or filter) have the same key, only one of them is stored. Example: ##contig lines in test/gatk.vcf

This is because they are stored in a dict.

Python doesn't have an associative array without a unique constraint so I'm not sure what the best way around this would be.

Lenna

Error on install

Hi,

when I try to install PyVCF 0.6.0 through python setup.py install on a mac I get the following:

error: can't copy 'vcf/cparse.c': doesn't exist or not a regular file

The file ins't in the vcf directory. Am I doing something wrong? Did I miss a step?

thanks

jm

test/null_genotype_mono.vcf missing from repository

The file test/null_genotype_mono.vcf which is referenced in the latest code was not added to the repository. This breaks the unit tests.

python -m unittest test.test_vcf.TestRegression
./full/path/test/null_genotype_mono.vcf

E

ERROR: test_null_mono (test.test_vcf.TestRegression)

Traceback (most recent call last):
File "test/test_vcf.py", line 408, in test_null_mono
p = vcf.Reader(fh('null_genotype_mono.vcf'))
File "test/test_vcf.py", line 14, in fh
return file(os.path.join(os.path.dirname(file), fname))
IOError: [Errno 2] No such file or directory: 'test/null_genotype_mono.vcf'


Ran 2 tests in 0.003s

FAILED (errors=1)

Normalize for case in sequences

Sometimes VCF files contain mixed upper- and lowercase reference and variant sequences. I don't know what this would mean in VCF context.

Anyway, we might want to normalize for this (and in that way, it is related to issue #22, giving a normalized and universal interface to all VCF files and their variants out there).

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.