Giter Site home page Giter Site logo

vcf_parser's Introduction

VCF Parser

Build Status

Small library for parsing vcf files. Based on PyVCF

usage examples:

- Split variants lines with multiple alleles
- Create vcf files from a python environment and print them
- Parse through variants and retrieve relevant information

Installation

pip install vcf_parser

or

>git clone https://github.com/moonso/vcf_parser.git
>cd vcf_parser
>python setup.py install

Usage

If used within a python environment:

>from vcf_parser import VCFParser
>my_parser = VCFParser(infile='infile.vcf', split_variants=True, check_info=True)
>for variant in my_parser:
    print(variant)

or used as a command line tool

vcf_parser examples/test_vcf.vcf --split

Prints a new vcf with splitted variants to screen.

Vcf parser is really a lightweight version of PyVCF with most of it's code borrowed and modified from there.

The idea was to make a faster and more flexible tool that mostly work with python dictionaries.

It is easy to access information for each variant, edit the information and edit the headers.

Basic function

Returns dictionary with the vcf info for each variant. To split the multiallelic calls(and accurate splitting of INFO field including the VEP CSQ fiels) use:

my_parser = VCFParser(infile='infile.vcf', split_variants=True)

The ordinary vcf entrys is stored by there header names, like

variant['CHROM']
variant['ALT']

etc.

Genotypes

The genotype information is converted to a genotype object and stored in a dictionary

variant['genotypes']

and looks like:

'genotypes': {'father': <genotype_object>,
              'mother': <genotype_object>,
              'proband': <genotype_object>}

The genotype class have the following attributes for phrasing common questions:

- genotype STRING (Same as in VCF-standard)
- allele_1 STRING (Base on allele 1)
- allele_2 STRING (Base on allele 2)
- nocall BOOL
- heterozygote BOOL 
- homo_alt BOOL (If individual is homozygote alternative)
- homo_ref BOOL (If individual is homozygote reference)
- has_variant BOOL (If individual is called and not homozygote reference)
- ref_depth INT
- alt_depth INT
- phred_likelihoods LIST with FLOAT
- depth_of_coverage INT
- genotype_quality FLOAT
- phased BOOL

Vep info

Vep information, if present, is parsed into

variant['vep_dict']

and looks like (depending on how vep was run):

'vep_info': {<alternative_allele>: {
                        'Allele': 'G',
                        'Amino_acids': '',
                        'CDS_position': '',
                        'Codons': '',
                        'Consequence': 'downstream_gene_variant',
                        'DISTANCE': '3084',
                        'EXON': '',
                        'Existing_variation': '',
                        'Feature': 'NM_015658.3',
                        'Feature_type': 'Transcript',
                        'Gene': '26155',
                        'HGVSc': '',
                        'HGVSp': '',
                        'INTRON': '',
                        'PolyPhen': '',
                        'Protein_position': '',
                        'SIFT': '',
                        'STRAND': '-1',
                        'SYMBOL': 'NOC2L',
                        'SYMBOL_SOURCE': '',
                        'cDNA_position': ''},
              <alternative_allele>: {
                          'Allele': 'G',
                         'Amino_acids': '',
                         'CDS_position': '',
                         'Codons': '',
                         'Consequence': 'intron_variant',
                         'DISTANCE': '',
                         'EXON': '',
                         'Existing_variation': '',
                         'Feature': 'NM_152486.2',
                         'Feature_type': 'Transcript',
                         'Gene': '148398',
                         'HGVSc': 'NM_152486.2:c.707-25A>G',
                         'HGVSp': '',
                         'INTRON': '7/13',
                         'PolyPhen': '',
                         'Protein_position': '',
                         'SIFT': '',
                         'STRAND': '1',
                         'SYMBOL': 'SAMD11',
                         'SYMBOL_SOURCE': '',
                         'cDNA_position': ''
                     }
                 }

Info field

INFO field is parsed into a dictionary The keys are the names of the info field and values are lists separated on ','.

variant['info_dict]

and looks like

'info_dict': {'AC': ['1'],
               'AF': ['0.167'],
               'AN': ['6'],
               'BaseQRankSum': ['2.286'],
               'DB': [],
               'DP': ['1306'],
               'FS': ['1.539'],
               'InbreedingCoeff': ['0.1379'],
               'MQ': ['39.83'],
               'MQ0': ['0'],
               'MQRankSum': ['-2.146'],
               'POSITIVE_TRAIN_SITE': [],
               'QD': ['29.57'],
               'ReadPosRankSum': ['0.897'],
               'VQSLOD': ['4.52'],
               'culprit': ['FS'],
               'set': ['variant']}

Print a vcf in it´s original format:

my_parser = parser.VCFParser(infile='infile.vcf')
for line in my_parser.metadata.print_header():
    print(line)
for variant in my_parser:
    print('\t'.join([[variant[head] for head in my_parser.header]))

Build a vcf file from scratch

One can use vcf_parser to build vcf files from scratch. A vcf file must allways have the "fileformat" header, so start by initializing a vcf parser with the name of the fileformat like

from vcf_parser import VCFParser

> my_vcf = VCFParser(fileformat='VCFv4.2')

Add metadata information:

Adding INFO field:

    my_parser.add_info('my_new_id', <number>, <type>, <description>)

Where 'number', 'type' and 'description' follows the VCF specification.

Add a filter:

    my_parser.add_filter('my_filter', <description>)

Add a arbitrary metadata line:

    my_parser.add_filter(key='my_data_line', value="The description")

example:

from vcf_parser import VCFParser

> my_vcf = VCFParser(fileformat='VCFv4.2')
> my_vcf.metadata.add_filedate(filedate='20150607')
> my_vcf.metadata.add_info(info_id='MQ', number='1', entry_type='Float', description="RMS Mapping Quality")
> my_vcf.metadata.add_filter(filter_id="MY_FILTER", description="The filter description")
> my_vcf.metadata.add_format(format_id="DP", number='1', entry_type='Integer', description="The read depth")
> my_vcf.metadata.add_alt(alt_id="MY_ALTERNATIVE", description="The alternative description")
> my_vcf.metadata.add_meta_line_(key="filedate", value="20150607")



vcf_parser's People

Contributors

moonso avatar dakl avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.