mskcc / vcf2maf Goto Github PK

Convert a VCF into a MAF, where each variant is annotated to only one of all possible gene isoforms

License: Other

Perl 99.09% Dockerfile 0.91%

vcf maf perl vep isoforms

vcf2maf's Introduction

vcfmaf

To convert a VCF into a MAF, each variant must be mapped to only one of all possible gene transcripts/isoforms that it might affect. But even within a single isoform, a Missense_Mutation close enough to a Splice_Site, can be labeled as either in MAF format, but not as both. This selection of a single effect per variant, is often subjective. And that's what this project attempts to standardize. The vcf2maf and maf2maf scripts leave most of that responsibility to Ensembl's VEP, but allows you to override their "canonical" isoforms, or use a custom ExAC VCF for annotation. Though the most useful feature is the extensive support in parsing a wide range of crappy MAF-like or VCF-like formats we've seen out in the wild.

Quick start

Find the latest release, download it, and view the detailed usage manuals for vcf2maf and maf2maf:

export VCF2MAF_URL=`curl -sL https://api.github.com/repos/mskcc/vcf2maf/releases | grep -m1 tarball_url | cut -d\" -f4`
curl -L -o mskcc-vcf2maf.tar.gz $VCF2MAF_URL; tar -zxf mskcc-vcf2maf.tar.gz; cd mskcc-vcf2maf-*
perl vcf2maf.pl --man
perl maf2maf.pl --man

If you don't have VEP installed, then follow this gist. Of the many annotators out there, VEP is preferred for its large team of active coders, and its CLIA-compliant HGVS formats. After installing VEP, test out vcf2maf like this:

perl vcf2maf.pl --input-vcf tests/test.vcf --output-maf tests/test.vep.maf

To fill columns 16 and 17 of the output MAF with tumor/normal sample IDs, and to parse out genotypes and allele counts from matched genotype columns in the VCF, use options --tumor-id and --normal-id. Skip option --normal-id if you didn't have a matched normal:

perl vcf2maf.pl --input-vcf tests/test.vcf --output-maf tests/test.vep.maf --tumor-id WD1309 --normal-id NB1308

VCFs from variant callers like VarScan use hardcoded sample IDs TUMOR/NORMAL to name genotype columns. To have vcf2maf correctly locate the columns to parse genotypes, while still printing proper sample IDs in the output MAF:

perl vcf2maf.pl --input-vcf tests/test_varscan.vcf --output-maf tests/test_varscan.vep.maf --tumor-id WD1309 --normal-id NB1308 --vcf-tumor-id TUMOR --vcf-normal-id NORMAL

If VEP is installed under /opt/vep and the VEP cache is under /srv/vep, there are options available to tell vcf2maf where to find them:

perl vcf2maf.pl --input-vcf tests/test.vcf --output-maf tests/test.vep.maf --vep-path /opt/vep --vep-data /srv/vep

If you want to skip running VEP and need a minimalist MAF-like file listing data from the input VCF only, then use the --inhibit-vep option. If your input VCF contains VEP annotation, then vcf2maf will try to extract it. But be warned that the accuracy of your resulting MAF depends on how VEP was operated upstream. In standard operation, vcf2maf runs VEP with very specific parameters to make sure everyone produces comparable MAFs. So, it is strongly recommended to avoid --inhibit-vep unless you know what you're doing.

maf2maf

If you have a MAF or a MAF-like file that you want to reannotate, then use maf2maf, which simply runs maf2vcf followed by vcf2maf:

perl maf2maf.pl --input-maf tests/test.maf --output-maf tests/test.vep.maf

After tests on variant lists from many sources, maf2vcf and maf2maf are quite good at dealing with formatting errors or "MAF-like" files. It even supports VCF-style alleles, as long as Start_Position == POS. But it's OK if the input format is imperfect. Any variants with a reference allele mismatch are kept aside in a separate file for debugging. The bare minimum columns that maf2maf expects as input are:

Chromosome	Start_Position	Reference_Allele	Tumor_Seq_Allele2	Tumor_Sample_Barcode
1	3599659	C	T	TCGA-A1-A0SF-01
1	6676836	A	AGC	TCGA-A1-A0SF-01
1	7886690	G	A	TCGA-A1-A0SI-01

See data/minimalist_test_maf.tsv for a sampler. Addition of Tumor_Seq_Allele1 will be used to determine zygosity. Otherwise, it will try to determine zygosity from variant allele fractions, assuming that arguments --tum-vad-col and --tum-depth-col are set correctly to the names of columns containing those read counts. Specifying the Matched_Norm_Sample_Barcode with its respective columns containing read-counts, is also strongly recommended. Columns containing normal allele read counts can be specified using argument --nrm-vad-col and --nrm-depth-col.

Docker

Assuming you have a recent version of docker, clone the main branch and build an image as follows:

git clone [email protected]:mskcc/vcf2maf.git
cd vcf2maf
docker build -t vcf2maf:main .
docker builder prune -f

Now you run the scripts in docker as follows:

docker run --rm vcf2maf:main perl vcf2maf.pl --help
docker run --rm vcf2maf:main perl maf2maf.pl --help

Testing

A small standalone test dataset was created by restricting VEP v112 cache/fasta to chr21 in GRCh38 and hosting that on a private server for download by CI services. We can manually fetch those as follows:

wget -P tests https://data.cyri.ac/Homo_sapiens.GRCh38.dna.chromosome.21.fa.gz
gzip -d tests/Homo_sapiens.GRCh38.dna.chromosome.21.fa.gz
wget -P tests https://data.cyri.ac/homo_sapiens_vep_112_GRCh38_chr21.tar.gz
tar -zxf tests/homo_sapiens_vep_112_GRCh38_chr21.tar.gz -C tests

And the following scripts test the docker image on predefined inputs and compare outputs against expected outputs:

perl tests/vcf2maf.t
perl tests/vcf2vcf.t
perl tests/maf2vcf.t

License

Apache-2.0 | Apache License, Version 2.0 | https://www.apache.org/licenses/LICENSE-2.0

Citation

Cyriac Kandoth. mskcc/vcf2maf: vcf2maf v1.6. (2020). doi:10.5281/zenodo.593251

vcf2maf's People

Contributors

Stargazers

Watchers

Forkers

beifang ding-lab kdaily q-kim zhangtongyikai rubydrus alexpenson honglongwu ahmetz aminzia biodev mailllonely sb43 thomasyu888 vd4mmind apastore jie-yin arvin580 jjgao zheins afc523 16nwallace gpcr brunograndephd mhbailey fkgruber patchper covingto gustaveroussy die4live fortuno sbg oplantalech vswilliamson hrk2109 sambrightman teamcgc snashraf xtmgah stevetsa baiyuanxiang adamstruck alenzhao jiaolongsun yangkangyf tcgriffith rdmorin baraslab chizhou-siti a1aks rhshah neodong raylim sandertan kellyquek qwangmsk nm0542 zzygyx9119 alezanalp inambioinfo bjdong thehyve pieterlukasse inodb ao508 cjeschke91 ytlogos jason-weirather wangdi2014 xuwei684 zhx828 kelly1210 rnandety ucsf-cbi lixuenan ustc-calchem-calbio-lab ahvo teninq scha36 gavinwinner ro-joshi zm-git-dev rptashkin andurill jchenpku sebastianlange mehulgoel1 yurasong cococou linhxxx firekingit vanallenlab dtakayan wwang-nmdp cgpu cookersjs caiandroiddeveloper fossilqq vipints zhanglzu

vcf2maf's Issues

maf2maf for VarScan tab delimited output

Would it be possible to make maf2maf work for the standard tab delimited varscan output? VarScan output is not the same as MAF, but it's similar, as it's tab delimited. Positions are the same as in the VCF format. I can share a sample file if that helps.

Floris

Error: unrecognized biotype "enhancer"

Hi,

Thank you for making this tool. I followed the instructions to download the latest version of vcf2maf (within the last week) and the v79 of VEP (although in the variant_effect_predictor.pl it says the version is v77).

The a *.vep.vcf file is generated, however the *.maf output terminates prematurely. The error I'm getting is 'ERROR: Unrecognized biotype "enhancer". Please update your hashes!'

I saw that this seems a bit similar to an earlier post, but the error message complains about a different biotype.

Can you please advise?

Jonathan

SNP were annotated as DNP

For SNPs in input MAF, e.g. a SNP with Reference_Allele being 'G' and Tumor_Seq_Allele1 being 'A', if the Tumor_Seq_Allele2 is '-', maf2maf will annotate such SNPs as DNP, i.e. in the resulting annotation file, Reference_Allele becomes 'GG' and Tumor_Seq_Allele1 is 'G' and Tumor_Seq_Allele2 is 'GA'.

Please kindly fix this issue if possible, because they are not DNP.

Qingguo

vcf2maf problem - regulatory is not available for this species

Hi,I get a problem when I use vcf2maf perl script, and I don't know how to solve it.Can you help me?Thanks!

ERROR: --regulatory is not available for this species at /home/xiaxy/vep/variant_effect_predictor.pl line 878.
ERROR: Failed to run the VEP annotator!

Forked process(es) died

Hi.I get a new problem.
Command line: /software/vcf2maf/vcf2maf.pl --input-vcf ./mm.vcf --output-maf ./mm.maf --tumor-id WD1000 --normal-id NB1001
By the way,I can run successfully using your test.vcf .
wrong_information.txt

Report variants with ref mismatches [maf2maf]

It would be convenient to write variants to a separate file if they are skipped by maf2maf due to reference mismatches.

Error running vcf2maf

Hi Cyriac,

I am trying to run vcf2maf for my exome data. Here is the error I keep getting?
Use of uninitialized value in pattern match (m//) at vcf2maf.pl line 787, line 109.
Use of uninitialized value in string eq at vcf2maf.pl line 788, line 109.
Use of uninitialized value in string eq at vcf2maf.pl line 789, line 109.
Use of uninitialized value in string eq at vcf2maf.pl line 790, line 109.
Use of uninitialized value in string eq at vcf2maf.pl line 791, line 109.
Use of uninitialized value in pattern match (m//) at vcf2maf.pl line 792, line 109.
Use of uninitialized value in string eq at vcf2maf.pl line 793, line 109.
Use of uninitialized value in string eq at vcf2maf.pl line 794, line 109.
Use of uninitialized value in pattern match (m//) at vcf2maf.pl line 795, line 109.
Use of uninitialized value in pattern match (m//) at vcf2maf.pl line 796, line 109.
Use of uninitialized value in pattern match (m//) at vcf2maf.pl line 797, line 109.
Use of uninitialized value in pattern match (m//) at vcf2maf.pl line 798, line 109.
Use of uninitialized value in pattern match (m//) at vcf2maf.pl line 799, line 109.
Use of uninitialized value in string eq at vcf2maf.pl line 800, line 109.
Use of uninitialized value in pattern match (m//) at vcf2maf.pl line 801, line 109.
Use of uninitialized value in string eq at vcf2maf.pl line 802, line 109.
Use of uninitialized value in string eq at vcf2maf.pl line 803, line 109.
Use of uninitialized value in pattern match (m//) at vcf2maf.pl line 787, line 110.
Use of uninitialized value in string eq at vcf2maf.pl line 788, line 110.
Use of uninitialized value in string eq at vcf2maf.pl line 789, line 110.
Use of uninitialized value in string eq at vcf2maf.pl line 790, line 110.
Use of uninitialized value in string eq at vcf2maf.pl line 791, line 110.
Use of uninitialized value in pattern match (m//) at vcf2maf.pl line 792, line 110.
Use of uninitialized value in string eq at vcf2maf.pl line 793, line 110.
Use of uninitialized value in string eq at vcf2maf.pl line 794, line 110.
Use of uninitialized value in pattern match (m//) at vcf2maf.pl line 795, line 110.
Use of uninitialized value in pattern match (m//) at vcf2maf.pl line 796, line 110.
Use of uninitialized value in pattern match (m//) at vcf2maf.pl line 797, line 110.
Use of uninitialized value in pattern match (m//) at vcf2maf.pl line 798, line 110.
Use of uninitialized value in pattern match (m//) at vcf2maf.pl line 799, line 110.
Use of uninitialized value in string eq at vcf2maf.pl line 800, line 110.
Use of uninitialized value in pattern match (m//) at vcf2maf.pl line 801, line 110.

Can you please help me in this regard.

Best,
Ashiq

Add option to specify VEP species

Vep defaults to homo_sapiens if species is not defined. So create an option for vcf2maf that allows to override that. Also find a way to skip escape and shift_hgvs for references that don't support it.

Reference allele does not match Ensembl reference

Hey,

Thank you in advance for your time. When I run maf2maf, I get these warnings/messages. It is able to run all the way through, but the 64 warnings I get cut my final maf file short. I read this issue, so is there currently no way around this? I am using the most up to date vcf2maf and also vep v84. It says on here that Tumor_seq_allele1 is not required, so can I ignore the first ~1000 messages?

Use of uninitialized value $col_idx{"tumor_seq_allele1"} in pattern match (m//) at /home/vcf2maf/maf2vcf.pl line 120, <GEN0> line 1082.
Use of uninitialized value $col_idx{"tumor_seq_allele1"} in pattern match (m//) at /home/vcf2maf/maf2vcf.pl line 120, <GEN0> line 1083.
Use of uninitialized value $col_idx{"tumor_seq_allele1"} in pattern match (m//) at /home/vcf2maf/maf2vcf.pl line 120, <GEN0> line 1084.
Use of uninitialized value $col_idx{"tumor_seq_allele1"} in pattern match (m//) at /home/vcf2maf/maf2vcf.pl line 120, <GEN0> line 1085.

WARNING: Specified reference allele TGT does not match Ensembl reference allele GTT on line 8
WARNING: Specified reference allele CTTGATGTA does not match Ensembl reference allele TTGATGTAC on line 9
WARNING: Specified reference allele GAATTAAGA does not match Ensembl reference allele TAAGAGAAG on line 29
WARNING: Specified reference allele CAA does not match Ensembl reference allele aac on line 42
WARNING: Specified reference allele TG does not match Ensembl reference allele GT on line 51
WARNING: Specified reference allele CG does not match Ensembl reference allele GA on line 56

Best,
Tom

MAF file input

Hi,
I am trying to use your maf2maf.pl script to convert an MAF file I generated with the MUGSY aligner (http://mugsy.sourceforge.net/) into a VCF file format.

Is your script for MAFs in a specific format? I run into the error:
ERROR: Couldn't find a header line (starts with Hugo_Symbol, Chromosome, or Tumor_Sample_Barcode) in the MAF: outgroup.maf

Thank you!

The first few lines of my MAF:

maf version=1 scoring=mugsy

a score=640492 label=1 mult=3
s B31.Chromosome 4603 638466 + 910724 AAAATCTCAGGAGAGGGATCAAATGGGGGATGCATAATTCATCCTTCAAGGGTAAGAGACCCAATTACTACTTTACTTAGTATCGTAAAATTATTAAAAATGAA

ERROR: Unrecognized biotype "lncRNA". Please update your hashes!

Hi,
I used vcf2maf for chicken genome and got errors "Unrecognized biotype "lncRNA". Please update your hashes!". I am not sure this problem is due to vcf2maf or my custom built VEP database. First I try to ask here.
My command for VCF2MAF was
perl ${vcf2maf} --input-vcf ${vcf} --output-maf ${maf} \ --species gallus_gallus \ --vep-path ${vep_path} \ --ref-fasta ${ref}\ --ncbi-build 84 \ --tumor-id 842-2_S20\ --normal-id 842-0_2_S28
My command for building VEP database was

perl gtf2vep.pl -i genes.gff -f galgal5_seq.fa -d 84 -s gallus_gallus

Thanks in advance.

Hongen

ExAC DNPs

DNPs composed of two ExAC SNPs are not annotated.

Solution: add these DNPs (and TNPs?) to the vcf.

entrez identifiers

Hi,

After running vcf2maf all the Entrez_Gene_Id values are set to '0'. Would it be possible to add the entrez identifiers?

Best regards,
Sander

Add trinuc context as extra MAF column

Add trinuc context as an extra column output to maf2maf/vcf2maf.

Here is code in maf2vcf to locate samtools binary:
https://github.com/mskcc/vcf2maf/blob/master/maf2vcf.pl#L16

Samtools faidx can be used to grab ref fasta seq for a given chr:start-stop
https://github.com/mskcc/vcf2maf/blob/master/maf2vcf.pl#L167

Unable to run Ensembl VEP in single-threaded mode

Whenever I specify --vep-forks 1 when running vcf2maf, I get the following error with Ensembl VEP.

ERROR: Fork number must be greater than 1

For some reason, the authors of VEP would rather throw an error instead of simply ignoring the --fork 1 option. If the default for vcf2maf was one thread, this wouldn't be a problem, but since the default is four, it would be convenient if I could manually run in single-threaded mode.

maf2.4 is not suite for MuSic0.4

Hi,
my vcf file was annotated by VEP(version 84)
and transfrom vep.vcf to maf( TCGA MAF Specification v2.4) file using vcf2maf(latest version)
But during running Genome music0.4 code:

genome music bmr calc-covg --roi-file ../ensembl_67_cds_ncrna_and_splice_sites_hg19 --bam-list bam.list --output-dir MusiC/test --reference-sequence ../human_g1k_v37.fasta --gene-covg-dir MusiC/test/gene-covg-dir
genome music bmr calc-bmr --roi-file ../ensembl_67_cds_ncrna_and_splice_sites_hg19 --bam-list bam.list --output-dir MusiC/test --reference-sequence ../human_g1k_v37.fasta --maf-file test.vep.maf  --bmr-output MusiC/test --gene-mr-file MusiC/test/test.gene-mr-file

there were some error,like:

Unrecognized Variant_Classification Splice_Region in MAF file.
Please use TCGA MAF Specification v2.3.

maybe music0.4 is not support maf 2.4, how could I fix this error?
thanks
xq

maf2vcf.pl: Write all to one vcf

Hello,

I would like to write all the mutations in my maf file to the same vcf, rather than many different vcfs that are matched. To do this, I must create one long header containing all the samples, rather than only the tumor-sample pair. Do you recommend any method to do so?

Thank you!

Jonathan

fix targeted_region mutation type for inframe mutations

Can we do the following for complex in-frame mutation?

If the var allele has more NTs than ref allele, call it a IN_FRAME_INS. If less, IN_FRAME_DEL. If equal, MISSENSE_MUTATION.

vcf2maf fails when converting InDels @ GRCm38

This minimal vcf-file cannot be converted by vcf2maf 1.6.3 (the Indel-Site fails to be annotated), while VEP (online version and stand-alone) works fine:
11:96283450-96283451 deletion intron_variant, feature_truncation MODIFIER Hoxb8 ENSMUSG00000056648 T

##fileformat=VCFv4.2
##FILTER=<ID=PASS,Description="All filters passed">
##FILTER=<ID=REJECT,Description="Rejected as a confident somatic mutation">
##FORMAT=<ID=AD,Number=.,Type=Integer,Description="Allelic depths for the ref and alt alleles in the order listed">
##FORMAT=<ID=BQ,Number=A,Type=Float,Description="Average base quality for reads supporting alleles">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth (reads with MQ=255 or with bad mates are filtered)">
##FORMAT=<ID=FA,Number=A,Type=Float,Description="Allele fraction of the alternate allele with regard to reference">
##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=PL,Number=G,Type=Integer,Description="Normalized, Phred-scaled likelihoods for genotypes as defined in the VCF specification">
##FORMAT=<ID=SS,Number=1,Type=Integer,Description="Variant status relative to non-adjacent Normal,0=wildtype,1=germline,2=somatic,3=LOH,4=post-transcriptional modification,5=unknown">
##INFO=<ID=DB,Number=0,Type=Flag,Description="dbSNP Membership">
##INFO=<ID=MQ0,Number=1,Type=Integer,Description="Total Mapping Quality Zero Reads">
##INFO=<ID=SOMATIC,Number=0,Type=Flag,Description="Somatic event">
##INFO=<ID=VT,Number=1,Type=String,Description="Variant type, can be SNP, INS or DEL">
##contig=<ID=10,length=130694993>
##contig=<ID=11,length=122082543>
##contig=<ID=12,length=120129022>
##contig=<ID=13,length=120421639>
##contig=<ID=14,length=124902244>
##contig=<ID=15,length=104043685>
##contig=<ID=16,length=98207768>
##contig=<ID=17,length=94987271>
##contig=<ID=18,length=90702639>
##contig=<ID=19,length=61431566>
##contig=<ID=1,length=195471971>
##contig=<ID=2,length=182113224>
##contig=<ID=3,length=160039680>
##contig=<ID=4,length=156508116>
##contig=<ID=5,length=151834684>
##contig=<ID=6,length=149736546>
##contig=<ID=7,length=145441459>
##contig=<ID=8,length=129401213>
##contig=<ID=9,length=124595110>
##contig=<ID=X,length=171031299>
##contig=<ID=Y,length=91744698>
##reference=file:///data1/misc/Genomes/GRCm38/GRCm38.fa
##INFO=<ID=SF,Number=.,Type=String,Description="Source File (index to sourceFiles, f when filtered)">
##INFO=<ID=AC,Number=.,Type=Integer,Description="Allele count in genotypes">
##INFO=<ID=AN,Number=1,Type=Integer,Description="Total number of alleles in called genotypes">
#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  64_2B
11  74388661    .   C   A   .   PASS    AC=1;AN=2;SF=38;SOMATIC;VT=SNP  GT:BQ:DP:FA:SS:AD   0/1:34:30:0.1:2:27,3
11  96283450    .   TG  T   .   PASS    AC=1;AN=2;END=96283451;HOMLEN=8;HOMSEQ=GGGGGGGG;SF=0,2;SVLEN=-1;SVTYPE=DEL  GT:AD   0/1:18,5

Leading to this error:

~/Packages/vcf2maf-1.6.3$ perl vcf2maf.pl --vep-path ~/Packages/vep-v82/ --vep-data ~/Packages/.vep-v82/ --ref-fasta ~/Packages/.vep-v82/mus_musculus/82_GRCm38/Mus_musculus.GRCm38.dna.primary_assembly.fa  --ncbi-build GRCm38 --species mus_musculus --input-vcf ~/Packages/vcf2maf-1.6.2/1.vcf --output-maf ~/Packages/vcf2maf-1.6.3/1.maf --tumor-id 64_2B --normal-id 64_2B_Normal
STATUS: Running VEP and writing to: /home/engleitner/Packages/vcf2maf-1.6.3/1.vep.vcf
2015-11-05 10:23:54 - Read existing cache info
2015-11-05 10:23:54 - Starting...
2015-11-05 10:23:54 - Detected format of input file as vcf
2015-11-05 10:23:54 - Read 2 variants into buffer
2015-11-05 10:23:54 - Calculating consequences

Use of uninitialized value in pattern match (m//) at /home/engleitner/Packages/vep-v82/Bio/EnsEMBL/Variation/Utils/VEP.pm line 1642.
2015-11-05 10:23:55 - Writing output
2015-11-05 10:23:55 - Processed 2 total variants (2 vars/sec, 2 vars/sec total)
2015-11-05 10:23:55 - Finished!

Argument "" isn't numeric in numeric eq (==) at vcf2maf.pl line 604, <GEN1> line 45.
Argument "" isn't numeric in numeric eq (==) at vcf2maf.pl line 604, <GEN1> line 45.
Argument "" isn't numeric in numeric eq (==) at vcf2maf.pl line 604, <GEN1> line 45.
Argument "" isn't numeric in numeric eq (==) at vcf2maf.pl line 604, <GEN1> line 45.
Use of uninitialized value $effect in pattern match (m//) at vcf2maf.pl line 697, <GEN1> line 45.
Use of uninitialized value $effect in string eq at vcf2maf.pl line 698, <GEN1> line 45.
Use of uninitialized value $effect in string eq at vcf2maf.pl line 699, <GEN1> line 45.
Use of uninitialized value $effect in string eq at vcf2maf.pl line 700, <GEN1> line 45.
Use of uninitialized value $effect in string eq at vcf2maf.pl line 701, <GEN1> line 45.
Use of uninitialized value $effect in pattern match (m//) at vcf2maf.pl line 702, <GEN1> line 45.
Use of uninitialized value $effect in string eq at vcf2maf.pl line 703, <GEN1> line 45.
Use of uninitialized value $effect in string eq at vcf2maf.pl line 704, <GEN1> line 45.
Use of uninitialized value $effect in pattern match (m//) at vcf2maf.pl line 705, <GEN1> line 45.
Use of uninitialized value $effect in pattern match (m//) at vcf2maf.pl line 706, <GEN1> line 45.
Use of uninitialized value $effect in pattern match (m//) at vcf2maf.pl line 707, <GEN1> line 45.
Use of uninitialized value $effect in pattern match (m//) at vcf2maf.pl line 708, <GEN1> line 45.
Use of uninitialized value $effect in pattern match (m//) at vcf2maf.pl line 709, <GEN1> line 45.
Use of uninitialized value $effect in string eq at vcf2maf.pl line 710, <GEN1> line 45.
Use of uninitialized value $effect in pattern match (m//) at vcf2maf.pl line 711, <GEN1> line 45.
Use of uninitialized value $effect in string eq at vcf2maf.pl line 712, <GEN1> line 45.
Use of uninitialized value $effect in string eq at vcf2maf.pl line 713, <GEN1> line 45.
Use of uninitialized value in join or string at vcf2maf.pl line 689, <GEN1> line 45.

sequence_feature missing in vcf2maf

Hi Cyriac,
running vcf2maf 1.6.6 with vep83 i am getting:
ERROR: Unrecognized effect "sequence_feature". Please update your hashes!

I have add sequence_feature to vcf2maf feature but i just want to ask you if that is correct.

Thanks!

t_ref_count, t_alt_count, n_ref_count, n_alt_count not printing out in annotated maf

I'm having trouble trying to get t_ref_count, t_alt_count, n_ref_count, n_alt_count to print out in the maf with the v1.6.5 of maf2maf.pl using a dataset that worked with a previous version of the script.

Thanks,
Krista

ALLELE_NUM undefined for some variants

ALLELE_NUM for some reason is not always present in *.vep.vcf file, which causes some downstream problems (precisely this fragment is crashing:

# Skip effects on other ALT alleles
push( @all_effects, \%effect ) if( $effect{ALLELE_NUM} == $var_allele_idx );)

Handle empty input

Somes, especially when sequencing a panel rather than full exome, no somatic mutations are detected and the input VCF only has a header. As of now, this causes vcf2maf to crash and burn:

snpEff-annotated VCF file is missing or empty!

or similar depending on what method of annotation is used.

It would be great if this was handled more graciously. Outputing a header-only MAF would be the best case in my opinion.

On-line server tools?

Do you have any idea to setup this as on-line server tool? For example, place under cBioPortal?

splice_region_variant has lower priority than synonymous_variant

17:g7579312C>A is annotated as silent mutation, but according to VEP the most severe consequence is splice_region_variant.

http://grch37.rest.ensembl.org/vep/human/hgvs/17:g7579312C%3EA?content-type=application/json

I think we should assign a lower priority to synonymous_variant.

Alternatively, is it possible to assign the most_severe_consequence from VEP as the annotated consequence in MAF?

Ensembl VEP "--regulatory" option not available for some species

Thanks for the recent added support for species other than humans. It's been helpful!

I have variants called for a canine tumour dataset that I wish to annotate and convert to MAF format. However, the --regulatory option for Ensembl VEP isn't available for the canis_familiaris species, despite Ensembl not indicating that the option is "species limited" in their documentation.

I'm fixing this by classifying that option as human-only, but you might want to include it for all supported species. Just a heads-up!

Bad tumour and normal count calculations from mutect VCFs

We've seen an issue where the VCF data from Mutect seems not to be handled correctly. Specifically, the AD field is used, and even if it provides two values, the logic carries on and attempts various other ways to calculate the depths. In the case of Mutect, the ends up overwriting these two values with a single value calculated from FA and DP, therefore we have data loss from the AD data.

maf2maf - Matched Normal Alleles have a preceding bp

This input:

X  44918690  44918690  A  -

Generates output:

KDM6A  0  MSKCC  GRCh37  X  44918690  44918690  +  Frame_Shift_Del  DEL  A  -  -      TUMOR  NORMAL  CA  CA  ...

The alleles for Match_Norm_Seq_Allele1 and Match_Norm_Seq_Allele2 should be A instead of CA

vcf to maf conversion

Hi, i used vcf2maf perl script. Installed vcf2maf-master, vep. But getting an error - WARNING: Chromosome 1 not found in cache on line . What can be the reason?

Problems installing VEP v84

Dear all,

I am trying to install vcf2maf in Mac Os x 10.9.5.
I followed the instructions, but I got the following error message during the installation of VEP v84:

ExtUtils::Mkbootstrap::Mkbootstrap('blib/arch/auto/Bio/DB/HTS/HTS.bs')
env MACOSX_DEPLOYMENT_TARGET=10.3 cc -bundle -undefined dynamic_lookup -L/usr/local/lib -L/opt/local/lib -fstack-protector -o blib/arch/auto/Bio/DB/HTS/HTS.bundle lib/Bio/DB/HTS.o -L/User/vep/v84/htslib -Wl,-rpath,/User/vep/v84/htslib -lhts -lpthread -lz
ld: -rpath can only be used when targeting Mac OS X 10.5 or later
clang: error: linker command failed with exit code 1 (use -v to see invocation)
error building blib/arch/auto/Bio/DB/HTS/HTS.bundle from lib/Bio/DB/HTS.o at /User/perl5/perlbrew/perls/perl-5.22.0/lib/5.22.0/ExtUtils/CBuilder/Base.pm line 320.
ERROR: Shared Bio::DB:HTS library not found

Best regards,

Annotate with GRCh38 error

Hi Cyriac,
I'm having trouble running VCFs generated by Mutect2 and Haplotype caller aligned to GRCh38 with vcf2maf.

Running them through command line VEP is fine but when I try with vcf2maf using --ncbi-build GRCh38 it gives me this error.

WARNING: Could not fetch sub-slice from 1:632803-632803(1) on line 236
WARNING: Specified reference allele T does not match Ensembl reference allele on line 236
WARNING: Could not fetch sub-slice from 1:810105-810105(1) on line 237
WARNING: Specified reference allele C does not match Ensembl reference allele on line 237

The insertions are in the MAF file, but substitutions and deletions are not.

Cheers,
Phil

Use of uninitialized value $effect in pattern match

I am having a recurrent issue with vcf2maf. I have used VEP to annotate my multi sample VCF file. It keeps giving me the error below. Initially I was using SNPeff and I saw that it had problems with this so I then used VEP, but the problem still occurs:

Use of uninitialized value in numeric eq (==) at /home/abeggs/vcf2maf-master/vcf2maf.pl line 611, line 1073.
Use of uninitialized value $effect in pattern match (m//) at /home/abeggs/vcf2maf-master/vcf2maf.pl line 772, line 1073.
Use of uninitialized value $effect in string eq at /home/abeggs/vcf2maf-master/vcf2maf.pl line 773, line 1073.
Use of uninitialized value $effect in string eq at /home/abeggs/vcf2maf-master/vcf2maf.pl line 774, line 1073.
Use of uninitialized value $effect in string eq at /home/abeggs/vcf2maf-master/vcf2maf.pl line 775, line 1073.
Use of uninitialized value $effect in string eq at /home/abeggs/vcf2maf-master/vcf2maf.pl line 776, line 1073.
Use of uninitialized value $effect in pattern match (m//) at /home/abeggs/vcf2maf-master/vcf2maf.pl line 777, line 1073.
Use of uninitialized value $effect in string eq at /home/abeggs/vcf2maf-master/vcf2maf.pl line 778, line 1073.
Use of uninitialized value $effect in string eq at /home/abeggs/vcf2maf-master/vcf2maf.pl line 779, line 1073.
Use of uninitialized value $effect in pattern match (m//) at /home/abeggs/vcf2maf-master/vcf2maf.pl line 780, line 1073.
Use of uninitialized value $effect in pattern match (m//) at /home/abeggs/vcf2maf-master/vcf2maf.pl line 781, line 1073.
Use of uninitialized value $effect in pattern match (m//) at /home/abeggs/vcf2maf-master/vcf2maf.pl line 782, line 1073.
Use of uninitialized value $effect in pattern match (m//) at /home/abeggs/vcf2maf-master/vcf2maf.pl line 783, line 1073.
Use of uninitialized value $effect in pattern match (m//) at /home/abeggs/vcf2maf-master/vcf2maf.pl line 784, line 1073.
Use of uninitialized value $effect in string eq at /home/abeggs/vcf2maf-master/vcf2maf.pl line 785, line 1073.
Use of uninitialized value $effect in pattern match (m//) at /home/abeggs/vcf2maf-master/vcf2maf.pl line 786, line 1073.
Use of uninitialized value $effect in string eq at /home/abeggs/vcf2maf-master/vcf2maf.pl line 787, line 1073.
Use of uninitialized value $effect in string eq at /home/abeggs/vcf2maf-master/vcf2maf.pl line 788, line 1073.

The only issue that I can think off is that I am using the new Illumina aligner Isaac followed by Strelka, however I have stripped all the extrataneous stuff it puts in with bcftools

running vcf2maf with VEP installed as module

Hello,

I am trying to run vcf2maf on a university computer cluster that has VEP installed as a module. When I try to run vcf2maf, I get the error "ERROR: Cannot find VEP script variant_effect_predictor.pl in path:"

I saw in the manual that it is possible to tell vcf2maf where the vep perl script is installed but in this case there is no location as it is a module.

Is there any way to run vcf2maf like this?

Thanks!!

a potential bug in vcf2maf.pl

Hi Cyriac, there may be a bug in line 537 in vcf2maf.pl. For normal sample, $nrm_depths[$var_allele_idx] can be zero in many cases. However, if $nrm_depths[$var_allele_idx] is 0, lines 539 and 540 will not be executed. To calculate $nrm_info{DP}, it may make more sense to use "if (( defined $nrm_depths[0] and defined $nrm_depths[$var_allele_idx] ) ..." in this line, instead of "if(( $nrm_depths[0] and $nrm_depths[$var_allele_idx] ) ..."

Qingguo

MAF>0.04% filter is too strict in some smaller ExAC subpopulations

What the title says

Unrecognized biotype error

Hi Cyriac,

Thanks for developing this resource, it's incredibly useful. I'm currently attempting to run the 1.3.0 version of the script using the latest version of VEP (v76) on GRCh37. I keep running into this error for one of my samples:
ERROR: Unrecognized biotype "ENST00000387347.2:n.815N>T". Please update your hashes! at vcf2maf.pl line 608, <GEN1> line 21.

The script successfully completes the VEP annotation but terminates prematurely without outputting anything into .maf format. My command line parameters are identical to what you have in the tutorial. Do you have any suggestions on how to trouble shoot this? Thanks!

"|" (pipe) in alt fields

Certain variants create pipes in the ALT field of vcf output, which is not valid vcf to my understanding.

e.g.

#version 2.4
##
## Oncotator v1.8.0.0 | Flat File Reference hg19 | GENCODE v19 EFFECT | UniProt_AAxform 2014_12 | ClinVar 12.03.20 | ESP 6500SI-V2 | ORegAnno UCSC Track | dbSNP build 142 | CCLE_By_GP 09292010 | COSMIC v62_291112 | 1000gp3 20130502 | UniProt_AA 2014_12 | dbNSFP v2.4 | ESP 6500SI-V2 | COSMIC_FusionGenes v62_291112 | gencode_xref_refseq metadata_v19 | CCLE_By_Gene 09292010 | ACHILLES_Lineage_Results 110303 | CGC full_2012-03-15 | UniProt 2014_12 | HumanDNARepairGenes 20110905 | HGNC Sept172014 | COSMIC_Tissue 291112 | Familial_Cancer_Genes 20110905 | TUMORScape 20100104 | Ensembl ICGC MUCOPA | TCGAScape 110405 | MutSig Published Results 20110905
Hugo_Symbol     Entrez_Gene_Id  Center  NCBI_Build      Chromosome      Start_position  End_position    Strand  Variant_Classification  Variant_Type    Reference_Allele        Tumor_Seq_Allele1       Tumor_Seq_Allele2       dbSNP_RS     dbSNP_Val_Status Tumor_Sample_Barcode    Matched_Norm_Sample_Barcode     Match_Norm_Seq_Allele1  Match_Norm_Seq_Allele2  Tumor_Validation_Allele1        Tumor_Validation_Allele2        Match_Norm_Validation_Allele1   Match_Norm_Validation_Allele2 Verification_Status     Validation_Status       Mutation_Status Sequencing_Phase        Sequence_Source Validation_Method       Score   BAM_file        Sequencer       Tumor_Sample_UUID       Matched_Norm_Sample_UUID        Genome_Change Annotation_Transcript   Transcript_Strand       Transcript_Exon Transcript_Position     cDNA_Change     Codon_Change    Protein_Change  Other_Transcripts       Refseq_mRNA_Id  Refseq_prot_Id  SwissProt_acc_Id        SwissProt_entry_Id    Description     UniProt_AApos   UniProt_Region  UniProt_Site    UniProt_Natural_Variations      UniProt_Experimental_Info       GO_Biological_Process   GO_Cellular_Component   GO_Molecular_Function   COSMIC_overlapping_mutations COSMIC_fusion_genes      COSMIC_tissue_types_affected    COSMIC_total_alterations_in_gene        Tumorscape_Amplification_Peaks  Tumorscape_Deletion_Peaks       TCGAscape_Amplification_Peaks   TCGAscape_Deletion_Peaks        DrugBank     ref_context      gc_content      CCLE_ONCOMAP_overlapping_mutations      CCLE_ONCOMAP_total_mutations_in_gene    CGC_Mutation_Type       CGC_Translocation_Partner       CGC_Tumor_Types_Somatic CGC_Tumor_Types_Germline        CGC_Other_Diseases    DNARepairGenes_Role     FamilialCancerDatabase_Syndromes        MUTSIG_Published_Results        OREGANNO_ID     OREGANNO_Values i_1000gp3_AA    i_1000gp3_AC    i_1000gp3_AF    i_1000gp3_AFR_AF        i_1000gp3_AMR_AF        i_1000gp3_AN  i_1000gp3_CIEND i_1000gp3_CIPOS i_1000gp3_CS    i_1000gp3_DP    i_1000gp3_EAS_AF        i_1000gp3_END   i_1000gp3_EUR_AF        i_1000gp3_IMPRECISE     i_1000gp3_MC    i_1000gp3_MEINFO        i_1000gp3_MEND  i_1000gp3_MLEN  i_1000gp3_MSTART      i_1000gp3_NS    i_1000gp3_SAS_AF        i_1000gp3_SVLEN i_1000gp3_SVTYPE        i_1000gp3_TSD   i_ACHILLES_Lineage_Results_Top_Genes    i_BAM_File      i_CGC_Cancer Germline Mut       i_CGC_Cancer Molecular Genetics i_CGC_Cancer Somatic Mut      i_CGC_Cancer Syndrome   i_CGC_Chr       i_CGC_Chr Band  i_CGC_GeneID    i_CGC_Name      i_CGC_Other Germline Mut        i_CGC_Tissue Type       i_COSMIC_n_overlapping_mutations        i_COSMIC_overlapping_mutation_descriptions    i_COSMIC_overlapping_primary_sites      i_ClinVar_ASSEMBLY      i_ClinVar_HGMD_ID       i_ClinVar_SYM   i_ClinVar_TYPE  i_ClinVar_rs    i_ESP_AA        i_ESP_AAC       i_ESP_AA_AC     i_ESP_AA_AGE    i_ESP_AA_GTC    i_ESP_AvgAAsampleReadDepth    i_ESP_AvgEAsampleReadDepth      i_ESP_AvgSampleReadDepth        i_ESP_CA        i_ESP_CDP       i_ESP_CG        i_ESP_CP        i_ESP_Chromosome        i_ESP_DBSNP     i_ESP_DP        i_ESP_EA_AC     i_ESP_EA_AGE i_ESP_EA_GTC     i_ESP_EXOME_CHIP        i_ESP_FG        i_ESP_GL        i_ESP_GM        i_ESP_GS        i_ESP_GTC       i_ESP_GTS       i_ESP_GWAS_PUBMED       i_ESP_MAF       i_ESP_PH        i_ESP_PP        i_ESP_Position  i_ESP_TAC    i_ESP_TotalAAsamplesCovered      i_ESP_TotalEAsamplesCovered     i_ESP_TotalSamplesCovered       i_EVS_AA        i_EVS_All       i_EVS_EA        i_Ensembl_so_accession  i_Ensembl_so_term       i_Entrez_Gene_Id        i_Familial_Cancer_Genes_Reference     i_Familial_Cancer_Genes_Synonym i_HGNC_Accession Numbers        i_HGNC_CCDS IDs i_HGNC_Chromosome       i_HGNC_Date Modified    i_HGNC_Date Name Changed        i_HGNC_Date Symbol Changed      i_HGNC_Ensembl Gene ID  i_HGNC_Ensembl ID(supplied by Ensembl)        i_HGNC_Enzyme IDs       i_HGNC_Gene family description  i_HGNC_HGNC ID  i_HGNC_Locus Group      i_HGNC_Locus Type       i_HGNC_Name Synonyms    i_HGNC_OMIM ID(supplied by NCBI)        i_HGNC_Previous Names i_HGNC_Previous Symbols i_HGNC_Primary IDs      i_HGNC_Pubmed IDs       i_HGNC_Record Type      i_HGNC_RefSeq(supplied by NCBI) i_HGNC_Secondary IDs    i_HGNC_Status   i_HGNC_Synonyms i_HGNC_UCSC ID(supplied by UCSC)        i_HGNC_UniProt ID(supplied by UniProt)        i_HGNC_VEGA IDs i_HGVS_coding_DNA_change        i_HGVS_genomic_change   i_HGVS_protein_change   i_ORegAnno_bin  i_UniProt_alt_uniprot_accessions        i_Variant_Classification        i_Variant_Typei_all_domains_WU        i_amino_acid_change_WU  i_annotation_transcript i_build i_c_position_WU i_ccds_id       i_chromosome_name_WU    i_dbNSFP_1000Gp1_AC     i_dbNSFP_1000Gp1_AF     i_dbNSFP_1000Gp1_AFR_AC i_dbNSFP_1000Gp1_AFR_AF i_dbNSFP_1000Gp1_AMR_AC       i_dbNSFP_1000Gp1_AMR_AF i_dbNSFP_1000Gp1_ASN_AC i_dbNSFP_1000Gp1_ASN_AF i_dbNSFP_1000Gp1_EUR_AC i_dbNSFP_1000Gp1_EUR_AF i_dbNSFP_Ancestral_allele       i_dbNSFP_CADD_phred     i_dbNSFP_CADD_raw       i_dbNSFP_CADD_raw_rankscore   i_dbNSFP_ESP6500_AA_AF  i_dbNSFP_ESP6500_EA_AF  i_dbNSFP_Ensembl_geneid i_dbNSFP_Ensembl_transcriptid   i_dbNSFP_FATHMM_pred    i_dbNSFP_FATHMM_rankscore       i_dbNSFP_FATHMM_score   i_dbNSFP_GERP++_NR      i_dbNSFP_GERP++_RS    i_dbNSFP_GERP++_RS_rankscore    i_dbNSFP_Interpro_domain        i_dbNSFP_LRT_Omega      i_dbNSFP_LRT_converted_rankscore        i_dbNSFP_LRT_pred       i_dbNSFP_LRT_score      i_dbNSFP_LR_pred        i_dbNSFP_LR_rankscore   i_dbNSFP_LR_score     i_dbNSFP_MutationAssessor_pred  i_dbNSFP_MutationAssessor_rankscore     i_dbNSFP_MutationAssessor_score i_dbNSFP_MutationTaster_converted_rankscore     i_dbNSFP_MutationTaster_pred    i_dbNSFP_MutationTaster_score   i_dbNSFP_Polyphen2_HDIV_pred  i_dbNSFP_Polyphen2_HDIV_rankscore       i_dbNSFP_Polyphen2_HDIV_score   i_dbNSFP_Polyphen2_HVAR_pred    i_dbNSFP_Polyphen2_HVAR_rankscore       i_dbNSFP_Polyphen2_HVAR_score   i_dbNSFP_RadialSVM_pred i_dbNSFP_RadialSVM_rankscore  i_dbNSFP_RadialSVM_score        i_dbNSFP_Reliability_index      i_dbNSFP_SIFT_converted_rankscore       i_dbNSFP_SIFT_pred      i_dbNSFP_SIFT_score     i_dbNSFP_SLR_test_statistic     i_dbNSFP_SiPhy_29way_logOdds    i_dbNSFP_SiPhy_29way_logOdds_rankscore        i_dbNSFP_SiPhy_29way_pi i_dbNSFP_UniSNP_ids     i_dbNSFP_Uniprot_aapos  i_dbNSFP_Uniprot_acc    i_dbNSFP_Uniprot_id     i_dbNSFP_aaalt  i_dbNSFP_aapos  i_dbNSFP_aapos_FATHMM   i_dbNSFP_aapos_SIFT  i_dbNSFP_aaref   i_dbNSFP_cds_strand     i_dbNSFP_codonpos       i_dbNSFP_fold-degenerate        i_dbNSFP_genename       i_dbNSFP_hg18_pos(1-coor)       i_dbNSFP_phastCons100way_vertebrate     i_dbNSFP_phastCons100way_vertebrate_rankscorei_dbNSFP_phastCons46way_placental        i_dbNSFP_phastCons46way_placental_rankscore     i_dbNSFP_phastCons46way_primate i_dbNSFP_phastCons46way_primate_rankscore       i_dbNSFP_phyloP100way_vertebrate        i_dbNSFP_phyloP100way_vertebrate_rankscore    i_dbNSFP_phyloP46way_placental  i_dbNSFP_phyloP46way_placental_rankscore        i_dbNSFP_phyloP46way_primate    i_dbNSFP_phyloP46way_primate_rankscore  i_dbNSFP_refcodon       i_default_gene_name_WU  i_deletion_substructures_WU   i_domain_WU     i_ensembl_gene_id       i_entrez_gene_id        i_gc_content_full       i_gencode_transcript_name       i_gencode_transcript_status     i_gencode_transcript_tags       i_gencode_transcript_type       i_gene_name_WUi_gene_name_source_WU   i_gene_type     i_havana_transcript     i_normal_ref_reads      i_normal_vaf    i_normal_var_reads      i_reference_WU  i_refseq_mrna_id        i_secondary_variant_classification      i_start_WU      i_stop_WU    i_strand_WU      i_transcript_error_WU   i_transcript_name_WU    i_transcript_source_WU  i_transcript_species_WU i_transcript_status_WU  i_transcript_version_WU i_trv_type_WU   i_tumor_ref_reads       i_tumor_vaf     i_tumors_var_reads   i_type_WU        i_ucsc_cons_WU  i_variant_WU
GRID2   2895    genome.wustl.edu        37      4       94547528        94547529        +       Missense_Mutation       DNPCG       CG      GA                      TCGA-E2-A150-01A-11D-A12B-09    TCGA-E2-A150-10A-01D-A12B-09    C|G     C|G        Unknown  Untested        Somatic Phase_IV        WXS     none    1               Illumina GAIIx  446064de-ff64-4113-9080-360e5bf6d5e4        17e424ec-2364-4fa2-ae06-9fa4a409fe3e    g.chr4:94547528_94547529CG>GA   ENST00000282020.4       +       14 2560_2561        c.2302_2303CG>GA        c.(2302-2304)CGg>GAg    p.R768E GRID2_ENST00000510992.1_Missense_Mutation_p.R673E  NM_001510.2      NP_001501.2     O43424  GRID2_HUMAN     glutamate receptor, ionotropic, delta 2 768                        cellular protein localization (GO:0034613)|cerebellar granule cell differentiation (GO:0021707)|glutamate receptor signaling pathway (GO:0007215)|heterophilic cell-cell adhesion (GO:0007157)|ion transmembrane transport (GO:0034220)|ionotropic glutamate receptor signaling pathway (GO:0035235)|prepulse inhibition (GO:0060134)|regulation of excitatory postsynaptic membrane potential (GO:0060079)|regulation of neuron apoptotic process (GO:0043523)|regulation of neuron projection development (GO:0010975)|synaptic transmission, glutamatergic (GO:0035249)|transport (GO:0006810)        cell junction (GO:0030054)|dendrite (GO:0030425)|dendritic spine (GO:0043197)|integral component of plasma membrane (GO:0005887)|ionotropic glutamate receptor complex (GO:0008328)|plasma membrane (GO:0005886)|postsynaptic membrane (GO:0045211)|synapse (GO:0045202)        extracellular-glutamate-gated ion channel activity (GO:0005234)|glutamate receptor activity (GO:0008066)|ionotropic glutamate receptor activity (GO:0004970)|PDZ domain binding (GO:0030165)|scaffold protein binding (GO:0097110)                      NS(2)|breast(2)|central_nervous_system(1)|endometrium(6)|kidney(4)|large_intestine(18)|lung(45)|ovary(3)|prostate(6)|skin(9)|upper_aerodigestive_tract(3)|urinary_tract(1)      100             Hepatocellular(203;0.114)|all_hematologic(202;0.177)            OV - Ovarian serous cystadenocarcinoma(123;3.22e-06)|LUSC - Lung squamous cell carcinoma(81;0.185)|Lung(65;0.191)           TGTTGCTGATCGGGGATATGGA      0.386                                                                                              dbGAP                                                                                    0                                  --       -       SO:0001583      missense        0                       AF009014        CCDS3637.1, CCDS68758.1 4q22    2012-08-29                  ENSG00000152208 ENSG00000152208         """Ligand-gated ion channels / Glutamate receptors, ionotropic"", ""Glutamate receptors"""  4576    protein-coding gene     gene with protein product               602368             9465309  Standard        NM_001510               Approved        GluD2, GluR-delta-2     uc011cdt.2      O43424  OTTHUMG00000130975  Exception_encountered   4.37:g.94547528_94547529delinsGA        ENSP00000282020:p.Arg768Glu             E9PH24|Q4KKU8|Q4KKU9|Q4KKV0|Q59FZ1  Missense_Mutation       SNP     pfam_Iontro_glu_rcpt,pfam_ANF_lig-bd_rcpt,pfam_SBP_bac_3,pfam_Glu_rcpt_Glu/Gly-bd,smart_Iontro_glu_rcpt,smart_Glu_rcpt_Glu/Gly-bd,prints_NMDA_rcpt  p.R768G|p.R768Q ENST00000282020.4       37 c.2302|c.2303    CCDS3637.1      4                                                                                          GRID2    -       pfam_Iontro_glu_rcpt,pfam_SBP_bac_3,smart_Iontro_glu_rcpt       ENSG00000152208         0.386   GRID2-001  KNOWN    basic|appris_principal|CCDS     protein_coding  GRID2   HGNC    protein_coding  OTTHUMT00000253588.2    486     0.000       C|G                     94547528|94547529       94547528|94547529       +1      no_errors       ENST00000282020 ensembl     human   known   69_37n  missense        293|290 18.78|18.99     68      SNP     1.000   G|A

Add ability to use VEP's custom annotations feature

For example, if someone had a tabix-indexed ExAC VCF, they could append minor allele frequencies to the resulting MAF as follows:

perl variant_effect_predictor.pl -i example_GRCh37.vcf -cache -custom ExAC.r0.3.sites.vep.vcf.gz,ExAC,vcf,exact,0,AF

Ensembl VEP "—cache-version" not available

This is required to run VEP scripts with cache of different versions. The current behavior of vcf2maf is to look for cache of the same version, and fail if it can not find it.

maf2maf should annotate a single VCF instead of per-TN pair

For MAFs where each sample has very few variants (like MSK-IMPACT MAFs), maf2maf will create a small VCF per tumor-normal pair, and run each through VEP. Disk I/O becomes an bottleneck, and --vep-forks does not provide any speedup. So the alternative is:

maf2vcf must create a single multi-sample VCF for a given MAF, preferably with genotypes properly filled (Use GT=./. under samples without a variant). It must also generate a T-N pairing TSV file that maf2maf can use
vcf2maf must be able to skip variants (by default) where GT=./. in the given tumor/normal sample, to generate a MAF for a T-N pair, from a pre-annotated multi-sample VCF that may contain more variants than in those T-N samples
maf2maf will run maf2vcf to create a multi-sample VCF, and annotate it using VEP. The resulting big VCF will be passed to vcf2maf for each T-N pair, and the STDOUT fed into a final MAF

Possible bug?

I just noticed this. On ~ line 359

Currently:

$tum_info{GT} = "./." unless( defined $tum_info{GT} and $tum_info{GT} ne '.' );

Should it not be:

$nrm_info{GT} = "./." unless( defined $nrm_info{GT} and $nrm_info{GT} ne '.' );

Strand incorrectly replaced with "-1" or "1"

VEP strand replaces Strand column in maf.

Tumor_Seq_Allele1 same as Reference_Allele?

Hey, Cyriac. Thanks for developing this tool. It's been of great use to us in my lab.

I had a question regarding the allele columns, namely Reference_Allele, Tumor_Seq_Allele1 and Tumor_Seq_Allele2. For some variants, Tumor_Seq_Allele1 is the same as Reference_Allele, and Tumor_Seq_Allele2 is different. For others, you have what I would expect, which is Tumor_Seq_Allele1 and Tumor_Seq_Allele2 are both different from Reference_Allele.

I've included two MAF rows below to show the two cases I described above. Unfortunately, I didn't find the column descriptions on the NCI Wiki to be all that helpful. Note that I'm using vcf2maf v1.5.0. Am I missing something about these columns that would explain why I'm seeing this in my MAF files? Thanks for your help!

#version 2.4
Hugo_Symbol Entrez_Gene_Id  Center  NCBI_Build  Chromosome  Start_Position  End_Position    Strand  Variant_Classification  Variant_Type    Reference_Allele    Tumor_Seq_Allele1   Tumor_Seq_Allele2   dbSNP_RS    dbSNP_Val_Status    Tumor_Sample_Barcode    Matched_Norm_Sample_Barcode Match_Norm_Seq_Allele1  Match_Norm_Seq_Allele2  Tumor_Validation_Allele1    Tumor_Validation_Allele2    Match_Norm_Validation_Allele1   Match_Norm_Validation_Allele2   Verification_Status Validation_Status   Mutation_Status Sequencing_Phase    Sequence_Source Validation_Method   Score   BAM_File    Sequencer   Tumor_Sample_UUID   Matched_Norm_Sample_UUID    HGVSc   HGVSp   HGVSp_Short Transcript_ID   Exon_Number t_depth t_ref_count t_alt_count n_depth n_ref_count n_alt_count all_effects Allele  Gene    Feature Feature_type    Consequence cDNA_position   CDS_position    Protein_position    Amino_acids Codons  Existing_variation  ALLELE_NUM  DISTANCE    STRAND  SYMBOL  SYMBOL_SOURCE   HGNC_ID BIOTYPE CANONICAL   CCDS    ENSP    SWISSPROT   TREMBL  UNIPARC RefSeq  SIFT    PolyPhen    EXON    INTRON  DOMAINS GMAF    AFR_MAF AMR_MAF ASN_MAF EUR_MAF AA_MAF  EA_MAF  CLIN_SIG    SOMATIC PUBMED  MOTIF_NAME  MOTIF_POS   HIGH_INF_POS    MOTIF_SCORE_CHANGE
PUM1    0   .   GRCh37  1   31482305    31482305    +   Intron  SNP A   G   G   novel       Tumour  Normal  A   A                                                               c.433-2356N>C           ENST00000426105     37  6   30  24  23  0   PUM1,intron_variant,,ENST00000373747;PUM1,intron_variant,,ENST00000257075;PUM1,intron_variant,,ENST00000424085;PUM1,intron_variant,,ENST00000373741;PUM1,intron_variant,,ENST00000426105;PUM1,intron_variant,,ENST00000440538;PUM1,intron_variant,,ENST00000423018;PUM1,intron_variant,,ENST00000373742;    G   ENSG00000134644 ENST00000426105 Transcript  intron_variant  -/4043  -/3567  -/1188              1       -1  PUM1    HGNC    14957   protein_coding  YES CCDS44099.1 ENSP00000391723 PUM1_HUMAN  E9PL65_HUMAN    UPI0000203D8E                   3/21                                                            
RBBP4   0   .   GRCh37  1   33129557    33129557    +   Intron  SNP T   T   G   novel       Tumour  Normal  T   T                                                               c.311-4269N>G           ENST00000373493     39  7   25  30  25  0   RBBP4,intron_variant,,ENST00000373493;RBBP4,intron_variant,,ENST00000414241;RBBP4,intron_variant,,ENST00000458695;RBBP4,intron_variant,,ENST00000544435;RBBP4,intron_variant,,ENST00000373485;  G   ENSG00000162521 ENST00000373493 Transcript  intron_variant  -/7943  -/1278  -/425               1       1   RBBP4   HGNC    9887    protein_coding  YES CCDS366.1   ENSP00000362592 RBBP4_HUMAN H0YCT5_HUMAN,E9PND5_HUMAN,C9JPP3_HUMAN,B4DRT0_HUMAN UPI000013318C   NM_005610.2,NM_001135255.1              3/11

By the way, I went back to the VCF file for these variants to confirm the REF and ALT columns.

1   31482305    .   A   G   .   PASS    NT=ref;QSS=65;QSS_NT=63;SGT=AA->AG;SOMATIC;TQSS=2;TQSS_NT=1
1   33129557    .   T   G   .   PASS    NT=ref;QSS=34;QSS_NT=34;SGT=TT->GT;SOMATIC;TQSS=2;TQSS_NT=2

Variants being assigned to wrong sample

maf2maf v1.6.1 with VEP v81 was run on our Luna cluster with the following I/O:

INPUT = /ifs/res/pwg/data/tcga_mafs/debug/tcga_blca_from_dcc.maf
OUTPUT = /ifs/res/pwg/data/tcga_mafs/debug/tcga_blca_from_dcc.vep.maf

Note that the output MAF has more variants than the input. On closer look, it's because maf2maf doesn't seem to properly handle when the tumor ID has more than one matched normals. Please fix by adding normal ID to variant hash keys.

--normal-id getting passed in as value for --tumor-id intermittently

I've narrowed this down to the following section of code in maf2maf:

# For each VCF generated by maf2vcf above, contruct a vcf2maf command and run it
my @vcfs = grep{ !m/.vep.vcf$/ and !m/$vcf_file/ } glob( "$tmp_dir/*.vcf" ); # Avoid reannotating annotated VCFs
foreach my $tn_vcf ( @vcfs ) {
    my ( $tumor_id, $normal_id ) = $tn_vcf=~m/^.*\/(.*)_vs_(.*)\.vcf/;
    my $tn_maf = $tn_vcf;
    $tn_maf =~ s/.vcf$/.vep.maf/;
    my $vcf2maf_cmd = "$perl_bin $vcf2maf_path --input-vcf $tn_vcf --output-maf $tn_maf " .
        "--tumor-id $tumor_id --normal-id $normal_id --vep-path $vep_path --vep-data $vep_data " .
        "--vep-forks $vep_forks --ref-fasta $ref_fasta --ncbi-build $ncbi_build --species $species";
    $vcf2maf_cmd .= " --custom-enst $custom_enst_file" if( $custom_enst_file );
    system( $vcf2maf_cmd ) == 0 or die "\nERROR: Failed to run vcf2maf!\nCommand: $vcf2maf_cmd\n";
}

What seems to be happening is if $tumor_id is empty, it gets put into the command, which causes --normal-id to be set as value for --tumor-id (tested this argument/value behavior locally with another program - it simply takes whatever follows the argument as the value).

Looking at the bad record, I also notice that what appears to be the actual Tumor_Sample_Barcode is in the Match_Normal_Seq_Allele1 column.

Here is how we are calling maf2maf in our java code:

/opt/common/CentOS_6/vcf2maf/v1.6.2/maf2maf.pl --vep-path /opt/common/CentOS_6/vep/v81 --vep-data /opt/common/CentOS_6/vep/v81 --ref-fasta
/ssd-data/cmo/opt/vep/v79/homo_sapiens/79_GRCh37/Homo_sapiens.GRCh37.75.dna.primary_assembly.fa
--retain-cols
hugo_symbol,entrez_gene_id,center,ncbi_build,chromosome,start_position,end_position,strand,variant_classification,variant_type,reference_allele,tumor_seq_allele1,tumor_seq_allele2,dbsnp_rs,dbsnp_val_status,tumor_sample_barcode,matched_norm_sample_barcode,match_norm_seq_allele1,match_norm_seq_allele2,tumor_validation_allele1,tumor_validation_allele2,match_norm_validation_allele1,match_norm_validation_allele2,verification_status,validation_status,mutation_status,sequencing_phase,sequence_source,validation_method,score,bam_file,sequencer,tumor_sample_uuid,matched_norm_sample_uuid,t_depth,t_ref_count,t_alt_count,n_depth,n_ref_count,n_alt_count,high_inf_pos,motif_score_change,impact
--input-maf /data/zack/tmp/import-tmp/1448309457235.sanitizedMAF --output-maf /data/zack/tmp/import-tmp/1448309446721-0/annotator_out.maf --tmp-dir /ssd-data/vep-tmp/ben --vep-forks 32 --custom-enst /opt/common/CentOS_6/vcf2maf/v1.6.2/data/isoform_overrides_at_mskcc

We are also calling this on a "sanitized" maf, which @qwangmsk has an example of.

Would it make sense for now to place an empty check on the tumor_id and normal_id scalars for the time being to simply alert us to the data causing the issue the next time this issue comes up? We haven't been able to reproduce it, and at no other point in the java would it make sense for the --normal-id to show up in the data - this seems to be the only spot where it could potentially find its way in.

Write proper unit tests and plug into Jenkins

VEP setup is a chore and Travis CI or other container-based CI services cannot handle the data-size and pre-test setup time. We have Jenkins running on our compute cluster, where VEP is already set up. Write tests that trigger on each push to github, and send pass/fail status to a github badge.

Annotate hg19 vcf

Hi Cyriac,
with the previous version of vcf2maf I had no problem annotating hg19 vcf even with vep 82, but with the current 1.6.3 I am getting this error... I was trying to figure out if zoo make some change but I did not discover something obvious ... do you have some suggestion (except sed/awk)

Thanks!
Alessandro

WARNING: Could not fetch sub-slice from 1:888639-888639(1) on line 210
WARNING: Specified reference allele T does not match Ensembl reference allele on line 210
WARNING: Could not fetch sub-slice from 1:909238-909238(1) on line 211
WARNING: Specified reference allele G does not match Ensembl reference allele on line 211
WARNING: Could not fetch sub-slice from 1:955597-955597(1) on line 212

Unrecognized effect "exon_loss_variant"

I'm using 90968ee and get this error:

Unrecognized effect "exon_loss_variant". Please update your hashes! at tools/vcf2maf.pl line 413, <GEN0> line 177.

It's not in the hashes right now, should it?

BR
Daniel

_sice

really?

vcf2maf with mouse genome?

Hi,

Is it possible to use vcf2maf using mouse genome? When I try to run, it says:

ERROR: Reference FASTA not found: /home/.vep/homo_sapiens/83_GRCh37/Homo_sapiens.GRCh37.75.dna.primary_assembly.fa.gz

The vep program seems to have put the mouse genome in this location already. Is there any option to specify mouse?

[~/ .vep]$ ls
mus_musculus mus_musculus_vep_82_GRCm38.tar.gz variant_effect_output.txt variant_effect_output.txt_summary.html

Thanks a lot!