zwdzwd / transvar Goto Github PK

View Code? Open in Web Editor NEW

115.0 115.0 34.0 5.54 MB

TransVar - multiway annotator for precision medicine

License: Other

Python 98.00% Shell 1.77% Dockerfile 0.23%

transvar's People

Contributors

Stargazers

Watchers

transvar's Issues

c.-22A>G works, c.-19-3A>G doesn't

Hi there,
Thanks for such a great tool, I've found transvar to be a massive time saver. I spotted an issue that would be great if it was addressed at some point. For intronic variants there are (at least) two notations:
NM_007294.3:c.-19-3A>G
NM_007298.3:c.-22A>G

The second one works whereas the second one produces empty fields.

See https://www.ncbi.nlm.nih.gov/clinvar/variation/125471/

Is this something you could look at?

http://transvar.info/transvar_user/annotations/ is not correct?

We can't open this website, even we use the online method to download, the prompt given is that this website does not have this database information

Querying upstream and downstream (including 5'UTR and 3'UTR)

Hi, Thanks for making this software available on Github. It is a very great tool, which makes my life easier to retrieve the genomic position when looking at variations.

I have a few questions regarding retrieving positions upstream and downstream for a gene. If the mutation is upstream of the starting codon (either 5' flanking region or 5'UTR), the HGVS notation usually starts with a negative sign. For example "KCNJ11_c.-134G>T" (http://www.ncbi.nlm.nih.gov/clinvar/variation/8674/) . The negative notation fails on the software as it complains "invalid position string". A work-around solution is to replace the negative notation with "1-". If the query is modified as KCNJ11_c.1-134G>T, it works.

On the other hand, working with the downstream mutation is a bit tricky, for example c.*143A>T. Since it is not trivial to get the position of the stop codon, I am wondering if there is a way to handle this situation. Thanks!!

Cannot index against the newest version of refseq

indexing the newest version of refseq throws an error due to some entries that lack both a 'Name' and 'product' key when the t.name property is set for 'tRNA'.
transvar index --refseq hg38.refseq.gff.gz

config.py links for raw transcript tables

link from config.py
fns[('hg38', 'raw')] = [
('raw_refseq', 'hg38.refseq.gff.gz',
'ftp://ftp.ncbi.nlm.nih.gov/genomes/H_sapiens/GFF/ref_GRCh38.p12_top_level.gff3.gz'),

is already in archive
ftp://ftp.ncbi.nlm.nih.gov/genomes/Homo_sapiens/ARCHIVE/ANNOTATION_RELEASE.109/GFF/ref_GRCh38.p12_top_level.gff3.gz

m.b. it is would be better or useful to add some generic download links for resources like these? or to move them into separate "additional config-like" file.

chr8:g.145737904C>T protein change failed

Hi,
It failed to annotate the protein change correctly when transvar the following variant:
`transvar ganno --refseq -i 'chr8:g.145737904C>T'

input transcript gene strand coordinates(gDNA/cDNA/protein) region info

chr8:g.145737904C>T NM_004260.3 (protein_coding) RECQL4 - chr8:g.145737904C>T/c.2927G>A/p.976 inside_[cds_in_exon_17] CSQN=Synonymous;dbsnp=rs35070885(chr8:145737904C>T);codon_pos=145737903-145737904-145737905;ref_codon_seq=TGA;dbxref=GeneID:9401,HGNC:9949,MIM:603780;aliases=NP_004251;source=RefSeq`

transvar --version TransVar Version 2.5.9.20190830

See https://www.ncbi.nlm.nih.gov/clinvar/variation/406950/

Is this something you could look at?

How to get transcript version number?

Sometimes we need to know the version of transcript , so I want to how to get it by Transvar?

Performance issue in faidx.py fetch_sequence

I noticed a 10x performance drop compared to my older transvar version (https://bitbucket.org/wanding/transvar/commits/8a7a774618174bd591e8821b9c7c7fd5c03ce8c4) for some variants.
I traced back the performance drop to the addition of decode() the fetch_sequence function, which convert seq from str to unicode and apparently the concatenation of unicode is way slower than that of str

transvar/transvar/faidx.py

Lines 81 to 83 in 28a725d

    
           line=self.fasta_handle.readline().decode() 
        
           line=line[:-1] #Remove newline symbols 
        
           seq=seq+line

I suggest to only concatenate the unicode at the end of the loop or remove the decode()

test.vcf.gz

transvar ganno --vcf test.vcf.gz --refversion hg19 --ccds

Current version: 46.5366 s
Version without decode(): 5.30291 s
Version without one join(): 9.85117 s

Bug in annotation download code?

Good afternoon,

When I try to download the annotations for hg19 as shown in the first example in the quick start documentation, I get a series of file access errors:

% transvar config --download_anno --refversion hg19 [_download_] warning: file not available: http://zwdzwd.io/transvar_user/annotations/hg19.refseq.gff.gz.transvardb or target directory not found [_download_] warning: file not available: http://zwdzwd.io/transvar_user/annotations/hg19.refseq.gff.gz.transvardb.gene_idx or target directory not found [_download_] warning: file not available: http://zwdzwd.io/transvar_user/annotations/hg19.refseq.gff.gz.transvardb.trxn_idx or target directory not found [_download_] warning: file not available: http://zwdzwd.io/transvar_user/annotations/hg19.refseq.gff.gz.transvardb.loc_idx or target directory not found [_download_] warning: file not available: http://zwdzwd.io/transvar_user/annotations/hg19.refseq.gff.gz.transvardb.loc_idx.tbi or target directory not found [_download_] warning: file not available: http://zwdzwd.io/transvar_user/annotations/hg19.ccds.txt.transvardb or target directory not found [_download_] warning: file not available: http://zwdzwd.io/transvar_user/annotations/hg19.ccds.txt.transvardb.gene_idx or target directory not found
(many other lines are deleted)

I think that there is a bug in the code that is generating the filenames for download, because if I strip off the transvardb suffixes, the downloads work in my web browser. Is this a known problem, and are there any plans to fix it?

Thanks,
Andy

No `CdsStartLoss` and `CdsStopLoss` functionality

It appears that whilst CdsStartLoss and CdsStopLoss appear in the documentation, this functionality does not exist.

ganno batch example

Please could you post an example of a ganno (genomic co-ordinates to HGVS) batch input example file and the command I would use to run it?

e.g. if I had a file a bit like this (only a quick example)

# chrom pos ref alt preferred_transcript
chr6 1234 A T ENST001234
chr10 4321 A C ENST004321
etc...

Whats the ganno command to run to run transvar and return the corresponding HGVS co-ordinates?

Thanks in advance!

panno mode couldn't find coordinate Ensembl transcript ID

I used panno mode to transfer protein region to genomic region in reference genome hg19.
For example, I want to get information on gene 'KMT2B' 's transcript ID ENST00000222270

When I searched:

transvar panno -i 'KMT2B' --ensembl

I got :

input   transcript      gene    strand  coordinates(gDNA/cDNA/protein)  region  info
KMT2B   .       .       .       ././.   .       no_valid_transcript_found

I downloaded gtf file from https://zhouserver.research.chop.edu/TransVar/annotations/hg19.ensembl.gtf.gz.

ENST00000222270 is in the database:

zcat hg19.ensembl.gtf.gz | grep ENST00000222270  |head
19      protein_coding  transcript      36208921        36229779        .       +       .       gene_id "ENSG00000272333"; transcript_id "ENST00000222270"; gene_name "KMT2B"; gene_source "ensembl"; gene_biotype "protein_coding"; transcript_name "KMT2B-201"; transcript_source "ensembl"; tag "CCDS"; ccds_id "CCDS46055";
19      protein_coding  exon    36208921        36209283        .       +       .       gene_id "ENSG00000272333"; transcript_id "ENST00000222270"; exon_number "1"; gene_name "KMT2B"; gene_source "ensembl"; gene_biotype "protein_coding"; transcript_name "KMT2B-201"; transcript_source "ensembl"; tag "CCDS"; ccds_id "CCDS46055"; exon_id "ENSE00001625606";
19      protein_coding  CDS     36208921        36209283        .       +       0       gene_id "ENSG00000272333"; transcript_id "ENST00000222270"; exon_number "1"; gene_name "KMT2B"; gene_source "ensembl"; gene_biotype "protein_coding"; transcript_name "KMT2B-201"; transcript_source "ensembl"; tag "CCDS"; ccds_id "CCDS46055"; protein_id "ENSP00000222270";
19      protein_coding  start_codon     36208921        36208923        .       +       0       gene_id "ENSG00000272333"; transcript_id "ENST00000222270"; exon_number "1"; gene_name "KMT2B"; gene_source "ensembl"; gene_biotype "protein_coding"; transcript_name "KMT2B-201"; transcript_source "ensembl"; tag "CCDS"; ccds_id "CCDS46055";
19      protein_coding  exon    36210371        36210443        .       +       .       gene_id "ENSG00000272333"; transcript_id "ENST00000222270"; exon_number "2"; gene_name "KMT2B"; gene_source "ensembl"; gene_biotype "protein_coding"; transcript_name "KMT2B-201"; transcript_source "ensembl"; tag "CCDS"; ccds_id "CCDS46055"; exon_id "ENSE00000699758";
19      protein_coding  CDS     36210371        36210443        .       +       0       gene_id "ENSG00000272333"; transcript_id "ENST00000222270"; exon_number "2"; gene_name "KMT2B"; gene_source "ensembl"; gene_biotype "protein_coding"; transcript_name "KMT2B-201"; transcript_source "ensembl"; tag "CCDS"; ccds_id "CCDS46055"; protein_id "ENSP00000222270";
19      protein_coding  exon    36210686        36212706        .       +       .       gene_id "ENSG00000272333"; transcript_id "ENST00000222270"; exon_number "3"; gene_name "KMT2B"; gene_source "ensembl"; gene_biotype "protein_coding"; transcript_name "KMT2B-201"; transcript_source "ensembl"; tag "CCDS"; ccds_id "CCDS46055"; exon_id "ENSE00001388562";
19      protein_coding  CDS     36210686        36212706        .       +       2       gene_id "ENSG00000272333"; transcript_id "ENST00000222270"; exon_number "3"; gene_name "KMT2B"; gene_source "ensembl"; gene_biotype "protein_coding"; transcript_name "KMT2B-201"; transcript_source "ensembl"; tag "CCDS"; ccds_id "CCDS46055"; protein_id "ENSP00000222270";
19      protein_coding  exon    36213261        36213374        .       +       .       gene_id "ENSG00000272333"; transcript_id "ENST00000222270"; exon_number "4"; gene_name "KMT2B"; gene_source "ensembl"; gene_biotype "protein_coding"; transcript_name "KMT2B-201"; transcript_source "ensembl"; tag "CCDS"; ccds_id "CCDS46055"; exon_id "ENSE00000699705";
19      protein_coding  CDS     36213261        36213374        .       +       0       gene_id "ENSG00000272333"; transcript_id "ENST00000222270"; exon_number "4"; gene_name "KMT2B"; gene_source "ensembl"; gene_biotype "protein_coding"; transcript_name "KMT2B-201"; transcript_source "ensembl"; tag "CCDS"; ccds_id "CCDS46055"; protein_id "ENSP00000222270";

But not in the transvardb

 grep ENST00000222270 hg19.ensembl.gtf.gz.transvardb

Can somebody tell me why? Thanks!

Can't index from GTF files for C. elegans

Good evening,

I am trying to analyze variants in C. elegans genome. I am using genome build WS273. I have all files necessary downloaded (genome sequence, index and gtf feature file).

I tried to run
transvar index --refseq path_to_ws273_feature_file.gtf

this returned:

traceback (most recent call last):
File "/home/nik/Programs/anaconda3/bin/transvar", line 197, in
args.func(args)
File "/home/nik/Programs/anaconda3/lib/python3.6/site-packages/transvar/localdb.py", line 1169, in main_index
db.index([args.refseq])
File "/home/nik/Programs/anaconda3/lib/python3.6/site-packages/transvar/localdb.py", line 335, in index
self.parse_raw(*raw_fns)
File "/home/nik/Programs/anaconda3/lib/python3.6/site-packages/transvar/localdb.py", line 773, in parse_raw
info = dict([_.split('=') for _ in fields[8].split(';')])
ValueError: dictionary update sequence element #0 has length 1; 2 is required

I have two feature files, gtf and gff3. Both returned the same error message.
Any help would be great.
Thank you!

Deviation from HGVS recommendations

Below a second issue concerning a called variant, which deviates from the HGVS guidelines.
Using the current version (transvar v2.4.1) we use, the output of a specific variant is the following:

SRSF2
c.284_307del24
p.(Pro95_Arg102delProProAspSerHisHisSerArg)

However, following the HGVS guidelines, the 24 should not be reported, as this is excess information.
(https://varnomen.hgvs.org/recommendations/DNA/variant/deletion/)

The same applies on protein level where the deleted amino acids should not be summed up:
(https://varnomen.hgvs.org/recommendations/protein/variant/deletion/)

I compared the updates on github (our current version 2.4.1 with the last new updated version (v2.5.9)) but could not see any commit solving this issue.

Is there any adaptation on transvar possible in order to get the correct HGVS-recommended variant description?

Thanks in advance!
Kind regards,

problem with transcript to genomic coordinates conversion when UTR spans more than 1 exon

transvar canno --refseq -i 'NM_000051.3:c.-30' --refversion hg19
input transcript gene strand coordinates(gDNA/cDNA/protein) region info
NM_000051.3:c.-30 NM_000051.3 (protein_coding) ATM + chr11:g.108098322A/c.1-30A/. inside_[5-UTR;noncoding_exon_2] C2=NextToSpliceAcceptorOfExon2_At_chr11:108098321;dbxref=GeneID:472,HGNC:795,MIM:607585;aliases=NP_000042;source=RefSeq
transvar canno --refseq -i 'NM_000051.3:c.-31' --refversion hg19
input transcript gene strand coordinates(gDNA/cDNA/protein) region info
NM_000051.3:c.-31 NM_000051.3 (protein_coding) ATM + chr11:g.108098321G/c.1-31G/. inside_[5-UTR;intron_between_exon_1_and_2] C2=SpliceAcceptorOfExon2_At_chr11:108098321;dbxref=GeneID:472,HGNC:795,MIM:607585;aliases=NP_000042;source=RefSeq

The position transvar gets for c.-31 is chr11:g.108098321. In fact, there is a large intron between c.-30 (the left most base of the exon 2) and c.-31 (the right most base of exon 1), and they are not next to each other. It seems transvar assumes UTR is always on the 1st exon.

UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 0: invalid start byte

Hello,

I came up with the following Exception:

input	transcript	gene	strand	coordinates(gDNA/cDNA/protein)	region	info
Traceback (most recent call last):
  File "/home/ubuntu/anaconda3/bin/transvar", line 36, in <module>
    args.func(args)
  File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/transvar/anno.py", line 211, in main
    main_one(args, db, at)
  File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/transvar/anno.py", line 186, in main_one
    for q.gene in db.get_gene(q.tok):
  File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/transvar/annodb.py", line 213, in get_gene
    for g in db.get(name):
  File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/transvar/localdb.py", line 174, in get
    for g in self.get_by_alias(_name):
  File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/transvar/localdb.py", line 224, in get_by_alias
    self.alias_idx = load(open(self.dbfn+'.alias_idx'))
  File "/home/ubuntu/anaconda3/lib/python3.6/codecs.py", line 321, in decode
    (result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 0: invalid start byte

The command that I run was:

transvar canno -i 'NG_008377.1:c.-1289G>A' --refseq

Transvar version:

from transvar import version
print (version.__version__)

>>> '2.3.4.20161215'

Installation commands:

pip install transvar

Python version:

import sys
print(sys.version)

>>> '3.6.3 |Anaconda, Inc.| (default, Oct 13 2017, 12:02:49) \n[GCC 7.2.0]'

Thanks.

build-in dbsnp download fail

hi, I just installed transvar (from pip) and following the manual in readthedoc, however, for transvar config --download_dbsnp, I got

[download] warning: file not available: ftp://ftp.ncbi.nlm.nih.gov/snp/organisms/human_9606_b144_GRCh37p13/VCF/00-All.vcf.gz or target directory not found
[download] warning: file not available: ftp://ftp.ncbi.nlm.nih.gov/snp/organisms/human_9606_b144_GRCh37p13/VCF/00-All.vcf.gz.tbi or target directory not found

I guess it was an out-dated link, and hope it will be fixed:)

what's the difference between gDNA and left_align_gDNA

From canno results, I noticed that mostly the gDNA position in region column is different from the 'left_align_gDNA' in info. But I didn't find more information about this.

And furthermore, there are 'unaligned_gDNA'. These concepts confuse a newbie in Bioinformatics like me, and would you please provide some definition or reference?

Thanks!

Does/will transvar support Windows?

I tried to use it on Windows. The installation failed, while looking for "unistd.h" (which is a unix file).

Thanks!

my test issue track

Something is wrong and come fix!

c. annotations don't align with 3' genomic annotations on reverse strand

When I try to annotate a cDNA such as TP53 c.872_890del19, the left_align_gDNA is the genomic coordinate that describes the 3'-most change on the reverse strand. However, the left_align_cDNA is shifted -2 base pair on the positive strand rather than returning the 3'-most cDNA position on the reverse strand (which is the query).

chr17:g.7577050_7577068del19/c.872_890del19/p.K291Tfs*48
inside_[cds_in_exon_8]
CSQN=Frameshift
left_align_gDNA=g.7577048_7577066del19
unaligned_gDNA=g.7577049_7577067del19
left_align_cDNA=c.870_888del19
unalign_cDNA=c.871_889del19
dbxref=GeneID:7157,HGNC:11998,MIM:191170
aliases=NP_000537
source=RefSeq

Why don't left align shifts increase the cDNA position when the gene is on the reverse strand?
Thanks for your help.

need more annotated intomation from vcf input

There are very simple annotation info when use vcf as input currently, as for ganno: no cDNA and protein mutation info in column 'coordinates', also no detail region info but only position between which telomere in column 'region'.

and seems canno/panno does not support --vcf but this arg is listed in command help info.

would it be possible to provide more info for vcf input, thanks!

Incomplete candidate_snv_variants and candidate_mnv_variants result

how to annotate using rsID?

how to annotate using rsID or entire chip?

can't download reference genome/annotation

Hi!

I'm trying to run transvar config --download_ref --refversion hg19 and transvar config --download_anno --refversion hg19. It will create a download folder (.transvar.download) in my home directory but nothing is in it. I've also tried to change the transvar download directory by specifying a path to it in my env variable like you all had suggested (ie. export TRANSVAR_DOWNLOAD_DIR=/diskmnt/Software/bin/transvar.download). Nothing happens and it seems to still create a folder in my home directory. Can you help me with this?

Thanks!
Sohini

Inconsistent position

When I submitted a mutation:
NM_001164277:c.925_928delinsTC

to Transvar
with these options:
Reverse annotation cDNA, GRCh37, RefSeq

I got "p.G309Sfs*3" and "chr11:g.118896734_118896737delinsGA"
for this mutation.

Although I checked this mutation on the UCSC Genome Browser,
the position chr11:g.118896734_118896737delinsGA is 308 of
NM_001164277(attached png).

Other similar services, Mutalyzer and VariantVaridator, returns "309"
, and it is consistent with the genome browser.

failed annotation complex variant

Hi,
We annotated vcf of clinical sample with a clinically-relevant insertion exon 20.

alignment bwa-mem
variant calling Vardict
annotation TransVar
It appears that TransVar cannot annotate following 'complex' mutation (in-frame insertion +adjacent SNV)

Vardict output vcf (marked line 78% VAF):

Transvar annotated and filtered:

-->no annotation. Any options possible to achieve adequate annotation?
Thanks for help!

discrepancy on genomic annotation

I have a variant located and annotated by my caller as CREBBP g.3781324_3781326del, c.5039_5041del, p.S1680del which when double-checked against your service returns CREBBP g.3781329_3781331del, c.5039_5041del, p.S1680del. The same variant checked against varsome provides a genomic start of chr6:3781324.

I was just curious as to the source of the discrepancy in the genomic positions. Is Transvar providing right alignment here instead of left?

Thanks in advance

Problems with UniProt support

I'm having problems converting amino-acid substitutions to genomic coordinates using UniProt identifiers, e.g.

$ transvar panno -i 'Q9BXW4:p.G126A' --uniprot --ccds
input	transcript	gene	strand	coordinates(gDNA/cDNA/protein)	region	info
Q9BXW4:p.G126A	CCDS31074 (protein_coding)	MAP1LC3C	-	chr1:g.242159532C>G/c.377G>C/p.G126A	inside_[cds_in_exon_4]	CSQN=Missense;reference_codon=GGC;candidate_codons=GCA,GCC,GCG,GCT;candidate_mnv_variants=chr1:g.242159531_242159532delGCinsTG,chr1:g.242159531_242159532delGCinsCG,chr1:g.242159531_242159532delGCinsAG;source=CCDS

I'm getting the following output for quite a few proteins:

$ transvar panno -i 'Q99418:p.E156D' --uniprot --ccds
input	transcript	gene	strand	coordinates(gDNA/cDNA/protein)	region	info
[wrap_exception] warning: seek out of range
Q99418:p.E156D	.	.	.	././.	.	Error_seek out of range
[wrap_exception] warning: seek out of range
Q99418:p.E156D	.	.	.	././.	.	Error_seek out of range

I've tried different locations and substitutions (e.g. -i 'Q99418:1') to no avail. Here are some more affected proteins: P04180, P45381, P45381, Q14289, P49916. Am I doing something wrong?

question concerning implementation HGVS 3´ rule of variants

Dear,

We are currently using transvar as annotator in our pipeline.
We are really satisfied with the software.

However, we have a question concerning the support of both left-alignment and right-alignment convention in reporting indels and duplications.

In a recent EQC-program, we identified a deletion in the EGFR gene as follows:
c.2240_2257del <-> HGVS c.2237_2257delinsTAT
p.(Leu747_Pro753delinsSer) <-> HGVS p.(Glu746_Pro753delinsValSer)

Following the HGVS guidelines, a 3’ rule for alignment is prefered. When reading the TRANSVAR manual, we believe reporting the correct variant is possible with transvar as it supports both left-alignment and right-alignment convention.

On p15 of the transvar manual (2.5.4 release), an example is shown in which a difference between left and right-aligned identifiers is shown. I now have 3 questions

Is the term ‘unalign_cDNA´ used for right_align_cDNA? I only see left_align_cDNA in the examples of the manual?
If we look in our raw data for this specific deletion mentioned above, there is no difference between the left_align_cDNA and unalign_cDNA description
left_align_gDNA=g.55242470_55242487del18;unaligned_gDNA=g.55242470_55242487del18;left_align_cDNA=c.2240_2257del18;unalign_cDNA=c.2240_2257del18;dbxref=GeneID:1956,HGNC:3236,HPRD:00579,MIM:131550;aliases=NP_005219;source=RefSeq
Might this issue be solved if we upgrade from 2.4.1 to 2.5.4 version of the software or is there another option we have to include in the code?

Looking forward hearing from you,
Thanks in advance,

Kind regards,
Katrien

transvar does not process hgvs expressions for inversions?

hi - in case this is checked by the developers, we are using a lot TransVar as it is such a great tool! all the praise for Wanding!

So - an issue that we recently found is that -unless we are doing something wrong- it seems that TransVar does not process inversions (ie it returns an empty result), for instance:

$ transvar ganno -i 'X:g.32386323_32386325inv' --ensembl --gseq
input transcript gene strand coordinates(gDNA/cDNA/protein) region info CHROM POS REF ALT

is there a solution for that?

thanks in advance!
/david

UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 0: invalid start byte

I've spotted an issue with the latest release on Python 3:

Traceback (most recent call last):
  File "/Users/ilia/.cenvs/envs/mutagenesis/bin/transvar", line 36, in <module>
    args.func(args)
  File "/Users/ilia/.cenvs/envs/mutagenesis/lib/python3.6/site-packages/transvar/anno.py", line 197, in main
    db = AnnoDB(args, config)
  File "/Users/ilia/.cenvs/envs/mutagenesis/lib/python3.6/site-packages/transvar/annodb.py", line 76, in __init__
    idmap = load(open(args.uniprot))
  File "/Users/ilia/.cenvs/envs/mutagenesis/lib/python3.6/codecs.py", line 321, in decode
    (result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 0: invalid start byte

The issue is clearly related to #20
I'm making a pull-request with a hotfix.

inconsistence

Hi,

I am working on converting variants of different formats to gDNA variants I found some items gave different results:

commandline (version : 2.3.4)

hanh@cpuserver:/data/home$ transvar panno -i "EGFR:p.D770_N771insNPG" --ccds
input   transcript      gene    strand  coordinates(gDNA/cDNA/protein)  region  info
EGFR:p.D770_N771insNPG  CCDS5514 (protein_coding)       EGFR    +       chr7:g.55249017_55249018insTGGTAACCC/c.2315_2316insTGGTAACCC/p.P772_H773insGNP  inside_[cds_in_exon_20] CSQN=InFrameInsertion;left_align_protein=p.D770_N771insNPG;unalign_protein=p.D770_N771insNPG;left_align_gDNA=g.55249012_55249013insAACCCTGGT;unalign_gDNA=g.55249012_55249013insAACCCTGGT;left_align_cDNA=c.2310_2311insAACCCTGGT;unalign_cDNA=c.2310_2311insAACCCTGGT;32_CandidatesOmitted;source=CCDS
hanh@cpuserver:/data/home$ transvar panno -i "EGFR:p.D770_N771insSVQ" --ccds
input   transcript      gene    strand  coordinates(gDNA/cDNA/protein)  region  info
EGFR:p.D770_N771insSVQ  CCDS5514 (protein_coding)       EGFR    +       chr7:g.55249013_55249014insGCGTACAAA/c.2311_2312insGCGTACAAA/p.D770_N771insSVQ  inside_[cds_in_exon_20] CSQN=InFrameInsertion;left_align_protein=p.D770_N771insSVQ;unalign_protein=p.D770_N771insSVQ;left_align_gDNA=g.55249012_55249013insAGCGTACAA;unalign_gDNA=g.55249012_55249013insAGCGTACAA;left_align_cDNA=c.2310_2311insAGCGTACAA;unalign_cDNA=c.2310_2311insAGCGTACAA;48_CandidatesOmitted;source=CCDS

So here p.D770_N771insXXX gives me different genome coordinates.

The website returns:

EGFR:p.D770_N771insSVQ	CCDS5514 (protein_coding)	EGFR	+	chr7:g.(55249009ins9)/c.(2307_2308ins9)/p.D770_N771insSVQ	cds_in_exon_20	left_align_protein=p.D770_N771insSVQ;unalign_protein=p.D770_N771insSVQ;insertion_cDNA=AGCGTACAA;insertion_gDNA=AGCGTACAA;imprecise;source=CCDS
EGFR:p.D770_N771insNPG	CCDS5514 (protein_coding)	EGFR	+	chr7:g.(55249009ins9)/c.(2307_2308ins9)/p.P772_H773insGNP	cds_in_exon_20	left_align_protein=p.D770_N771insNPG;unalign_protein=p.D770_N771insNPG;insertion_cDNA=AACCCTGGT;insertion_gDNA=AACCCTGGT;imprecise;source=CCDS

Which one is the recommended one to follow?

Thanks,

-Han

record.py format_group grouping settings

In the record.py file there is
import locale
locale.setlocale(locale.LC_ALL, '') which sets the locale for all categories to the user’s default setting.
Then the numbers in variants are grouped with coma or dot (it depends on user locale)

9:g.69820797C>G . . . chr9:g.69820797C>G/./. inside_[intergenic_between_PTAR1(60,747_bp_upstream)_and_C9ORF135(7_bp_upstream)] CSQN=IntergenicSNV

9:g.69820797C>G . . . chr9:g.69820797C>G/./. inside_[intergenic_between_PTAR1(60.747_bp_upstream)_and_C9ORF135(7_bp_upstream)] CSQN=IntergenicSNV

Could you consider to disable grouping in this case at all?

def format_group(d):
return locale.format('%d', d, grouping=True) <---- here to False?

That could simplify batch processing in parallel for different samples using multiple computer nodes for annotation and parsing afterwards (without worrying about correct locale setting on each specific node)

three different results for the same canno input - which one is correct?

Dear TransVar developers,
I have found a canno annotation problem with 2.4.8.20190122 that did not occur with 2.4.1.20180815 (same reference fasta and annotation files). After downgrading, it is working again:

With the old version, variants 'NM_001760.3:c.786_796dup' and 'NM_001760.3:c.758_759dupAG' from a SangerSeq experiment annotated without any errors, for example:
columns = input transcript gene strand coordinates(gDNA/cDNA/protein) region info.
values = 'NM_001760.3:c.758_759dupAG' 'NM_001760 (protein_coding)' 'CCND3' '-' 'chr6:g.41936061_41936062dupTC/c.758_759dupAG/p.S254Rfs*103' 'inside_[cds_in_exon_5]' 'CSQN=Frameshift;left_align_gDNA=g.41936057_41936058insCT;unalign_gDNA=g.41936058_41936059dupCT;left_align_cDNA=c.756_757insGA;unalign_cDNA=c.758_759dupAG;dbxref=GeneID:896,HGNC:HGNC:1585,MIM:123834;aliases=NP_001751;source=RefSeq'

With the new version, both dup variants result in no_valid_transcript_found. Querying SNVs 'NM_001760.3:c.758A>C' and 'NM_001760.3:c.759G>T' works with the new version, so theindividual reference bases seem to be correct (while, e.g., 'NM_001760.3:c.759C>G' or 'NM_001760.3:c.759A>G' fail as expected). Interestingly, querying the dup equivalent 'NM_001760.3:c.758_759AG>AGAG' still works with the new version:
'NM_001760.3:c.758_759AG>AGAG' 'NM_001760.4 (protein_coding)' 'CCND3' '-' 'chr6:g.41936061_41936062dupTC/c.760_761dupAG/p.S254Rfs*103' 'inside_[cds_in_exon_5]' 'CSQN=Frameshift;left_align_gDNA=g.41936057_41936058insCT;unalign_gDNA=g.41936058_41936059dupCT;left_align_cDNA=c.756_757insGA;unalign_cDNA=c.758_759dupAG;dbxref=GeneID:896,HGNC:HGNC:1585,MIM:123834;aliases=NP_001751;source=RefSeq'
This output suggests that after the transcript update from NM_001760.3 to NM_001760.4, this variant should now be named 'NM_001760.4:c.760_761dupAG'. But using this TransVar output as input (same new version 2.4.8.20190122) results in no_valid_transcript_found, again...?

Which of the three results is correct? Thanks.

some error message in the info field are not self-explanatory

Hi there,

when inputing a cdna change to transvar with an incorrect reference nucleotide sequence, as e.g. ./transvar canno -i 'PALB2:c.172_175delTTGA' --ensembl (instead of PALB2:c.172_175delTTGT) I realized that transvar returns an empty record with a 'no_valid_transcript_found' message in the INFO field

I m a bit rusty in the use of Transvar, but I assume that the INFO field is intended to contain the errors/warnings as approppriate. If so, maybe you can consider another type of message (ignore this issue otherwise!)

thanks again for such a great tool and your work!
best
d

convert variant annotations from TransVar into a basic MAF

I would like to have some (semi-official) translation table of TransVar annotated variant_class into MAF(Mutation Annotation Format) class. Could you consider to correct or/and add this preliminary info into TransVar documentation?

TransVar = MAF

3-UTRBlockSubstitution = 3'UTR
3-UTRDeletion = 3'UTR
3-UTRInsertion= 3'UTR
3-UTRSNV = 5'UTR
5-UTRBlockSubstitution = 5'UTR
5-UTRDeletion = 5'UTR
5-UTRInsertion = 5'UTR
5-UTRSNV = 5'UTR
CdsStartBlockSubstitution = Translation_Start_Site
CdsStartDeletion = Translation_Start_Site
CdsStartInsertion = Translation_Start_Site
CdsStartSNV = Translation_Start_Site
CdsStopBlockSubstitution = Nonsense_Mutation
CdsStopDeletion = Nonstop_Mutation
CdsStopInsertion = Nonsense_Mutation
CdsStopSNV = Silent
Frameshift = Frame_Shift_InDel # how to separate Frame_Shift_Del / Frame_Shift_Ins ?
InFrameDeletion = In_Frame_Del
InFrameInsertion = In_Frame_Ins
IntergenicBlockSubstitution = IGR
IntergenicDeletion = IGR
IntergenicInsertion = IGR
IntergenicSNV = IGR
IntronicBlockSubstitution = Intron
IntronicDeletion = Intron
IntronicInsertion = Intron
IntronicSNV = Intron
Missense = Missense_Mutation
MultiAAMissense = Missense_Mutation
Nonsense = Nonsense_Mutation
SpliceAcceptorBlockSubstitution = Splice_Site
SpliceAcceptorDeletion = Splice_Site
SpliceAcceptorInsertion = Splice_Site
SpliceAcceptorSNV = Splice_Site # or Splice_Region?
SpliceDonorBlockSubstitution = Splice_Site
SpliceDonorDeletion = Splice_Site
SpliceDonorInsertion = Splice_Site
SpliceDonorSNV = Splice_Site # or Splice_Region?
Synonymous = Silent
Unclassified = UNKNOWN
UnclassifiedBlockSubstitution = UNKNOWN
UnclassifiedDeletion = UNKNOWN

... and what about the following mutations? How to identify them in TransVar output?
5'Flank, 3'Flank, RNA, Targeted_Region, De_novo_Start_InFrame, De_novo_Start_OutOfFrame

AttributeError: 'RefGenome' object has no attribute 'faidx_handle'

Hello...
I just installed transvar and I followed the Quick starting Guide

# set up databases
transvar config --download_anno --refversion hg19/hg38

# in case you don't have a reference
transvar config --download_ref --refversion hg19/hg38

It seems that the reference isn't being downloaded (arrow below)

--> transvar panno -i 'EGFR:p.D770_N771insNPG' --ucsc
Reference sequence doesn't exist
samtools faidx file doesn't exist for reference.    <------------------------
Traceback (most recent call last):
  File "/Users/liviamoura/.pyenv/versions/3.6.13/bin/transvar", line 197, in <module>
    args.func(args)
  File "/Users/liviamoura/.pyenv/versions/3.6.13/lib/python3.6/site-packages/transvar/anno.py", line 206, in main_anno
    db = AnnoDB(args, config)
  File "/Users/liviamoura/.pyenv/versions/3.6.13/lib/python3.6/site-packages/transvar/annodb.py", line 52, in __init__
    faidx.init_refgenome(args.reference if args.reference else None)
  File "/Users/liviamoura/.pyenv/versions/3.6.13/lib/python3.6/site-packages/transvar/faidx.py", line 103, in init_refgenome
    refgenome = RefGenome(r) if r else None
  File "/Users/liviamoura/.pyenv/versions/3.6.13/lib/python3.6/site-packages/transvar/faidx.py", line 26, in __init__
    self.load_faidx()
  File "/Users/liviamoura/.pyenv/versions/3.6.13/lib/python3.6/site-packages/transvar/faidx.py", line 37, in load_faidx
    for line in self.faidx_handle:
AttributeError: 'RefGenome' object has no attribute 'faidx_handle'

panno batch example

Hello,

This is so far a great tool, thanks!
I cannot, however, process proteinvariants in batches.
it works well with a single event:

transvar panno -i 'PIK3CA:p.H1047R' --refseq

input transcript gene strand coordinates(gDNA/cDNA/protein) region info PIK3CA:p.H1047R NM_006218.2 (protein_coding) PIK3CA + chr3:g.178952085A>G/c.3140A>G/p.H1047R inside_[cds_in_exon_21] CSQN=Missense;reference_codon=CAT;candidate_codons=AGG,AGA,CGA,CGC,CGG,CGT;candidate_mnv_variants=chr3:g.178952085_178952086delATinsGA,chr3:g.178952085_178952086delATinsGC,chr3:g.178952085_178952086delATinsGG,chr3:g.178952084_178952086delCATinsAGG,chr3:g.178952084_178952086delCATinsAGA;dbxref=GeneID:5290,HGNC:8975,MIM:171834;aliases=NP_006209;source=RefSeq

But if I give it a file as input (one protein HGVS per row), the return is empty (just the header:
transvar panno -i test_variants.txt --refseq
input transcript gene strand coordinates(gDNA/cDNA/protein) region info

How should i submit panno batches? Is this possible as ganno is? I really hope so crosses fingers
Thanks a lot in advance!

Problem in EnsemblDB for gene_ids with lowercase characters and no gene name

Hello, first let me say I think transvar is great, and we are planning to use it in PomBase, the model organism database for fission yeast.

I think there is a bug / inconsistency that arises like this. Let's say we have a feature in a GTF file (real example from pombe genome), that has lowercase in the gene_id and does not have a gene name:

II	PomBase	gene	177469	180628	.	-	.	gene_id "SPBC1198.04c"; gene_biotype "protein_coding";

I can create a database and retreive the gene:

db = AnnoDB(args, read_config())
print('in',list(db.get_gene('SPBC1198.04c')))

However, because of this:

transvar/transvar/anno.py

Lines 193 to 195 in f7c17a8

    
           q.tok = q.tok.upper() 
        
           genefound = False 
        
           for q.gene in db.get_gene(q.tok, args.strictversion):

Or this:

transvar/transvar/anno.py

Lines 153 to 155 in f7c17a8

    
           q.tok = q.tok.upper() 
        
           genefound = False 
        
           for q.gene in db.get_gene(q.tok, args.strictversion):

It's impossible to use it for a protein annotation such as SPBC1198.04c:p.N3A, because the q.tok is made uppercase and it does not exist in db.

A possible fix for this particular case would be to do gene_id.upper() here, although I am not sure it would break something else.

transvar/transvar/localdb.py

Lines 574 to 577 in f7c17a8

    
           if 'gene_name' in info: 
        
               g.name = info['gene_name'].upper() 
        
           else: 
        
               g.name = gene_id

Minimal example to show the problem is below, and the files to reproduce:

data/pombe_genome.fa: https://curation.pombase.org/dumps/latest_build/fasta/chromosomes/Schizosaccharomyces_pombe_all_chromosomes.fa.gz -o data/pombe_genome.fa.gz
data/pombe_genome.gtf: pombe_genome.gtf.zip

Then, to set up the transvar config:

# env vars
export TRANSVAR_CFG="$(pwd)/data/transvar.cfg"
export TRANSVAR_DOWNLOAD_DIR="$(pwd)/data/transvar_download"

transvar config -k reference -v "data/pombe_genome.fa" --refversion pombe_genome
transvar index --ensembl data/pombe_genome.gtf --reference data/pombe_genome.fa

Then run the code that gives the error (transvar_main_script below is bin/transvar from which I import the parser functions)

from transvar.annodb import AnnoDB
from transvar.config import read_config
import argparse
from transvar_main_script import parser_add_annotation, parser_add_mutation, parser_add_general
from transvar.anno import main_anno
from functools import partial

parser = argparse.ArgumentParser(description=__doc__)
subparsers = parser.add_subparsers()

p = subparsers.add_parser("panno", help='annotate protein element')
parser_add_annotation(p)
parser_add_mutation(p)
parser_add_general(p)
p.set_defaults(func=partial(main_anno, at='p'))

variant_type = 'panno'
variant_description = 'SPBC1198.04c:p.N3A'
args = parser.parse_args([variant_type, '-i', variant_description, '--ensembl', 'data/pombe_genome.gtf.transvardb', '--reference', 'data/pombe_genome.fa', '-v', '2'])

db = AnnoDB(args, read_config())

print('the gene is found in the db: ',list(db.get_gene('SPBC1198.04c')))

print('the gene is not found in the panno call')
args.func(args)

Missing some characters of amino acid name when use `panno`

Hi,
When I tried to annotate a protein by panno, I found this problem:
use --aa3 option for 3 letters code, but the output with one letter.

$ transvar panno -i 'JAK2:F537_K539delinsL' --ucsc --aa3
input	transcript	gene	strand	coordinates(gDNA/cDNA/protein)	region	info
JAK2:F537_K539delinsL	NM_004972 (protein_coding)	JAK2	+	chr9:g.5070022_5070027delTCACAA/c.1611_1616delTCACAA/p.Phe537_Lys539delinsL	inside_[cds_in_exon_12]	CSQN=MultiAAMissense;left_align_gDNA=g.5070022_5070027delTCACAA;unaligned_gDNA=g.5070022_5070027delTCACAA;left_align_cDNA=c.1611_1616delTCACAA;unalign_cDNA=c.1611_1616delTCACAA;candidate_alternative_sequence=CTT/CTG/CTA/CTC/TTA/TTG;source=UCSCRefGene

submitted mutation: JAK2:F537_K539delinsL (amino acid with one letter) with option --aa3
output protein change: p.Phe537_Lys539delinsL (correct: p.Phe537_Lys539delinsLeu)
missing: eu [Leu/Leucine/L] (Phe and Lys are correct)

Is there a character length limit when output? In other words, whether the character length of protein change or coordinates column exceeds the length threashold?

Thanks!

Programatic access / Python API

It would be great if users had programmatic access to TransVar rather than having to use the command line tool and

At the moment I've modified TransVar so that we can use it from within an analytics system written in Python (rather than calling it through the command line and reading the data back again), but this is a dirty hack so I'm not really comfortable opening a PR for this - I had to make Record.formats return information as a python dictionary rather than print to stdout, and then pass this object back through the call stack to _main_one_, but I'm not sure if this is the best way to do this.

`transvar/docs/source/features.rst` content incomplete

seems those lines are cut by fixed width

Reference files missing from server

Hi,

We have a setup pipeline for transvar where we run the command:
transvar config --download_ref --refversion hg38

During this step we get the following warning:
[download] warning: file not available: https://zhouserver.research.chop.edu/TransVar/annotations//hg38.fa or target directory not found
[download] warning: file not available: https://zhouserver.research.chop.edu/TransVar/annotations//hg38.fa.fai or target directory not found

When I check the server index these files are indeed missing. Am I missing something or is there a way to run the transvar pipeline without these files? If it matters, running using [email protected]

Thanks for any help you can offer!

Different position converter result?

Hi,
I use different position converter to get my HGVS format data:

NM_173651.2:c.11438G>T

https://mutalyzer.nl/position-converter?assembly_name_or_alias=GRCh37&description=NM_173651.2%3Ac.11438G%3ET gives result :NC_000002.11:g.186665204G>T

However, ucsc genome brewer gives : chr2:186,665,466, which seems very weird to me for this type of analysis.

Can you help me ?
Thanks.

Reference sequence doesn't exist error after following example

I followed the example code to download hg19 but still get the reference sequence doesn't exist error.
$ transvar config --download_anno --refversion hg19
$ transvar config --download_ref --refversion hg19
$ transvar panno -i 'PIK3CA:p.E545K' --ucsc --ccds
Reference sequence doesn't exist
samtools faidx file doesn't exist for reference
Traceback (most recent call last):
File "/usr/local/bin/transvar", line 30, in
args.func(args)
File "/usr/local/lib/python2.7/site-packages/transvar/anno.py", line 178, in main
db = AnnoDB(args, config)
File "/usr/local/lib/python2.7/site-packages/transvar/annodb.py", line 49, in init
faidx.init_refgenome(args.reference if args.reference else None)
File "/usr/local/lib/python2.7/site-packages/transvar/faidx.py", line 93, in init_refgenome
refgenome = RefGenome(r) if r else None
File "/usr/local/lib/python2.7/site-packages/transvar/faidx.py", line 22, in init
self.load_faidx()
File "/usr/local/lib/python2.7/site-packages/transvar/faidx.py", line 33, in load_faidx
for line in self.faidx_handle:
AttributeError: RefGenome instance has no attribute 'faidx_handle'

RefSeq hg19

--refseq switch for hg19 annotation uses annotation release 105 which is a bit old. I understand that NCBI switched to GRCh38 reference for their provided refseq gff files for subsequent releases. Is there a way to have transvar work with UCSC RefSeq downloaded files?

missing and inconsistent protein annotation usage

I'm trying to annotate a protein with its genomic coordinates using transvar and for most proteins it works fine, but sometimes nothing is returned except for the header of the output. How should I interpret this result? Or am I doing something wrong?

transvar panno --ensembl --idmap uniprot -i 'W5XKT8'
input	transcript	gene	strand	coordinates(gDNA/cDNA/protein)	region	info

Also, why do some proteins need their isoform to get any output and others do not?

Here's an example:


#returns output
transvar panno -i 'Q6N069-1' --uniprot --ensembl
input	transcript	gene	strand	coordinates(gDNA/cDNA/protein)	region	info
Q6N069-1	ENST00000379406 (protein_coding)	NAA16	+	chr13:g.41885341_41951166/c.1_2592/p.M1_I864	whole_transcript	promoter=chr13:41884341_41885341;#exons=20;cds=chr13:41885665_41949735

#no output
transvar panno -i 'Q6N069' --uniprot --ensembl
input	transcript	gene	strand	coordinates(gDNA/cDNA/protein)	region	info

#returns output without providing isoform number
transvar panno -i 'Q9H1K6' --uniprot --ensembl
input	transcript	gene	strand	coordinates(gDNA/cDNA/protein)	region	info
Q9H1K6	ENST00000267984 (protein_coding)	MESDC1	+	chr15:g.81293295_81296342/c.1_1086/p.M1_N362	whole_transcript	promoter=chr15:81292295_81293295;#exons=1;cds=chr15:81294613_81295698

Thanks!

Performance improvements (suggestion)

Hi
Are you planning to do some type of optimization to reduce the time taken by the annotations (more useful in batch mode, of course)?
I've seen that right now its only using 1 CPU core, and I was wondering whether it could be parallelized effectively (i.e., splitting the input and processing each segment in a different thread)
Thanks, Gonzalo

inconsistent InFrameInsertion annotation

Hi,
Thanks for such a great tool.
I got some inconsistent annotations using gdna and vcf input. Here is a InFrameInsertion example

test.zip
chr7 55248980 Mut C CTCCAGGAAGCCT

the command-line and partial reuslts

$ transvar ganno --refseq --vcf test.vcf
chr7:g.55248981_55248992dupTCCAGGAAGCCT/c.2284-5_2290dupTCCAGGAAGCCT/. inside_[intron_between_exon_19_and_20]
chr7:g.55248981_55248992dupTCCAGGAAGCCT/c.2149-5_2155dupTCCAGGAAGCCT/. inside_[intron_between_exon_18_and_19]
chr7:g.55248981_55248992dupTCCAGGAAGCCT/c.2125-5_2131dupTCCAGGAAGCCT/. inside_[intron_between_exon_19_and_20]
$ transvar ganno -i 'chr7:g.55248981_55248992dupTCCAGGAAGCCT' --refseq --refversion hg19
chr7:g.55248981_55248992dupTCCAGGAAGCCT/c.2284-5_2290dupTCCAGGAAGCCT/p.A763_Y764insFQEA  inside_[cds_in_exon_20]
chr7:g.55248981_55248992dupTCCAGGAAGCCT/c.2149-5_2155dupTCCAGGAAGCCT/p.A718_Y719insFQEA  inside_[cds_in_exon_19]
chr7:g.55248981_55248992dupTCCAGGAAGCCT/c.2125-5_2131dupTCCAGGAAGCCT/p.A710_Y711insFQEA  inside_[cds_in_exon_20]

It looks like that command line version is more reasonable. This was possibly caused by no QueryDUP in vcf version.

Best

	line=self.fasta_handle.readline().decode()
	line=line[:-1] #Remove newline symbols
	seq=seq+line

	q.tok = q.tok.upper()
	genefound = False
	for q.gene in db.get_gene(q.tok, args.strictversion):

	if 'gene_name' in info:
	g.name = info['gene_name'].upper()
	else:
	g.name = gene_id

zwdzwd / transvar Goto Github PK

transvar's People

Contributors

Stargazers

Watchers

Forkers

transvar's Issues

TransVar = MAF

Recommend Projects

Recommend Topics

Recommend Org