zwdzwd / transvar Goto Github PK
View Code? Open in Web Editor NEWTransVar - multiway annotator for precision medicine
License: Other
TransVar - multiway annotator for precision medicine
License: Other
When I submitted a mutation:
NM_001164277:c.925_928delinsTC
to Transvar
with these options:
Reverse annotation cDNA, GRCh37, RefSeq
I got "p.G309Sfs*3" and "chr11:g.118896734_118896737delinsGA"
for this mutation.
Although I checked this mutation on the UCSC Genome Browser,
the position chr11:g.118896734_118896737delinsGA is 308 of
NM_001164277(attached png).
Other similar services, Mutalyzer and VariantVaridator, returns "309"
, and it is consistent with the genome browser.
Hello...
I just installed transvar and I followed the Quick starting Guide
# set up databases
transvar config --download_anno --refversion hg19/hg38
# in case you don't have a reference
transvar config --download_ref --refversion hg19/hg38
It seems that the reference isn't being downloaded (arrow below)
--> transvar panno -i 'EGFR:p.D770_N771insNPG' --ucsc
Reference sequence doesn't exist
samtools faidx file doesn't exist for reference. <------------------------
Traceback (most recent call last):
File "/Users/liviamoura/.pyenv/versions/3.6.13/bin/transvar", line 197, in <module>
args.func(args)
File "/Users/liviamoura/.pyenv/versions/3.6.13/lib/python3.6/site-packages/transvar/anno.py", line 206, in main_anno
db = AnnoDB(args, config)
File "/Users/liviamoura/.pyenv/versions/3.6.13/lib/python3.6/site-packages/transvar/annodb.py", line 52, in __init__
faidx.init_refgenome(args.reference if args.reference else None)
File "/Users/liviamoura/.pyenv/versions/3.6.13/lib/python3.6/site-packages/transvar/faidx.py", line 103, in init_refgenome
refgenome = RefGenome(r) if r else None
File "/Users/liviamoura/.pyenv/versions/3.6.13/lib/python3.6/site-packages/transvar/faidx.py", line 26, in __init__
self.load_faidx()
File "/Users/liviamoura/.pyenv/versions/3.6.13/lib/python3.6/site-packages/transvar/faidx.py", line 37, in load_faidx
for line in self.faidx_handle:
AttributeError: 'RefGenome' object has no attribute 'faidx_handle'
I've spotted an issue with the latest release on Python 3:
Traceback (most recent call last):
File "/Users/ilia/.cenvs/envs/mutagenesis/bin/transvar", line 36, in <module>
args.func(args)
File "/Users/ilia/.cenvs/envs/mutagenesis/lib/python3.6/site-packages/transvar/anno.py", line 197, in main
db = AnnoDB(args, config)
File "/Users/ilia/.cenvs/envs/mutagenesis/lib/python3.6/site-packages/transvar/annodb.py", line 76, in __init__
idmap = load(open(args.uniprot))
File "/Users/ilia/.cenvs/envs/mutagenesis/lib/python3.6/codecs.py", line 321, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 0: invalid start byte
The issue is clearly related to #20
I'm making a pull-request with a hotfix.
indexing the newest version of refseq throws an error due to some entries that lack both a 'Name' and 'product' key when the t.name property is set for 'tRNA'.
transvar index --refseq hg38.refseq.gff.gz
We can't open this website, even we use the online method to download, the prompt given is that this website does not have this database information
Dear TransVar developers,
I have found a canno annotation problem with 2.4.8.20190122 that did not occur with 2.4.1.20180815 (same reference fasta and annotation files). After downgrading, it is working again:
With the old version, variants 'NM_001760.3:c.786_796dup' and 'NM_001760.3:c.758_759dupAG' from a SangerSeq experiment annotated without any errors, for example:
columns = input transcript gene strand coordinates(gDNA/cDNA/protein) region info.
values = 'NM_001760.3:c.758_759dupAG' 'NM_001760 (protein_coding)' 'CCND3' '-' 'chr6:g.41936061_41936062dupTC/c.758_759dupAG/p.S254Rfs*103' 'inside_[cds_in_exon_5]' 'CSQN=Frameshift;left_align_gDNA=g.41936057_41936058insCT;unalign_gDNA=g.41936058_41936059dupCT;left_align_cDNA=c.756_757insGA;unalign_cDNA=c.758_759dupAG;dbxref=GeneID:896,HGNC:HGNC:1585,MIM:123834;aliases=NP_001751;source=RefSeq'
With the new version, both dup variants result in no_valid_transcript_found. Querying SNVs 'NM_001760.3:c.758A>C' and 'NM_001760.3:c.759G>T' works with the new version, so theindividual reference bases seem to be correct (while, e.g., 'NM_001760.3:c.759C>G' or 'NM_001760.3:c.759A>G' fail as expected). Interestingly, querying the dup equivalent 'NM_001760.3:c.758_759AG>AGAG' still works with the new version:
'NM_001760.3:c.758_759AG>AGAG' 'NM_001760.4 (protein_coding)' 'CCND3' '-' 'chr6:g.41936061_41936062dupTC/c.760_761dupAG/p.S254Rfs*103' 'inside_[cds_in_exon_5]' 'CSQN=Frameshift;left_align_gDNA=g.41936057_41936058insCT;unalign_gDNA=g.41936058_41936059dupCT;left_align_cDNA=c.756_757insGA;unalign_cDNA=c.758_759dupAG;dbxref=GeneID:896,HGNC:HGNC:1585,MIM:123834;aliases=NP_001751;source=RefSeq'
This output suggests that after the transcript update from NM_001760.3 to NM_001760.4, this variant should now be named 'NM_001760.4:c.760_761dupAG'. But using this TransVar output as input (same new version 2.4.8.20190122) results in no_valid_transcript_found, again...?
Which of the three results is correct? Thanks.
I have a variant located and annotated by my caller as CREBBP g.3781324_3781326del, c.5039_5041del, p.S1680del which when double-checked against your service returns CREBBP g.3781329_3781331del, c.5039_5041del, p.S1680del. The same variant checked against varsome provides a genomic start of chr6:3781324.
I was just curious as to the source of the discrepancy in the genomic positions. Is Transvar providing right alignment here instead of left?
Thanks in advance
Good evening,
I am trying to analyze variants in C. elegans genome. I am using genome build WS273. I have all files necessary downloaded (genome sequence, index and gtf feature file).
I tried to run
transvar index --refseq path_to_ws273_feature_file.gtf
this returned:
traceback (most recent call last):
File "/home/nik/Programs/anaconda3/bin/transvar", line 197, in
args.func(args)
File "/home/nik/Programs/anaconda3/lib/python3.6/site-packages/transvar/localdb.py", line 1169, in main_index
db.index([args.refseq])
File "/home/nik/Programs/anaconda3/lib/python3.6/site-packages/transvar/localdb.py", line 335, in index
self.parse_raw(*raw_fns)
File "/home/nik/Programs/anaconda3/lib/python3.6/site-packages/transvar/localdb.py", line 773, in parse_raw
info = dict([_.split('=') for _ in fields[8].split(';')])
ValueError: dictionary update sequence element #0 has length 1; 2 is required
I have two feature files, gtf and gff3. Both returned the same error message.
Any help would be great.
Thank you!
seems those lines are cut by fixed width
I followed the example code to download hg19 but still get the reference sequence doesn't exist error.
$ transvar config --download_anno --refversion hg19
$ transvar config --download_ref --refversion hg19
$ transvar panno -i 'PIK3CA:p.E545K' --ucsc --ccds
Reference sequence doesn't exist
samtools faidx file doesn't exist for reference
Traceback (most recent call last):
File "/usr/local/bin/transvar", line 30, in
args.func(args)
File "/usr/local/lib/python2.7/site-packages/transvar/anno.py", line 178, in main
db = AnnoDB(args, config)
File "/usr/local/lib/python2.7/site-packages/transvar/annodb.py", line 49, in init
faidx.init_refgenome(args.reference if args.reference else None)
File "/usr/local/lib/python2.7/site-packages/transvar/faidx.py", line 93, in init_refgenome
refgenome = RefGenome(r) if r else None
File "/usr/local/lib/python2.7/site-packages/transvar/faidx.py", line 22, in init
self.load_faidx()
File "/usr/local/lib/python2.7/site-packages/transvar/faidx.py", line 33, in load_faidx
for line in self.faidx_handle:
AttributeError: RefGenome instance has no attribute 'faidx_handle'
Below a second issue concerning a called variant, which deviates from the HGVS guidelines.
Using the current version (transvar v2.4.1) we use, the output of a specific variant is the following:
SRSF2
c.284_307del24
p.(Pro95_Arg102delProProAspSerHisHisSerArg)
However, following the HGVS guidelines, the 24 should not be reported, as this is excess information.
(https://varnomen.hgvs.org/recommendations/DNA/variant/deletion/)
The same applies on protein level where the deleted amino acids should not be summed up:
(https://varnomen.hgvs.org/recommendations/protein/variant/deletion/)
I compared the updates on github (our current version 2.4.1 with the last new updated version (v2.5.9)) but could not see any commit solving this issue.
Is there any adaptation on transvar possible in order to get the correct HGVS-recommended variant description?
Thanks in advance!
Kind regards,
I'm trying to annotate a protein with its genomic coordinates using transvar and for most proteins it works fine, but sometimes nothing is returned except for the header of the output. How should I interpret this result? Or am I doing something wrong?
transvar panno --ensembl --idmap uniprot -i 'W5XKT8'
input transcript gene strand coordinates(gDNA/cDNA/protein) region info
Also, why do some proteins need their isoform to get any output and others do not?
Here's an example:
#returns output
transvar panno -i 'Q6N069-1' --uniprot --ensembl
input transcript gene strand coordinates(gDNA/cDNA/protein) region info
Q6N069-1 ENST00000379406 (protein_coding) NAA16 + chr13:g.41885341_41951166/c.1_2592/p.M1_I864 whole_transcript promoter=chr13:41884341_41885341;#exons=20;cds=chr13:41885665_41949735
#no output
transvar panno -i 'Q6N069' --uniprot --ensembl
input transcript gene strand coordinates(gDNA/cDNA/protein) region info
#returns output without providing isoform number
transvar panno -i 'Q9H1K6' --uniprot --ensembl
input transcript gene strand coordinates(gDNA/cDNA/protein) region info
Q9H1K6 ENST00000267984 (protein_coding) MESDC1 + chr15:g.81293295_81296342/c.1_1086/p.M1_N362 whole_transcript promoter=chr15:81292295_81293295;#exons=1;cds=chr15:81294613_81295698
Thanks!
I would like to have some (semi-official) translation table of TransVar annotated variant_class into MAF(Mutation Annotation Format) class. Could you consider to correct or/and add this preliminary info into TransVar documentation?
3-UTRBlockSubstitution = 3'UTR
3-UTRDeletion = 3'UTR
3-UTRInsertion= 3'UTR
3-UTRSNV = 5'UTR
5-UTRBlockSubstitution = 5'UTR
5-UTRDeletion = 5'UTR
5-UTRInsertion = 5'UTR
5-UTRSNV = 5'UTR
CdsStartBlockSubstitution = Translation_Start_Site
CdsStartDeletion = Translation_Start_Site
CdsStartInsertion = Translation_Start_Site
CdsStartSNV = Translation_Start_Site
CdsStopBlockSubstitution = Nonsense_Mutation
CdsStopDeletion = Nonstop_Mutation
CdsStopInsertion = Nonsense_Mutation
CdsStopSNV = Silent
Frameshift = Frame_Shift_InDel # how to separate Frame_Shift_Del / Frame_Shift_Ins ?
InFrameDeletion = In_Frame_Del
InFrameInsertion = In_Frame_Ins
IntergenicBlockSubstitution = IGR
IntergenicDeletion = IGR
IntergenicInsertion = IGR
IntergenicSNV = IGR
IntronicBlockSubstitution = Intron
IntronicDeletion = Intron
IntronicInsertion = Intron
IntronicSNV = Intron
Missense = Missense_Mutation
MultiAAMissense = Missense_Mutation
Nonsense = Nonsense_Mutation
SpliceAcceptorBlockSubstitution = Splice_Site
SpliceAcceptorDeletion = Splice_Site
SpliceAcceptorInsertion = Splice_Site
SpliceAcceptorSNV = Splice_Site # or Splice_Region?
SpliceDonorBlockSubstitution = Splice_Site
SpliceDonorDeletion = Splice_Site
SpliceDonorInsertion = Splice_Site
SpliceDonorSNV = Splice_Site # or Splice_Region?
Synonymous = Silent
Unclassified = UNKNOWN
UnclassifiedBlockSubstitution = UNKNOWN
UnclassifiedDeletion = UNKNOWN
... and what about the following mutations? How to identify them in TransVar output?
5'Flank, 3'Flank, RNA, Targeted_Region, De_novo_Start_InFrame, De_novo_Start_OutOfFrame
From canno
results, I noticed that mostly the gDNA position in region
column is different from the 'left_align_gDNA' in info
. But I didn't find more information about this.
And furthermore, there are 'unaligned_gDNA'. These concepts confuse a newbie in Bioinformatics like me, and would you please provide some definition or reference?
Thanks!
Hi!
I'm trying to run transvar config --download_ref --refversion hg19 and transvar config --download_anno --refversion hg19. It will create a download folder (.transvar.download) in my home directory but nothing is in it. I've also tried to change the transvar download directory by specifying a path to it in my env variable like you all had suggested (ie. export TRANSVAR_DOWNLOAD_DIR=/diskmnt/Software/bin/transvar.download). Nothing happens and it seems to still create a folder in my home directory. Can you help me with this?
Thanks!
Sohini
I'm having problems converting amino-acid substitutions to genomic coordinates using UniProt identifiers, e.g.
$ transvar panno -i 'Q9BXW4:p.G126A' --uniprot --ccds
input transcript gene strand coordinates(gDNA/cDNA/protein) region info
Q9BXW4:p.G126A CCDS31074 (protein_coding) MAP1LC3C - chr1:g.242159532C>G/c.377G>C/p.G126A inside_[cds_in_exon_4] CSQN=Missense;reference_codon=GGC;candidate_codons=GCA,GCC,GCG,GCT;candidate_mnv_variants=chr1:g.242159531_242159532delGCinsTG,chr1:g.242159531_242159532delGCinsCG,chr1:g.242159531_242159532delGCinsAG;source=CCDS
I'm getting the following output for quite a few proteins:
$ transvar panno -i 'Q99418:p.E156D' --uniprot --ccds
input transcript gene strand coordinates(gDNA/cDNA/protein) region info
[wrap_exception] warning: seek out of range
Q99418:p.E156D . . . ././. . Error_seek out of range
[wrap_exception] warning: seek out of range
Q99418:p.E156D . . . ././. . Error_seek out of range
I've tried different locations and substitutions (e.g. -i 'Q99418:1'
) to no avail. Here are some more affected proteins: P04180, P45381, P45381, Q14289, P49916. Am I doing something wrong?
Sometimes we need to know the version of transcript , so I want to how to get it by Transvar?
Hello,
I came up with the following Exception:
input transcript gene strand coordinates(gDNA/cDNA/protein) region info
Traceback (most recent call last):
File "/home/ubuntu/anaconda3/bin/transvar", line 36, in <module>
args.func(args)
File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/transvar/anno.py", line 211, in main
main_one(args, db, at)
File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/transvar/anno.py", line 186, in main_one
for q.gene in db.get_gene(q.tok):
File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/transvar/annodb.py", line 213, in get_gene
for g in db.get(name):
File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/transvar/localdb.py", line 174, in get
for g in self.get_by_alias(_name):
File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/transvar/localdb.py", line 224, in get_by_alias
self.alias_idx = load(open(self.dbfn+'.alias_idx'))
File "/home/ubuntu/anaconda3/lib/python3.6/codecs.py", line 321, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 0: invalid start byte
The command that I run was:
transvar canno -i 'NG_008377.1:c.-1289G>A' --refseq
Transvar version:
from transvar import version
print (version.__version__)
>>> '2.3.4.20161215'
Installation commands:
pip install transvar
Python version:
import sys
print(sys.version)
>>> '3.6.3 |Anaconda, Inc.| (default, Oct 13 2017, 12:02:49) \n[GCC 7.2.0]'
Thanks.
Hi,
Thanks for such a great tool.
I got some inconsistent annotations using gdna and vcf input. Here is a InFrameInsertion example
test.zip
chr7 55248980 Mut C CTCCAGGAAGCCT
the command-line and partial reuslts
$ transvar ganno --refseq --vcf test.vcf
chr7:g.55248981_55248992dupTCCAGGAAGCCT/c.2284-5_2290dupTCCAGGAAGCCT/. inside_[intron_between_exon_19_and_20]
chr7:g.55248981_55248992dupTCCAGGAAGCCT/c.2149-5_2155dupTCCAGGAAGCCT/. inside_[intron_between_exon_18_and_19]
chr7:g.55248981_55248992dupTCCAGGAAGCCT/c.2125-5_2131dupTCCAGGAAGCCT/. inside_[intron_between_exon_19_and_20]
$ transvar ganno -i 'chr7:g.55248981_55248992dupTCCAGGAAGCCT' --refseq --refversion hg19
chr7:g.55248981_55248992dupTCCAGGAAGCCT/c.2284-5_2290dupTCCAGGAAGCCT/p.A763_Y764insFQEA inside_[cds_in_exon_20]
chr7:g.55248981_55248992dupTCCAGGAAGCCT/c.2149-5_2155dupTCCAGGAAGCCT/p.A718_Y719insFQEA inside_[cds_in_exon_19]
chr7:g.55248981_55248992dupTCCAGGAAGCCT/c.2125-5_2131dupTCCAGGAAGCCT/p.A710_Y711insFQEA inside_[cds_in_exon_20]
It looks like that command line version is more reasonable. This was possibly caused by no QueryDUP in vcf version.
Best
Hello,
This is so far a great tool, thanks!
I cannot, however, process proteinvariants in batches.
it works well with a single event:
transvar panno -i 'PIK3CA:p.H1047R' --refseq
input transcript gene strand coordinates(gDNA/cDNA/protein) region info PIK3CA:p.H1047R NM_006218.2 (protein_coding) PIK3CA + chr3:g.178952085A>G/c.3140A>G/p.H1047R inside_[cds_in_exon_21] CSQN=Missense;reference_codon=CAT;candidate_codons=AGG,AGA,CGA,CGC,CGG,CGT;candidate_mnv_variants=chr3:g.178952085_178952086delATinsGA,chr3:g.178952085_178952086delATinsGC,chr3:g.178952085_178952086delATinsGG,chr3:g.178952084_178952086delCATinsAGG,chr3:g.178952084_178952086delCATinsAGA;dbxref=GeneID:5290,HGNC:8975,MIM:171834;aliases=NP_006209;source=RefSeq
But if I give it a file as input (one protein HGVS per row), the return is empty (just the header:
transvar panno -i test_variants.txt --refseq
input transcript gene strand coordinates(gDNA/cDNA/protein) region info
How should i submit panno batches? Is this possible as ganno is? I really hope so crosses fingers
Thanks a lot in advance!
Something is wrong and come fix!
It would be great if users had programmatic access to TransVar rather than having to use the command line tool and
At the moment I've modified TransVar so that we can use it from within an analytics system written in Python (rather than calling it through the command line and reading the data back again), but this is a dirty hack so I'm not really comfortable opening a PR for this - I had to make Record.formats
return information as a python dictionary rather than print to stdout, and then pass this object back through the call stack to _main_one_
, but I'm not sure if this is the best way to do this.
hi - in case this is checked by the developers, we are using a lot TransVar as it is such a great tool! all the praise for Wanding!
So - an issue that we recently found is that -unless we are doing something wrong- it seems that TransVar does not process inversions (ie it returns an empty result), for instance:
$ transvar ganno -i 'X:g.32386323_32386325inv' --ensembl --gseq
input transcript gene strand coordinates(gDNA/cDNA/protein) region info CHROM POS REF ALT
is there a solution for that?
thanks in advance!
/david
Hi there,
when inputing a cdna change to transvar with an incorrect reference nucleotide sequence, as e.g. ./transvar canno -i 'PALB2:c.172_175delTTGA' --ensembl (instead of PALB2:c.172_175delTTGT) I realized that transvar returns an empty record with a 'no_valid_transcript_found' message in the INFO field
I m a bit rusty in the use of Transvar, but I assume that the INFO field is intended to contain the errors/warnings as approppriate. If so, maybe you can consider another type of message (ignore this issue otherwise!)
thanks again for such a great tool and your work!
best
d
Hi,
I use different position converter to get my HGVS format data:
NM_173651.2:c.11438G>T
https://mutalyzer.nl/position-converter?assembly_name_or_alias=GRCh37&description=NM_173651.2%3Ac.11438G%3ET gives result :NC_000002.11:g.186665204G>T
However, ucsc genome brewer gives : chr2:186,665,466, which seems very weird to me for this type of analysis.
Can you help me ?
Thanks.
Hi,
When I tried to annotate a protein by panno
, I found this problem:
use --aa3
option for 3 letters code, but the output with one letter.
$ transvar panno -i 'JAK2:F537_K539delinsL' --ucsc --aa3
input transcript gene strand coordinates(gDNA/cDNA/protein) region info
JAK2:F537_K539delinsL NM_004972 (protein_coding) JAK2 + chr9:g.5070022_5070027delTCACAA/c.1611_1616delTCACAA/p.Phe537_Lys539delinsL inside_[cds_in_exon_12] CSQN=MultiAAMissense;left_align_gDNA=g.5070022_5070027delTCACAA;unaligned_gDNA=g.5070022_5070027delTCACAA;left_align_cDNA=c.1611_1616delTCACAA;unalign_cDNA=c.1611_1616delTCACAA;candidate_alternative_sequence=CTT/CTG/CTA/CTC/TTA/TTG;source=UCSCRefGene
JAK2:F537_K539delinsL
(amino acid with one letter) with option --aa3
p.Phe537_Lys539delinsL
(correct: p.Phe537_Lys539delinsLeu
)eu
[Leu/Leucine/L] (Phe
and Lys
are correct)Is there a character length limit when output? In other words, whether the character length of protein change or coordinates
column exceeds the length threashold?
Thanks!
Please could you post an example of a ganno
(genomic co-ordinates to HGVS) batch input example file and the command I would use to run it?
e.g. if I had a file a bit like this (only a quick example)
# chrom pos ref alt preferred_transcript
chr6 1234 A T ENST001234
chr10 4321 A C ENST004321
etc...
Whats the ganno
command to run to run transvar and return the corresponding HGVS co-ordinates?
Thanks in advance!
Dear,
We are currently using transvar as annotator in our pipeline.
We are really satisfied with the software.
However, we have a question concerning the support of both left-alignment and right-alignment convention in reporting indels and duplications.
In a recent EQC-program, we identified a deletion in the EGFR gene as follows:
c.2240_2257del <-> HGVS c.2237_2257delinsTAT
p.(Leu747_Pro753delinsSer) <-> HGVS p.(Glu746_Pro753delinsValSer)
Following the HGVS guidelines, a 3’ rule for alignment is prefered. When reading the TRANSVAR manual, we believe reporting the correct variant is possible with transvar as it supports both left-alignment and right-alignment convention.
On p15 of the transvar manual (2.5.4 release), an example is shown in which a difference between left and right-aligned identifiers is shown. I now have 3 questions
Looking forward hearing from you,
Thanks in advance,
Kind regards,
Katrien
link from config.py
fns[('hg38', 'raw')] = [
('raw_refseq', 'hg38.refseq.gff.gz',
'ftp://ftp.ncbi.nlm.nih.gov/genomes/H_sapiens/GFF/ref_GRCh38.p12_top_level.gff3.gz'),
is already in archive
ftp://ftp.ncbi.nlm.nih.gov/genomes/Homo_sapiens/ARCHIVE/ANNOTATION_RELEASE.109/GFF/ref_GRCh38.p12_top_level.gff3.gz
m.b. it is would be better or useful to add some generic download links for resources like these? or to move them into separate "additional config-like" file.
how to annotate using rsID or entire chip?
Hi, Thanks for making this software available on Github. It is a very great tool, which makes my life easier to retrieve the genomic position when looking at variations.
I have a few questions regarding retrieving positions upstream and downstream for a gene. If the mutation is upstream of the starting codon (either 5' flanking region or 5'UTR), the HGVS notation usually starts with a negative sign. For example "KCNJ11_c.-134G>T" (http://www.ncbi.nlm.nih.gov/clinvar/variation/8674/) . The negative notation fails on the software as it complains "invalid position string". A work-around solution is to replace the negative notation with "1-". If the query is modified as KCNJ11_c.1-134G>T, it works.
On the other hand, working with the downstream mutation is a bit tricky, for example c.*143A>T. Since it is not trivial to get the position of the stop codon, I am wondering if there is a way to handle this situation. Thanks!!
Hi,
We have a setup pipeline for transvar where we run the command:
transvar config --download_ref --refversion hg38
During this step we get the following warning:
[download] warning: file not available: https://zhouserver.research.chop.edu/TransVar/annotations//hg38.fa or target directory not found
[download] warning: file not available: https://zhouserver.research.chop.edu/TransVar/annotations//hg38.fa.fai or target directory not found
When I check the server index these files are indeed missing. Am I missing something or is there a way to run the transvar pipeline without these files? If it matters, running using [email protected]
Thanks for any help you can offer!
There are very simple annotation info when use vcf as input currently, as for ganno
: no cDNA and protein mutation info in column 'coordinates', also no detail region info but only position between which telomere in column 'region'.
and seems canno
/panno
does not support --vcf
but this arg is listed in command help info.
would it be possible to provide more info for vcf input, thanks!
Good afternoon,
When I try to download the annotations for hg19 as shown in the first example in the quick start documentation, I get a series of file access errors:
% transvar config --download_anno --refversion hg19 [_download_] warning: file not available: http://zwdzwd.io/transvar_user/annotations/hg19.refseq.gff.gz.transvardb or target directory not found [_download_] warning: file not available: http://zwdzwd.io/transvar_user/annotations/hg19.refseq.gff.gz.transvardb.gene_idx or target directory not found [_download_] warning: file not available: http://zwdzwd.io/transvar_user/annotations/hg19.refseq.gff.gz.transvardb.trxn_idx or target directory not found [_download_] warning: file not available: http://zwdzwd.io/transvar_user/annotations/hg19.refseq.gff.gz.transvardb.loc_idx or target directory not found [_download_] warning: file not available: http://zwdzwd.io/transvar_user/annotations/hg19.refseq.gff.gz.transvardb.loc_idx.tbi or target directory not found [_download_] warning: file not available: http://zwdzwd.io/transvar_user/annotations/hg19.ccds.txt.transvardb or target directory not found [_download_] warning: file not available: http://zwdzwd.io/transvar_user/annotations/hg19.ccds.txt.transvardb.gene_idx or target directory not found
(many other lines are deleted)
I think that there is a bug in the code that is generating the filenames for download, because if I strip off the transvardb suffixes, the downloads work in my web browser. Is this a known problem, and are there any plans to fix it?
Thanks,
Andy
When I try to annotate a cDNA such as TP53 c.872_890del19, the left_align_gDNA is the genomic coordinate that describes the 3'-most change on the reverse strand. However, the left_align_cDNA is shifted -2 base pair on the positive strand rather than returning the 3'-most cDNA position on the reverse strand (which is the query).
chr17:g.7577050_7577068del19/c.872_890del19/p.K291Tfs*48
inside_[cds_in_exon_8]
CSQN=Frameshift
left_align_gDNA=g.7577048_7577066del19
unaligned_gDNA=g.7577049_7577067del19
left_align_cDNA=c.870_888del19
unalign_cDNA=c.871_889del19
dbxref=GeneID:7157,HGNC:11998,MIM:191170
aliases=NP_000537
source=RefSeq
Why don't left align shifts increase the cDNA position when the gene is on the reverse strand?
Thanks for your help.
I tried to use it on Windows. The installation failed, while looking for "unistd.h" (which is a unix file).
Thanks!
Hi,
I am working on converting variants of different formats to gDNA variants I found some items gave different results:
commandline (version : 2.3.4)
hanh@cpuserver:/data/home$ transvar panno -i "EGFR:p.D770_N771insNPG" --ccds
input transcript gene strand coordinates(gDNA/cDNA/protein) region info
EGFR:p.D770_N771insNPG CCDS5514 (protein_coding) EGFR + chr7:g.55249017_55249018insTGGTAACCC/c.2315_2316insTGGTAACCC/p.P772_H773insGNP inside_[cds_in_exon_20] CSQN=InFrameInsertion;left_align_protein=p.D770_N771insNPG;unalign_protein=p.D770_N771insNPG;left_align_gDNA=g.55249012_55249013insAACCCTGGT;unalign_gDNA=g.55249012_55249013insAACCCTGGT;left_align_cDNA=c.2310_2311insAACCCTGGT;unalign_cDNA=c.2310_2311insAACCCTGGT;32_CandidatesOmitted;source=CCDS
hanh@cpuserver:/data/home$ transvar panno -i "EGFR:p.D770_N771insSVQ" --ccds
input transcript gene strand coordinates(gDNA/cDNA/protein) region info
EGFR:p.D770_N771insSVQ CCDS5514 (protein_coding) EGFR + chr7:g.55249013_55249014insGCGTACAAA/c.2311_2312insGCGTACAAA/p.D770_N771insSVQ inside_[cds_in_exon_20] CSQN=InFrameInsertion;left_align_protein=p.D770_N771insSVQ;unalign_protein=p.D770_N771insSVQ;left_align_gDNA=g.55249012_55249013insAGCGTACAA;unalign_gDNA=g.55249012_55249013insAGCGTACAA;left_align_cDNA=c.2310_2311insAGCGTACAA;unalign_cDNA=c.2310_2311insAGCGTACAA;48_CandidatesOmitted;source=CCDS
So here p.D770_N771insXXX gives me different genome coordinates.
The website returns:
EGFR:p.D770_N771insSVQ CCDS5514 (protein_coding) EGFR + chr7:g.(55249009ins9)/c.(2307_2308ins9)/p.D770_N771insSVQ cds_in_exon_20 left_align_protein=p.D770_N771insSVQ;unalign_protein=p.D770_N771insSVQ;insertion_cDNA=AGCGTACAA;insertion_gDNA=AGCGTACAA;imprecise;source=CCDS
EGFR:p.D770_N771insNPG CCDS5514 (protein_coding) EGFR + chr7:g.(55249009ins9)/c.(2307_2308ins9)/p.P772_H773insGNP cds_in_exon_20 left_align_protein=p.D770_N771insNPG;unalign_protein=p.D770_N771insNPG;insertion_cDNA=AACCCTGGT;insertion_gDNA=AACCCTGGT;imprecise;source=CCDS
Which one is the recommended one to follow?
Thanks,
-Han
In the record.py file there is
import locale
locale.setlocale(locale.LC_ALL, '') which sets the locale for all categories to the user’s default setting.
Then the numbers in variants are grouped with coma or dot (it depends on user locale)
9:g.69820797C>G . . . chr9:g.69820797C>G/./. inside_[intergenic_between_PTAR1(60,747_bp_upstream)_and_C9ORF135(7_bp_upstream)] CSQN=IntergenicSNV
9:g.69820797C>G . . . chr9:g.69820797C>G/./. inside_[intergenic_between_PTAR1(60.747_bp_upstream)_and_C9ORF135(7_bp_upstream)] CSQN=IntergenicSNV
Could you consider to disable grouping in this case at all?
def format_group(d):
return locale.format('%d', d, grouping=True) <---- here to False?
That could simplify batch processing in parallel for different samples using multiple computer nodes for annotation and parsing afterwards (without worrying about correct locale setting on each specific node)
Hi
Are you planning to do some type of optimization to reduce the time taken by the annotations (more useful in batch mode, of course)?
I've seen that right now its only using 1 CPU core, and I was wondering whether it could be parallelized effectively (i.e., splitting the input and processing each segment in a different thread)
Thanks, Gonzalo
hi, I just installed transvar (from pip
) and following the manual in readthedoc, however, for transvar config --download_dbsnp
, I got
[download] warning: file not available: ftp://ftp.ncbi.nlm.nih.gov/snp/organisms/human_9606_b144_GRCh37p13/VCF/00-All.vcf.gz or target directory not found
[download] warning: file not available: ftp://ftp.ncbi.nlm.nih.gov/snp/organisms/human_9606_b144_GRCh37p13/VCF/00-All.vcf.gz.tbi or target directory not found
I guess it was an out-dated link, and hope it will be fixed:)
Hi there,
Thanks for such a great tool, I've found transvar to be a massive time saver. I spotted an issue that would be great if it was addressed at some point. For intronic variants there are (at least) two notations:
NM_007294.3:c.-19-3A>G
NM_007298.3:c.-22A>G
The second one works whereas the second one produces empty fields.
See https://www.ncbi.nlm.nih.gov/clinvar/variation/125471/
Is this something you could look at?
Hello, first let me say I think transvar is great, and we are planning to use it in PomBase, the model organism database for fission yeast.
I think there is a bug / inconsistency that arises like this. Let's say we have a feature in a GTF file (real example from pombe genome), that has lowercase in the gene_id
and does not have a gene name
:
II PomBase gene 177469 180628 . - . gene_id "SPBC1198.04c"; gene_biotype "protein_coding";
I can create a database and retreive the gene:
db = AnnoDB(args, read_config())
print('in',list(db.get_gene('SPBC1198.04c')))
However, because of this:
Lines 193 to 195 in f7c17a8
Or this:
Lines 153 to 155 in f7c17a8
It's impossible to use it for a protein annotation such as SPBC1198.04c:p.N3A
, because the q.tok
is made uppercase and it does not exist in db
.
A possible fix for this particular case would be to do gene_id.upper()
here, although I am not sure it would break something else.
Lines 574 to 577 in f7c17a8
Minimal example to show the problem is below, and the files to reproduce:
data/pombe_genome.fa
: https://curation.pombase.org/dumps/latest_build/fasta/chromosomes/Schizosaccharomyces_pombe_all_chromosomes.fa.gz -o data/pombe_genome.fa.gzdata/pombe_genome.gtf
: pombe_genome.gtf.zipThen, to set up the transvar config:
# env vars
export TRANSVAR_CFG="$(pwd)/data/transvar.cfg"
export TRANSVAR_DOWNLOAD_DIR="$(pwd)/data/transvar_download"
transvar config -k reference -v "data/pombe_genome.fa" --refversion pombe_genome
transvar index --ensembl data/pombe_genome.gtf --reference data/pombe_genome.fa
Then run the code that gives the error (transvar_main_script
below is bin/transvar from which I import the parser functions)
from transvar.annodb import AnnoDB
from transvar.config import read_config
import argparse
from transvar_main_script import parser_add_annotation, parser_add_mutation, parser_add_general
from transvar.anno import main_anno
from functools import partial
parser = argparse.ArgumentParser(description=__doc__)
subparsers = parser.add_subparsers()
p = subparsers.add_parser("panno", help='annotate protein element')
parser_add_annotation(p)
parser_add_mutation(p)
parser_add_general(p)
p.set_defaults(func=partial(main_anno, at='p'))
variant_type = 'panno'
variant_description = 'SPBC1198.04c:p.N3A'
args = parser.parse_args([variant_type, '-i', variant_description, '--ensembl', 'data/pombe_genome.gtf.transvardb', '--reference', 'data/pombe_genome.fa', '-v', '2'])
db = AnnoDB(args, read_config())
print('the gene is found in the db: ',list(db.get_gene('SPBC1198.04c')))
print('the gene is not found in the panno call')
args.func(args)
I noticed a 10x performance drop compared to my older transvar version (https://bitbucket.org/wanding/transvar/commits/8a7a774618174bd591e8821b9c7c7fd5c03ce8c4) for some variants.
I traced back the performance drop to the addition of decode()
the fetch_sequence
function, which convert seq from str
to unicode
and apparently the concatenation of unicode
is way slower than that of str
Lines 81 to 83 in 28a725d
I suggest to only concatenate the unicode
at the end of the loop or remove the decode()
transvar ganno --vcf test.vcf.gz --refversion hg19 --ccds
Current version: 46.5366 s
Version without decode(): 5.30291 s
Version without one join(): 9.85117 s
Hi,
We annotated vcf of clinical sample with a clinically-relevant insertion exon 20.
Vardict output vcf (marked line 78% VAF):
Transvar annotated and filtered:
-->no annotation. Any options possible to achieve adequate annotation?
Thanks for help!
Hi,
It failed to annotate the protein change correctly when transvar the following variant:
`transvar ganno --refseq -i 'chr8:g.145737904C>T'
input transcript gene strand coordinates(gDNA/cDNA/protein) region info
chr8:g.145737904C>T NM_004260.3 (protein_coding) RECQL4 - chr8:g.145737904C>T/c.2927G>A/p.976 inside_[cds_in_exon_17] CSQN=Synonymous;dbsnp=rs35070885(chr8:145737904C>T);codon_pos=145737903-145737904-145737905;ref_codon_seq=TGA;dbxref=GeneID:9401,HGNC:9949,MIM:603780;aliases=NP_004251;source=RefSeq`
transvar --version TransVar Version 2.5.9.20190830
See https://www.ncbi.nlm.nih.gov/clinvar/variation/406950/
Is this something you could look at?
transvar canno --refseq -i 'NM_000051.3:c.-30' --refversion hg19
input transcript gene strand coordinates(gDNA/cDNA/protein) region info
NM_000051.3:c.-30 NM_000051.3 (protein_coding) ATM + chr11:g.108098322A/c.1-30A/. inside_[5-UTR;noncoding_exon_2] C2=NextToSpliceAcceptorOfExon2_At_chr11:108098321;dbxref=GeneID:472,HGNC:795,MIM:607585;aliases=NP_000042;source=RefSeq
transvar canno --refseq -i 'NM_000051.3:c.-31' --refversion hg19
input transcript gene strand coordinates(gDNA/cDNA/protein) region info
NM_000051.3:c.-31 NM_000051.3 (protein_coding) ATM + chr11:g.108098321G/c.1-31G/. inside_[5-UTR;intron_between_exon_1_and_2] C2=SpliceAcceptorOfExon2_At_chr11:108098321;dbxref=GeneID:472,HGNC:795,MIM:607585;aliases=NP_000042;source=RefSeq
The position transvar gets for c.-31 is chr11:g.108098321. In fact, there is a large intron between c.-30 (the left most base of the exon 2) and c.-31 (the right most base of exon 1), and they are not next to each other. It seems transvar assumes UTR is always on the 1st exon.
It appears that whilst CdsStartLoss
and CdsStopLoss
appear in the documentation, this functionality does not exist.
--refseq switch for hg19 annotation uses annotation release 105 which is a bit old. I understand that NCBI switched to GRCh38 reference for their provided refseq gff files for subsequent releases. Is there a way to have transvar work with UCSC RefSeq downloaded files?
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.