ltnetcase / bedanno Goto Github PK
View Code? Open in Web Editor NEWAnnotate genomics variations of hg19 by using a BED format database, which construct from NCBI annotation release 104
License: GNU Lesser General Public License v3.0
Annotate genomics variations of hg19 by using a BED format database, which construct from NCBI annotation release 104
License: GNU Lesser General Public License v3.0
in https://varnomen.hgvs.org/recommendations/open-issues/
format like c.-237-u29A>G
is rejected
I simply change $cDot
for my own use. But I think you can go further If you will.
+d
is similar case. Since no example case to test in hand and less to occur, I have not modify it yet.
There are some of the genes which in the LRG_reference with their transcript are not list in LRG reference file, which will lead to no primary tag assigning error for them.
test case:
{"chr":"chr17","begin":"41056042","end":"41056043","referenceSequence":"G","variantSequence":"A"}
I run command perl Makefile.PL
,there is a warning like this:
Warning: prerequisite Tabix v0.2.5 not found. Generating a Unix-style Makefile Writing Makefile for BedAnno Writing MYMETA.yml and MYMETA.json
then I run make
,the result like this:
Manifying 1 pod document
next I run make test
,there is an error:
Skip blib/lib/BedAnno.pm (unchanged) PERL_DL_NONLAZY=1 "/datapool/bin/perl" "-MExtUtils::Command::MM" "-MTest::Harness" "-e" "undef *Test::Harness::Switches; test_harness(0, 'blib/lib', 'blib/arch')" t/*.t t/BedAnno.t .. Can't locate Test/Most.pm in @INC (you may need to install the Test::Most module) (@INC contains: /datapool/home/bimiaomiao/tool/BedAnno-master/blib/lib /datapool/home/bimiaomiao/tool/BedAnno-master/blib/arch /datapool/soft/install/lib/perl5/site_perl/5.24.1/x86_64-linux /datapool/soft/install/lib/perl5/site_perl/5.24.1 /datapool/soft/install/lib/perl5/5.24.1/x86_64-linux /datapool/soft/install/lib/perl5/5.24.1 .) at t/BedAnno.t line 29. BEGIN failed--compilation aborted at t/BedAnno.t line 29. t/BedAnno.t .. Dubious, test returned 2 (wstat 512, 0x200) No subtests run Test Summary Report t/BedAnno.t (Wstat: 512 Tests: 0 Failed: 0) Non-zero exit status: 2 Parse errors: No plan found in TAP output Files=1, Tests=0, 1 wallclock secs ( 0.04 usr + 0.03 sys = 0.07 CPU) Result: FAIL Failed 1/1 test programs. 0/0 subtests failed. make: *** [test_dynamic] Error 2
Annotation Engine for this mutation gives a large deletion, but actually a stop gain mutation.
the correct annotation for NM_000546.5 should be NM_000546.5: c.609_610delinsTT(p.(Glu204*))
Tabix Perl API invokes a XS module to query the bgzipped database,
and in the method 'query', some new feature keys will be assigned to
the handle object, which will cause incompatible, if we share the
handle object in our module.
We'll need to impement multiple-threads by either ask the author of tabix
or ourself to rewrite Tabix API to support threads::shared, or multi-threaded
invoke many non-shared handle threads to work as daemon.
NM_002537.2,311;1
NM_004152.2,282;1
NM_001172437.1,1402;-1
NM_015068.3,1436;-1
NM_001184961.1,1436;-1
NM_016178.2,295;1
NM_001134939.1,260;1
These all lead to error parsing of coordinate.
As any change to those frameshift sites is unknown to function impact except large deletion which cross over these sites.
Our module will try parsing those deletion vars cross over these sites, and return 'abnormal-fs-site' for function impact if other bases substitution around these sites.
Test Case:
{"chr":"chr1","begin":"151739699","end":"151739700","referenceSequence":"T","variantSequence":""}
{"chr":"chr1","begin":"151739699","end":"151739700","referenceSequence":"T","variantSequence":"C"}
{"chr":"chr1","begin":"151739698","end":"151739702","referenceSequence":"CTGA","variantSequence":""}
{"chr":"chr1","begin":"151739698","end":"151739701","referenceSequence":"CTG","variantSequence":""}
{"chr":"chr1","begin":"151739640","end":"151739640","referenceSequence":"","variantSequence":"A"}
{"chr":"chr7","begin":"94293824","end":"94293825","referenceSequence":"A","variantSequence":"T"}
{"chr":"chr7","begin":"94293824","end":"94293827","referenceSequence":"AAA","variantSequence":""}
{"chr":"chr7","begin":"94293824","end":"94293826","referenceSequence":"AA","variantSequence":""}
{"chr":"chr7","begin":"94293824","end":"94293824","referenceSequence":"","variantSequence":"A"}
{"chr":"chr7","begin":"94293520","end":"94293520","referenceSequence":"","variantSequence":"TC"}
When there are some differences between transcript and reference genome and these locate around tandem of simple repeat region, map information should be correct to let every possible var calling around the region to recognize these differences.
e.g.
NM_015120.4 ALMS1 146;3;|1682;1;|1684;1;|1686;1;
The latter 3 mismatch constitute a repeat difference:
Given mapping
Name | Sequence |
---|---|
NM_015120.4:1683-1691 | TTCTCCTCT |
reference genome | _T_T_CTCT |
Suggested Mapping: 1682;9;TTCTCT
But unfortunately, these kind of repeat difference is hard to recognize, we can just map them to
1682;5;TT
Test Case:
{"chr":"chr2","begin":"73675226","end":"73675227","referenceSequence":"T","variantSequence":"T"}
{"chr":"chr2","begin":"73675227","end":"73675227","referenceSequence":"","variantSequence":"CTC"}
例子:
chr19 36120576 36120577 T TCCAGCAG chr19 g.36120577_36120578insCCAGCAG
中间过程:
初步提取了3个注释区段
36120421 36120575 'NM_024321.3|RBM42|79171|+|C2|316|EX2|205|358|129|282|154||Y' => '0' 36120575 36120577 'NM_024321.3|RBM42|79171|+|DC2|163|IVS2|358+1|358+2|282+1|282+2|2||Y' => '0' 36120577 36122039 'NM_024321.3|RBM42|79171|+|IC2|191|IVS2|358+3|359-3|282+3|283-3|1462||Y' => '0'
第一条# not reach var
跳过了
第二条不满足cal_hgvs_pos
跳过了
第三条# hit 3'downstream only
删除了
结果:
没有可用转录本信息,最终变异注释成了intergenic
模式
总结:
初步排查如上,可能有错漏,但是这个case注释成intergenic
是已发生的问题
Test case:
chr1:g.172376978_172376979insAGG
Correct Annotation should be:
NM_015569.3(DNM3): c.2589_2590insAGG(p.D863_*864insR)
An standard_gHGVS and standard_cHGVS need to be added to final returned BedAnno::Anno object keys to maintain this format of HGVS string.
Tabix Perl API use a XS module to query bgzipped database, and in the
C code, I found it will assign some new feature keys to the object,
which is out of the scope of perl coding, and will cause the incompatibility,
when we use a shared tabix handle.
We'll need to either ask the author of tabix to change the tabix module to support threads::shared, or multiple-threaded invokes standalone tabix handle to implement
multiple-threads.
Test cases should cover all of the following classes
Test cases will have to be any kind of combination of the above classes.
当氨基酸改变相同时($prRef eq $prAlt
),
$prSimpleSame
的值是氨基酸的长度
my $prSimpleSame = 0;
for (
my $pp = 0 ;
$pp < length($prRef) and $pp < length($prAlt) ;
$pp++
)
{
if (
substr( $prRef, $pp, 1 ) eq
substr( $prAlt, $pp, 1 ) )
{
$prSimpleSame++;
}
else {
last;
}
}
$no_parsed_pP
这个变量的值就是氨基酸序列最后一个氨基酸的位置
my $no_parsed_pP = $prBegin - 1 + $prSimpleSame;
当氨基酸序列长度大于1的时候,
$trannoEnt->{p} = 'p.(' . $no_parsed_pP . '_' . ( $no_parsed_pP + length($prRef) - 1 ) . '=)'
显然是错误的,
应该用$prBegin替代$no_parsed_pP
测试用例:
11 111965687 111965691 . GCTC ACTA delins SDHD C4E EX4E coding-synon NM_003002.2 NP_002993.1 c.474_477delGCTCinsACTA . [-p.(159_160=)-]{+p.(158_159=)+}
11 111965690 111965693 . CTG ATA delins SDHD C4E EX4E stop-retained NM_003002.2 NP_002993.1 c.477_479delCTGinsATA . [-p.(160_161=)-]{+p.(159_160=)+}
Annotation give frameshift format annotations to inframe substitution mutation.
Dear developer(s), in README two fields "BlockAttr" and "GenePartsSO" but I didn't find these formats in other databases. The GenePartSO I guess should be a concept similar to "Sequence Ontology" but they are different in format (SO terms looks like "SO:0001819"). For BlockAttr I just cannot figure out what is it. Could you explain more about how could they were defined and in what way can I update configurations by myself (such as update ncbi annotation database from 104 to 109) ?
5'==|>>>>|[============]|>>>|>>>|[=========]|>>>|>>>|[============]|>>>>|==3'
PROM 5U2E D5U1 I5U1 A5U1 5U1 C1 DC1 IC1 AC1 C2E 3U1 D3U1 I3U1 A3U1 3U2E
167 204 163 447 164 204 316 163 191 164 316 205 163 448 164 205
. |EX1 | IVS1 | EX2 | IVS2 | EX3 | IVS3 |EX4E|
如果上游exon末端和下游exon首端序列相同,上游exon末端的变异会错误的识别为repeat sequence,基于3‘规则被移动到下游exon的首端。
示例输入:
9 37432131 37432133 TG .
错误注释结果NM_012203.1:c.862TG[3>2] (std: c.862TG[2] alt: c.866_867delTG )
参考NM_012203.1:c.864_865delTG
Sequence "TG" at position 905_906 was not corrected to "TG" at position 907_908, since they reside in different exons.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.