ltnetcase / bedanno Goto Github PK

View Code? Open in Web Editor NEW

18.0 18.0 5.0 2.86 MB

Annotate genomics variations of hg19 by using a BED format database, which construct from NCBI annotation release 104

License: GNU Lesser General Public License v3.0

Perl 99.87% Shell 0.13%

annotate-genomics-variations annotation-tool function-prediction hgvs human-variation variant-annotation

bedanno's People

Contributors

Stargazers

Watchers

Forkers

honglongwu zilhua liserjrqlxue jinyan9293 healthvivo

bedanno's Issues

`-u` REJECTED for current HGVS version

in https://varnomen.hgvs.org/recommendations/open-issues/
format like c.-237-u29A>G is rejected

I simply change $cDot for my own use. But I think you can go further If you will.

+d is similar case. Since no example case to test in hand and less to occur, I have not modify it yet.

Primary Tag Assigning Need to Be Fixed

There are some of the genes which in the LRG_reference with their transcript are not list in LRG reference file, which will lead to no primary tag assigning error for them.

test case:
{"chr":"chr17","begin":"41056042","end":"41056043","referenceSequence":"G","variantSequence":"A"}

writing makefile error

I run command perl Makefile.PL,there is a warning like this:
Warning: prerequisite Tabix v0.2.5 not found. Generating a Unix-style Makefile Writing Makefile for BedAnno Writing MYMETA.yml and MYMETA.json
then I run make,the result like this:
Manifying 1 pod document
next I run make test,there is an error:
Skip blib/lib/BedAnno.pm (unchanged) PERL_DL_NONLAZY=1 "/datapool/bin/perl" "-MExtUtils::Command::MM" "-MTest::Harness" "-e" "undef *Test::Harness::Switches; test_harness(0, 'blib/lib', 'blib/arch')" t/*.t t/BedAnno.t .. Can't locate Test/Most.pm in @INC (you may need to install the Test::Most module) (@INC contains: /datapool/home/bimiaomiao/tool/BedAnno-master/blib/lib /datapool/home/bimiaomiao/tool/BedAnno-master/blib/arch /datapool/soft/install/lib/perl5/site_perl/5.24.1/x86_64-linux /datapool/soft/install/lib/perl5/site_perl/5.24.1 /datapool/soft/install/lib/perl5/5.24.1/x86_64-linux /datapool/soft/install/lib/perl5/5.24.1 .) at t/BedAnno.t line 29. BEGIN failed--compilation aborted at t/BedAnno.t line 29. t/BedAnno.t .. Dubious, test returned 2 (wstat 512, 0x200) No subtests run Test Summary Report t/BedAnno.t (Wstat: 512 Tests: 0 Failed: 0) Non-zero exit status: 2 Parse errors: No plan found in TAP output Files=1, Tests=0, 1 wallclock secs ( 0.04 usr + 0.03 sys = 0.07 CPU) Result: FAIL Failed 1/1 test programs. 0/0 subtests failed. make: *** [test_dynamic] Error 2

Annotation incorrect for mutation g.chr17:7578239_7578240delCCinsAA

Annotation Engine for this mutation gives a large deletion, but actually a stop gain mutation.
the correct annotation for NM_000546.5 should be NM_000546.5: c.609_610delinsTT（p.(Glu204*)）

About multiple-thread implement

Tabix Perl API Does not Support threads::shared Handle

Tabix Perl API invokes a XS module to query the bgzipped database,
and in the method 'query', some new feature keys will be assigned to
the handle object, which will cause incompatible, if we share the
handle object in our module.

We'll need to impement multiple-threads by either ask the author of tabix
or ourself to rewrite Tabix API to support threads::shared, or multi-threaded
invoke many non-shared handle threads to work as daemon.

The frameshift in 7 transcript need to be fixed

NM_002537.2,311;1
NM_004152.2,282;1
NM_001172437.1,1402;-1
NM_015068.3,1436;-1
NM_001184961.1,1436;-1
NM_016178.2,295;1
NM_001134939.1,260;1

These all lead to error parsing of coordinate.
As any change to those frameshift sites is unknown to function impact except large deletion which cross over these sites.
Our module will try parsing those deletion vars cross over these sites, and return 'abnormal-fs-site' for function impact if other bases substitution around these sites.

Test Case:
{"chr":"chr1","begin":"151739699","end":"151739700","referenceSequence":"T","variantSequence":""}
{"chr":"chr1","begin":"151739699","end":"151739700","referenceSequence":"T","variantSequence":"C"}
{"chr":"chr1","begin":"151739698","end":"151739702","referenceSequence":"CTGA","variantSequence":""}
{"chr":"chr1","begin":"151739698","end":"151739701","referenceSequence":"CTG","variantSequence":""}
{"chr":"chr1","begin":"151739640","end":"151739640","referenceSequence":"","variantSequence":"A"}
{"chr":"chr7","begin":"94293824","end":"94293825","referenceSequence":"A","variantSequence":"T"}
{"chr":"chr7","begin":"94293824","end":"94293827","referenceSequence":"AAA","variantSequence":""}
{"chr":"chr7","begin":"94293824","end":"94293826","referenceSequence":"AA","variantSequence":""}
{"chr":"chr7","begin":"94293824","end":"94293824","referenceSequence":"","variantSequence":"A"}
{"chr":"chr7","begin":"94293520","end":"94293520","referenceSequence":"","variantSequence":"TC"}

Not proper mapped repeat region on reference genome for refseq transcripts

When there are some differences between transcript and reference genome and these locate around tandem of simple repeat region, map information should be correct to let every possible var calling around the region to recognize these differences.
e.g.
NM_015120.4 ALMS1 146;3;|1682;1;|1684;1;|1686;1;
The latter 3 mismatch constitute a repeat difference：

Given mapping

Name	Sequence
NM_015120.4:1683-1691	TTCTCCTCT
reference genome	_T_T_CTCT

Suggested Mapping: 1682;9;TTCTCT

But unfortunately, these kind of repeat difference is hard to recognize, we can just map them to
1682;5;TT

Test Case:
{"chr":"chr2","begin":"73675226","end":"73675227","referenceSequence":"T","variantSequence":"T"}
{"chr":"chr2","begin":"73675227","end":"73675227","referenceSequence":"","variantSequence":"CTC"}

区段之间的插入注释成了`intergenic`

例子：
chr19 36120576 36120577 T TCCAGCAG chr19 g.36120577_36120578insCCAGCAG
中间过程：
初步提取了3个注释区段
36120421 36120575 'NM_024321.3|RBM42|79171|+|C2|316|EX2|205|358|129|282|154||Y' => '0' 36120575 36120577 'NM_024321.3|RBM42|79171|+|DC2|163|IVS2|358+1|358+2|282+1|282+2|2||Y' => '0' 36120577 36122039 'NM_024321.3|RBM42|79171|+|IC2|191|IVS2|358+3|359-3|282+3|283-3|1462||Y' => '0'
第一条# not reach var跳过了
第二条不满足cal_hgvs_pos跳过了
第三条# hit 3'downstream only删除了
结果：
没有可用转录本信息，最终变异注释成了intergenic模式

总结：
初步排查如上，可能有错漏，但是这个case注释成intergenic是已发生的问题

Inframe Insertion before stop codon annotation fail

Test case:

chr1:g.172376978_172376979insAGG

Correct Annotation should be:

NM_015569.3(DNM3): c.2589_2590insAGG（p.D863_*864insR）

HGVS now requires self-rev-complement delins to be annotated as inversion format

An standard_gHGVS and standard_cHGVS need to be added to final returned BedAnno::Anno object keys to maintain this format of HGVS string.

Tabix Perl API Does NOT Support threads::shared Handle

Tabix Perl API use a XS module to query bgzipped database, and in the
C code, I found it will assign some new feature keys to the object,
which is out of the scope of perl coding, and will cause the incompatibility,
when we use a shared tabix handle.

We'll need to either ask the author of tabix to change the tabix module to support threads::shared, or multiple-threaded invokes standalone tabix handle to implement
multiple-threads.

Test Cases Summary

Test cases should cover all of the following classes

Molecular Type

non-coding RNA
protein coding RNA

Whether on mitochondrion

chrMT
non-chrMT

Different Gene Component

promoter
5'utr
3'utr
5'utr intron
3'utr intron
cds region
init-codon
stop-codon
splice-ds(5')
splice-as(3')
intron in cds region
promoter~5'utr edge
5'utr~cds edge
5'utr~5'utr-intron edge
cds~5'utr-intron edge
cds~3'utr-intron edge
cds~cds-intron edge
cds~3'utr edge
3'utr~3'utr-intron edge
3'utr~3'downstream edge
promoter~cds edge
cds~3'downstream edge
span cases
intergenic
ncRNA
annotation-fail

Variant Type

ref
snv
ins
- a. non-repeat
- b. repeat
  - i. dup
  - ii. microsatellite
del
- a. non-repeat
- b. repeat
  - i. dup
  - ii. microsatellite
delins
- a. same length
- b. different length
no-call
- a. snv
- b. ins
- c. delins

Variant's reference Length in refSeq

single base
multiple bases
zero bases (ins)

Reference genome to RefSeq mismatches

I or DI ( insertion/delins on RefSeq )
- a. var beside the left edge
- b. var beside the right edge
- c. var in the middle of the mismatch region
- d. var get across this mismatch
- e. var partly cover left end of mismatch
- f. var partly cover right end of mismatch
D ( deletion on RefSeq )
- a. var beside the left edge
- b. var beside the right edge
- c. var just at the same position of mismatch
- d. var get across the mismatch

Functional Impact which may overlap with Gene Components class

"abnormal-inseq-stop",
"abnormal-intron",
"annotation-fail",
"cds-del",
"cds-indel",
"cds-ins",
"cds-loss",
"coding-synon",
"init-loss",
"no-change",
"splice",
"splice-3",
"splice-5",
"stop-gain",
"stop-loss",
"stop-retained",
"unknown-no-call",
"utr-3",
"utr-5",
"altstart",
"frameshift",
"intron",
"knockout",
"missense",
"ncRNA",
"nonsense"
"promoter"
"unknown"

Test cases will have to be any kind of combination of the above classes.

Candidate Genes should be of the following classes

only one cds exon, without utr region
multiple cds exon, without utr region
multiple exons, with separate 5'utr and 3'utr region
multiple exons, with united 5'/3'utr-cds exons.
transcript with mismatch between reference genome and refSeq.
transcript with insertion/delins on refSeq compared with reference genome.
transcript with deletion on refSeq compared with reference genome.
transcript with abnormal-inseq-stop codon
transcript with abnormal-intron
transcript with selenocysteine
transcript with multiple-altstart codons
ncRNA on chrMT
coding RNA on chrMT with polyA-Tail

coding-synon or stop-retained for multiple amino acids result position shift

当氨基酸改变相同时（$prRef eq $prAlt），
$prSimpleSame的值是氨基酸的长度

                    my $prSimpleSame = 0;
                    for (
                        my $pp = 0 ;
                        $pp < length($prRef) and $pp < length($prAlt) ;
                        $pp++
                      )
                    {
                        if (
                            substr( $prRef, $pp, 1 ) eq
                            substr( $prAlt, $pp, 1 ) )
                        {
                            $prSimpleSame++;
                        }
                        else {
                            last;
                        }
                    }

$no_parsed_pP这个变量的值就是氨基酸序列最后一个氨基酸的位置

my $no_parsed_pP = $prBegin - 1 + $prSimpleSame;

当氨基酸序列长度大于1的时候，
$trannoEnt->{p} = 'p.(' . $no_parsed_pP . '_' . ( $no_parsed_pP + length($prRef) - 1 ) . '=)'显然是错误的，
应该用$prBegin替代$no_parsed_pP

测试用例：

11      111965687       111965691       .       GCTC    ACTA    delins  SDHD    C4E     EX4E    coding-synon    NM_003002.2     NP_002993.1     c.474_477delGCTCinsACTA .       [-p.(159_160=)-]{+p.(158_159=)+}
11      111965690       111965693       .       CTG     ATA     delins  SDHD    C4E     EX4E    stop-retained   NM_003002.2     NP_002993.1     c.477_479delCTGinsATA   .       [-p.(160_161=)-]{+p.(159_160=)+}

Bug for stop codon mutation annotated as frameshift

Annotation give frameshift format annotations to inframe substitution mutation.

What's the rule defining BlockAttr and GenePartsSO?

Dear developer(s), in README two fields "BlockAttr" and "GenePartsSO" but I didn't find these formats in other databases. The GenePartSO I guess should be a concept similar to "Sequence Ontology" but they are different in format (SO terms looks like "SO:0001819"). For BlockAttr I just cannot figure out what is it. Could you explain more about how could they were defined and in what way can I update configurations by myself (such as update ncbi annotation database from 104 to 109) ?

 5'==|>>>>|[============]|>>>|>>>|[=========]|>>>|>>>|[============]|>>>>|==3'
 PROM 5U2E D5U1 I5U1 A5U1 5U1 C1  DC1 IC1 AC1 C2E 3U1 D3U1 I3U1 A3U1 3U2E
 167  204  163  447  164  204 316 163 191 164 316 205 163  448  164  205
 .   |EX1 |     IVS1     |  EX2  |   IVS2    |  EX3  |    IVS3      |EX4E|

codon different for mitochondrion

修正repeat sequence时跨越不同exon

如果上游exon末端和下游exon首端序列相同，上游exon末端的变异会错误的识别为repeat sequence，基于3‘规则被移动到下游exon的首端。
示例输入：

9	37432131	37432133	TG	.

错误注释结果NM_012203.1:c.862TG[3>2] (std: c.862TG[2] alt: c.866_867delTG )
参考NM_012203.1:c.864_865delTG

Sequence "TG" at position 905_906 was not corrected to "TG" at position 907_908, since they reside in different exons.