Giter Site home page Giter Site logo

ltnetcase / bedanno Goto Github PK

View Code? Open in Web Editor NEW
18.0 18.0 5.0 2.86 MB

Annotate genomics variations of hg19 by using a BED format database, which construct from NCBI annotation release 104

License: GNU Lesser General Public License v3.0

Perl 99.87% Shell 0.13%
annotate-genomics-variations annotation-tool function-prediction hgvs human-variation variant-annotation

bedanno's People

Contributors

ltnetcase avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

bedanno's Issues

Primary Tag Assigning Need to Be Fixed

There are some of the genes which in the LRG_reference with their transcript are not list in LRG reference file, which will lead to no primary tag assigning error for them.

test case:
{"chr":"chr17","begin":"41056042","end":"41056043","referenceSequence":"G","variantSequence":"A"}

writing makefile error

I run command perl Makefile.PL,there is a warning like this:
Warning: prerequisite Tabix v0.2.5 not found. Generating a Unix-style Makefile Writing Makefile for BedAnno Writing MYMETA.yml and MYMETA.json
then I run make,the result like this:
Manifying 1 pod document
next I run make test,there is an error:
Skip blib/lib/BedAnno.pm (unchanged) PERL_DL_NONLAZY=1 "/datapool/bin/perl" "-MExtUtils::Command::MM" "-MTest::Harness" "-e" "undef *Test::Harness::Switches; test_harness(0, 'blib/lib', 'blib/arch')" t/*.t t/BedAnno.t .. Can't locate Test/Most.pm in @INC (you may need to install the Test::Most module) (@INC contains: /datapool/home/bimiaomiao/tool/BedAnno-master/blib/lib /datapool/home/bimiaomiao/tool/BedAnno-master/blib/arch /datapool/soft/install/lib/perl5/site_perl/5.24.1/x86_64-linux /datapool/soft/install/lib/perl5/site_perl/5.24.1 /datapool/soft/install/lib/perl5/5.24.1/x86_64-linux /datapool/soft/install/lib/perl5/5.24.1 .) at t/BedAnno.t line 29. BEGIN failed--compilation aborted at t/BedAnno.t line 29. t/BedAnno.t .. Dubious, test returned 2 (wstat 512, 0x200) No subtests run Test Summary Report t/BedAnno.t (Wstat: 512 Tests: 0 Failed: 0) Non-zero exit status: 2 Parse errors: No plan found in TAP output Files=1, Tests=0, 1 wallclock secs ( 0.04 usr + 0.03 sys = 0.07 CPU) Result: FAIL Failed 1/1 test programs. 0/0 subtests failed. make: *** [test_dynamic] Error 2

About multiple-thread implement

Tabix Perl API Does not Support threads::shared Handle

Tabix Perl API invokes a XS module to query the bgzipped database,
and in the method 'query', some new feature keys will be assigned to
the handle object, which will cause incompatible, if we share the
handle object in our module.

We'll need to impement multiple-threads by either ask the author of tabix
or ourself to rewrite Tabix API to support threads::shared, or multi-threaded
invoke many non-shared handle threads to work as daemon.

The frameshift in 7 transcript need to be fixed

NM_002537.2,311;1
NM_004152.2,282;1
NM_001172437.1,1402;-1
NM_015068.3,1436;-1
NM_001184961.1,1436;-1
NM_016178.2,295;1
NM_001134939.1,260;1

These all lead to error parsing of coordinate.
As any change to those frameshift sites is unknown to function impact except large deletion which cross over these sites.
Our module will try parsing those deletion vars cross over these sites, and return 'abnormal-fs-site' for function impact if other bases substitution around these sites.

Test Case:
{"chr":"chr1","begin":"151739699","end":"151739700","referenceSequence":"T","variantSequence":""}
{"chr":"chr1","begin":"151739699","end":"151739700","referenceSequence":"T","variantSequence":"C"}
{"chr":"chr1","begin":"151739698","end":"151739702","referenceSequence":"CTGA","variantSequence":""}
{"chr":"chr1","begin":"151739698","end":"151739701","referenceSequence":"CTG","variantSequence":""}
{"chr":"chr1","begin":"151739640","end":"151739640","referenceSequence":"","variantSequence":"A"}
{"chr":"chr7","begin":"94293824","end":"94293825","referenceSequence":"A","variantSequence":"T"}
{"chr":"chr7","begin":"94293824","end":"94293827","referenceSequence":"AAA","variantSequence":""}
{"chr":"chr7","begin":"94293824","end":"94293826","referenceSequence":"AA","variantSequence":""}
{"chr":"chr7","begin":"94293824","end":"94293824","referenceSequence":"","variantSequence":"A"}
{"chr":"chr7","begin":"94293520","end":"94293520","referenceSequence":"","variantSequence":"TC"}

Not proper mapped repeat region on reference genome for refseq transcripts

When there are some differences between transcript and reference genome and these locate around tandem of simple repeat region, map information should be correct to let every possible var calling around the region to recognize these differences.
e.g.
NM_015120.4 ALMS1 146;3;|1682;1;|1684;1;|1686;1;
The latter 3 mismatch constitute a repeat difference:

Given mapping

Name Sequence
NM_015120.4:1683-1691 TTCTCCTCT
reference genome _T_T_CTCT

Suggested Mapping: 1682;9;TTCTCT

But unfortunately, these kind of repeat difference is hard to recognize, we can just map them to
1682;5;TT

Test Case:
{"chr":"chr2","begin":"73675226","end":"73675227","referenceSequence":"T","variantSequence":"T"}
{"chr":"chr2","begin":"73675227","end":"73675227","referenceSequence":"","variantSequence":"CTC"}

区段之间的插入注释成了`intergenic`

例子:
chr19 36120576 36120577 T TCCAGCAG chr19 g.36120577_36120578insCCAGCAG
中间过程:
初步提取了3个注释区段
36120421 36120575 'NM_024321.3|RBM42|79171|+|C2|316|EX2|205|358|129|282|154||Y' => '0' 36120575 36120577 'NM_024321.3|RBM42|79171|+|DC2|163|IVS2|358+1|358+2|282+1|282+2|2||Y' => '0' 36120577 36122039 'NM_024321.3|RBM42|79171|+|IC2|191|IVS2|358+3|359-3|282+3|283-3|1462||Y' => '0'
第一条# not reach var跳过了
第二条不满足cal_hgvs_pos跳过了
第三条# hit 3'downstream only删除了
结果:
没有可用转录本信息,最终变异注释成了intergenic模式

总结:
初步排查如上,可能有错漏,但是这个case注释成intergenic是已发生的问题

Tabix Perl API Does NOT Support threads::shared Handle

Tabix Perl API use a XS module to query bgzipped database, and in the
C code, I found it will assign some new feature keys to the object,
which is out of the scope of perl coding, and will cause the incompatibility,
when we use a shared tabix handle.

We'll need to either ask the author of tabix to change the tabix module to support threads::shared, or multiple-threaded invokes standalone tabix handle to implement
multiple-threads.

Test Cases Summary

Test cases should cover all of the following classes

Molecular Type

  1. non-coding RNA
  2. protein coding RNA

Whether on mitochondrion

  1. chrMT
  2. non-chrMT

Different Gene Component

  1. promoter
  2. 5'utr
  3. 3'utr
  4. 5'utr intron
  5. 3'utr intron
  6. cds region
  7. init-codon
  8. stop-codon
  9. splice-ds(5')
  10. splice-as(3')
  11. intron in cds region
  12. promoter~5'utr edge
  13. 5'utr~cds edge
  14. 5'utr~5'utr-intron edge
  15. cds~5'utr-intron edge
  16. cds~3'utr-intron edge
  17. cds~cds-intron edge
  18. cds~3'utr edge
  19. 3'utr~3'utr-intron edge
  20. 3'utr~3'downstream edge
  21. promoter~cds edge
  22. cds~3'downstream edge
  23. span cases
  24. intergenic
  25. ncRNA
  26. annotation-fail

Variant Type

  1. ref
  2. snv
  3. ins
    • a. non-repeat
    • b. repeat
      • i. dup
      • ii. microsatellite
  4. del
    • a. non-repeat
    • b. repeat
      • i. dup
      • ii. microsatellite
  5. delins
    • a. same length
    • b. different length
  6. no-call
    • a. snv
    • b. ins
    • c. delins

Variant's reference Length in refSeq

  1. single base
  2. multiple bases
  3. zero bases (ins)

Reference genome to RefSeq mismatches

  1. I or DI ( insertion/delins on RefSeq )
    • a. var beside the left edge
    • b. var beside the right edge
    • c. var in the middle of the mismatch region
    • d. var get across this mismatch
    • e. var partly cover left end of mismatch
    • f. var partly cover right end of mismatch
  2. D ( deletion on RefSeq )
    • a. var beside the left edge
    • b. var beside the right edge
    • c. var just at the same position of mismatch
    • d. var get across the mismatch

Functional Impact which may overlap with Gene Components class

  1. "abnormal-inseq-stop",
  2. "abnormal-intron",
  3. "annotation-fail",
  4. "cds-del",
  5. "cds-indel",
  6. "cds-ins",
  7. "cds-loss",
  8. "coding-synon",
  9. "init-loss",
  10. "no-change",
  11. "splice",
  12. "splice-3",
  13. "splice-5",
  14. "stop-gain",
  15. "stop-loss",
  16. "stop-retained",
  17. "unknown-no-call",
  18. "utr-3",
  19. "utr-5",
  20. "altstart",
  21. "frameshift",
  22. "intron",
  23. "knockout",
  24. "missense",
  25. "ncRNA",
  26. "nonsense"
  27. "promoter"
  28. "unknown"

Test cases will have to be any kind of combination of the above classes.

Candidate Genes should be of the following classes

  1. only one cds exon, without utr region
  2. multiple cds exon, without utr region
  3. multiple exons, with separate 5'utr and 3'utr region
  4. multiple exons, with united 5'/3'utr-cds exons.
  5. transcript with mismatch between reference genome and refSeq.
  6. transcript with insertion/delins on refSeq compared with reference genome.
  7. transcript with deletion on refSeq compared with reference genome.
  8. transcript with abnormal-inseq-stop codon
  9. transcript with abnormal-intron
  10. transcript with selenocysteine
  11. transcript with multiple-altstart codons
  12. ncRNA on chrMT
  13. coding RNA on chrMT with polyA-Tail

coding-synon or stop-retained for multiple amino acids result position shift

当氨基酸改变相同时($prRef eq $prAlt),
$prSimpleSame的值是氨基酸的长度

                    my $prSimpleSame = 0;
                    for (
                        my $pp = 0 ;
                        $pp < length($prRef) and $pp < length($prAlt) ;
                        $pp++
                      )
                    {
                        if (
                            substr( $prRef, $pp, 1 ) eq
                            substr( $prAlt, $pp, 1 ) )
                        {
                            $prSimpleSame++;
                        }
                        else {
                            last;
                        }
                    }

$no_parsed_pP这个变量的值就是氨基酸序列最后一个氨基酸的位置

my $no_parsed_pP = $prBegin - 1 + $prSimpleSame;                     

当氨基酸序列长度大于1的时候,
$trannoEnt->{p} = 'p.(' . $no_parsed_pP . '_' . ( $no_parsed_pP + length($prRef) - 1 ) . '=)'显然是错误的,
应该用$prBegin替代$no_parsed_pP

测试用例:

11      111965687       111965691       .       GCTC    ACTA    delins  SDHD    C4E     EX4E    coding-synon    NM_003002.2     NP_002993.1     c.474_477delGCTCinsACTA .       [-p.(159_160=)-]{+p.(158_159=)+}
11      111965690       111965693       .       CTG     ATA     delins  SDHD    C4E     EX4E    stop-retained   NM_003002.2     NP_002993.1     c.477_479delCTGinsATA   .       [-p.(160_161=)-]{+p.(159_160=)+}

What's the rule defining BlockAttr and GenePartsSO?

Dear developer(s), in README two fields "BlockAttr" and "GenePartsSO" but I didn't find these formats in other databases. The GenePartSO I guess should be a concept similar to "Sequence Ontology" but they are different in format (SO terms looks like "SO:0001819"). For BlockAttr I just cannot figure out what is it. Could you explain more about how could they were defined and in what way can I update configurations by myself (such as update ncbi annotation database from 104 to 109) ?

 5'==|>>>>|[============]|>>>|>>>|[=========]|>>>|>>>|[============]|>>>>|==3'
 PROM 5U2E D5U1 I5U1 A5U1 5U1 C1  DC1 IC1 AC1 C2E 3U1 D3U1 I3U1 A3U1 3U2E
 167  204  163  447  164  204 316 163 191 164 316 205 163  448  164  205
 .   |EX1 |     IVS1     |  EX2  |   IVS2    |  EX3  |    IVS3      |EX4E|              

修正repeat sequence时跨越不同exon

如果上游exon末端和下游exon首端序列相同,上游exon末端的变异会错误的识别为repeat sequence,基于3‘规则被移动到下游exon的首端。
示例输入:

9	37432131	37432133	TG	.

错误注释结果NM_012203.1:c.862TG[3>2] (std: c.862TG[2] alt: c.866_867delTG )
参考NM_012203.1:c.864_865delTG

Sequence "TG" at position 905_906 was not corrected to "TG" at position 907_908, since they reside in different exons.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.