mutalyzer / mutalyzer Goto Github PK

View Code? Open in Web Editor NEW

29.0 6.0 7.0 4.4 MB

Tool suite for HGVS variant descriptions

Home Page: https://mutalyzer.nl

License: MIT License

Python 100.00%

bioinformatics mutalyzer hgvs description variant convert map normalize

mutalyzer's Introduction

Mutalyzer

Package designed to check descriptions of sequence variants according to the Human Genome Sequence Variation Society (HGVS) guidelines.

Please see ReadTheDocs for the latest documentation.

mutalyzer's People

Contributors

Stargazers

Watchers

Forkers

kchennen mathieulemieux kakitcheung roland-ewald sophy7074 xliu-hub xuexiaohua-bio

mutalyzer's Issues

Position converter not working for chromosomes.

One of the examples results in an ERETR error.

Usage of legacy locus selectors.

The name checker crashes on the following description.

NG_012337.1(SDHD):c.274G>T

It would be nice if whenever a legacy locus selector is used, we try to find it in the reference model and present the user with a selectable list of options. E.g., in this particular example, we could say something like

Transcript "SDHD" not found, but the a gene was found by that name. Please choose from:
NG_012337.1(NM_003002.2):c.274G>T (succinate dehydrogenase complex, subunit D, integral membrane protein)

Likewise, we could allow for the HGNC id in the same way.

Note: I do not suggest to resolve the full legacy locus selectors (e.g., SDHD_v1). In this case I would discard everything after the _ and follow the same procedure described above.

Missing affected transcripts.

For the following description:

NC_000016.9:g.15815278C>T

No affected transcripts were found, while Mutalyzer 2 finds 13.

Multiple entry points for position converter.

In the position_convert endpoint, there seem to be multiple ways of providing input (i.e., via a description and via a combination of other input fields). It would be cleaner to split this into two different endpoints.

Name checker bug.

This variant inserts two consecutive Cs. It is corrected however, to a duplication that does not contain two consecutive Cs.

Change infos into warnings

there are some errors in readme.md

git checkout refactor

The refactor branch don't exist. Does it have matter?

git clone [email protected]:mlefter/mutalyzer-visualization-vuetify.git

it seem to be a private repository. Could it be shared?

thanks!

Start and end positions swapped.

When the following request is done to the API:

curl -X GET "http://v3.mutalyzer.nl/api/reference_model/NM_002001.2" -H  "accept: application/json"

we get the following response:

"model": {
  "id": "NM_002001.2",
  "type": "record",
  "location": {
    "type": "range",
    "start": {
      "type": "point",
      "position": 1191
    },
    "end": {
      "type": "point",
      "position": 0
    }
  },
...

The start and end positions seem to be swapped.

Have Name Checker convert "chr1" to proper NCBI Gene ID

Suggestion/Improvement
When entering something like, "chr1:g.169519049C>T" into Name Checker,
It would be helpful for MUT to return, or go ahead and convert, "chr1" to NC_000001.10

Suggestion for performance.

Perhaps we should not send the entire reference model and reference sequence to the JavaScript client by default. This could be done on request, if it is absolutely needed.

Incorrect mapping around splice sites.

Variant

NM_002001.4:c.55_56insTTTT

is converted to:

NC_000001.11(NM_002001.4):c.55_56insTTTT

Which is not correct because there is an intron between c.55 and c.56.

It is unclear how to map this variant, for now it would be nice to raise an error.

Genomic descriptions for NM transcripts for GRCh37 and GRCh38

Website stalls indefinitely.

When the following description is offered, the website stalls indefinitely.

CCDS4702.1:c.123C>T

Mitochondrial reference not recognised.

The following description gives an error, while none is expexted.

Wrong title.

The title of description_extract says: "Convert a position".

Repeated sequences.

Add support for repeated sequences using the following format:

start _ end SEQ [ repeat_number ]

where SEQ is the repeat unit, which:

occurs repeat_number_seq times between start and end locations in the reference sequence.
1.1. repeat_number_seq >= 0
1.2. end - start + 1 % |SEQ| = 0
occurs repeat_number of times in the observed sequence, with repeat_number >= 0.

algebra should use the right backend implementation

Currently everything is converted in the frontend to sequences. If the input is provided as variants, these should be given to the backend as is for performance reasons.

Superfluous selector for transcript references.

When a transcript is used in the name checker, there is no need to use the transcript ID both as reference sequence ID and selector.

Descriptions are not clickable.

It would be nice to have the descriptions in the section "Equivalent descriptions" link to a new name check run.

Short sequence repeats.

When checking the following description:

LRG_24:g.5525C[4]

The non-informative message "Some response error occured." appears. I would expect either a message stating that the operation is not supported, or a normalised result.

cdna to genomic converter : is data up to date ?

Hi everyone !

We are trying to use your API to convert cdna to genomic position. Overall, it's working pretty fine, but we had a problem with one conversion :
https://v3.mutalyzer.nl/positionconverter?referenceId=NC_000003.11&fromSelectorId=NM_014850.4&fromCoordinateSystem=c&position=2392&toSelectorId=&toCoordinateSystem=g&includeOverlapping=true
The problem is the version of the NM : NM_014850.4 doesn't work, but NM_014850.3 works fine.
For the NCBI, the .4 version is the one accepted since november 2018 (https://www.ncbi.nlm.nih.gov/nuccore/NM_014850), is this time gap normal ? And if yes, where can I find the accepted NM list for a given NC ?

Thanks,

Quentin Riché-Piotaix, PhD
Bioinformatic Engineer,
CHU Poitiers

Missing warning messages.

The following descriptions are (rightfully) silently corrected. However a warning about why they were corrected would be in order.

NG_012337.1:g.7125+1G>T
NG_012337.1:g.7125G>TA

For the first description I would expect a warning about using an intronic position without a proper exon boundary.
For the second description I would expect a warning about the type (operator) used.

Normalised description model missing.

I can see the description model of the (possibly wrong) input, but the description model after normalisation is missing. Arguably, we should only offer the normalised model, if any at all.

Wrong insertion of a range.

A description like NG_123.4:g.ins100_110 is short hand for NG_123.4:g.insNG_123.4:100_110. however, this variant differs from this one.

I suspect that the selection of a transcript may have something to do with this.

Mapping is asymmetric.

A RefSeq transcript can be mapped to a genomic transcript, but not the other way around.

`get_selectors` does not work for chromosomes.

The following request:

curl -X GET "http://v3.mutalyzer.nl/api/get_selectors/NC_000001.11" -H  "accept: application/json"

results in an ERETR error. It is unclear why.

Errors are not formatted.

It would be nice to have a more readable formatting of the errors.

RNA descriptions.

The following description (generated by Mutalyzer) is not accepted: NG_012337.1(NM_003002.2):r.([274g>u;278u>g])

bug converting cDNA to genomic position with Mutalyzer API v3

Hi team,

We are trying to use your API to convert cDNA sequenced to genomic positions: https://v3.mutalyzer.nl/positionconverter?referenceId=NM_000334.4&fromSelectorId&fromCoordinateSystem=c&position=9877&toSelectorId&toCoordinateSystem=g&includeOverlapping=true

The results obtain is not valid with this version of the API, in this example it should be: NC_000017.10:g.62013765C>T as we correctly obtain when using Mutalyzer v2: https://mutalyzer.nl/position-converter?assembly_name_or_alias=GRCh37&description=NM_000334.4%3Ac.9877G%3EA

Thanks!
Leslie Matalonga

--
Leslie Matalonga, PhD
Clinical Genomics Specialist
CNAG-CRG
Tel:934020828

Missing feedback.

The following description:

NC_000016.9:g.[15815278C>T;15815278del]

is normalised to:

NC_000016.9:g.15815278C>T

Part of the description is discarded, but no warning or errors are given.

Server error.

Some internal server error is triggered when checking the following variant description.

NC_000001.11:g.114750024_114750025ins[(123);114750025_114750040]

Wrong interpretation of conversions.

The following conversion gets corrected to LRG_199:g.=, which does not look right to me.

Wrong normalisation.

The following variant description :

NC_000016.9:g.[15815278C>A;15815279del]

is erroneously normalised to:

NC_000016.9:g.15815277_15815279dup

No exons in transcript model.

When a transcript is used in the name checker, error ESELECTORMODELNOEXONS may be raised. The name checker can and should continue in this case, by assuming that the whole transcript is one big exon.

Halt on ambiguous descriptions.

In the following example, the description can not be interpreted because of internal inconsistencies.

NG_012337.1:g.7125delGACinsT

According to the position, one nucleotide is deleted, but according to the (optional) sequence, three nucleotides are deleted. In case of such inconsistencies, I would suggest to halt instead of silently correcting the description.

Variant overlap detection and variant ordering

The following allele descriptions are handled differently:
NG_008376.4:g.[6933del;6932_6933insC], and:
NG_008376.4:g.[6932_6933insC;6933del].

3' rule don't apply to some hgvs.g

NC_000012.11:g.78582566_78582568delinsGATAA should be normalized to NC_000012.11:g.78582569_78582571delinsGATAA

Check protein descriptions

http://v3.mutalyzer.nl/namechecker/NG_012337.1(NM_003002.2):c.274G%3ET

Server error for view variants endpoint

An internal server error is triggered for the following:

https://v3.mutalyzer.nl/api/view_variants/test

Issue warning on non-sorted variants

Incorrect allele descriptions.

The following variant description:

LRG_303:g.6883_6884insTTTCGCCCC

is correctly normalised to:

LRG_303:g.6875_6883dup

However, when an other variant is added upstream, e.g.:

LRG_303:g.[11del;6883_6884insTTTCGCCCC]

it is incorrectly normalised to:

LRG_303:g.[11del;6883_6884insCGCCCCTTT]

Perhaps this is a bug in the mutator module?

Back translation (2).

When checking description NP_002993.1:p.Asp92Glu, no suggestions for back translations are given. Mutalyzer 2 used to do this.

mutalyzer_name_checker error?

Hi,

I've installed the mutalyzer 3.0.0a2 dev0 from source, but can not be applied.

Errors were attached.

  File "/bioinfo/software/miniconda3/bin/mutalyzer_name_checker", line 33, in <module>
    sys.exit(load_entry_point('mutalyzer==3.0.0a2.dev0', 'console_scripts', 'mutalyzer_name_checker')())
  File "/bioinfo/software/miniconda3/bin/mutalyzer_name_checker", line 25, in importlib_load_entry_point
    return next(matches).load()
  File "/bioinfo/software/miniconda3/lib/python3.7/site-packages/importlib_metadata/__init__.py", line 167, in load
    module = import_module(match.group('module'))
  File "/bioinfo/software/miniconda3/lib/python3.7/importlib/__init__.py", line 127, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
  File "<frozen importlib._bootstrap>", line 1006, in _gcd_import
  File "<frozen importlib._bootstrap>", line 983, in _find_and_load
  File "<frozen importlib._bootstrap>", line 967, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 677, in _load_unlocked
  File "<frozen importlib._bootstrap_external>", line 728, in exec_module
  File "<frozen importlib._bootstrap>", line 219, in _call_with_frames_removed
  File "/bioinfo/software/miniconda3/lib/python3.7/site-packages/mutalyzer-3.0.0a2.dev0-py3.7.egg/mutalyzer/cli.py", line 4, in <module>
    from mutalyzer.name_checker import name_check
  File "/bioinfo/software/miniconda3/lib/python3.7/site-packages/mutalyzer-3.0.0a2.dev0-py3.7.egg/mutalyzer/name_checker.py", line 1, in <module>
    from .description import Description
  File "/bioinfo/software/miniconda3/lib/python3.7/site-packages/mutalyzer-3.0.0a2.dev0-py3.7.egg/mutalyzer/description.py", line 10, in <module>
    from mutalyzer_mutator import mutate
  File "<frozen importlib._bootstrap>", line 983, in _find_and_load
  File "<frozen importlib._bootstrap>", line 967, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 668, in _load_unlocked
  File "<frozen importlib._bootstrap>", line 638, in _load_backward_compatible
  File "/bioinfo/software/miniconda3/lib/python3.7/site-packages/mutalyzer_mutator-0.2.0-py3.7.egg/mutalyzer_mutator/__init__.py", line 17, in <module>
  File "/bioinfo/software/miniconda3/lib/python3.7/site-packages/mutalyzer_mutator-0.2.0-py3.7.egg/mutalyzer_mutator/__init__.py", line 7, in _get_metadata
  File "/bioinfo/software/miniconda3/lib/python3.7/site-packages/pkg_resources/__init__.py", line 482, in get_distribution
    raise TypeError("Expected string, Requirement, or Distribution", dist)
TypeError: ('Expected string, Requirement, or Distribution', None)

Any tips to fix this error?

Thanks,
Junfeng

Missing default return values.

The following pattern is found a number of times (e.g., 1, 2, 3, 4) in this project.

if something:
    return a
elif something_else:
    return b

This however leads to an inconsistency in return type when neither something nor something_else is true. A default return value is preferred here.

Also see the recommendation "Either all return statements in a function should return an expression, or none of them should." (pep8).

Incorrect example.

The example on the Name Checker page results in an error. It would be better to only show working examples.

Negative strand shift

For variants on the negative strand the 3' rule is not applied.

Example:

NG_008835.1(NM_001168390.2):c.*3186del should be normalized to NG_008835.1(NM_001168390.2):c.*3188del. The genomic description should be NG_008835.1:g.320804del

NC_000001.11(NM_032833.5):c.65_66insGGCTTCCGGTTCTGGCC is wrongly normalized to NC_000001.11(NM_032833.5):c.66_82dup. On the transcript reference it seems fine: NM_032833.5:c.65_66insGGCTTCCGGTTCTGGCC is normalized to NM_032833.5:c.49_65dup.
NC_000009.11:g.21974758_21974759insC should be normalized to NC_000009.11(NM_000077.5):c.68dup and not to NC_000009.11(NM_000077.5):c.69dup. Next, when NC_000009.11(NM_000077.5):c.69dup is used as in put it is wrongly normalized to NC_000009.11(NM_000077.4):c.70dup. It seems like there is a shifting problem.
NG_012337.1(NM_012459.2):c.5_6dup is wrongly normalized to NG_012337.1(NM_012459.2):c.7_8dup.

Back translation.

The result of the back translation of NM_003002.4:p.(Asp92Tyr) is NM_003002.4:c.(274G>T), but should this not be NM_003002.4:r.(274g>u)?