ga4gh / vrs-python Goto Github PK

View Code? Open in Web Editor NEW

50.0 20.0 27.0 17.52 MB

GA4GH Variation Representation Python Implementation

Home Page: https://github.com/ga4gh/vrs

License: Apache License 2.0

Makefile 1.60% Python 79.41% Jupyter Notebook 18.83% Shell 0.16%

genomics bioinformatics ga4gh

vrs-python's Introduction

vrs-python

VRS-Python provides Python language support and a reference implementation for the GA4GH Variation Representation Specification(VRS).

Information

Releases

Development

Features

Pydantic implementation of GKS core models and VRS models
Algorithm for generating consistent, globally unique identifiers for variation without a central authority
Algorithm for performing fully justified allele normalization
Translating from and to other variant formats
Annotate VCFs with VRS
Convert GA4GH objects between inlined and referenced forms

Known Issues

You are encouraged to browse issues. All known issues are listed there. Please report any issues you find.

Installing VRS-Python Locally

Prerequisites

Python >= 3.9
- Note: Python 3.10 is required for developers contributing to VRS-Python
libpq
postgresql

MacOS

You can use Homebrew to install the prerequisites. See the Homebrew documentation for how to install. Make sure Homebrew is up-to-date by running brew update.

brew install libpq
brew install python3
brew install postgresql@14

Ubuntu

sudo apt install gcc libpq-dev python3-dev

Installation Steps

1. Install VRS-Python with pip

VRS-Python is available on PyPI.

pip install 'ga4gh.vrs[extras]'

The [extras] argument tells pip to install packages to fulfill the dependencies of the ga4gh.vrs.extras package.

2. Install External Data Sources

The ga4gh.vrs.extras modules are not part of the VR spec per se. They are bundled with ga4gh.vrs for development and installation convenience. These modules depend directly and indirectly on external data sources of sequences, transcripts, and genome-transcript alignments.

First, you must install a local SeqRepo:

pip install seqrepo
export SEQREPO_VERSION=2021-01-29  # or newer if available -- check `seqrepo list-remote-instances`
sudo mkdir -p /usr/local/share/seqrepo
sudo chown $USER /usr/local/share/seqrepo
seqrepo pull -i $SEQREPO_VERSION
seqrepo update-latest

If you encounter a permission error similar to the one below:

PermissionError: [Error 13] Permission denied: '/usr/local/share/seqrepo/2021-01-29._fkuefgd' -> '/usr/local/share/seqrepo/2021-01-29'

Try moving data manually with sudo:

sudo mv /usr/local/share/seqrepo/$SEQREPO_VERSION.* /usr/local/share/seqrepo/$SEQREPO_VERSION

To make installation easy, we recommend using Docker to install the other Biocommons tools - SeqRepo REST and UTA. If you would like to use local instances of UTA, see UTA directly. We do provide some additional setup help here.

Next, run the following commands:

docker volume create --name=uta_vol
docker volume create --name=seqrepo_vol
docker-compose up

This should start three containers:

seqrepo: downloads seqrepo into a docker volume and exits
seqrepo-rest-service: a REST service on seqrepo (localhost:5000)
uta: a database of transcripts and alignments (localhost:5432)

Check that the containers are running, by running:

$ docker ps
CONTAINER ID        IMAGE                                    //  NAMES
86e872ab0c69        biocommons/seqrepo-rest-service:latest   //  vrs-python_seqrepo-rest-service_1
a40576b8cf1f        biocommons/uta:uta_20210129b              //  vrs-python_uta_1

Depending on your network and host, the first run is likely to take 5-15 minutes in order to download and install data. Subsequent startups should be nearly instantaneous.

You can test UTA and seqrepo installations like so:

$ psql -XAt postgres://anonymous@localhost/uta -c 'select count(*) from uta_20210129b.transcript'
314227

It doesn't work

Here are some things to try.

Bring up one service at a time. For example, if you haven't download seqrepo yet, you might see this:

$ docker-compose up seqrepo-rest-service
Starting vrs-python_seqrepo-rest-service_1 ... done
Attaching to vrs-python_seqrepo-rest-service_1
seqrepo-rest-service_1  | 2022-07-26 15:59:59 seqrepo_rest_service.__main__[1] INFO Using seqrepo_dir='/usr/local/share/seqrepo/2021-01-29' from command line
⋮
seqrepo-rest-service_1  | OSError: Unable to open SeqRepo directory /usr/local/share/seqrepo/2021-01-29
vrs-python_seqrepo-rest-service_1 exited with code 1

VRS-Python and VRS Version Correspondence

The ga4gh/vrs-python repo embeds the ga4gh/vrs repo as a git submodule for testing purposes. Each ga4gh.vrs package on PyPI embeds a particular version of VRS. The correspondences between the packages that are currently maintained may be summarized as:

vrs-python branch	vrs-python tag/version	vrs branch	vrs version
main (default branch)	2.x	2.x	2.x
1.x	0.8.x	1.x	1.x

⚠ Note: Only 2.x branch is being actively maintained. The 1.x branch will only be maintained for bug fixes.

⚠ Developers: See the development section below for recommendations for using submodules gracefully (and without causing problems for others!).

Previous VRS-Python and VRS Version Correspondence

The correspondences between the packages that are no longer maintained may be summarized as:

vrs-python branch	vrs-python tag/version	vrs branch	vrs version
0.9	0.9.x	metaschema-update	N/A
0.7	0.7.x	1.2	1.2.x
0.6	0.6.x	1.1	1.1.x

Developers

This section is intended for developers who contribute to VRS-Python.

Installing for development

Fork the repo at https://github.com/ga4gh/vrs-python/.

git clone --recurse-submodules [email protected]:YOUR_GITHUB_ID/vrs-python.git
cd vrs-python
make devready
source venv/3.10/bin/activate

If you already cloned the repo, but forgot to include --recurse-submodules you can run:

git submodule update --init --recursive

Submodules

vrs-python embeds vrs as a submodule, only for testing purposes. When checking out vrs-python and switching branches, it is important to make sure that the submodule tracks vrs-python correctly. The recommended way to do this is git config --global submodule.recurse true. If you don't set submodule.recurse, developers and reviewers must be extremely careful to not accidentally upgrade or downgrade schemas with respect to vrs-python.

Alternatively, see misc/githooks/.

Testing

This package implements typical unit tests for ga4gh.core and ga4gh.vrs. This package also implements the compliance tests from vrs (vrs/validation) in the tests/validation/ directory.

To run tests:

make test

Running the Notebooks

The notebooks do not require you to setup SeqRepo or UTA from Install External Data Sources.

Running the Notebooks on Binder

Binder allows you to create custom computing environments that can be shared and used by many remote users.

You can access the notebooks on Binder here.

Running the Notebooks on the Terra platform

Terra is a cloud platform for biomedical research developed by the Broad Institute, Microsoft and Verily. The platform includes preconfigured environments that provide user-friendly access to various applications commonly used in bioinformatics, including Jupyter Notebooks.

We have created a public VRS-demo-notebooks workspace in Terra that contains the demo notebooks along with instructions for running them with minimal setup. To get started, see either the VRS-demo-notebooks workspace or the Terra.ipynb notebook in this repository.

Running the Notebooks with VS Code

VS Code is a code editor developed by Microsoft. It is lightweight, highly customizable, and supports a wide range of programming languages, with a robust extension system. You can download VS Code here.

Open VS Code.
Use Extensions view (Ctrl+Shift+X or ⌘+Shift+X) to install the Jupyter extension.
Navigate to your vrs-python project folder and open it in VS Code.
In a notebook, click Select Kernel at the top right. Select the option where the path is venv/3.10/bin/python3. See here for more information on managing Jupyter Kernels in VS Code.
After selecting the kernel you can now run the notebook.

Security Note (from the GA4GH Security Team)

A stand-alone security review has been performed on the specification itself. This implementation is offered as-is, and without any security guarantees. It will need an independent security review before it can be considered ready for use in security-critical applications. If you integrate this code into your application it is AT YOUR OWN RISK AND RESPONSIBILITY to arrange for a security audit.

vrs-python's People

Contributors

Stargazers

Watchers

vrs-python's Issues

Implement method to generated imputed sequence from variation

Generate the imputed sequence from variation. For example, imputing a Haplotype would apply all Alleles to the underlying sequence.

Only SequenceLocations with SimpleIntervals may be imputed. Users must convert before invoking.

Also, Alleles should be applied in descending position order so that indels don't require tracking net offsets.

Extend normalize to support VOCA/SPDI

Implement NCBI's variant overprecision correction algorithm.

Probably worth considering generalizations to distinguish shuffling from extending. VOCA is essentially extending left and right.

Need to implement #16 first.

Data files not being packaged

It appears that the schema files from vr-spec (both the revised vr.* and the ga4gh.* files) are not getting packaged. Not sure yet why this isn't working.

Document use of vr-python in driver projects

VICC
ClinGen Allele Registry

docs: getting started

Thanks for a great project. Looking forward to applying this.

Using this issue as a place to collect notes for a simple getting started guide as I try this for the first time:

(python 2.7 already installed)
$ git clone [email protected]:ga4gh/vmc-python.git
$ cd vmc-python
$ git submodule init
$ git submodule update

$ make help
/bin/bash: sbin/makefile-extract-documentation: No such file or directory

$ make test
type -p python
/usr/local/bin/python
python setup.py pytest --addopts="--cov=vmc vmc tests"
running pytest
running egg_info
writing requirements to vmc.egg-info/requires.txt
writing vmc.egg-info/PKG-INFO
writing top-level names to vmc.egg-info/top_level.txt
writing dependency_links to vmc.egg-info/dependency_links.txt
writing manifest file 'vmc.egg-info/SOURCES.txt'
running build_ext
============================================================================================== test session starts ===============================================================================================
platform darwin -- Python 2.7.13, pytest-3.4.2, py-1.5.2, pluggy-0.6.0
rootdir: /Users/walsbr/vmc-python, inifile: pytest.ini
plugins: flask-0.8.1, cov-2.5.1, catchlog-1.2.2
collected 10 items / 2 errors


---------- coverage: platform darwin, python 2.7.13-final-0 ----------
Name                 Stmts   Miss  Cover   Missing
--------------------------------------------------
vmc/__init__.py          4      0   100%
vmc/__main__.py          1      0   100%
vmc/_models.py           8      0   100%
vmc/conversions.py      31     22    29%   46-48, 52-85
vmc/digest.py           48      2    96%   34, 145
vmc/normalize.py        42     37    12%   32-40, 59-67, 92-119
vmc/seqrepo.py          18      1    94%   49
--------------------------------------------------
TOTAL                  152     62    59%

===================================================================================================== ERRORS =====================================================================================================
___________________________________________________________________________________ ERROR collecting tests/test_vmc_models.py ____________________________________________________________________________________
tests/test_vmc_models.py:75: in <module>
    vmc_version=0,
/usr/local/lib/python2.7/site-packages/python_jsonschema_objects/classbuilder.py:168: in __init__
    prop, self.__class__.__name__)), sys.exc_info()[2])
/usr/local/lib/python2.7/site-packages/python_jsonschema_objects/classbuilder.py:164: in __init__
    setattr(self, prop, props[prop])
/usr/local/lib/python2.7/site-packages/python_jsonschema_objects/classbuilder.py:183: in __setattr__
    prop.fset(self, val)
/usr/local/lib/python2.7/site-packages/python_jsonschema_objects/classbuilder.py:767: in setprop
    validator = info['type'](val)
/usr/local/lib/python2.7/site-packages/python_jsonschema_objects/classbuilder.py:331: in __init__
    self.validate()
/usr/local/lib/python2.7/site-packages/python_jsonschema_objects/classbuilder.py:366: in validate
    validator(paramval, self._value, info)
/usr/local/lib/python2.7/site-packages/python_jsonschema_objects/validators.py:112: in format
    "'{0}' is not formatted as a {1}".format(value, param)
E   ValidationError: '2018-03-15T16:58:58.174554' is not formatted as a date-time
E   while setting 'generated_at' in Meta
___________________________________________________________________________________ ERROR collecting tests/test_vmc_models.py ____________________________________________________________________________________
tests/test_vmc_models.py:75: in <module>
    vmc_version=0,
/usr/local/lib/python2.7/site-packages/python_jsonschema_objects/classbuilder.py:168: in __init__
    prop, self.__class__.__name__)), sys.exc_info()[2])
/usr/local/lib/python2.7/site-packages/python_jsonschema_objects/classbuilder.py:164: in __init__
    setattr(self, prop, props[prop])
/usr/local/lib/python2.7/site-packages/python_jsonschema_objects/classbuilder.py:183: in __setattr__
    prop.fset(self, val)
/usr/local/lib/python2.7/site-packages/python_jsonschema_objects/classbuilder.py:767: in setprop
    validator = info['type'](val)
/usr/local/lib/python2.7/site-packages/python_jsonschema_objects/classbuilder.py:331: in __init__
    self.validate()
/usr/local/lib/python2.7/site-packages/python_jsonschema_objects/classbuilder.py:366: in validate
    validator(paramval, self._value, info)
/usr/local/lib/python2.7/site-packages/python_jsonschema_objects/validators.py:112: in format
    "'{0}' is not formatted as a {1}".format(value, param)
E   ValidationError: '2018-03-15T16:58:58.371521' is not formatted as a date-time
E   while setting 'generated_at' in Meta
================================================================================================ warnings summary ================================================================================================
None
  pytest-catchlog plugin has been merged into the core, please remove it from your requirements.

The offending line seems to be here:

https://github.com/ga4gh/vmc-python/blob/master/tests/test_vmc_models.py#L74

pip freeze shows:

$ pip freeze | grep jsonschema
jsonschema==2.6.0
python-jsonschema-objects==0.2.4

SimpleInterval and SequenceState use in Translator

I'm in the process of upgrading from 0.6.4 to 0.7.2. I wanted to use SequenceInterval, as SimpleInterval is deprecated. I ran into an issue using the translator, however, as it doesn't yet support SequenceInterval (tests showing current parse behavior).

I'm happy to put in a PR to upgrade the translator to use the SequenceInterval when parsing HGVS to Allele, and supporting either the SimpleInterval or SequenceInterval when parsing from Allele to HGVS string.

Are there any other considerations here?

Update prefixes to match the spec and relocate to vr-spec

See https://vr-spec.readthedocs.io/en/1.0rc/impl-guide/computed_identifier.html#identify

Add accession mapping functionality

VMC required that sequence locations used sequence hashes.
VR will recommend it.

The reference implementation needs to provide functionality to map accessions to sequence hashes, and the converse. Something like this:

to_hashed_sequencelocation(sl): given accession-based SequenceLocation sl, return new hashed-based SequenceLocation. Update id property if present.
as_accessioned_locations(sl, namespace=None): given hash-based SequenceLocation sl, return list of equivalent accession. Limit to accessions in namespace if provided.

Based on two functions:

ac_to_sequence_id(ac): returns hashed sequence id for accession ac
sequence_id_to_acs(id): returns list of accessions for given hashed sequence id

Reimplement serialization method

ga4gh/vrs#31 (comment)

VCF Support

Conversation during our late October VR call confirmed that we are more interested in extracting VRS Alleles from VCF for use in search than we are recreating VCFs from VRS objects + annotations.

Our VCF parsing tooling therefore should focus on:

collecting and normalizing Alleles from each record
reporting VRS Alleles observed in each sample across entire VCF

Migrate lookup to use seqrepo `sha512t24` namespace

When biocommons/biocommons.seqrepo#82 is available, use it instead of VMC, which will be deprecated.

update vrs-python changelogs

Translate deprecated to non-deprecated classes

We should enable vrs-python to automatically translate deprecated classes to preferred ones:

SimpleInterval -> SequenceInterval
SequenceState -> LiteralSequenceExpression

Start VR REST service for variation translation

Implement API for translation of various formats into VR. No persistent storage or registration.

Initially, implement just POST /allele/. POST body as with AnyVar.

SeqRepoDataProxy.get_sequence returns different data when qualified or not

SeqRepoDataProxy.get_sequence only works when namespace prefix is included. SeqRepoRESTDataProxy.get_sequence works when namespace prefix is included or excluded. See below for an example of this observation. Both dataproxies should work for both inputs.

OSError: Unable to open SeqRepo directory /usr/local/share/seqrepo/latest when using seqrepo

I followed the instructions to install and configure vrs-python.

$ docker ps
CONTAINER ID   IMAGE                                    COMMAND                  CREATED          STATUS          PORTS     NAMES
ba533a28fd60   biocommons/uta:uta_20180821              "docker-entrypoint.s…"   18 minutes ago   Up 18 minutes             stack_uta_1
888be6a91e11   biocommons/seqrepo-rest-service:latest   "seqrepo-rest-service"   18 minutes ago   Up 18 minutes             stack_seqrepo-rest-service_1

However, when I try to execute jupyter examples I get the following error:

HTTPError: 500 Server Error: INTERNAL SERVER ERROR for url: http://localhost:5000/seqrepo/1/sequence/NC_000019.10?start=44908821&end=44908822

I saw that there are two examples to test the containers in the docker-compose.yml file, while this works as expected:

$ psql -XAt postgres://anonymous@localhost/uta -c 'select count(*) from transcript'
249909

This returns e the same error described above:

$ curl -f http://0.0.0.0:5000/seqrepo/1/sequence/NP_001274413.1
curl: (22) The requested URL returned error: 500 INTERNAL SERVER ERROR

I can access to the http://localhost:5000/seqrepo/1/ui/ URL, but even when I try /ping I get the 500 error. Any ideas on how could I solver this?

Automate build process

Since changing to GitHub Actions for CI, we have not implemented an automated, tag-based build process.

This issue is for tracking requirements and discussion on that topic.

normalize.py - no right trimming?

I think that you need right trimming also at the beginning of normalization.

I would add ((3, 6), "CG") to tests (it should result in ((4,5),""))

https://github.com/ga4gh/vmc-python/blob/f519f61cf2f0f010f9beacff23e931a0d2268417/vmc/normalize.py#L117

use normalization in translator

The Translator class doesn't currently invoke normalization, but it should. Make it so.

File Not Found Error

I have installed ga4gh.vrs[extras] 0.6.3 release.

When I run from ga4gh.vrs import models I get this error:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/kxk102/.local/share/virtualenvs/variant-normalization-Jh4I8RMk/lib/python3.9/site-packages/ga4gh/vrs/__init__.py", line 28, in <module>
    from ._internal.enderef import vrs_deref, vrs_enref
  File "/Users/kxk102/.local/share/virtualenvs/variant-normalization-Jh4I8RMk/lib/python3.9/site-packages/ga4gh/vrs/_internal/enderef.py", line 3, in <module>
    from .models import class_refatt_map
  File "/Users/kxk102/.local/share/virtualenvs/variant-normalization-Jh4I8RMk/lib/python3.9/site-packages/ga4gh/vrs/_internal/models.py", line 61, in <module>
    _load_vrs_models()
  File "/Users/kxk102/.local/share/virtualenvs/variant-normalization-Jh4I8RMk/lib/python3.9/site-packages/ga4gh/vrs/_internal/models.py", line 56, in _load_vrs_models
    models = build_models(schema_path, standardize_names=False)
  File "/Users/kxk102/.local/share/virtualenvs/variant-normalization-Jh4I8RMk/lib/python3.9/site-packages/ga4gh/core/_internal/jsonschema.py", line 27, in build_models
    builder = pjs.ObjectBuilder(path)
  File "/Users/kxk102/.local/share/virtualenvs/variant-normalization-Jh4I8RMk/lib/python3.9/site-packages/python_jsonschema_objects/__init__.py", line 33, in __init__
    with codecs.open(uri, "r", "utf-8") as fin:
  File "/usr/local/Cellar/[email protected]/3.9.1_8/Frameworks/Python.framework/Versions/3.9/lib/python3.9/codecs.py", line 905, in open
    file = builtins.open(filename, mode, buffering)
FileNotFoundError: [Errno 2] No such file or directory: '/Users/kxk102/.local/share/virtualenvs/variant-normalization-Jh4I8RMk/lib/python3.9/site-packages/ga4gh/vrs/_internal/data/schema/vr.json'

I have looked in the package directory and the ga4gh/vrs/_internal/data directory is not being created.

@reece , any thoughts on how to fix this?

Thanks in advance!

Write vr-python docs

See also ga4gh/vrs#19 for schema docs + changelogs

Write Quick Start Guide

@melissacline recommended a quick start guide for using VRS and vr-python. For now, I'll park that request here in vr-python, but it might be that the notebooks or documentation should migrate to vr-spec. We can decide that later.

Test doctests and update them

Many functions/methods have docstrings with older tests. They're not tested currently because doing so requires injecting fixtures.

General plan:

add src to testpaths in setup.cfg
add doctest setup in conftest.py
add/link conftest.py into src/
exclude conftest.py from packaging
fix broken tests

Tests to implement

This issue collects tests that I think we should implement (and that I might forget about).

null values must not be included in serialization
arrays must be sorted during serialization

implement VR webservice

Demonstrate use of VR as a web service to translate variation into VR structures.

For now, this is solely for translating variation. Registration (storage) of variation is not intended at this time. Therefore, only PUT operations are provided.

Endpoints:
/v1/

variation/ -- parse as text and allele, returning ids for both
allele/ -- parse as allele
text/ -- parse as text variation

Implement data proxy class

Depends on biocommons/seqrepo-rest-service#2.

VR implementations need access to at least three kinds of data:

sequences: Given an identifier and optional start,end range, return the sequence. Use: variation normalization.
sequence identifiers: Given an identifier, return all identifiers for the same sequence. Use: enable determination of variation that is equivalent under sequence identity
sequence metadata: Given an identifier, return (at least) the sequence length. Use: validation of location on sequences.

See also #21 for second phase data proxy

seqrepo docker container does not create volumes

Under https://github.com/ga4gh/vr-python#installing-dependencies-for-ga4ghvrextras, there is a docker-compose command to fire up the seqrepo rest API. When you run this command on a fresh installation, it fails twice, instructing you to do "docker volume create" commands, for seqrepo-rest-service and uta respectively. This isn't a major setback, since the error message gives you exactly the command to execute, and if you enter that command verbatim then everything works fine. Even though seqrepo isn't part of VR per se, these commands might as well be added to the docker container or the installation instructions.

Digest objects using only the digest (not the namespaces and type prefix)

Based on gks leads meeting after VR, we agreed to digest nested objects using only their digests (without the ga4gh and type prefixes) in order to insulate computed digests from possible changes in the id structure.

Collect representative variation for Beacon

The goal for this issue is to document variation representations that are required to support the Beacon project.

In addition, the Beacon use case requires that we consider whether and how to represent variation in query params. For example, the query 13 : 32936732 G > C is translated into https://beacon-network.org/#/search?pos=32936732&chrom=13&allele=C&ref=G&rs=GRCh37.

vr-python currently has functionality to translate expressions like 13 : 32936732 G > C to Allele structures. However, that doesn't help with query params. Furthermore, the Allele structure is too complex to reasonably go in query params.

A few options:

Do nothing.
Provide a translation service (see #8).
Recommend that Beacon support POSTing Allele structures (in addition to GET w/params).
Write a spec for translating a subset of VR to query params.

@mbaudis: What would you find most helpful from VR?

When serializing objects, fully identify before converting to dictionary.

Currently, dictify and identify may be circularly dependent when enref=True and
an object doesn't already have an identifier. This is unnecessary.

Instead, when enref is requested, we should identify all nested objects, then dictify. This breaks the circularity and simplifies the code. Also, this means that the modules can be deaggregated as they were originally.

The end result and API will be unchanged.

Implement data proxy extensions

See #20

Needed somewhere (perhaps in proxy and probably later):

maploc coordinates: Given an assembly, chr, and band, return sequence coordinates. Use: Convert maplocs to sequence locations.
gene coordinates: Given a gene name and genomic sequence identifier, return coordinates of maximal transcript bounds. Use: convert gene symbols to genomic locations.
exon structures: Given a transcript accession, return exon structure/lengths. Use: limit normalization to exon boundaries so that variation isn't accidentally normalized across junctions.

Update to match schema call

The VR-schema utility test was updated to the new "vr" convention instead of "vmc". We need to adjust this to match.

https://github.com/ga4gh/vr-python/blob/aa4c4cea83f8b08229110815e9b8cf41eb812687/tests/validation/test_utils_tests.py#L12

Update vr-python to use new validation tests

OS X failures

This is a meta-thread as I work through trying to get vr-python to work in my local OS X dev environment.

My first issue was installing setup_extras with pip, though I'm convinced that is a conda issue and out-of-scope here. My workaround was to just manually construct the pip install command from listed packages in setup.cfg.

Rewrite normalizer to support sequence fetching callback

The currently normalizer expects the full-length sequence. It should instead take a callback function to fetch sequence over a region. (see bioutils seqfetcher)

Implement serialization

See ga4gh/vrs#31 for details.

Makefile xargs -r argument not supported on macosx

There are three xargs commands in the "clean" steps at the end of the Makefile that don't work on macosx. The xargs command in BSD does not support the -r (--no-run-if-empty) option.

...
#= CLEANUP
...
clean:
	find . \( -name \*~ -o -name \*.bak \) -print0 | xargs -0r rm
...
cleaner: clean
	rm ...
	find . \( -name \*.pyc -o -name \*.orig -o -name \*.rej \) -print0 | xargs -0r rm
	find . -name __pycache__ -print0 | **xargs -0r rm -fr**

If the -r option is dropped it works fine since the BSD version does not run the command when empty.

I'm not sure of the best workaround, but macosx users are likely to hit this causing mild frustration and delay in getting started for some.

Refactors in ga4gh.vr after release 0.2.0 breaks some symbol imports

The problem is the dependency ga4gh.vr[extras]>=0.2.0. There has been a significant amount of refactoring after version 0.2.0.

Attempted resolution by changing to ga4gh.vr[extras]>=0.2.0. However this resulted in another breaking change when aattempting to require bioutils>=1.0.0a4 from the ga4gh.vr dependencies in ga4gh.vr==0.2.0. It seems that bioutils version was either a typo or the bioutils releases did not strictly increase or had a break in the version sequence, as the latest bioutils release is 0.5.2.post3.

Will attempt another fix by refactoring references to moved or removed symbols.

implement vr validator/linter

The VR Spec imposes requirements on data beyond the data structure. There is currently no library to validate those constraints.

This issue should implement a VR validator that checks the following (list in progress):

extract rules from implementation guidance sections
0 <= start <= end <= len
ChromsomeLocation start, end order

Add support for translating inversion variants from HGVS

The VR translator does not currently support translation of inversion variants from HGVS to VR, and throws exceptions upon encountering any. Sample variants include:
NC_000013.11:g.32319069_32319070inv
NC_000013.11:g.32338162_32338163inv
NC_000013.11:g.32355094_32355095inv

Implement expand & contract functions

VR instances have two forms, inlined and referential.
Inlined instances are self-contained, except for a reference to an external sequence.
Referential instances reference objects by id, which requires storage that is external to the instance.

The two forms need to be convertible in the reference implementation.

to_inlined(s, ro, keep_reference_ids=False, keep_self_ids=False) builds an inlined object using the given referential object ro and instances in storage s. keep_reference_ids controls whether the reference ids are preserved. keep_self_ids controls whether instance id attributes are removed from inlined representations.
to_referential(s, io) adds inlined object io to s and returns a referential object.

Demonstrate parsing of entire ClinVar data set via REST API

After implementing VR web service (#8), demonstrate the use of the webservice to parse ClinVar HGVS expressions.

Implement VR export formats w/reverse translation of sequence ids

The translator currently imports from beacon, hgvs, spdi, and vcf.
Now, translate to those formats as well.

Support translating to other variation formats

The translator currently imports from beacon, hgvs, spdi, and vcf.
Now, translate to those formats as well.

File Not Found Error with 0.7.1

Hey - I am trying to install the 0.7.1 package (pip install ga4gh.vrs), but I am not able to use the library. It appears that the schema is not being included.

$ pip install ga4gh.vrs
$ python
>>> import ga4gh.core
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python3.7/site-packages/ga4gh/core/__init__.py", line 9, in <module>
    from ._internal.enderef import ga4gh_enref, ga4gh_deref
  File "/usr/local/lib/python3.7/site-packages/ga4gh/core/_internal/enderef.py", line 14, in <module>
    from .identifiers import ga4gh_identify, is_ga4gh_identifier
  File "/usr/local/lib/python3.7/site-packages/ga4gh/core/_internal/identifiers.py", line 39, in <module>
    cfg = yaml.safe_load(open(schema_dir + "/ga4gh.yaml"))
FileNotFoundError: [Errno 2] No such file or directory: '/usr/local/lib/python3.7/site-packages/ga4gh/core/_internal/data/schema/ga4gh.yaml'

Would be nice to have a jupyter notebook to provide CNV examples

To explain better how to represent CNVs, can we get a jupyter notebook?

Update notebooks to use VRS 1.2.0 models

Will need to re-run notebooks once #61 is resolved.

Standardize class/model/schema/instance terminology

The definitions of class, model, schema, instance, object, value object are related and overlap. The code is slightly confusing because it uses all of them in various places. Straighten this out to make it easier to understand.

Definitions:

schema: a conceptual representation of data and data relationships
class: In code, a template or prototype of data and, optionally, methods. In a schema, a group of related data.
model: a class whose primary purpose is to represent data
object: any value in a language
instance: an object that is an exemplar of a class (or model)

In vr-python, let's use these terms:

"schema" as above
"class" for any schema class or python jsonschema class (that it calls models)
"instance" as above

So, in other words, drop the "model" lingo.

Write methods to convert between inlined and referenced objects

inlined vro ⇔ referenced vro (+ object store)

See ga4gh/vrs#173 for related spec change

Add support for translating dup variants from HGVS

The VR translator currently doesn't support translation of dup variants from hgvs, and throws exceptions upon encountering one. Sample variants include:
NC_000013.11:g.32316467dup
NC_000013.11:g.32319315dup
NC_000013.11:g.32331093_32331094dup