ga4gh / ga4gh-server Goto Github PK

View Code? Open in Web Editor NEW

96.0 52.0 93.0 10.38 MB

Reference implementation of the APIs defined in ga4gh-schemas. RETIRED 2018-01-24

Home Page: http://ga4gh.org

License: Apache License 2.0

Python 99.29% HTML 0.68% Shell 0.03%

ga4gh genomics server global alliance health variants rna genome-annotation reference-implementation

ga4gh-server's Introduction

REPOSITORY RETIREMENT NOTICE

The Genomics API was intended to act as a suite of integrated APIs each targeting a different aspect of exchanging genomic information between data providers and consumers. The Genomics API, together with the Reference Server and Compatibility test suite, was retired on January 24, 2018 and several of the sub-APIs are now being pursued under the auspices of new GA4GH Work Streams. We would like to thank those of you who have worked on this and look forward to ongoing contributions in the GA4GH Work Streams. You may still fork this repository if you wish to pursue developments. You may read the meeting minutes of the GA4GH Engineering Committee to learn more about the decision to retire the API. For additional questions or to get involved with ongoing technical work at GA4GH, please see the website at Global Alliance for Genomics and Health

ga4gh-server's People

Contributors

Stargazers

Watchers

Forkers

jeromekelleher fnothaft hershman sidrahussain isb-cgc pashields kerrydc melaniedc diekhans dcolligan congeorges tymiles003 saupchurch cmarkello macieksmuga bioinformed shajoezhu pgrosu rkgi10 emi80 umms-biocore afirth jbu javild andrew-lshift jeltje adamnovak eelsther hjellinek danvk ohsu-comp-bio pcingola jmarshall bwalsh sguthrie alexanderdilthey adamstruck marghoob abeconnelly glennhickey jsh2134 tixii jercoh andrewyatz mp15 lijiayong ktym bartgrantham pkrusche knoxcarey nouyang-curoverse srblum ianrogers-lshift mcupak david4096 andrewjesaitis cloudxtreme snehitp kozbo sgosline bosz ljishen bjea achave11-ucsc peterolph hubertp holtlab arc39 alexanderdoria rudyzhou2 nishill philip-wu mavas dkucsc ga4ghpoc mwvaughn ejacox ellmen junaruga ksens pamelarussell emegodoyr digideskio gscuser scchess violethaze74 roshanrajapakse

ga4gh-server's Issues

dist appears to be broken

After flaskification and subsequent PRs, dist (from "python setup.py sdist") doesn't include server, and following the README instructions no longer gives us a running server.

A beacon client would be useful

Adding scriptable support for the beacon API would be very useful. It would be good to have a beacon function in the ga4gh.client.HTTPClient class, following the existing pattern. It would then be simple to add a CLI (either as part of the existing client application or as its own standalone program, if it was sufficiently useful) using this method. This would also require #25 to be solved, as Beacons will most often be at prefixed endpoints.

Client side conversion tools required

The reference server currently converts VCF data into GA4GH variant objects. It would also be useful to convert in the other direction for many purposes (for example, providing a stream to programs that do not support GA4GH input). A useful task for the reference implementation might be to provide programs to convert GA4GH data from a given server URL and dataSetId to their VCF and SAM equivalents. Is this an appropriate task for the reference implementation, or should this be left to independent projects?

Need to be precise about what Python versions are supported

The setup scripts currently do not check what Python version is being used, which leads to cryptic failures on older Python versions. This should be fixed, and the exact versions tested using Travis.

PyPI package required

The standard approach for distributing Python software is to use the Python Package Index. This would allow users to install the client and server tools simply by running pip install ga4gh.

Ultimately we want to have a pip installable package so that the server can be deployed as simply as possible. There are a few issues that should be resolved first, and this is a good place to discuss the issues:

What should the package be called? (ga4gh seems the obvious choice.)
When do we make a PyPI package available? (i.e. do we wait until we have a stable release, or do we make early alpha versions available for use, with the clear proviso that the are early alpha releases?)
Should we create a PyPI account to upload the packages run by the GA4GH or just use an account owned by one of the developers?

Create run_tests.py for local Travis imitation

Created after discussion in #95

Travis (AFAIK) does not allow manual test runs (see http://stackoverflow.com/questions/17606874/trigger-a-travis-ci-rebuild-without-pushing-a-commit -- a run can be triggered via a push, and the whole issue is we want a CI run BEFORE a push to master to ensure we don't break it!) and it would be a good practice for merging devs to ensure merges don't break master before pushing them to GitHub.

run_tests.py should imitate the Travis CI run. travis.yml and run_tests.py should be updated in parallel.

CORS unit test broken

@jeromekelleher @dcolligan : the merge in 6b9ef7e seems to have broken @melaniedc 's new CORS unit test? If I check out 399e96 or earlier SHAs, nosetests completes.

I think what's going on here is that we no longer call cors.CORS() for all invocations, but only in frontend.configure(), which is only called when we're entering through server_main().

Does nosetests not execute server_main()?

variants/search client end point needs more detail

With the changes in #102, we lost the detailed output for the variants/search function. We need to get this back.

In general, it would be good to have an option to provide either short or detailed output for each object. These do not need to be filled in for all objects to close this issue.

Protocol definitions are incomplete

Only a handful of the GA4GH API classes have been implemented in protocol.py. The full protocol must be defined here. There are a number of options:

Continue the current pattern, and hand code the Python class equivalents;
Somehow generate valid Python code from the Avro definitions, and check this code into git each time there is a protocol change;
Generate an intermediate representation of the classes from the Avro definitions which can be parsed by Python to dynamically create these classes at run time.

There are advantages to all of these, but I think that option 3 would be the best if it was possible. This intermediate representation would also be very useful for testing and verification purposes as we could include information about types and whether fields are mandatory.

Travis CI integration

Once #7 has been resolved, it would be good to implement Travis CI continuous integration testing.

Support searching over multiple variant sets

The server currently supports searching for variants over a single variant set. The protocol requires searching over a list of variant sets. This requires a rearrangement of the code and also means that we need to include more information in the nextPageToken so we can cleanly page over multiple variant sets.

A compliance tester would be useful

The compliance test suite (https://github.com/ga4gh/compliance) is extremely useful, but does not allow for easily automating the process of API compliance testing. Checking for API compliance is a minimal requirement of our test suite, so we will need to do these checks in any case. Since having a command line application to test an arbitrary API endpoint is a useful additional tool to have, we may as well wrap this in with the suite of applications we are developing as part of the reference implementation.

Here are some ideas for how we might proceed:

Create a new module called ga4gh.compliance and put the compliance testing code in here. This should make calls to the endpoints using the ga4gh.client module, and test the form of the responses. This should be done in as automated a manner as possible using the information encoded in the Avro schemas (see issue #19 for discussion on this point).
Create a new CLI to run this compliance test against an arbitrary endpoint.
Include calls to the compliance code in the test suite.

Simple WSGI deployment

Deploying the server on WSGI compliant HTTP servers is currently not straightforward. We need simple, well-documented recipies for deploying on common platforms such as Apache and Nginx. We may need to add some helper functions into ga4gh.server to make this as easy as possible.

Need to keep track of library dependencies for testing and dev

We are keeping track of library dependencies for deployment using the install_requires keyword in setup.py. We should try to keep track of our requirements for testing and development also, as we have quite a few now. One option is to use a requirements.txt file. This would allow us to install all our requirements in a virtualenv using pip install -r requirements.txt. Thoughts?

Need to implement variantsets/search method

We need to implement the variantsets/search method described in ga4gh/ga4gh-schemas#134. The variantSetIds are stored in the self._variantSets dictionary, so this should be simple to implement.

Specify configuration override file from the command line

Once #61 is addressed, we would like to be able to specify a configuration override file from the command line ("--config_file=..."?) in the same way we can with the GA4GH_CONFIGURATION environment variable.

Clean up code to improve readability

Rewriting some of the bigger classes to make variable names more readable and improve commenting would make understanding the code and contributing to the project easier.

Improve installation time/travis build time

Installation of the ga4gh package in a virtualenv is very slow because we must compile pysam each time. This also happens for our TravisCI builds, which is bad for both us and TravisCI.

Can we speed this up in any way? Would a binary wheel for pysam on Travis's infrastructure be feasible/desirable?

Support configuration

In order to support performance benchmarking we should be able control whether or not the server is logging performance statistics, particularly those that are expensive - memory use or detailed profile data. This is a good impetus to start using Flask configurations.

This can be some combination of command-line, environment variables, Python classes, and files.

(Addresses a TODO in ga4gh/server/init.py)

Frontend should display something

Maybe have a basic input form and/or usage instructions on "/", and something to display (or allow download of) results for the implemented methods.

Thoughts? What would be useful to have here?

Multiple variantNames not handled correctly

The VCF ID field has been mapped directly to the Variant.names field. This is also used in the SearchVariantsRequest object, where we only return variants that have variantName within the interval defined by start and end. This is supported by building a wormtable index over the CHROM and ID columns, which lets us seek to the first record in which the ID value is >= to the query string. This works perfectly well for VCF rows with a single ID value, but will not work correctlly when multiple ID values are associated with a single VCF row.

Is this a common requirement?

Proposed data directory layout

Currently, we only support variant data derived from VCF, in either vcf.gz or wormtable format. We need to support other types of data as we move to support the full API. I suggest we use a designated file system layout for the data directory to discover the files and their relationships. An initial pass at this might be

dataDir/
     datasetId1/
           variants/
                 variantSetId1/
                       chrom1.vcf.gz
                       chrom2.vcf.gz
                 variantSetId2/
                       variants.wt
           reads/
                  [Some directory structure laying out BAM/SAM files]
           references/
                  [Some directory structure laying out references]
     dataSetId2/
           variants/
                  ....
           reads/
                  ....
           references

That is, within the data directory, each directory d corresponds to a Dataset with ID d. Within a given dataset, we have three directories, called variants, reads and references. Within the variants directory, we then have more directories, one for each VariantSet. Within a VariantSet directory, we have a bunch of files that define these variants in what ever formats we support.

For data within a given variant set, we could follow the existing convention used for the TabixBackend: we can split the variant set over as many files as we like, so long as we don't have data for the same chromosome in two files. I've deliberately mixed up wormtable and VCF format files in here. The input should detect the file type and make an instance of the appropriate handler based on the extension. If we don't support a given file type then we complain and quit. This would get rid of the currently very clumsy model of specifying wormtable or tabix as command line arguments to determine the type of backend to allocate.

We'd then just call the server using ga4gh_server dataDir for development and testing purposes. For deployment an absolute path to the required data directory in a config file would be all that is needed.

The idea is that we use the file system hierarchy to explicitly set out the structure of the data to be served. This keeps the number of configuration variables we need as we deal with more complex datasets down. The downside is that we do require a pretty complex file system hierarchy to specify the data to be served. One way we could alleviate this problem is by providing a tool to check the layout of the hierarchy and to raise errors if it's not well formed. This could also print some summaries about the data that's in there too, so that people can get an overview of what they're about to serve.

I've not filled out the reads and references section, as I don't understand this side of things all that well. Will this sort of decomposition make sense here too? Does this hierarchy capture the structure of the API, or is it too restrictive? Any feedback would be much appreciated!

Refactor backends classes for dependency injection

Our backends package now is difficult to test, due to the dependency the VariantSet classes have on the filesystem. We should change the VariantSets to take objects in their constructors instead of file paths, and create these objects using a factory.

Generate URL endpoints from schema definitions

It would be very useful to have a list of valid URL endpoints and the types of their arguments and return values for testing purposes. This information should be derivable from the Avro schemas and could be added to the auto generated file _protocoldefinitions.py.

The general pattern is Search[Class]Request maps to the URL /[class]/search which takes a Search[Class]Request and returns a Search[Class]Response. We can scan for these class names in scripts/generate_schemas.py using a simple regex, and write a list of (url, requestClass, responseClass) triples at the end of the _protocoldefinitions.py.

A different (and inconsistent) pattern holds for GET requests, which should be dealt with later.

generalize python code generation from schema

Python support for AVRO is not very advance. The class generate in generate_schemas.py is an excellent step it the right direction.

I would like to split out the code generation from the schema fetching and create a first class Unix-style command line program (not a `script') that takes a schema and generates classes.

Support CORS

Development of Javascript clients would be greatly eased by supporting Cross-Origin Resource Sharing.

For the current prototype server.py, adding to do_POST:

       self.send_header("Access-Control-Allow-Origin", "*")

should do the trick for now.

What is the correct mapping VCF GT values to `Call` objects?

The mapping VCF genotype values to GA4GH Call objects is currently incomplete/incorrect. Questions:

If the delimiter is | then the genotype is phased. What should the value of the phaseset field be? If the delimiter is / then I assume phaseset should be null.
What does ./. indicate in terms of Call objects? No call exists an so we do not return a Call for the current CallSet?

Extend client library to include full protocol

The client library needs to support the full GA4GH protocol. We should develop and test this against existing implementations.

Response generation model is poor

The current model for generating responses in the server is to construct a complete GASearchVariantsResponse object and to then convert this to a JSON string. This is a poor model because it entails creating lots of intermediate objects and lists of GAVariants and GACalls. This can consume large amounts of memory on the server and is a factor in the current poor performance when returning large numbers of calls. It would be much better if we could generate the JSON directly and write this to a buffer, without generating the lists of GAVariants and GACalls.

We need:

An abstraction that lets us write JSON encoded attributes directly to the buffer. We should also bear in mind we'll want to implement partial responses at some point. This also feeds into issue #5 where we should keep this problem in mind when generating the protocol element classes from Avro.
Management of the number of records written to the buffer according to the requested page size and nextPageToken handling.
An approximate limit in bytes over which the response cannot grow. If we exceed this limit while generating a record, we terminate the response once the current record has been completed. This allows us to limit server memory resources and manage latency.

We should also keep the ability to directly convert protocol elements to JSON though, as this is useful for clients.

Add Beacon Server API support

With the change over to Flask (see issue #38 and PR #40 for discussion and progress) and our support for plain VCF files using the TabixBackend we have the infrastructure to very easily make a Beacon server. This could be trivial to deploy using the PyPI package (see issue #36) and would require only a path to a directory of VCF files to configure.

Handle native filetypes gracefully, without conversion

We should support .vcf, .bcf, .bam, and .sam files as input. Conversion to internal formats and choice of backend should happen on our end, with backend choice determined by performance benchmarking.

A test suite is required

We need a comprehensive test suite to ensure that our reference implementation is indeed correct. I suggest we keep tests in a separate directory called 'tests' and use nose for test discovery and running. This also integrates well with setuptools and has many plugins.

Fix our build so that we pass compliance tests with both tabix and wormtable backends

We should also update our readme with instructions on how to run both backends against the compliance suite, and the expected results with the current build.

Server performance benchmarking needed

As the reference server evolves, we need to have a mechanism to keep track of how the server's performance changes as we add new features. There is currently a crude performance benchmark built into the client program, but this is not satisfactory as it includes network latency and client side JSON parsing in its timings.

One way that we might get a better handle on query performance is to build a separate benchmarking program that takes a dataset (in the same format as the server program, whatever that turns out to be --- see discussion at #44), and runs some standard queries directly against the Backend implementation. That is, we cut out the HTTP layer entirely and run the queries against the GA4GH endpoints (e.g., searchVariants(request), where request is a SearchVariantsRequest object). We should get the response from these functions in the form of JSON strings.

This should allow us to get some good numbers for how our backend implementation is performing, without the noise of HTTP servers, network latency, JSON parsing etc, etc. We might also track memory usage, following the approach taken by @Naburimannu in #10.

What are the important aspects that we need to track? What standard queries should we use? @benedictpaten, any thoughts here?

Fix GASearchVariantsRequest semantics.

The semantics of the GASearchVariantsRequest needs to be updated to follow the update in ga4gh/ga4gh-schemas#166

Server should use WSGI framework

The server currently has a rather dumb HTTP server based on Python's http.server.HTTPServer. This is not a good idea for many reasons. A better approach would be to use the WSGI standard so that the server can be deployed at scale using any WSGI conforming web server. See http://docs.python-guide.org/en/latest/scenarios/web/ for more discussion and here for a performance comparison of WSGI servers.

Which framework (if any) should we choose?

OAuth support required

The reference implementation needs support for OAuth to allow us to perform authentication. There are many problems to be resolved in this, and this issue should provide a central location for the discussion around this topic.

Some questions:

What library should we use?
Do we use OAuth 1.0 or 2.0?
Lots of other questions I would ask if I knew more about OAuth.

Hopefully we can flesh out the actual requirements for the reference server here, and make a concrete plan for implementing these.

Fix Travis security issue

We get this warning from Travis builds:
"/home/travis/virtualenv/python2.7.9/lib/python2.7/site-packages/pkg_resources.py:1045: UserWarning: /home/travis/.python-eggs is writable by group/others and vulnerable to attack when used with get_resource_filename. Consider a more secure location (set with .set_extraction_path or the PYTHON_EGG_CACHE environment variable)."
We should fix this.

Optionally read from encrypted data files

Allow users to specify a key and possibly encryption scheme along with data files. We should accept the key, and decrypt data before processing.
The security doc (http://genomicsandhealth.org/security-infrastructure-read-online#AccessControl) says in 4.6:
"Each stakeholder that stores genomic or clinical data will use strong encryption to encrypt the data for storage."
We should therefore cut over to requiring data encryption in the long run. I lean towards private key encryption, for simplicity and portability. Supporting AES using pycrypto would be a good first step.

Need to implement callsets/search method

We need to implement the callsets/search method. Currently, the concept of a CallSet does not exist; we just generate Calls and map the callSetId and callSetName (do we really need both of these?) to the sample name in VCF. Can we simply map CallSet.id, CallSet.name and CallSet.sampleName to the VCF sample name?

`generate_schemas.py` script should use textwrap instead of forking an `fmt` process

When generating the embedded text representing the JSON schema of a protocol element, the generate_schemas.py script pipes the text through fmt to wrap the text. The textwrap module from the standard library provides a much better way to do this.

Support multiple test buckets

We should have wicked fast unit tests; the current set is perhaps a bit slow.
Separate from those we should have integration tests.
Separate from those we should have end-to-end tests.

We should make sure Travis runs at least some of the end-to-end tests on merge.
(If we get wheels to help with #71 that'll win us some time for more aggressive testing per merge.)

Change from Werkzeug to Flask?

Supporting OAuth2 and authentication would be considerably simpler if we used Flask instead of Werkzeug, and it would probably make working with the server a lot easier from a web development perspective also. This is probably a formality since I was the only one really in favour of Werkzeug, but are there any objections to changing over to Flask?

For previous discussions see #3 and #31.

Variant simulator backend should be removed and README updated

The simulator backend should be removed, and the README documentation updated to reflect the more recent developments. The simulator isn't very useful currently, and would take quite a lot of work to do properly. Therefore, I think it should be removed.

Any objections?

Error handling is required

Currently, any exceptions that occur are not dealt with. At a minimum, we need to catch these exceptions and send back a GAException object to the client. See the TODO section of the README for further discussion of this.

Improve CLI code layout

Currently, the command line programs are defined in the ga4gh.scripts package, with a module for the server and client applications. This is a poor layout, and the scripts name is quite confusing. Another disadvantage of this approach is that when we want to develop with the server.py and client.py programs, we need to either hard link or copy the files into the root of the package hierarchy. This is confusing, and possibly a barrier to developers.

Here is one possible way to improve this:

Remove the scripts package entirely, and consolidate all the code for the CLI programs into a new ga4gh.cli module. This will make sharing of code between the various CLI programs much easier. The entry points for the client and server could be client_main and server_main respectively. We then modify setup.py so that the console_scripts endpoints are correct. From a deployment perspective, there is therefore no difference: we still have ga4gh_client and ga4gh_server.
Add some new files into the root of the project to make running these scripts as easy as possible during development. One possibility would be to have dev_client.py and dev_server.py that just contain calls to ga4gh.cli.client_main and ga4gh.cli.server_main respectively. (These files would not be distributed in a release, they're just to make developers lives a little easier.) However, there are lots of ways to do this, and this is just one possibility.

Thoughts?

Write audit logs for all requests and responses

Access to data should be logged. Flask provides a logging framework, so that may be a good start. Once we have authentication we should log user IDs along with actions.

Client CLI needs to support arbitary URL endpoints

The client program currently only supports connecting to GA4GH servers in which there are no URL prefixes (i.e., we assume that we only connect to http://server:port/API_CALLS). However, the majority of endpoints are not of this form and contain a URL prefix (e.g. http://server:port/some/arbitrary/prefix/API_CALLS). The client program (and libraries) should support this pattern.

Ideally, this would be handled in the ga4gh.client.HTTPClient constructor, where pass in an optional URL prefix as well as host and port. Or, we might decide that it would be better to simply pass in an encoded string containing this information instead.

Select among configuration subclasses from the command line

After #61 is addressed, http://flask.pocoo.org/docs/0.10/config/#development-production suggests subclassing ga4gh.server.config:DefaultConfig for test, development, production, or other use cases. Once we have a use for them, we should support a command-line argument ("--config=..."?) to select among them.

Explore using slots for protocol objects

Some of our memory burden (observed #10 ) comes from the dict allocated for every object; we can have a quarter million or more objects allocated on the server to answer a single query. Python classes can explicitly declare __slots__ = [...] with a fixed immutable list of fields to prevent dict allocation.

This might save 15% of the peak memory burden (and perhaps provide speedup due to lack of memory churn?); @pashields points out that only a couple of lines of change in the codegen script will be needed. It would be cleaner, more pythonic, and more amenable to error handling than my previously prototyped ad-hoc string aggregation.