candig / htsget_app Goto Github PK

View Code? Open in Web Editor NEW

2.0 2.0 1.0 269.76 MB

Htsget API implementation based on the Htsget protocol

License: GNU Lesser General Public License v3.0

Python 97.77% Dockerfile 0.37% Shell 1.86%

htsget_app's People

Contributors

Stargazers

Watchers

Forkers

zhengwin

htsget_app's Issues

The application does not seem to be following some of the htsget specs

The htsget spec defines that

the id should pass as a URL parameter, e.g., /data/<id>, but this application considers id a query string parameter, which treats it as /data?id=.
parameters should be written in camelCase, such as referenceName, the application spec writes it as reference_name.
server should accept chr1, 1 both as valid reference to chromosome 1, yet the server only accepts 1. The spec also indicates that chr is of type int, while it should be of type string.

Remove .vscode, pycache from version control

Add python packaging for package

Following python_model_service, create directory structure and a setup.py such that one can do the following steps:

virtualenv htsenv
source htsenv/bin/activate
python setup.py install

and have the htsget server installed in the local python environment. This will require #1 to be complete so that one can change configuration parameters without having to reinstall.

External services for dependency updates + code quality

Following the python model service, register repository with codefactor and pyup to automatically track simple code quality issues and old dependencies

create_slice returns entire chromosome for 1 slice

while it should respect the input's start and end parameters..

Currently, if the requested region is smaller than the chunked size, it returns the entire chromosome

This is not going to work

E.g.

If requesting a block of 10k bps, and the chunk size is 100k bps, instead of returning data chunks of 10k bps, it returns the entire chromosome, this doesn't make sense

Travis-CI tests

Once #2 and #3 are done we can easily add travis-CI testing for at least the local-file/sqlite3 service.

Invalid contig for certain bam files - reasons unknown

After #1 and #2, create a docker file for the python model server (following the python model server), and we'll register the repository w/ Quay.io (after making it public) to have docker images built automatically as part of the CI process.

Create a README.md for the repo

Create a README which describes this repo as an implementation of the HTSGET protocol (and link to it), with a very brief description of the features and how to run it. You can, but needn't, reference the python model service (https://github.com/CanDIG/python_model_service) and use its README as a model.
Add badges for travis, pyup, and code factor. You can see how that's done in the python model service README; I've activated this repo for those services.
When the docker container is successfully being built we can also include the badge for quay.io.

Use SQLAlchemy rather than sqlite directly

in _execute, rather than calling the sqlite3 library, use SQLAlchemy; this will mean creating a database engine and connection on initialization, and then calling execute on the connection (see here: https://docs.sqlalchemy.org/en/13/core/tutorial.html). This modest change will mean that this service could be used with back ends that use MySQL, Postgres, etc as well.

Can we avoid the "creating a temporary file" step for getting the data?

This may or many not be possible but is worth considering, as it would avoid having to clean up temporary files in the case of clients disconnecting/not following up on tickets

DRS+Minio CI testing?

With #6 completed we can start thinking about automated CI testing with DRS and Minio servers; this may require the use of Circle-CI rather than Travis however since it will involve 3 services plus a client running simultaneously. This will take some thought.

The /variants endpoint is not handling the request correctly

It looks like it doesn't support search for VCFs across multiple chromosomes; the implementation seems to assume that one VCF file can only contain records from 1 chromosome.

E.g., for the following sample VCF File (headers are stripped)

#CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO	FORMAT	217-70-3296_sample_1
1	76569151	.	C	A	.	PASS	.	GT	0/0
1	82441079	.	C	T	.	PASS	.	GT	0/0
3	46018344	.	G	A	.	PASS	.	GT	1/1
21	34609505	.	G	A	.	PASS	.	GT	1/1

If you do a /variants search on this, it will return the referenceName as None, start as the first row, which is 76569151, and end as the last row, which is 34609505, even though they belong to different chromosomes

Add some pytest tests

Using pytest, create a small test suite using another client to successfully pull down slices off the packaged data:

Should successfully pull some prescribed slices of both VCFs and BAMs, showing no difference from tabix-extracted subset
Should successfully pull entire file if no start/end given,
Should fail with the expected error if request is given for a data file which doesn't exist
Should fail with the expected error on malformed request (e.g., end < start)

Tag latest version of htsget_app

Can you please confirm if v0.1.4 of htsget_app is the latest to be used for CanDIGv2. It looks like the stable branch is ahead but I wanted to make sure that we should switch to it instead of the v0.1.4 tagged version.

Bring htsget spec up to date

(Ab)use DRS `object.mime_type` to reflect file type in minio

CORS Request

Pull data directly from Minio

htsget should access the data file directly from minio

Put configuration parameters in a config file

There's a number of ways this can be done, but a simple and common way to do this is with configparser: https://docs.python.org/3/library/configparser.html

Here's an example of combining configparser with argparse to allow the config parameters to be overridden on the command line: https://stackoverflow.com/questions/3609852/which-is-the-best-way-to-allow-configuration-options-be-overridden-at-the-comman

The application does not seem to support small-range searches

http://abc.ca:3333/htsget/v1/reads?id=abc.bam&format=BAM&reference_name=21&start=14099895&end=14168318

The request only yields the following response, which spans across the entire chromosome "http://abc.ca:3333/htsget/v1/data?id=abc.bam&reference_name=21"

While a bigger range request

http://abc.ca:3333/htsget/v1/reads?id=abc&reference_name=21&start=0&end=10000000

yields

{
  "htsget": {
    "format": "BAM",
    "urls": [
      {
        "url": "http://abc.ca:3333/htsget/v1/data?id=abc&reference_name=21&start=0&end=10000000"
      },
      {
        "url": "http://abc.ca:3333/htsget/v1/data?id=abc&reference_name=21&start=10000000&end=10000000"
      }
    ]
  }
}

The second url does not seem to be needed, but at least it includes the start and end in its response.

Support returning binary data instead of plain text (bam instead of sam)

Currently, the /data endpoint defaults to returning plain text (equivalent to SAM), the endpoint should really allow for the option to return data in binary or plain text format.

Additionally, IGV browser would expect the binary (bam) formatted data, not sam.