superphy / spfy Goto Github PK

Spfy: an integrated graph database for real-time prediction of Escherichia coli phenotypes and downstream comparative analyses

Home Page: https://lfz.corefacility.ca/superphy/grouch/

License: Apache License 2.0

Python 99.94% Shell 0.06%

genome-annotation predictive-analytics web-app rq blazegraph spa docker

spfy's Introduction

Spfy: Platform for predicting subtypes from E.coli whole genome sequences, and builds graph data for population-wide comparative analyses.

Published as: Le,K.K., Whiteside,M.D., Hopkins,J.E., Gannon,V.P.J., Laing,C.R. Spfy: an integrated graph database for real-time prediction of bacterial phenotypes and downstream comparative analyses. Database (2018) Vol. 2018: article ID bay086; doi:10.1093/database/bay086

Live: https://lfz.corefacility.ca/superphy/spfy/

Use:

Install Docker (& Docker-Compose separately if you're on Linux, link). mac/windows users have Compose bundled with Docker Engine.
git clone --recursive https://github.com/superphy/spfy.git
cd spfy/
docker-compose up
Visit http://localhost:8090
Eat cake 🍰

Submodule Build Statuses:

ECTyper:

PanPredic:

Docker Image for Conda:

Stats:

Comparing different population groups:

Runtimes of subtyping modules:

CLI: Generate Graph Files:

If you wish to only create rdf graphs (serialized as turtle files):

First install miniconda and activate the environment from https://raw.githubusercontent.com/superphy/docker-flask-conda/master/app/environment.yml
cd into the app folder (where RQ workers typically run from): cd app/
Run savvy.py like so: python -m modules/savvy -i tests/ecoli/GCA_001894495.1_ASM189449v1_genomic.fna where the argument after the -i is your genome (FASTA) file.

CLI: Generate Ontology:

The ontology for Spfy is available at: https://raw.githubusercontent.com/superphy/backend/master/app/scripts/spfy\_ontology.ttl It was generated using https://raw.githubusercontent.com/superphy/backend/master/app/scripts/generate\_ontology.py with shared functions from Spfy's backend code. If you wish to run it, do: 1. cd app/ 2. python -m scripts/generate_ontology which will put the ontology in app/

You can generate a pretty diagram from the .ttl file using http://www.visualdataweb.de/webvowl/

CLI: Enqueue Subtyping Tasks w/o Reactapp:

Note

currently setup for just .fna files

You can bypass the front-end website and still enqueue subtyping jobs by:

First, mount the host directory with all your genome files to /datastore in the containers.

For example, if you keep your files at /home/bob/ecoli-genomes/, you'd edit the docker-compose.yml file and replace:
volumes:
- /datastore
with:
volumes:
- /home/bob/ecoli-genomes:/datastore

Then take down your docker composition (if it's up) and restart it

docker-compose down
docker-compose up -d

Drop and shell into your webserver container (though the worker containers would work too) and run the script.

docker exec -it backend_webserver_1 sh
python -m scripts/sideload
exit

Note that reisdues may be created in your genome folder.

Architecture:

Dock er Imag e	Port s	Name s	Des crip tion
back end-rq	80/t cp, 443/ tcp	back end_wor ker_1	the main redi s queu e work ers
back end-rq-b laze grap h	80/t cp, 443/ tcp	back end_wor ker-blaz egra ph-i ds_ 1	this hand les spfy ID gene rati on for the blaz egra ph data base
back end	0.0. 0.0: 8000 ->80 /tcp , 443/ tcp	back end_web -ngi nx-u wsgi _1	the flas k back end whic h hand les enqu euei ng task s
supe rphy /bla zegr aph: 2.1. 4-in fere ncin g	0.0. 0.0: 8080 ->80 80/t cp	back end_bla zegr aph_1	Blaz egra ph Data base
redi s:3. 2	6379 /tcp	back end_red is_ 1	Redi s Data base
reac tapp	0.0. 0.0: 8090 ->50 00/t cp	back end_rea ctap p_1	fron t-en d to spfy

Further Details:

The superphy/backend-rq:2.0.0 image is scalable: you can create as many instances as you need/have processing power for. The image is responsible for listening to the multiples queue (12 workers) which handles most of the tasks, including RGI calls. It also listens to the singles queue (1 worker) which runs ECTyper. This is done as RGI is the slowest part of the equation. Worker management in handled in supervisor.

The superphy/backend-rq-blazegraph:2.0.0 image is not scalable: it is responsible for querying the Blazegraph database for duplicate entries and for assigning spfyIDs in sequential order. It's functions are kept as minimal as possible to improve performance (as ID generation is the one bottleneck in otherwise parallel pipelines); comparisons are done by sha1 hashes of the submitted files and non-duplicates have their IDs reserved by linking the generated spfyID to the file hash. Worker management in handled in supervisor.

The superphy/backend:2.0.0 which runs the Flask endpoints uses supervisor to manage inner processes: nginx, uWsgi.

Blazegraph:

We are currently running Blazegraph version 2.1.4. If you want to run Blazegraph separately, please use the same version otherwise there may be problems in endpoint urls / returns (namely version 2.1.1). See #63 Alternatively, modify the endpoint accordingly under database['blazegraph_url'] in /app/config.py

Contributing:

Steps required to add new modules are documented in the Developer Guide.

spfy's People

Contributors

Stargazers

Watchers

Forkers

jamez-eh computationalpathogens

spfy's Issues

deployment of v4.3.3 on corefacility

set reactapp as the default web interface

.owl ontology for Spfy's data in Blazegraph

webserver service should use its own base docker image

currently both the webserver service and all 3 rq-worker-* services used the superphy/docker-flask-conda:2.0.0 image with the preinstalled conda env. This isn't necessary, should split the webserver image to only have the req. flask deps. and everything else in a base worker image.

return all job ids including dependencies in subtying

currently, we only return the end task (beautify normally, or datastruct if using bulk uploading). this makes for less checking required by the server when handling blobids but is likely the cause of #142 . Also, it means significantly more code complexity.

instead, we should return all the ids for every job and check all of them when polling a blobid. This would effectively force the groupresults option in the subtyping card to be removed.

make blazegraph optional

Blacklist certain relation types in flask api

namely, types such as :hasPart or :isFoundIn don't make sense to be queried

need to create a new queue for frontend modules that must be returned right away

this is a problem as modules like the database status views also use rq multiples which wont return anytime soon if there are long-running analyses also occurring

issue overlaps with #115

the new workers will likely become:

priority > multiples > backlog
priority > singles
blazegraph
priority

Debug app not happy when you upload several thousand genomes at once

create docs about setting up for production

show which position you task is in the queue

bonus: how many tasks are completed per minute

note: this will be somewhat complicated as we use blob ids

docs on adding new options to the subtyping module

relook at rdfs terms for linking objects

namely: we may want to add rdfs:domain(where it starts from) and rdfs:range(where it ends) to our link_uris(). If this works, we can use the savvy.py script to generate all the graph files, merge them, then use https://github.com/VisualDataWeb/OWL2VOWL to convert and https://github.com/VisualDataWeb/WebVOWL to visualize.

This would be an alternate approach to superphy/grouch#42

Code to autogenerate metadata from genbank

Bug where old api works for gettting results but not new api

http://10.139.14.212:8000/results/5459efe1-2c8c-4bde-b12c-1d0947dc1aea works, but
http://10.139.14.212:8000/api/v0/results/5459efe1-2c8c-4bde-b12c-1d0947dc1aea returns a 500 error (isa)

[2017-06-12 17:32:38,032] ERROR in app: Exception on /api/v0/results/5459efe1-2c8c-4bde-b12c-1d0947dc1aea [GET]
Traceback (most recent call last):
  File "/opt/conda/envs/backend/lib/python2.7/site-packages/flask/app.py", line 1982, in wsgi_app
    response = self.full_dispatch_request()
  File "/opt/conda/envs/backend/lib/python2.7/site-packages/flask/app.py", line 1615, in full_dispatch_request
    return self.finalize_request(rv)
  File "/opt/conda/envs/backend/lib/python2.7/site-packages/flask/app.py", line 1630, in finalize_request
    response = self.make_response(rv)
  File "/opt/conda/envs/backend/lib/python2.7/site-packages/flask/app.py", line 1740, in make_response
    rv = self.response_class.force_type(rv, request.environ)
  File "/opt/conda/envs/backend/lib/python2.7/site-packages/werkzeug/wrappers.py", line 847, in force_type
    response = BaseResponse(*_run_wsgi_app(response, environ))
  File "/opt/conda/envs/backend/lib/python2.7/site-packages/werkzeug/test.py", line 871, in run_wsgi_app
    app_rv = app(environ, start_response)
TypeError: 'list' object is not callable
[pid: 45|app: 0|req: 4914/7477] 10.139.14.104 () {44 vars in 822 bytes} [Mon Jun 12 17:32:38 2017] GET /api/v0/results/5459efe1-2c8c-4bde-b12c-1d0947dc1aea => generated 291 bytes in 9 msecs (HTTP/1.1 500) 3 headers in 116 bytes (1 switches on core 0)
10.139.14.104 - - [12/Jun/2017:17:32:38 +0000] "GET /api/v0/results/5459efe1-2c8c-4bde-b12c-1d0947dc1aea HTTP/1.1" 500 291 "-" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36" "-"

Workaround blazegraph to get N counts for UI

tests inside the docker-composition

mostly looking for regression tests

create a task in RQ that runs at the end and cleans up all other tasks

should also upload the genome file as as bytes to the genome object, as suggested by @chadlaing

looks like multi-job dependencies may be added pretty soon to RQ rq/rq#260 , which would make this much easier

this would also partly address #94 as we could then overwite the initial blobid:{(dict of jobids)} with something like {blobid: {status: complete, result: (someresult)}}. This is necessary as in https://github.com/superphy/backend/tree/147-return-all-jobids we start returning all jobids insteal of only the end task which as a ttl=-1 which causes status checking for dependent tasks to fail when they hit their ttl

prob where after ~5 mins flask is reporting the job as not found

the blob id is okay, checked via docker exec -it backend_redis_1 sh redis-cli -h redis -p 6379 GET blob-1068307098500090648
figured out the problem:
- not all tasks are stored forever - namely dependencies
  - so when they expire, it will through a job not found error even though the main tasks are finished

refactor datastruct_savvy.py to accept wildcards for new modules

https://github.com/superphy/backend/blob/master/app/modules/turtleGrapher/datastruct_savvy.py

as suggested by @jamez-eh

create two endpoints, one just for status checking and the other for getting results

something like:

/status/<job_id> which simple returns pending, failed, or complete
/results/<job_id> which returns either job.exc_info (failed) or job.results (complete)
The job id will be the same, and just dep. on which endpoint is hit, flask will return 1 or 2.
Would eliminate a lot of data transfer that is currently going on unnecessarily.
Alt. option for 1: could return failed + job.exc_info since it won't be that large.

report run-time as metadata to jobs

this will let us compare (in a more general fashion, given differing cpu usage at the time) different modules, for example two serotyping modules

create a service to run blazegraph on corefacility

as required by #159
currently running on screen in /Warehouse/Users/claing/superphy/spfy/docker-blazegraph/2.1.4-inferencing using command java -server -Xmx4g -Dbigdata.propertyFile=/Warehouse/Users/claing/superphy/spfy/docker-blazegraph/2.1.4-inferencing/RWStore.properties -jar blazegraph.jar

User accounts

Perhaps via https://github.com/lingthio/Flask-User
With modifications to the front-end for handling in React

Code for usersubmitted metadata

add jsx support to readthedocs build env

regressions in flask endpoints for gc support

looks like this was caused by 07ae827

Code to auto-load all ref genes in vf/amr into blazegraph

persistance in redis causes failed jobs/those that will fail to remain after restart

File ext checks userside

Problem with db return from reserve_id when Blazegraph empty

address difference between faldo:Position and faldo:position

matt_whiteside [3:13 PM] 
@kevin faldo:Position is an owl:Class for labelling object types, but faldo:position is a owl:DatatypeProperty for linking bnodes to position literals


[3:13] 
same word, different capitalizations


kevin
[3:14 PM] 
ya, this is how its defined in the ontology if im not mistaken


matt_whiteside [3:14 PM] 
i noticed the Position is used when linking to the integer literals


kevin [3:14 PM] 
in theory rdf is cap-null but this doesnt seem to pan out in practice


matt_whiteside [3:14 PM] 
in the code


[3:15] 
so blazegraph doesn't distinguish between Position and position?


kevin
[3:16 PM] 
mm I think it does


[3:16] 
it shouldnt by definition of the rdf spec


[3:16] 
though*


[3:16] 
and ya we use


[3:16] 
 ```            graph.add((bnode_start, gu('faldo:Position'),
                       Literal(gene_record['START'])))
            graph.add((bnode_end, gu('faldo:Position'),
                       Literal(gene_record['STOP'])))```


[3:16] 
which are numerical literals


matt_whiteside [3:16 PM] 
ya, thats what i was getting at


kevin
[3:16 PM] 
and
```            graph.add((bnode_start, gu('rdf:type'), gu('faldo:Position')))
            graph.add((bnode_start, gu('rdf:type'), gu('faldo:ExactPosition')))
            graph.add((bnode_end, gu('rdf:type'), gu('faldo:Position')))
            graph.add((bnode_end, gu('rdf:type'), gu('faldo:ExactPosition')))```


[3:16] 
which are classes


[3:17] 
im not sure if/how we should chnage this, would have to refer back to the faldo spec


matt_whiteside [3:18 PM] 
For:
```graph.add((bnode_start, gu('faldo:position'),
                       Literal(gene_record['START'])))```
its lower case


[3:18] 
and  
```graph.add((bnode_start, gu('rdf:type'), gu('faldo:Position')))```
uppercase


[3:20] 
also, do you need both lines:
```graph.add((bnode_start, gu('rdf:type'), gu('faldo:Position')))
            graph.add((bnode_start, gu('rdf:type'), gu('faldo:ExactPosition')))```


[3:20] 
isn't ExactPosition a subclass of Position


kevin
[3:21 PM] 
hmm


[3:21] 
i don’t know if i ever defined that relation explicityl


[3:21] 
that would work though


matt_whiteside [3:22 PM] 
its defined in FALDO


kevin
[3:22 PM] 
faldo the ontology isnt included for performance reasons


matt_whiteside [3:22 PM] 
ohh ok


new messages
[3:22] 
then maybe it is needed

"analyze this" module

Per @chadlaing 's comments:

See if it is possible within the framework of 10mins to do something like:
"We have an unknown sample, and spfy loaded with all public E. coli data. We can:
1) get serotype
2) get known VF and AMR
3) get pan-genome composition (eg. is it close to known reference strains? outbreak strains?)
4) identify Stx-type
5) identify markers for the unknown strain, or group of closely related strains

This shows the capabilites of spfy, but also the usefulness for PHAC

This would have to wrap various existing modules:

subtyping for 1) & 2)
James' panseq code
fishers or phylotyper(Matt) to find closely related strains and then compute common markers (5)
a new module must be added for 4)

Use redux store to build Cards for Results.js

User should be able to select targets from either objects or attributes

blazegraph struggles under load

        at javax.servlet.http.HttpServlet.service(HttpServlet.java:790)
        at org.eclipse.jetty.servlet.ServletHolder.handle(ServletHolder.java:769)
        at org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:585)
        at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:143)
        at org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:577)
        at org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:223)
        at org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1125)
        at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:515)
        at org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:185)
        at org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1059)
        at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)
        at org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:215)
        at org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:110)
        at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:97)
        at org.eclipse.jetty.server.Server.handle(Server.java:497)
        at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:311)
        at org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:248)
        at org.eclipse.jetty.io.AbstractConnection$2.run(AbstractConnection.java:540)
        at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:610)
        at org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:539)
        at java.lang.Thread.run(Thread.java:748)
Caused by: java.io.IOException: java.util.concurrent.TimeoutException: Idle timeout expired: 30001/30000 ms
        at org.eclipse.jetty.util.SharedBlockingCallback$Blocker.block(SharedBlockingCallback.java:234)
        at org.eclipse.jetty.server.HttpInputOverHTTP.blockForContent(HttpInputOverHTTP.java:66)
        at org.eclipse.jetty.server.HttpInput$1.waitForContent(HttpInput.java:456)
        at org.eclipse.jetty.server.HttpInput.read(HttpInput.java:121)
        at org.apache.commons.io.input.BOMInputStream.read(BOMInputStream.java:286)
        at sun.nio.cs.StreamDecoder.readBytes(StreamDecoder.java:284)
        at sun.nio.cs.StreamDecoder.implRead(StreamDecoder.java:326)
        at sun.nio.cs.StreamDecoder.read(StreamDecoder.java:178)
        at java.io.InputStreamReader.read(InputStreamReader.java:184)
        at java.io.BufferedReader.fill(BufferedReader.java:161)
        at java.io.BufferedReader.read(BufferedReader.java:182)
        at java.io.LineNumberReader.read(LineNumberReader.java:126)
        at java.io.FilterReader.read(FilterReader.java:65)
        at java.io.PushbackReader.read(PushbackReader.java:90)
        at org.openrdf.rio.turtle.TurtleParser.read(TurtleParser.java:1247)
        at org.openrdf.rio.turtle.TurtleParser.parseString(TurtleParser.java:764)
        at org.openrdf.rio.turtle.TurtleParser.parseQuotedString(TurtleParser.java:740)
        at org.openrdf.rio.turtle.TurtleParser.parseQuotedLiteral(TurtleParser.java:648)
        at org.openrdf.rio.turtle.TurtleParser.parseValue(TurtleParser.java:626)
        at org.openrdf.rio.turtle.TurtleParser.parseObject(TurtleParser.java:502)
        at org.openrdf.rio.turtle.TurtleParser.parseObjectList(TurtleParser.java:428)
        at org.openrdf.rio.turtle.TurtleParser.parsePredicateObjectList(TurtleParser.java:421)
        at org.openrdf.rio.turtle.TurtleParser.parseTriples(TurtleParser.java:385)
        at org.openrdf.rio.turtle.TurtleParser.parseStatement(TurtleParser.java:261)
        at org.openrdf.rio.turtle.TurtleParser.parse(TurtleParser.java:216)
        at org.openrdf.rio.turtle.TurtleParser.parse(TurtleParser.java:159)
        at com.bigdata.rdf.sail.webapp.InsertServlet$InsertWithBodyTask.call(InsertServlet.java:308)
        at com.bigdata.rdf.sail.webapp.InsertServlet$InsertWithBodyTask.call(InsertServlet.java:229)
        at com.bigdata.rdf.task.ApiTaskForIndexManager.call(ApiTaskForIndexManager.java:68)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        ... 1 more
Caused by: java.util.concurrent.TimeoutException: Idle timeout expired: 30001/30000 ms
        at org.eclipse.jetty.io.IdleTimeout.checkIdleTimeout(IdleTimeout.java:156)
        at org.eclipse.jetty.io.IdleTimeout$1.run(IdleTimeout.java:50)
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
        at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
        ... 3 more

Create a map to do real time submission displays

human readable descriptions in db

New endpoints using df.to_json(orient='split') to standardize tabular returns for frontend

new approach to uploading files

this is needed due to problems encountered in #159

Docker docs/tutorial

Extracting of Large .zip/.tar files causes uWsgi to disconnect socket

if you're using the master branch, just upload it w/o compressing for now

yarn not building in some platforms

seems specifically to Ubuntu 16.04

create a backlog queue to handle background tasks

also add a config option for using this

Create a task to visualize and track strains that are submitted

tabular view for current status of blazegraph entries #41

superphy/grouch#41

Integrate Sentry into our docker-compose.yml, so we don't rely on Sentry.io

This would allow everything to be self-hosted.

email notifications when jobs complete

all jobs should route through grouped to avoid missing dependencies

If you submit a job without the options.groupresults flag being true, then flask returns a list of job ids, two of which are the QC and ID generation steps. However, reactapp doesn't check either of those (and, really, shouldn't have to) so if either fails, the actual analysis job is still returned as "pending".

We should store and check dependencies server-side to avoid this problem.

jobs that hit timeout dont return a job.failed=True

for example:

 modules.amr.amr.amr('/datastore/2017-07-05-03-40-32-228811-GCA_900016125.1_EF467_contigs_genomic.fna') from backlog_multiples
e582dbda-e5e6-45b8-91f0-b376298158e7
Failed 35 minutes ago
Traceback (most recent call last):
  File "/opt/conda/envs/backend/lib/python2.7/site-packages/rq/worker.py", line 700, in perform_job
    rv = job.perform()
  File "/opt/conda/envs/backend/lib/python2.7/site-packages/rq/job.py", line 500, in perform
    self._result = self.func(*self.args, **self.kwargs)
  File "./modules/amr/amr.py", line 21, in amr
    '-o', outputname])
  File "/opt/conda/envs/backend/lib/python2.7/subprocess.py", line 168, in call
    return Popen(*popenargs, **kwargs).wait()
  File "/opt/conda/envs/backend/lib/python2.7/subprocess.py", line 1073, in wait
    pid, sts = _eintr_retry_call(os.waitpid, self.pid, 0)
  File "/opt/conda/envs/backend/lib/python2.7/subprocess.py", line 121, in _eintr_retry_call
    return func(*args)
  File "/opt/conda/envs/backend/lib/python2.7/site-packages/rq/timeouts.py", line 51, in handle_death_penalty
    'value ({0} seconds)'.format(self._timeout))
JobTimeoutException: Job exceeded maximum timeout value (600 seconds)

But the if job.failed: doesnt evaluate to true as normally is the case if the actual job fails.

Will have to look for a way to check that job hits timeout.