Giter Site home page Giter Site logo

superphy / spfy Goto Github PK

View Code? Open in Web Editor NEW
4.0 5.0 2.0 31.97 MB

Spfy: an integrated graph database for real-time prediction of Escherichia coli phenotypes and downstream comparative analyses

Home Page: https://lfz.corefacility.ca/superphy/grouch/

License: Apache License 2.0

Python 99.94% Shell 0.06%
genome-annotation predictive-analytics web-app rq blazegraph spa docker

spfy's Introduction

Build Status GitHub license Documentation Status

Spfy: Platform for predicting subtypes from E.coli whole genome sequences, and builds graph data for population-wide comparative analyses.

Published as: Le,K.K., Whiteside,M.D., Hopkins,J.E., Gannon,V.P.J., Laing,C.R. Spfy: an integrated graph database for real-time prediction of bacterial phenotypes and downstream comparative analyses. Database (2018) Vol. 2018: article ID bay086; doi:10.1093/database/bay086

Live: https://lfz.corefacility.ca/superphy/spfy/

screenshot of the results page

Use:

  1. Install Docker (& Docker-Compose separately if you're on Linux, link). mac/windows users have Compose bundled with Docker Engine.
  2. git clone --recursive https://github.com/superphy/spfy.git
  3. cd spfy/
  4. docker-compose up
  5. Visit http://localhost:8090
  6. Eat cake 🍰

Submodule Build Statuses:

ECTyper:

image

PanPredic:

image

Docker Image for Conda:

image

Stats:

Comparing different population groups:

Overall Performance

Runtimes of subtyping modules:

Runtimes of individual analyses

CLI: Generate Graph Files:

  • If you wish to only create rdf graphs (serialized as turtle files):
  1. First install miniconda and activate the environment from https://raw.githubusercontent.com/superphy/docker-flask-conda/master/app/environment.yml
  2. cd into the app folder (where RQ workers typically run from): cd app/
  3. Run savvy.py like so: python -m modules/savvy -i tests/ecoli/GCA_001894495.1_ASM189449v1_genomic.fna where the argument after the -i is your genome (FASTA) file.

CLI: Generate Ontology:

screenshot of the results page

The ontology for Spfy is available at: https://raw.githubusercontent.com/superphy/backend/master/app/scripts/spfy\_ontology.ttl It was generated using https://raw.githubusercontent.com/superphy/backend/master/app/scripts/generate\_ontology.py with shared functions from Spfy's backend code. If you wish to run it, do: 1. cd app/ 2. python -m scripts/generate_ontology which will put the ontology in app/

You can generate a pretty diagram from the .ttl file using http://www.visualdataweb.de/webvowl/

CLI: Enqueue Subtyping Tasks w/o Reactapp:

Note

currently setup for just .fna files

You can bypass the front-end website and still enqueue subtyping jobs by:

  1. First, mount the host directory with all your genome files to /datastore in the containers.

For example, if you keep your files at /home/bob/ecoli-genomes/, you'd edit the docker-compose.yml file and replace:

volumes:
- /datastore

with:

volumes:
- /home/bob/ecoli-genomes:/datastore
  1. Then take down your docker composition (if it's up) and restart it
docker-compose down
docker-compose up -d
  1. Drop and shell into your webserver container (though the worker containers would work too) and run the script.
docker exec -it backend_webserver_1 sh
python -m scripts/sideload
exit

Note that reisdues may be created in your genome folder.

Architecture:

screenshot of the results page

Dock er Imag e Port s Name s Des crip tion
back end-rq 80/t cp, 443/ tcp back end_wor ker_1 the main redi s queu e work ers
back end-rq-b laze grap h 80/t cp, 443/ tcp back end_wor ker-blaz egra ph-i ds_ 1 this hand les spfy ID gene rati on for the blaz egra ph data base
back end 0.0. 0.0: 8000 ->80 /tcp , 443/ tcp back end_web -ngi nx-u wsgi _1 the flas k back end whic h hand les enqu euei ng task s
supe rphy /bla zegr aph: 2.1. 4-in fere ncin g 0.0. 0.0: 8080 ->80 80/t cp back end_bla zegr aph_1 Blaz egra ph Data base
redi s:3. 2 6379 /tcp back end_red is_ 1 Redi s Data base
reac tapp 0.0. 0.0: 8090 ->50 00/t cp back end_rea ctap p_1 fron t-en d to spfy

Further Details:

The superphy/backend-rq:2.0.0 image is scalable: you can create as many instances as you need/have processing power for. The image is responsible for listening to the multiples queue (12 workers) which handles most of the tasks, including RGI calls. It also listens to the singles queue (1 worker) which runs ECTyper. This is done as RGI is the slowest part of the equation. Worker management in handled in supervisor.

The superphy/backend-rq-blazegraph:2.0.0 image is not scalable: it is responsible for querying the Blazegraph database for duplicate entries and for assigning spfyIDs in sequential order. It's functions are kept as minimal as possible to improve performance (as ID generation is the one bottleneck in otherwise parallel pipelines); comparisons are done by sha1 hashes of the submitted files and non-duplicates have their IDs reserved by linking the generated spfyID to the file hash. Worker management in handled in supervisor.

The superphy/backend:2.0.0 which runs the Flask endpoints uses supervisor to manage inner processes: nginx, uWsgi.

Blazegraph:

  • We are currently running Blazegraph version 2.1.4. If you want to run Blazegraph separately, please use the same version otherwise there may be problems in endpoint urls / returns (namely version 2.1.1). See #63 Alternatively, modify the endpoint accordingly under database['blazegraph_url'] in /app/config.py

Contributing:

Steps required to add new modules are documented in the Developer Guide.

spfy's People

Contributors

chadlaing avatar jamez-eh avatar kevinkle avatar

Stargazers

 avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

spfy's Issues

return all job ids including dependencies in subtying

currently, we only return the end task (beautify normally, or datastruct if using bulk uploading). this makes for less checking required by the server when handling blobids but is likely the cause of #142 . Also, it means significantly more code complexity.

instead, we should return all the ids for every job and check all of them when polling a blobid. This would effectively force the groupresults option in the subtyping card to be removed.

Bug where old api works for gettting results but not new api

http://10.139.14.212:8000/results/5459efe1-2c8c-4bde-b12c-1d0947dc1aea works, but
http://10.139.14.212:8000/api/v0/results/5459efe1-2c8c-4bde-b12c-1d0947dc1aea returns a 500 error (isa)

[2017-06-12 17:32:38,032] ERROR in app: Exception on /api/v0/results/5459efe1-2c8c-4bde-b12c-1d0947dc1aea [GET]
Traceback (most recent call last):
  File "/opt/conda/envs/backend/lib/python2.7/site-packages/flask/app.py", line 1982, in wsgi_app
    response = self.full_dispatch_request()
  File "/opt/conda/envs/backend/lib/python2.7/site-packages/flask/app.py", line 1615, in full_dispatch_request
    return self.finalize_request(rv)
  File "/opt/conda/envs/backend/lib/python2.7/site-packages/flask/app.py", line 1630, in finalize_request
    response = self.make_response(rv)
  File "/opt/conda/envs/backend/lib/python2.7/site-packages/flask/app.py", line 1740, in make_response
    rv = self.response_class.force_type(rv, request.environ)
  File "/opt/conda/envs/backend/lib/python2.7/site-packages/werkzeug/wrappers.py", line 847, in force_type
    response = BaseResponse(*_run_wsgi_app(response, environ))
  File "/opt/conda/envs/backend/lib/python2.7/site-packages/werkzeug/test.py", line 871, in run_wsgi_app
    app_rv = app(environ, start_response)
TypeError: 'list' object is not callable
[pid: 45|app: 0|req: 4914/7477] 10.139.14.104 () {44 vars in 822 bytes} [Mon Jun 12 17:32:38 2017] GET /api/v0/results/5459efe1-2c8c-4bde-b12c-1d0947dc1aea => generated 291 bytes in 9 msecs (HTTP/1.1 500) 3 headers in 116 bytes (1 switches on core 0)
10.139.14.104 - - [12/Jun/2017:17:32:38 +0000] "GET /api/v0/results/5459efe1-2c8c-4bde-b12c-1d0947dc1aea HTTP/1.1" 500 291 "-" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36" "-"

create a task in RQ that runs at the end and cleans up all other tasks

should also upload the genome file as as bytes to the genome object, as suggested by @chadlaing

looks like multi-job dependencies may be added pretty soon to RQ rq/rq#260 , which would make this much easier

this would also partly address #94 as we could then overwite the initial blobid:{(dict of jobids)} with something like {blobid: {status: complete, result: (someresult)}}. This is necessary as in https://github.com/superphy/backend/tree/147-return-all-jobids we start returning all jobids insteal of only the end task which as a ttl=-1 which causes status checking for dependent tasks to fail when they hit their ttl

prob where after ~5 mins flask is reporting the job as not found

  • the blob id is okay, checked via docker exec -it backend_redis_1 sh redis-cli -h redis -p 6379 GET blob-1068307098500090648
  • figured out the problem:
    • not all tasks are stored forever - namely dependencies
      • so when they expire, it will through a job not found error even though the main tasks are finished

create two endpoints, one just for status checking and the other for getting results

something like:

  1. /status/<job_id> which simple returns pending, failed, or complete
  2. /results/<job_id> which returns either job.exc_info (failed) or job.results (complete)
    The job id will be the same, and just dep. on which endpoint is hit, flask will return 1 or 2.
    Would eliminate a lot of data transfer that is currently going on unnecessarily.
    Alt. option for 1: could return failed + job.exc_info since it won't be that large.

report run-time as metadata to jobs

  • this will let us compare (in a more general fashion, given differing cpu usage at the time) different modules, for example two serotyping modules

create a service to run blazegraph on corefacility

as required by #159
currently running on screen in /Warehouse/Users/claing/superphy/spfy/docker-blazegraph/2.1.4-inferencing using command java -server -Xmx4g -Dbigdata.propertyFile=/Warehouse/Users/claing/superphy/spfy/docker-blazegraph/2.1.4-inferencing/RWStore.properties -jar blazegraph.jar

address difference between faldo:Position and faldo:position

matt_whiteside [3:13 PM] 
@kevin faldo:Position is an owl:Class for labelling object types, but faldo:position is a owl:DatatypeProperty for linking bnodes to position literals


[3:13] 
same word, different capitalizations


kevin
[3:14 PM] 
ya, this is how its defined in the ontology if im not mistaken


matt_whiteside [3:14 PM] 
i noticed the Position is used when linking to the integer literals


kevin [3:14 PM] 
in theory rdf is cap-null but this doesnt seem to pan out in practice


matt_whiteside [3:14 PM] 
in the code


[3:15] 
so blazegraph doesn't distinguish between Position and position?


kevin
[3:16 PM] 
mm I think it does


[3:16] 
it shouldnt by definition of the rdf spec


[3:16] 
though*


[3:16] 
and ya we use


[3:16] 
 ```            graph.add((bnode_start, gu('faldo:Position'),
                       Literal(gene_record['START'])))
            graph.add((bnode_end, gu('faldo:Position'),
                       Literal(gene_record['STOP'])))```


[3:16] 
which are numerical literals


matt_whiteside [3:16 PM] 
ya, thats what i was getting at


kevin
[3:16 PM] 
and
```            graph.add((bnode_start, gu('rdf:type'), gu('faldo:Position')))
            graph.add((bnode_start, gu('rdf:type'), gu('faldo:ExactPosition')))
            graph.add((bnode_end, gu('rdf:type'), gu('faldo:Position')))
            graph.add((bnode_end, gu('rdf:type'), gu('faldo:ExactPosition')))```


[3:16] 
which are classes


[3:17] 
im not sure if/how we should chnage this, would have to refer back to the faldo spec


matt_whiteside [3:18 PM] 
For:
```graph.add((bnode_start, gu('faldo:position'),
                       Literal(gene_record['START'])))```
its lower case


[3:18] 
and  
```graph.add((bnode_start, gu('rdf:type'), gu('faldo:Position')))```
uppercase


[3:20] 
also, do you need both lines:
```graph.add((bnode_start, gu('rdf:type'), gu('faldo:Position')))
            graph.add((bnode_start, gu('rdf:type'), gu('faldo:ExactPosition')))```


[3:20] 
isn't ExactPosition a subclass of Position


kevin
[3:21 PM] 
hmm


[3:21] 
i don’t know if i ever defined that relation explicityl


[3:21] 
that would work though


matt_whiteside [3:22 PM] 
its defined in FALDO


kevin
[3:22 PM] 
faldo the ontology isnt included for performance reasons


matt_whiteside [3:22 PM] 
ohh ok


new messages
[3:22] 
then maybe it is needed

"analyze this" module

Per @chadlaing 's comments:

See if it is possible within the framework of 10mins to do something like:
"We have an unknown sample, and spfy loaded with all public E. coli data. We can:
1) get serotype
2) get known VF and AMR
3) get pan-genome composition (eg. is it close to known reference strains? outbreak strains?)
4) identify Stx-type
5) identify markers for the unknown strain, or group of closely related strains

This shows the capabilites of spfy, but also the usefulness for PHAC

This would have to wrap various existing modules:

  • subtyping for 1) & 2)
  • James' panseq code
  • fishers or phylotyper(Matt) to find closely related strains and then compute common markers (5)
  • a new module must be added for 4)

blazegraph struggles under load

        at javax.servlet.http.HttpServlet.service(HttpServlet.java:790)
        at org.eclipse.jetty.servlet.ServletHolder.handle(ServletHolder.java:769)
        at org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:585)
        at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:143)
        at org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:577)
        at org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:223)
        at org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1125)
        at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:515)
        at org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:185)
        at org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1059)
        at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)
        at org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:215)
        at org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:110)
        at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:97)
        at org.eclipse.jetty.server.Server.handle(Server.java:497)
        at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:311)
        at org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:248)
        at org.eclipse.jetty.io.AbstractConnection$2.run(AbstractConnection.java:540)
        at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:610)
        at org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:539)
        at java.lang.Thread.run(Thread.java:748)
Caused by: java.io.IOException: java.util.concurrent.TimeoutException: Idle timeout expired: 30001/30000 ms
        at org.eclipse.jetty.util.SharedBlockingCallback$Blocker.block(SharedBlockingCallback.java:234)
        at org.eclipse.jetty.server.HttpInputOverHTTP.blockForContent(HttpInputOverHTTP.java:66)
        at org.eclipse.jetty.server.HttpInput$1.waitForContent(HttpInput.java:456)
        at org.eclipse.jetty.server.HttpInput.read(HttpInput.java:121)
        at org.apache.commons.io.input.BOMInputStream.read(BOMInputStream.java:286)
        at sun.nio.cs.StreamDecoder.readBytes(StreamDecoder.java:284)
        at sun.nio.cs.StreamDecoder.implRead(StreamDecoder.java:326)
        at sun.nio.cs.StreamDecoder.read(StreamDecoder.java:178)
        at java.io.InputStreamReader.read(InputStreamReader.java:184)
        at java.io.BufferedReader.fill(BufferedReader.java:161)
        at java.io.BufferedReader.read(BufferedReader.java:182)
        at java.io.LineNumberReader.read(LineNumberReader.java:126)
        at java.io.FilterReader.read(FilterReader.java:65)
        at java.io.PushbackReader.read(PushbackReader.java:90)
        at org.openrdf.rio.turtle.TurtleParser.read(TurtleParser.java:1247)
        at org.openrdf.rio.turtle.TurtleParser.parseString(TurtleParser.java:764)
        at org.openrdf.rio.turtle.TurtleParser.parseQuotedString(TurtleParser.java:740)
        at org.openrdf.rio.turtle.TurtleParser.parseQuotedLiteral(TurtleParser.java:648)
        at org.openrdf.rio.turtle.TurtleParser.parseValue(TurtleParser.java:626)
        at org.openrdf.rio.turtle.TurtleParser.parseObject(TurtleParser.java:502)
        at org.openrdf.rio.turtle.TurtleParser.parseObjectList(TurtleParser.java:428)
        at org.openrdf.rio.turtle.TurtleParser.parsePredicateObjectList(TurtleParser.java:421)
        at org.openrdf.rio.turtle.TurtleParser.parseTriples(TurtleParser.java:385)
        at org.openrdf.rio.turtle.TurtleParser.parseStatement(TurtleParser.java:261)
        at org.openrdf.rio.turtle.TurtleParser.parse(TurtleParser.java:216)
        at org.openrdf.rio.turtle.TurtleParser.parse(TurtleParser.java:159)
        at com.bigdata.rdf.sail.webapp.InsertServlet$InsertWithBodyTask.call(InsertServlet.java:308)
        at com.bigdata.rdf.sail.webapp.InsertServlet$InsertWithBodyTask.call(InsertServlet.java:229)
        at com.bigdata.rdf.task.ApiTaskForIndexManager.call(ApiTaskForIndexManager.java:68)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        ... 1 more
Caused by: java.util.concurrent.TimeoutException: Idle timeout expired: 30001/30000 ms
        at org.eclipse.jetty.io.IdleTimeout.checkIdleTimeout(IdleTimeout.java:156)
        at org.eclipse.jetty.io.IdleTimeout$1.run(IdleTimeout.java:50)
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
        at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
        ... 3 more

all jobs should route through grouped to avoid missing dependencies

If you submit a job without the options.groupresults flag being true, then flask returns a list of job ids, two of which are the QC and ID generation steps. However, reactapp doesn't check either of those (and, really, shouldn't have to) so if either fails, the actual analysis job is still returned as "pending".

We should store and check dependencies server-side to avoid this problem.

jobs that hit timeout dont return a job.failed=True

for example:

 modules.amr.amr.amr('/datastore/2017-07-05-03-40-32-228811-GCA_900016125.1_EF467_contigs_genomic.fna') from backlog_multiples
e582dbda-e5e6-45b8-91f0-b376298158e7
Failed 35 minutes ago
Traceback (most recent call last):
  File "/opt/conda/envs/backend/lib/python2.7/site-packages/rq/worker.py", line 700, in perform_job
    rv = job.perform()
  File "/opt/conda/envs/backend/lib/python2.7/site-packages/rq/job.py", line 500, in perform
    self._result = self.func(*self.args, **self.kwargs)
  File "./modules/amr/amr.py", line 21, in amr
    '-o', outputname])
  File "/opt/conda/envs/backend/lib/python2.7/subprocess.py", line 168, in call
    return Popen(*popenargs, **kwargs).wait()
  File "/opt/conda/envs/backend/lib/python2.7/subprocess.py", line 1073, in wait
    pid, sts = _eintr_retry_call(os.waitpid, self.pid, 0)
  File "/opt/conda/envs/backend/lib/python2.7/subprocess.py", line 121, in _eintr_retry_call
    return func(*args)
  File "/opt/conda/envs/backend/lib/python2.7/site-packages/rq/timeouts.py", line 51, in handle_death_penalty
    'value ({0} seconds)'.format(self._timeout))
JobTimeoutException: Job exceeded maximum timeout value (600 seconds)

But the if job.failed: doesnt evaluate to true as normally is the case if the actual job fails.

Will have to look for a way to check that job hits timeout.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.