sourmash-bio / wort Goto Github PK

View Code? Open in Web Editor NEW

17.0 7.0 2.0 541 KB

A database for signatures of public genomic sources

Home Page: https://wort.sourmash.bio

License: Other

Python 39.64% Mako 0.45% Dockerfile 3.02% Rust 34.13% CSS 10.24% HTML 4.10% Shell 0.81% Makefile 0.63% Nix 6.99%

minhash bioinformatics bioinformatics-databases webservice sourmash

wort's People

Contributors

Stargazers

Watchers

Forkers

mschatz pombredanne

wort's Issues

SRR7130925 doesn't match regex pattern

Hello! I tried out the wort API to download this paper's data, which has a bioproject and SRR id (SRR7130925).

Searching for SRR7130925 at the wort api (https://wort.oxli.org/v1/ui/#!/default/wort_blueprints_compute_views_compute_sra) results in this invalid pattern error:

{
  "detail": "'SRR7130925' does not match '^\\\\w{3}\\\\d{6}$'\n\nFailed validating 'pattern' in schema:\n    {'description': 'SRA ID for a dataset',\n     'in': 'path',\n     'name': 'sra_id',\n     'pattern': '^\\\\w{3}\\\\d{6}$',\n     'type': 'string'}\n\nOn instance:\n    'SRR7130925'",
  "status": 400,
  "title": "Bad Request",
  "type": "about:blank"
}

SRR7130925 is \w{3}\d{7} rather than \w{3}\d{6} as in the error message.

Schemathesis for API testing

https://github.com/kiwicom/schemathesis

Support project file downloads

Hello,
I've been using bionode-ncbi to get SRR ids from a PRJ** IDs so I don't have to look for each file individually (example below). Would it be possible to do a similar query in wort?
Warmest,
Olga

(base) 
 Fri  9 Nov - 14:01  ~ 
  docker run -it bionode/bionode-ncbi bash 
root@3a16736030c9:/# bionode-ncbi search sra PRJEB23372 | wc -l
27
root@3a16736030c9:/# bionode-ncbi search sra PRJEB23372 | head -n 3
{"uid":"4915577","expxml":{"Summary":{"Title":"Illumina HiSeq 2000 paired end sequencing","Platform":{"_":"ILLUMINA","instrument_model":"Illumina HiSeq 2000"},"Statistics":{"total_runs":"1","total_spots":"5615688","total_bases":"1403922000","total_size":"506152479","load_done":"true","cluster_name":"public"}},"Submitter":{"acc":"ERA1127236","center_name":"","contact_name":"European Nucleotide Archive","lab_name":"European Nucleotide Archive"},"Experiment":{"acc":"ERX2248648","ver":"1","status":"public","name":"Illumina HiSeq 2000 paired end sequencing"},"Study":{"acc":"ERP105122","name":"Single-cell RNA-Sequencing reveals vesicle-mediated communication between mosquito blood cells"},"Organism":{"taxid":"7165","ScientificName":"Anopheles gambiae"},"Sample":{"acc":"ERS1996171","name":""},"Instrument":{"ILLUMINA":"Illumina HiSeq 2000"},"Library_descriptor":{"LIBRARY_NAME":"unspecified","LIBRARY_STRATEGY":"RNA-Seq","LIBRARY_SOURCE":"TRANSCRIPTOMIC","LIBRARY_SELECTION":"Oligo-dT","LIBRARY_LAYOUT":{"PAIRED":{"NOMINAL_LENGTH":"500"}},"LIBRARY_CONSTRUCTION_PROTOCOL":"Smart-seq2"},"Bioproject":"PRJEB23372","Biosample":"SAMEA104371193"},"runs":{"Run":[{"acc":"ERR2192533","total_spots":"5615688","total_bases":"1403922000","load_done":"true","is_public":"true","cluster_name":"public","static_data_available":"true"}]},"extlinks":"","createdate":"2018/01/06","updatedate":"2018/01/06"}
{"uid":"4915578","expxml":{"Summary":{"Title":"Illumina HiSeq 2000 paired end sequencing","Platform":{"_":"ILLUMINA","instrument_model":"Illumina HiSeq 2000"},"Statistics":{"total_runs":"1","total_spots":"6481103","total_bases":"1620275750","total_size":"576542121","load_done":"true","cluster_name":"public"}},"Submitter":{"acc":"ERA1127236","center_name":"","contact_name":"European Nucleotide Archive","lab_name":"European Nucleotide Archive"},"Experiment":{"acc":"ERX2248649","ver":"1","status":"public","name":"Illumina HiSeq 2000 paired end sequencing"},"Study":{"acc":"ERP105122","name":"Single-cell RNA-Sequencing reveals vesicle-mediated communication between mosquito blood cells"},"Organism":{"taxid":"7165","ScientificName":"Anopheles gambiae"},"Sample":{"acc":"ERS1996172","name":""},"Instrument":{"ILLUMINA":"Illumina HiSeq 2000"},"Library_descriptor":{"LIBRARY_NAME":"unspecified","LIBRARY_STRATEGY":"RNA-Seq","LIBRARY_SOURCE":"TRANSCRIPTOMIC","LIBRARY_SELECTION":"Oligo-dT","LIBRARY_LAYOUT":{"PAIRED":{"NOMINAL_LENGTH":"500"}},"LIBRARY_CONSTRUCTION_PROTOCOL":"Smart-seq2"},"Bioproject":"PRJEB23372","Biosample":"SAMEA104371194"},"runs":{"Run":[{"acc":"ERR2192534","total_spots":"6481103","total_bases":"1620275750","load_done":"true","is_public":"true","cluster_name":"public","static_data_available":"true"}]},"extlinks":"","createdate":"2018/01/06","updatedate":"2018/01/06"}
{"uid":"4915579","expxml":{"Summary":{"Title":"Illumina HiSeq 2000 paired end sequencing","Platform":{"_":"ILLUMINA","instrument_model":"Illumina HiSeq 2000"},"Statistics":{"total_runs":"1","total_spots":"5274357","total_bases":"1318589250","total_size":"493826312","load_done":"true","cluster_name":"public"}},"Submitter":{"acc":"ERA1127236","center_name":"","contact_name":"European Nucleotide Archive","lab_name":"European Nucleotide Archive"},"Experiment":{"acc":"ERX2248650","ver":"1","status":"public","name":"Illumina HiSeq 2000 paired end sequencing"},"Study":{"acc":"ERP105122","name":"Single-cell RNA-Sequencing reveals vesicle-mediated communication between mosquito blood cells"},"Organism":{"taxid":"7165","ScientificName":"Anopheles gambiae"},"Sample":{"acc":"ERS1996173","name":""},"Instrument":{"ILLUMINA":"Illumina HiSeq 2000"},"Library_descriptor":{"LIBRARY_NAME":"unspecified","LIBRARY_STRATEGY":"RNA-Seq","LIBRARY_SOURCE":"TRANSCRIPTOMIC","LIBRARY_SELECTION":"Oligo-dT","LIBRARY_LAYOUT":{"PAIRED":{"NOMINAL_LENGTH":"500"}},"LIBRARY_CONSTRUCTION_PROTOCOL":"Smart-seq2"},"Bioproject":"PRJEB23372","Biosample":"SAMEA104371195"},"runs":{"Run":[{"acc":"ERR2192535","total_spots":"5274357","total_bases":"1318589250","load_done":"true","is_public":"true","cluster_name":"public","static_data_available":"true"}]},"extlinks":"","createdate":"2018/01/06","updatedate":"2018/01/06"}

Calculating SAC on metagenome clusters

@luizirber, one more thing for today (not intending to distract you), it would be really interesting if you could calculate the species accumulation curve (SAC) for hash sets in clusters of metagenomes in your monster wort database. For example, when looking at soil metagenomes as a cluster, you could build a matrix of hashes (such as here), calculate different orders of intersection between hash sets from the soil metagenomes, and then plot an SAC from the hashes. While this might be impossible with kmers, and species tallies are corrupted by incomplete annotation due to incomplete databases, hashes might give you a chance to get an accurate SAC based on plotting the effect of incrementally adding hash sets and seeing the change in intersection sets. See equation 3 in this paper for a definitive explanation. Then you could efficiently use all the data in the SRA and JGI dbs to estimate if the species count based on current soil metagenome is "open" (SAC fits a power law function) or "closed" (SAC fits an exponential function), that is, whether or not we've collected enough data to estimate an asymptote for the number of species (in this case using hashes as a proxy) in soil metagenomes (or some other interesting biome). Although I'm not a soil biologist, I think that's a major question in their field. Other biomes might be interesting too. Not sure if anyone has tried this with raw kmers, but it would seem too gargantuan of a task. Hashes might make this problem tractable?

Create a feed for daily added datasets

Using the computed field from the models

some signatures that have been suppressed by RefSeq/GenBank are not in wort

For example, the following accessions have been "suppressed" in RefSeq and GenBank, and do no appear in wort (/group/ctbrowngrp/irber/data/wort-data/wort-genomes/sigs/<accession>.sig). I checked newer version numbers and still did not find these accessions (e.g. GCA_900474135*sig). Looking on GenBank, I see the record has been suppressed/removed: https://www.ncbi.nlm.nih.gov/assembly/GCF_900474135.2/

GCA_900474135.2
GCA_002798115.1
GCA_001308105.1

I couldn't find why these records were removed, but thought I'd document this here!

Describe token auth in the API once connexion supports it

wait for spec-first/connexion#732 to be merged and released

Add GenBank/RefSeq historical data?

For now only keeping the latest version for each assembly, but might be useful to add previous versions (if data is still available) too, especially because other databases (like GTDB) might be using older versions of an assembly.

lighthouse ci for frontend

https://github.com/treosh/lighthouse-ci-action

alternatives to keyring crate

It's hard to run on Linux without an X server.

hwchen/keyring-rs#25
hwchen/secret-service-rs#14
https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=690530

make ascii diagrams compatible with svgbob

https://github.com/ivanceras/svgbob

Use aspera for faster downloads

See AlexsLemonade/refinebio#109 for examples on how to add to docker workers

Sharing indices

make it bagit-compatible, similar plans in sourmash-bio/sourmash#991
set the fetch.txt file with the URLs for all the signatures, but don't include the full data (it would be more than 2 TB...).
- It would be pretty cool to have IPFS addresses in fetch.txt...
By 'indices' here I mean SBT (without internal nodes, tree structure only) or a Linear index. LCA can be computed locally, tho.
Check if building a taxinfo for SRA datasets is viable, and include in the index.

Update dockerfiles

follow https://osf.io/fsd7t/

Track task results for datasets

At the moment the Celery tasks results are being saved to S3 but removed without extra checks. It might be useful to add an extra model and connect it to Dataset (in the compute task case) to check what datasets are failing (and why they are failing)

analysis tab in metagenomic SRA results

The NCBI SRA Taxonomy Analysis Tool (STAT) is available in the 'analysis' tab in SRA results (alpha version). Example: https://trace.ncbi.nlm.nih.gov/Traces/sra/?run=ERR929453
More info: sourmash-bio/sourmash#299

Create a webextension that adds another tab with the sourmash gather results for the same dataset. BioJupies does something similar, see MaayanLab/biojupies#1

Save some basic metadata about datasets in the DB

The current process to show information if a dataset is available goes like this:

on compute: checks if the dataset is already calculated (available in an S3 bucket), if it is set an entry in the cache (to avoid checking S3 all the time).
on view: check the cache to see if dataset was calculated. If cache is not set, doesn't render the page with info, showing a generic "dataset not available" page.

Main issue is that, since view doesn't check S3 and the update only happens on compute, it needs to trigger compute twice. IPFS addresses are also not automated yet, so there is a script to update the cache after the addresses are calculated.

Because this is all in cache, information is lost from time to time, and I need to refresh the cache periodically.

The initial idea was to avoid the need for a DB, but since the cache is acting like a DB it might be better to actually define a model and save that information in postgres (which is used only for auth currently). Having some extra info about datasets (the original size, for example) is also needed for selecting which compute queue to use (see #24) and avoid hammering the SRA API for the info.

Add JGI assemblies?

Pointers for fungi in https://github.com/luizirber/2017-jgi-download

Check Stark for replacing postgres?

https://github.com/will-rowe/stark

Split compute queues

Some datasets are too large to fit into the allocated times for workers in some infra (HPC, mostly). Split the celery queues by size, and update HPC workers to only calculate small datasets (< 300?) or at most medium-sized (300-1600), but only connect workers to the large queue if they can run for a long time.

frontend: fixed size download progress

https://twitter.com/kentcdodds/status/1166698278714396672

Use terraform for infra deploy?

https://github.com/yoshuawuyts/server
https://www.terraform.io/intro/index.html
https://github.com/dmathieu/byovpn

requester pays buckets

Make the signature buckets on S3 'requester pays', providing direct access to the data.

This WON'T BE OFFICIALLY SUPPORTED, and make it clear whenever there is any suggestion to use this (don't want to tie infra to AWS)

Stop working if no tasks available

When running on the HPC terminate the job earlier if there are no tasks available to run (avoid allocating resources available for other jobs)

running a gubaphage query

Hi Luiz, Titus,

I was wondering if you could please run a Wort query across all metagenomes for the Serratus people? I seem to have understood from your talk that the best way to ask for it is through Github issues, but let me know if you prefer another channel. The goal here is to search for gubaphages in metagenomes. I'm attaching both DNA and protein signatures: Gubaphage_genomes.sigs.zip that I computed for a collection (multifasta) of gubaphages as follows:

sourmash compute --dna Gubaphage_genomes.fa
mv Gubaphage_genomes.fa.sig Gubaphage_genomes.dna.sig
sourmash compute -k 6,9,15  --protein Gubaphage_genomes.fa
mv Gubaphage_genomes.fa.sig Gubaphage_genomes.protein.sig

thanks in advance!
Rayan

Is the website down? Everything is timing out

I tried looking at the download link for the example SRA entry: https://wort.oxli.org/v1/view/sra/DRR013902 but am getting the following:

Same thing happens for pretty much any SRA, IMG, or NCBI assembly I try.
Are the servers overloaded or the like?