sourmash-bio / wort Goto Github PK
View Code? Open in Web Editor NEWA database for signatures of public genomic sources
Home Page: https://wort.sourmash.bio
License: Other
A database for signatures of public genomic sources
Home Page: https://wort.sourmash.bio
License: Other
Hello! I tried out the wort API to download this paper's data, which has a bioproject and SRR id (SRR7130925).
Searching for SRR7130925 at the wort api (https://wort.oxli.org/v1/ui/#!/default/wort_blueprints_compute_views_compute_sra) results in this invalid pattern error:
{
"detail": "'SRR7130925' does not match '^\\\\w{3}\\\\d{6}$'\n\nFailed validating 'pattern' in schema:\n {'description': 'SRA ID for a dataset',\n 'in': 'path',\n 'name': 'sra_id',\n 'pattern': '^\\\\w{3}\\\\d{6}$',\n 'type': 'string'}\n\nOn instance:\n 'SRR7130925'",
"status": 400,
"title": "Bad Request",
"type": "about:blank"
}
SRR7130925 is \w{3}\d{7}
rather than \w{3}\d{6}
as in the error message.
Hello,
I've been using bionode-ncbi to get SRR ids from a PRJ**
IDs so I don't have to look for each file individually (example below). Would it be possible to do a similar query in wort?
Warmest,
Olga
(base)
Fri 9 Nov - 14:01 ~
docker run -it bionode/bionode-ncbi bash
root@3a16736030c9:/# bionode-ncbi search sra PRJEB23372 | wc -l
27
root@3a16736030c9:/# bionode-ncbi search sra PRJEB23372 | head -n 3
{"uid":"4915577","expxml":{"Summary":{"Title":"Illumina HiSeq 2000 paired end sequencing","Platform":{"_":"ILLUMINA","instrument_model":"Illumina HiSeq 2000"},"Statistics":{"total_runs":"1","total_spots":"5615688","total_bases":"1403922000","total_size":"506152479","load_done":"true","cluster_name":"public"}},"Submitter":{"acc":"ERA1127236","center_name":"","contact_name":"European Nucleotide Archive","lab_name":"European Nucleotide Archive"},"Experiment":{"acc":"ERX2248648","ver":"1","status":"public","name":"Illumina HiSeq 2000 paired end sequencing"},"Study":{"acc":"ERP105122","name":"Single-cell RNA-Sequencing reveals vesicle-mediated communication between mosquito blood cells"},"Organism":{"taxid":"7165","ScientificName":"Anopheles gambiae"},"Sample":{"acc":"ERS1996171","name":""},"Instrument":{"ILLUMINA":"Illumina HiSeq 2000"},"Library_descriptor":{"LIBRARY_NAME":"unspecified","LIBRARY_STRATEGY":"RNA-Seq","LIBRARY_SOURCE":"TRANSCRIPTOMIC","LIBRARY_SELECTION":"Oligo-dT","LIBRARY_LAYOUT":{"PAIRED":{"NOMINAL_LENGTH":"500"}},"LIBRARY_CONSTRUCTION_PROTOCOL":"Smart-seq2"},"Bioproject":"PRJEB23372","Biosample":"SAMEA104371193"},"runs":{"Run":[{"acc":"ERR2192533","total_spots":"5615688","total_bases":"1403922000","load_done":"true","is_public":"true","cluster_name":"public","static_data_available":"true"}]},"extlinks":"","createdate":"2018/01/06","updatedate":"2018/01/06"}
{"uid":"4915578","expxml":{"Summary":{"Title":"Illumina HiSeq 2000 paired end sequencing","Platform":{"_":"ILLUMINA","instrument_model":"Illumina HiSeq 2000"},"Statistics":{"total_runs":"1","total_spots":"6481103","total_bases":"1620275750","total_size":"576542121","load_done":"true","cluster_name":"public"}},"Submitter":{"acc":"ERA1127236","center_name":"","contact_name":"European Nucleotide Archive","lab_name":"European Nucleotide Archive"},"Experiment":{"acc":"ERX2248649","ver":"1","status":"public","name":"Illumina HiSeq 2000 paired end sequencing"},"Study":{"acc":"ERP105122","name":"Single-cell RNA-Sequencing reveals vesicle-mediated communication between mosquito blood cells"},"Organism":{"taxid":"7165","ScientificName":"Anopheles gambiae"},"Sample":{"acc":"ERS1996172","name":""},"Instrument":{"ILLUMINA":"Illumina HiSeq 2000"},"Library_descriptor":{"LIBRARY_NAME":"unspecified","LIBRARY_STRATEGY":"RNA-Seq","LIBRARY_SOURCE":"TRANSCRIPTOMIC","LIBRARY_SELECTION":"Oligo-dT","LIBRARY_LAYOUT":{"PAIRED":{"NOMINAL_LENGTH":"500"}},"LIBRARY_CONSTRUCTION_PROTOCOL":"Smart-seq2"},"Bioproject":"PRJEB23372","Biosample":"SAMEA104371194"},"runs":{"Run":[{"acc":"ERR2192534","total_spots":"6481103","total_bases":"1620275750","load_done":"true","is_public":"true","cluster_name":"public","static_data_available":"true"}]},"extlinks":"","createdate":"2018/01/06","updatedate":"2018/01/06"}
{"uid":"4915579","expxml":{"Summary":{"Title":"Illumina HiSeq 2000 paired end sequencing","Platform":{"_":"ILLUMINA","instrument_model":"Illumina HiSeq 2000"},"Statistics":{"total_runs":"1","total_spots":"5274357","total_bases":"1318589250","total_size":"493826312","load_done":"true","cluster_name":"public"}},"Submitter":{"acc":"ERA1127236","center_name":"","contact_name":"European Nucleotide Archive","lab_name":"European Nucleotide Archive"},"Experiment":{"acc":"ERX2248650","ver":"1","status":"public","name":"Illumina HiSeq 2000 paired end sequencing"},"Study":{"acc":"ERP105122","name":"Single-cell RNA-Sequencing reveals vesicle-mediated communication between mosquito blood cells"},"Organism":{"taxid":"7165","ScientificName":"Anopheles gambiae"},"Sample":{"acc":"ERS1996173","name":""},"Instrument":{"ILLUMINA":"Illumina HiSeq 2000"},"Library_descriptor":{"LIBRARY_NAME":"unspecified","LIBRARY_STRATEGY":"RNA-Seq","LIBRARY_SOURCE":"TRANSCRIPTOMIC","LIBRARY_SELECTION":"Oligo-dT","LIBRARY_LAYOUT":{"PAIRED":{"NOMINAL_LENGTH":"500"}},"LIBRARY_CONSTRUCTION_PROTOCOL":"Smart-seq2"},"Bioproject":"PRJEB23372","Biosample":"SAMEA104371195"},"runs":{"Run":[{"acc":"ERR2192535","total_spots":"5274357","total_bases":"1318589250","load_done":"true","is_public":"true","cluster_name":"public","static_data_available":"true"}]},"extlinks":"","createdate":"2018/01/06","updatedate":"2018/01/06"}
@luizirber, one more thing for today (not intending to distract you), it would be really interesting if you could calculate the species accumulation curve (SAC) for hash sets in clusters of metagenomes in your monster wort database. For example, when looking at soil metagenomes as a cluster, you could build a matrix of hashes (such as here), calculate different orders of intersection between hash sets from the soil metagenomes, and then plot an SAC from the hashes. While this might be impossible with kmers, and species tallies are corrupted by incomplete annotation due to incomplete databases, hashes might give you a chance to get an accurate SAC based on plotting the effect of incrementally adding hash sets and seeing the change in intersection sets. See equation 3 in this paper for a definitive explanation. Then you could efficiently use all the data in the SRA and JGI dbs to estimate if the species count based on current soil metagenome is "open" (SAC fits a power law function) or "closed" (SAC fits an exponential function), that is, whether or not we've collected enough data to estimate an asymptote for the number of species (in this case using hashes as a proxy) in soil metagenomes (or some other interesting biome). Although I'm not a soil biologist, I think that's a major question in their field. Other biomes might be interesting too. Not sure if anyone has tried this with raw kmers, but it would seem too gargantuan of a task. Hashes might make this problem tractable?
Using the computed
field from the models
For example, the following accessions have been "suppressed" in RefSeq and GenBank, and do no appear in wort (/group/ctbrowngrp/irber/data/wort-data/wort-genomes/sigs/<accession>.sig
). I checked newer version numbers and still did not find these accessions (e.g. GCA_900474135*sig
). Looking on GenBank, I see the record has been suppressed/removed: https://www.ncbi.nlm.nih.gov/assembly/GCF_900474135.2/
GCA_900474135.2
GCA_002798115.1
GCA_001308105.1
I couldn't find why these records were removed, but thought I'd document this here!
wait for spec-first/connexion#732 to be merged and released
For now only keeping the latest version for each assembly, but might be useful to add previous versions (if data is still available) too, especially because other databases (like GTDB) might be using older versions of an assembly.
It's hard to run on Linux without an X server.
hwchen/keyring-rs#25
hwchen/secret-service-rs#14
https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=690530
See AlexsLemonade/refinebio#109 for examples on how to add to docker workers
fetch.txt
file with the URLs for all the signatures, but don't include the full data (it would be more than 2 TB...).
fetch.txt
...follow https://osf.io/fsd7t/
At the moment the Celery tasks results are being saved to S3 but removed without extra checks. It might be useful to add an extra model and connect it to Dataset
(in the compute
task case) to check what datasets are failing (and why they are failing)
The NCBI SRA Taxonomy Analysis Tool (STAT) is available in the 'analysis' tab in SRA results (alpha version). Example: https://trace.ncbi.nlm.nih.gov/Traces/sra/?run=ERR929453
More info: sourmash-bio/sourmash#299
Create a webextension that adds another tab with the sourmash gather results for the same dataset. BioJupies does something similar, see MaayanLab/biojupies#1
The current process to show information if a dataset is available goes like this:
compute
: checks if the dataset is already calculated (available in an S3 bucket), if it is set an entry in the cache (to avoid checking S3 all the time).view
: check the cache to see if dataset was calculated. If cache is not set, doesn't render the page with info, showing a generic "dataset not available" page.Main issue is that, since view
doesn't check S3 and the update only happens on compute
, it needs to trigger compute
twice. IPFS addresses are also not automated yet, so there is a script to update the cache after the addresses are calculated.
Because this is all in cache, information is lost from time to time, and I need to refresh the cache periodically.
The initial idea was to avoid the need for a DB, but since the cache is acting like a DB it might be better to actually define a model and save that information in postgres (which is used only for auth currently). Having some extra info about datasets (the original size, for example) is also needed for selecting which compute queue to use (see #24) and avoid hammering the SRA API for the info.
Pointers for fungi in https://github.com/luizirber/2017-jgi-download
Some datasets are too large to fit into the allocated times for workers in some infra (HPC, mostly). Split the celery queues by size, and update HPC workers to only calculate small datasets (< 300?) or at most medium-sized (300-1600), but only connect workers to the large queue if they can run for a long time.
Make the signature buckets on S3 'requester pays', providing direct access to the data.
This WON'T BE OFFICIALLY SUPPORTED, and make it clear whenever there is any suggestion to use this (don't want to tie infra to AWS)
When running on the HPC terminate the job earlier if there are no tasks available to run (avoid allocating resources available for other jobs)
Hi Luiz, Titus,
I was wondering if you could please run a Wort query across all metagenomes for the Serratus people? I seem to have understood from your talk that the best way to ask for it is through Github issues, but let me know if you prefer another channel. The goal here is to search for gubaphages in metagenomes. I'm attaching both DNA and protein signatures: Gubaphage_genomes.sigs.zip that I computed for a collection (multifasta) of gubaphages as follows:
sourmash compute --dna Gubaphage_genomes.fa
mv Gubaphage_genomes.fa.sig Gubaphage_genomes.dna.sig
sourmash compute -k 6,9,15 --protein Gubaphage_genomes.fa
mv Gubaphage_genomes.fa.sig Gubaphage_genomes.protein.sig
thanks in advance!
Rayan
I tried looking at the download link for the example SRA entry: https://wort.oxli.org/v1/view/sra/DRR013902 but am getting the following:
Same thing happens for pretty much any SRA, IMG, or NCBI assembly I try.
Are the servers overloaded or the like?
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.