Giter Site home page Giter Site logo

sourmash-bio / wort Goto Github PK

View Code? Open in Web Editor NEW
17.0 7.0 2.0 541 KB

A database for signatures of public genomic sources

Home Page: https://wort.sourmash.bio

License: Other

Python 39.64% Mako 0.45% Dockerfile 3.02% Rust 34.13% CSS 10.24% HTML 4.10% Shell 0.81% Makefile 0.63% Nix 6.99%
minhash bioinformatics bioinformatics-databases webservice sourmash

wort's People

Contributors

bluegenes avatar dependabot[bot] avatar luizirber avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

wort's Issues

SRR7130925 doesn't match regex pattern

Hello! I tried out the wort API to download this paper's data, which has a bioproject and SRR id (SRR7130925).

Searching for SRR7130925 at the wort api (https://wort.oxli.org/v1/ui/#!/default/wort_blueprints_compute_views_compute_sra) results in this invalid pattern error:

{
  "detail": "'SRR7130925' does not match '^\\\\w{3}\\\\d{6}$'\n\nFailed validating 'pattern' in schema:\n    {'description': 'SRA ID for a dataset',\n     'in': 'path',\n     'name': 'sra_id',\n     'pattern': '^\\\\w{3}\\\\d{6}$',\n     'type': 'string'}\n\nOn instance:\n    'SRR7130925'",
  "status": 400,
  "title": "Bad Request",
  "type": "about:blank"
}

SRR7130925 is \w{3}\d{7} rather than \w{3}\d{6} as in the error message.

Support project file downloads

Hello,
I've been using bionode-ncbi to get SRR ids from a PRJ** IDs so I don't have to look for each file individually (example below). Would it be possible to do a similar query in wort?
Warmest,
Olga

(base) 
 Fri  9 Nov - 14:01  ~ 
  docker run -it bionode/bionode-ncbi bash 
root@3a16736030c9:/# bionode-ncbi search sra PRJEB23372 | wc -l
27
root@3a16736030c9:/# bionode-ncbi search sra PRJEB23372 | head -n 3
{"uid":"4915577","expxml":{"Summary":{"Title":"Illumina HiSeq 2000 paired end sequencing","Platform":{"_":"ILLUMINA","instrument_model":"Illumina HiSeq 2000"},"Statistics":{"total_runs":"1","total_spots":"5615688","total_bases":"1403922000","total_size":"506152479","load_done":"true","cluster_name":"public"}},"Submitter":{"acc":"ERA1127236","center_name":"","contact_name":"European Nucleotide Archive","lab_name":"European Nucleotide Archive"},"Experiment":{"acc":"ERX2248648","ver":"1","status":"public","name":"Illumina HiSeq 2000 paired end sequencing"},"Study":{"acc":"ERP105122","name":"Single-cell RNA-Sequencing reveals vesicle-mediated communication between mosquito blood cells"},"Organism":{"taxid":"7165","ScientificName":"Anopheles gambiae"},"Sample":{"acc":"ERS1996171","name":""},"Instrument":{"ILLUMINA":"Illumina HiSeq 2000"},"Library_descriptor":{"LIBRARY_NAME":"unspecified","LIBRARY_STRATEGY":"RNA-Seq","LIBRARY_SOURCE":"TRANSCRIPTOMIC","LIBRARY_SELECTION":"Oligo-dT","LIBRARY_LAYOUT":{"PAIRED":{"NOMINAL_LENGTH":"500"}},"LIBRARY_CONSTRUCTION_PROTOCOL":"Smart-seq2"},"Bioproject":"PRJEB23372","Biosample":"SAMEA104371193"},"runs":{"Run":[{"acc":"ERR2192533","total_spots":"5615688","total_bases":"1403922000","load_done":"true","is_public":"true","cluster_name":"public","static_data_available":"true"}]},"extlinks":"","createdate":"2018/01/06","updatedate":"2018/01/06"}
{"uid":"4915578","expxml":{"Summary":{"Title":"Illumina HiSeq 2000 paired end sequencing","Platform":{"_":"ILLUMINA","instrument_model":"Illumina HiSeq 2000"},"Statistics":{"total_runs":"1","total_spots":"6481103","total_bases":"1620275750","total_size":"576542121","load_done":"true","cluster_name":"public"}},"Submitter":{"acc":"ERA1127236","center_name":"","contact_name":"European Nucleotide Archive","lab_name":"European Nucleotide Archive"},"Experiment":{"acc":"ERX2248649","ver":"1","status":"public","name":"Illumina HiSeq 2000 paired end sequencing"},"Study":{"acc":"ERP105122","name":"Single-cell RNA-Sequencing reveals vesicle-mediated communication between mosquito blood cells"},"Organism":{"taxid":"7165","ScientificName":"Anopheles gambiae"},"Sample":{"acc":"ERS1996172","name":""},"Instrument":{"ILLUMINA":"Illumina HiSeq 2000"},"Library_descriptor":{"LIBRARY_NAME":"unspecified","LIBRARY_STRATEGY":"RNA-Seq","LIBRARY_SOURCE":"TRANSCRIPTOMIC","LIBRARY_SELECTION":"Oligo-dT","LIBRARY_LAYOUT":{"PAIRED":{"NOMINAL_LENGTH":"500"}},"LIBRARY_CONSTRUCTION_PROTOCOL":"Smart-seq2"},"Bioproject":"PRJEB23372","Biosample":"SAMEA104371194"},"runs":{"Run":[{"acc":"ERR2192534","total_spots":"6481103","total_bases":"1620275750","load_done":"true","is_public":"true","cluster_name":"public","static_data_available":"true"}]},"extlinks":"","createdate":"2018/01/06","updatedate":"2018/01/06"}
{"uid":"4915579","expxml":{"Summary":{"Title":"Illumina HiSeq 2000 paired end sequencing","Platform":{"_":"ILLUMINA","instrument_model":"Illumina HiSeq 2000"},"Statistics":{"total_runs":"1","total_spots":"5274357","total_bases":"1318589250","total_size":"493826312","load_done":"true","cluster_name":"public"}},"Submitter":{"acc":"ERA1127236","center_name":"","contact_name":"European Nucleotide Archive","lab_name":"European Nucleotide Archive"},"Experiment":{"acc":"ERX2248650","ver":"1","status":"public","name":"Illumina HiSeq 2000 paired end sequencing"},"Study":{"acc":"ERP105122","name":"Single-cell RNA-Sequencing reveals vesicle-mediated communication between mosquito blood cells"},"Organism":{"taxid":"7165","ScientificName":"Anopheles gambiae"},"Sample":{"acc":"ERS1996173","name":""},"Instrument":{"ILLUMINA":"Illumina HiSeq 2000"},"Library_descriptor":{"LIBRARY_NAME":"unspecified","LIBRARY_STRATEGY":"RNA-Seq","LIBRARY_SOURCE":"TRANSCRIPTOMIC","LIBRARY_SELECTION":"Oligo-dT","LIBRARY_LAYOUT":{"PAIRED":{"NOMINAL_LENGTH":"500"}},"LIBRARY_CONSTRUCTION_PROTOCOL":"Smart-seq2"},"Bioproject":"PRJEB23372","Biosample":"SAMEA104371195"},"runs":{"Run":[{"acc":"ERR2192535","total_spots":"5274357","total_bases":"1318589250","load_done":"true","is_public":"true","cluster_name":"public","static_data_available":"true"}]},"extlinks":"","createdate":"2018/01/06","updatedate":"2018/01/06"}

Calculating SAC on metagenome clusters

@luizirber, one more thing for today (not intending to distract you), it would be really interesting if you could calculate the species accumulation curve (SAC) for hash sets in clusters of metagenomes in your monster wort database. For example, when looking at soil metagenomes as a cluster, you could build a matrix of hashes (such as here), calculate different orders of intersection between hash sets from the soil metagenomes, and then plot an SAC from the hashes. While this might be impossible with kmers, and species tallies are corrupted by incomplete annotation due to incomplete databases, hashes might give you a chance to get an accurate SAC based on plotting the effect of incrementally adding hash sets and seeing the change in intersection sets. See equation 3 in this paper for a definitive explanation. Then you could efficiently use all the data in the SRA and JGI dbs to estimate if the species count based on current soil metagenome is "open" (SAC fits a power law function) or "closed" (SAC fits an exponential function), that is, whether or not we've collected enough data to estimate an asymptote for the number of species (in this case using hashes as a proxy) in soil metagenomes (or some other interesting biome). Although I'm not a soil biologist, I think that's a major question in their field. Other biomes might be interesting too. Not sure if anyone has tried this with raw kmers, but it would seem too gargantuan of a task. Hashes might make this problem tractable?

some signatures that have been suppressed by RefSeq/GenBank are not in wort

For example, the following accessions have been "suppressed" in RefSeq and GenBank, and do no appear in wort (/group/ctbrowngrp/irber/data/wort-data/wort-genomes/sigs/<accession>.sig). I checked newer version numbers and still did not find these accessions (e.g. GCA_900474135*sig). Looking on GenBank, I see the record has been suppressed/removed: https://www.ncbi.nlm.nih.gov/assembly/GCF_900474135.2/

GCA_900474135.2
GCA_002798115.1
GCA_001308105.1

I couldn't find why these records were removed, but thought I'd document this here!

Add GenBank/RefSeq historical data?

For now only keeping the latest version for each assembly, but might be useful to add previous versions (if data is still available) too, especially because other databases (like GTDB) might be using older versions of an assembly.

Sharing indices

  • make it bagit-compatible, similar plans in sourmash-bio/sourmash#991
  • set the fetch.txt file with the URLs for all the signatures, but don't include the full data (it would be more than 2 TB...).
    • It would be pretty cool to have IPFS addresses in fetch.txt...
  • By 'indices' here I mean SBT (without internal nodes, tree structure only) or a Linear index. LCA can be computed locally, tho.
  • Check if building a taxinfo for SRA datasets is viable, and include in the index.

Track task results for datasets

At the moment the Celery tasks results are being saved to S3 but removed without extra checks. It might be useful to add an extra model and connect it to Dataset (in the compute task case) to check what datasets are failing (and why they are failing)

Save some basic metadata about datasets in the DB

The current process to show information if a dataset is available goes like this:

  • on compute: checks if the dataset is already calculated (available in an S3 bucket), if it is set an entry in the cache (to avoid checking S3 all the time).
  • on view: check the cache to see if dataset was calculated. If cache is not set, doesn't render the page with info, showing a generic "dataset not available" page.

Main issue is that, since view doesn't check S3 and the update only happens on compute, it needs to trigger compute twice. IPFS addresses are also not automated yet, so there is a script to update the cache after the addresses are calculated.

Because this is all in cache, information is lost from time to time, and I need to refresh the cache periodically.

The initial idea was to avoid the need for a DB, but since the cache is acting like a DB it might be better to actually define a model and save that information in postgres (which is used only for auth currently). Having some extra info about datasets (the original size, for example) is also needed for selecting which compute queue to use (see #24) and avoid hammering the SRA API for the info.

Split compute queues

Some datasets are too large to fit into the allocated times for workers in some infra (HPC, mostly). Split the celery queues by size, and update HPC workers to only calculate small datasets (< 300?) or at most medium-sized (300-1600), but only connect workers to the large queue if they can run for a long time.

Stop working if no tasks available

When running on the HPC terminate the job earlier if there are no tasks available to run (avoid allocating resources available for other jobs)

running a gubaphage query

Hi Luiz, Titus,

I was wondering if you could please run a Wort query across all metagenomes for the Serratus people? I seem to have understood from your talk that the best way to ask for it is through Github issues, but let me know if you prefer another channel. The goal here is to search for gubaphages in metagenomes. I'm attaching both DNA and protein signatures: Gubaphage_genomes.sigs.zip that I computed for a collection (multifasta) of gubaphages as follows:

sourmash compute --dna Gubaphage_genomes.fa
mv Gubaphage_genomes.fa.sig Gubaphage_genomes.dna.sig
sourmash compute -k 6,9,15  --protein Gubaphage_genomes.fa
mv Gubaphage_genomes.fa.sig Gubaphage_genomes.protein.sig

thanks in advance!
Rayan

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.