Giter Site home page Giter Site logo

opencb / cellbase Goto Github PK

View Code? Open in Web Editor NEW
89.0 29.0 53.0 156.23 MB

High-Performance NoSQL database and RESTful web services to access to most relevant biological data

License: Apache License 2.0

Perl 1.61% Shell 0.22% Python 2.69% HTML 0.16% CSS 1.39% JavaScript 36.79% Java 56.04% R 0.75% Jupyter Notebook 0.23% Dockerfile 0.07% Mustache 0.05% Smarty 0.01%

cellbase's Introduction

description
Welcome to CellBase!

Overview

During the last years the advances of high-throughput technologies in biology have produced an unprecedented growth of repositories and databases storing relevant biological data. Today there is more biological information than ever but unfortunately the current status of many of these repositories is far from being optimal many times. Some of the most common problems are: a) information is spread out in many small repositories and databases, b) lack of standards between different repositories, c) unsupported databases, d) specific and unconnected information, etc.

All these problems make very difficult: a) to integrate or join many different sources into only one database to work or analyze experiments; b) to access and query this information in programmatically way.

To cope with all these problems we have designed and developed a NoSQL database that integrates the most relevant biological information about genomic features and proteins, gene expression regulation, functional annotation, genomic variation and systems biology information. We use the most relevant repositories such as Ensembl, Uniprot, ClinVar, COSMIC or IntAct among many others (you can browse them Data sources and species). The information integrated covers:

  • Core features: genes, transcripts, exons, proteins, genome sequence, etc.
  • Regulatory: Ensembl regulatory, TFBS, miRNA targets, CTCF, Open chromatin, etc.
  • Functional annotation: OBO ontologies (Gene Ontology, Human Disease Ontology), etc.
  • Genomic variation: Ensembl Variation, ClinVar, COSMIC, etc.
  • Systems biology: IntAct , Reactome, gene co-expression, etc.

To make this entire database accessible to researchers, an exhaustive RESTful Web service API has been implemented. This API contains many methods that will facilitate researchers to query and obtain different biological information from a single database saving a lot of time. Another benefit is that researchers can make easily queries about different biologTical topics and link all this information together as all information is integrated.

Currently Homo sapiens, Mus musculus and other 20 species are available and many others will be included soon. Results are offered in JSON format, making all this information accessible to both software or web applications.

Availability

Cellbase is a centralised database that integrates lots of information from several main genomic and biological databases used for genomic annotation and clinical variant prioritisation. See Overview for details.

CellBase is open-source and freely available at https://github.com/opencb/cellbase

You can search CellBase using your favourite programming language:

installation API docs tutorials
REST API RESTful Web Services
Python pypi
R Bioconductor Vignette
Java Installation Javadoc

CellBase is open-source and freely available at https://github.com/opencb/cellbase

Publications

CellBase was published at Nucleic Acids Research (2012):

http://nar.oxfordjournals.org/content/40/W1/W609.short

cellbase's People

Contributors

antonior26 avatar dapregi avatar frasator avatar imedina avatar j-coll avatar javild avatar jperflo avatar jtarraga avatar juanfesanahuja avatar juanrizetta avatar julie-sullivan avatar kevinpetersavage avatar marnau avatar marrobi avatar mbleda avatar mbsimonovic avatar melsiddieg avatar pabarcgar avatar pfurio avatar phamidko avatar swaathik avatar wbari avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

cellbase's Issues

Reimplement parsers to follow a more general ETL model

Some parsers will be reimplemented so that they generate a general data model stored in a json object. 'Loaders' will be implemented which will transform data into an appropriate an efficient format for the specific DBMS (e.g. MongoDB), as well as will load them into the DB. The objective is to obtain a data model which contains all the information regardless of the specific implementation for a given DBMS.

New Variant Annotation functionality

New variant annotation functionality can be implemented, this will return all the known information about a variant in CellBase: consequence type #26 , conservation, ...
Data models must be added to biodata-models.

Consequence Type calculation

A new method is needed to calculate the consequence type from SNV variants. This will be part of the Variant Annotation new functionality.
The behaviour must be as similar as Ensembl VEP as possible

Fixes for GeneWSServer

The following Exon WS should be implemented (currently not working):

  • /{version}/{species}/feature/gene/{geneId}/tfbs
  • /{version}/{species}/feature/gene/{geneId}/mirna_target
  • /{version}/{species}/feature/gene/{geneId}/reactome

/{version}/{species}/feature/gene/{geneId}/protein returns the PPIs for the specified gene. We should rename this WS to ppi or protein_interaction.

Would be also interesting to create a proper /{version}/{species}/feature/gene/{geneId}/protein WS returning UniProt information for this gene.

QueryOptions functionality to be enabled in the variant annotation WS

Several functionalities are required.

Filters:

  • geneset=gencode_basic: must only annotate against genecode-basic genes. Genecode-basic genes can be identified by looking at the gencode gtf:

ftp://ftp.sanger.ac.uk/pub/gencode/Gencode_human/release_19/gencode.v19.annotation.gtf.gz

There is a tag="basic" for each gencode-basic transcript. Tasks:
1.- gencode gtf has to be downloaded
2.- The list of gencode-basic transcript ids (ENSTxxx) must be loaded within the GeneParser into a HashSet.
3.- GeneParser will include a new annotationFlag "basic" for all parsed genecode-basic transcripts
4.- getAllConsequenceTypesByVariantList at VariantAnnotationMongoDBAdaptor will check that flag before proceeding to annotate the variant

  • so=term1,term2,term3: getAnnotationByVariantList shall only return the annotation for those variants which present any of these so terms.

Includes:

  • include={variation,clinical,consequence,conservation}: to allow enabling only certain annotation types.

New cellbase-mongodb module

Currently MySQL-Hibernate implementation is found in cellbase-core. To offer a more modular implementation and to have a plugin oriented framework the interfaces (cellbase-core) must be implemented in a different module, so a cellbase-mongodb module must be created for MongoDB

Improve documentation

Documentation needs to be significantly improved: building, architecture, REST calls

New Variant Effect model

Until we complete the implementation of the new Variant Effect classes, the commit 3b92d0c (7th May, branch ebi-develop) does not allow to compile OpenCGA nor EVA.

Commit 5da5e51 must be used until then.

Fixes for SnpWSServer

The following SNP WS should be implemented (currently not working):

  • /{version}/{species}/feature/snp/{snpId}/consequence_type
  • /{version}/{species}/feature/snp/{snpId}/population_frequency
  • /{version}/{species}/feature/snp/{snpId}/xref

Interesting but not urgent:

  • /{version}/{species}/feature/snp/{snpId}/sequence
  • /{version}/{species}/feature/snp/{snpId}/regulatory

List of deprecated WS:

  • /{version}/{species}/feature/snp/{snpId}/consequence_types
  • /{version}/{species}/feature/snp/{snpId}/phenotypes

ClinVar WS should query the clinical collection

ClinVar WS are now querying the ClinVar collection. ClinVar is also loaded within the clinical collection. Only one ClinVar copy will remain, the one within the clinical collection, and all queries should point to this one.

Fix or remove outdated tests

Some unit tests do not pass because they are outdated or they are based in local paths (/home/.... ). Fix those tests.

RefSeq parser

To add RefSeq parser method in GeneParser, data must be loaded together with Ensembl gene set

Create DB Loaders

Create a "load" interface in cellbase-core module. This interface will define the operations to load the data models, created by cellbase-app 'build' command, into a database.

A MongoDB implementation of this interface should be implemented in cellbase-mongodb module.

Fixes for ExonWSServer

The following Exon WS should be implemented (currently not working):

  • /{version}/{species}/feature/exon/{exonId}/info
  • /{version}/{species}/feature/exon/{exonId}/region
  • /{version}/{species}/feature/exon/{exonId}/sequence
  • /{version}/{species}/feature/exon/{exonId}/transcript

Interesting but not urgent:

  • /{version}/{species}/feature/exon/{exonId}/aminos

/{version}/{species}/feature/exon/{exonId}/bysnp should not be there and has been marked as Deprecated.

Swagger integration

In order to have a better documentation Swagger must be integrated and configured

Migration to MongoDB

NoSQL databases offer a higher performance and scalability. Document oriented database MongoDB fits very well for Cellbase needs. A new implementation based on MongoDB needs to be done.

New species web service

Would be great to implement a ws with all species information.
This is an example of the response. More information can be added.

{
    "taxonomies":[
        {
            "name":"vertebrates",
            "species":[
                {
                    "text":"Homo Sapiens",
                    "assembly":"GRCh38",
                    "chromosomes":[
                        {
                            "name": "5",
                            "isCircular": 0,
                            "size": 180915260,
                            "end": 180915260,
                            "start": 1,
                            "cytobands": [
                                {
                                    "stain": "acen",
                                    "name": "p11.1",
                                    "end": 17600000,
                                    "start": 16100001
                                }
                            ]
                        }
                    ]
                }
            ]
        },
        {
            "name":"metazoa",
            "species":[

            ]
        }
    ]
}

CLI should return database stats

This new option must return the collections installed for one species together with the indexes created and number documents. Other info may be also useful to be returned

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.