opencb / cellbase Goto Github PK

High-Performance NoSQL database and RESTful web services to access to most relevant biological data

License: Apache License 2.0

Perl 1.61% Shell 0.22% Python 2.69% HTML 0.16% CSS 1.39% JavaScript 36.79% Java 56.04% R 0.75% Jupyter Notebook 0.23% Dockerfile 0.07% Mustache 0.05% Smarty 0.01%

cellbase's Introduction

description
Welcome to CellBase!

Overview

During the last years the advances of high-throughput technologies in biology have produced an unprecedented growth of repositories and databases storing relevant biological data. Today there is more biological information than ever but unfortunately the current status of many of these repositories is far from being optimal many times. Some of the most common problems are: a) information is spread out in many small repositories and databases, b) lack of standards between different repositories, c) unsupported databases, d) specific and unconnected information, etc.

All these problems make very difficult: a) to integrate or join many different sources into only one database to work or analyze experiments; b) to access and query this information in programmatically way.

To cope with all these problems we have designed and developed a NoSQL database that integrates the most relevant biological information about genomic features and proteins, gene expression regulation, functional annotation, genomic variation and systems biology information. We use the most relevant repositories such as Ensembl, Uniprot, ClinVar, COSMIC or IntAct among many others (you can browse them Data sources and species). The information integrated covers:

Core features: genes, transcripts, exons, proteins, genome sequence, etc.
Regulatory: Ensembl regulatory, TFBS, miRNA targets, CTCF, Open chromatin, etc.
Functional annotation: OBO ontologies (Gene Ontology, Human Disease Ontology), etc.
Genomic variation: Ensembl Variation, ClinVar, COSMIC, etc.
Systems biology: IntAct , Reactome, gene co-expression, etc.

To make this entire database accessible to researchers, an exhaustive RESTful Web service API has been implemented. This API contains many methods that will facilitate researchers to query and obtain different biological information from a single database saving a lot of time. Another benefit is that researchers can make easily queries about different biologTical topics and link all this information together as all information is integrated.

Currently Homo sapiens, Mus musculus and other 20 species are available and many others will be included soon. Results are offered in JSON format, making all this information accessible to both software or web applications.

Availability

Cellbase is a centralised database that integrates lots of information from several main genomic and biological databases used for genomic annotation and clinical variant prioritisation. See Overview for details.

CellBase is open-source and freely available at https://github.com/opencb/cellbase

You can search CellBase using your favourite programming language:

	installation	API	docs	tutorials
REST API			RESTful Web Services
Python	pypi
R	Bioconductor			Vignette
Java	Installation	Javadoc

CellBase is open-source and freely available at https://github.com/opencb/cellbase

Publications

CellBase was published at Nucleic Acids Research (2012):

http://nar.oxfordjournals.org/content/40/W1/W609.short

cellbase's People

Contributors

Stargazers

Watchers

Forkers

marnau pabarcgar cyenyxe javild antonior26 j-coll mbleda jpdna jtarraga mrg7 swaathik agaor jdopazob melsiddieg mermegar dapregi wbari slimmtl mh11 ebivariation nicholsn eiathom renyaoxiang babelomics gtlangseth pythseq syuki0 sosundina saifgel kevinpetersavage marrobi lucioric2000 julie-sullivan mpievolbio-scicomp osmium78 alonzomb wook2014 nono9527 p-maraver imedina immunoml kingspm phamidko squinker ealagorm pelamee juanfesanahuja genostack oalmelid-gel magdalenazz awab-ahmed

cellbase's Issues

Reimplement parsers to follow a more general ETL model

Some parsers will be reimplemented so that they generate a general data model stored in a json object. 'Loaders' will be implemented which will transform data into an appropriate an efficient format for the specific DBMS (e.g. MongoDB), as well as will load them into the DB. The objective is to obtain a data model which contains all the information regardless of the specific implementation for a given DBMS.

New Variant Annotation functionality

New variant annotation functionality can be implemented, this will return all the known information about a variant in CellBase: consequence type #26 , conservation, ...
Data models must be added to biodata-models.

CLI fails when executed from outside the root directory

cellbase.sh should work when executed from any directory in the system.

Move cellbase to a module architecture

CellBase must make use of Maven modules to offer a bigger modularity and reduce dependencies loaded.

Make CellBase DBAdaptors use datastore repository

Remove from cellbase adaptors direct uses of the mongoDB drivers. Use datastore functionality instead.

Consequence Type calculation

A new method is needed to calculate the consequence type from SNV variants. This will be part of the Variant Annotation new functionality.
The behaviour must be as similar as Ensembl VEP as possible

Fixes for GeneWSServer

The following Exon WS should be implemented (currently not working):

/{version}/{species}/feature/gene/{geneId}/tfbs
/{version}/{species}/feature/gene/{geneId}/mirna_target
/{version}/{species}/feature/gene/{geneId}/reactome

/{version}/{species}/feature/gene/{geneId}/protein returns the PPIs for the specified gene. We should rename this WS to ppi or protein_interaction.

Would be also interesting to create a proper /{version}/{species}/feature/gene/{geneId}/protein WS returning UniProt information for this gene.

Add a WS for biological interactions

New Gene Expression Atlas data source to be included

Add Gene Expression Atlas data to the knowledgebase. Implement corresponding code for the:

Downloader
Builder
Loader
WSs

QueryOptions functionality to be enabled in the variant annotation WS

Several functionalities are required.

Filters:

geneset=gencode_basic: must only annotate against genecode-basic genes. Genecode-basic genes can be identified by looking at the gencode gtf:

ftp://ftp.sanger.ac.uk/pub/gencode/Gencode_human/release_19/gencode.v19.annotation.gtf.gz

There is a tag="basic" for each gencode-basic transcript. Tasks:
1.- gencode gtf has to be downloaded
2.- The list of gencode-basic transcript ids (ENSTxxx) must be loaded within the GeneParser into a HashSet.
3.- GeneParser will include a new annotationFlag "basic" for all parsed genecode-basic transcripts
4.- getAllConsequenceTypesByVariantList at VariantAnnotationMongoDBAdaptor will check that flag before proceeding to annotate the variant

so=term1,term2,term3: getAnnotationByVariantList shall only return the annotation for those variants which present any of these so terms.

Includes:

include={variation,clinical,consequence,conservation}: to allow enabling only certain annotation types.

New cellbase-mongodb module

Currently MySQL-Hibernate implementation is found in cellbase-core. To offer a more modular implementation and to have a plugin oriented framework the interfaces (cellbase-core) must be implemented in a different module, so a cellbase-mongodb module must be created for MongoDB

HGVS shall be returned as part of the variant annotation

Transcript HGVS shall be calculated and included within the VariantAnnotation object

cellbase-server starts_with webservice not working

http://www.ebi.ac.uk/cellbase/webservices/rest/latest/hsa/feature/id/BRCA2/starts_with?of=json

Returns null instead of a QueryResponse json serialized object.

Improve documentation

Documentation needs to be significantly improved: building, architecture, REST calls

Cellbase server war file should not be deployed at Central repository

Deploying war files should be avoided. Maven pom.xml files need to be properly configured

Population frequencies web services must return all frequencies

When querying population frequencies like:
http://wwwdev.ebi.ac.uk/cellbase/webservices/rest/v3/hsapiens/genomic/region/3:1166675-1166675/snp

Only frequencies different from '1' are returned. MongoDB has to contain only those.

New Variant Effect model

Until we complete the implementation of the new Variant Effect classes, the commit 3b92d0c (7th May, branch ebi-develop) does not allow to compile OpenCGA nor EVA.

Commit 5da5e51 must be used until then.

Add population frequencies to Variation collection

Variation document must contain population frequencies, this can be obtained from EVA datasets

cellbase-server latest/species not working correctly

The web service:
http://www.ebi.ac.uk/cellbase/webservices/rest/latest/species?of=json
does not show the species correctly, it returns repeated species in different formats.

Generate json schemas with jackson

Some schemas should be defined, using JSON Schemas seems the simplest approach

ChromosomeMongoDBAdapator method use aggregation instead of elemMatch

Method 'getAllByIdList' uses a complex aggregation when a much more simple elemMatch could be used. Also, currently 'supercontigs' are also returned:

http://www.ebi.ac.uk/cellbase/webservices/rest/v3/hsapiens/genomic/chromosome/13/info?of=json

Fixes for SnpWSServer

The following SNP WS should be implemented (currently not working):

/{version}/{species}/feature/snp/{snpId}/consequence_type
/{version}/{species}/feature/snp/{snpId}/population_frequency
/{version}/{species}/feature/snp/{snpId}/xref

Interesting but not urgent:

/{version}/{species}/feature/snp/{snpId}/sequence
/{version}/{species}/feature/snp/{snpId}/regulatory

List of deprecated WS:

/{version}/{species}/feature/snp/{snpId}/consequence_types
/{version}/{species}/feature/snp/{snpId}/phenotypes

Implementation of new CLI using JCommander

New CLI must be implemented using JCommander, the available commands are: download, build, load and query

ClinVar WS should query the clinical collection

ClinVar WS are now querying the ClinVar collection. ClinVar is also loaded within the clinical collection. Only one ClinVar copy will remain, the one within the clinical collection, and all queries should point to this one.

Fix or remove outdated tests

Some unit tests do not pass because they are outdated or they are based in local paths (/home/.... ). Fix those tests.

RefSeq parser

To add RefSeq parser method in GeneParser, data must be loaded together with Ensembl gene set

Reorganize configuration info (properties files)

Move species list and DB configuration info to the cellbase-server application.properties
Create Config object able to contain all relevant configuration info needed by cellbase-mongodb

Use new NIO from Java 7+

To create directories and other file system actions a new NIO API was developed in Java 7, this must used, e.g.:

https://github.com/opencb/cellbase/blob/develop/cellbase-app/src/main/java/org/opencb/cellbase/app/cli/DownloadCommandParser.java#L363

Create DB Loaders

Create a "load" interface in cellbase-core module. This interface will define the operations to load the data models, created by cellbase-app 'build' command, into a database.

A MongoDB implementation of this interface should be implemented in cellbase-mongodb module.

ChromosomeMongoDBAdapator runtime errors

Several errors are raised because new 'datastore' library integration.

Write query tutorial

Write clear guidelines for using the query command of the CLI

Species CLI parameter should not be a List

There is no the need of passing different species here:
https://github.com/opencb/cellbase/blob/develop/cellbase-app/src/main/java/org/opencb/cellbase/app/cli/CliOptionsParser.java#L96

Different species can be executed in different executions. This will make the code a bit simpler without losing any real functionality.

Fixes for ExonWSServer

The following Exon WS should be implemented (currently not working):

/{version}/{species}/feature/exon/{exonId}/info
/{version}/{species}/feature/exon/{exonId}/region
/{version}/{species}/feature/exon/{exonId}/sequence
/{version}/{species}/feature/exon/{exonId}/transcript

Interesting but not urgent:

/{version}/{species}/feature/exon/{exonId}/aminos

/{version}/{species}/feature/exon/{exonId}/bysnp should not be there and has been marked as Deprecated.

IntAct database integration

PPI from IntAct must be added, data models must be created in biodata-models

CellbaseClient should be able to call the POST WS for variant annotation

Currently, CellbaseClient can only call the GET WS for variant annotation. Include an option to allow making calls to the POST WS, thereby enabling sending bigger variant batches

Swagger integration

In order to have a better documentation Swagger must be integrated and configured

Ensembl Perl scripts should not use registry.conf

There is a mechanism in Ensembl Perl to avoid passing a huge registry file, this will avoid maintaining this file and will make CLI simpler since no parameter is needed for the registry file

New DisGeNET data source to be included

DisGeNET database need to be downloaded and included:

http://www.disgenet.org/web/DisGeNET/v2.1

A new collection gene_disease_association must be created.

Migration to MongoDB

NoSQL databases offer a higher performance and scalability. Document oriented database MongoDB fits very well for Cellbase needs. A new implementation based on MongoDB needs to be done.

Module cellbase-build must be renamed to cellbase-app

New module app will accept different command such as download, build and query

New ClinVar WS to query by gene symbol (HGNC)

The WS must use the ClinicalMongoDBAdaptor and query the

referenceClinVarAssertion.measureSet.measure.measureRelationship.symbol

field within the ClinVar record.

Return Uniprot's functional description of variants with variant annotation

Uniprot's data is already integrated in CellBase. Link functional description of the variants with the vriant annotation WS

New species web service

Would be great to implement a ws with all species information.
This is an example of the response. More information can be added.

{
    "taxonomies":[
        {
            "name":"vertebrates",
            "species":[
                {
                    "text":"Homo Sapiens",
                    "assembly":"GRCh38",
                    "chromosomes":[
                        {
                            "name": "5",
                            "isCircular": 0,
                            "size": 180915260,
                            "end": 180915260,
                            "start": 1,
                            "cytobands": [
                                {
                                    "stain": "acen",
                                    "name": "p11.1",
                                    "end": 17600000,
                                    "start": 16100001
                                }
                            ]
                        }
                    ]
                }
            ]
        },
        {
            "name":"metazoa",
            "species":[

            ]
        }
    ]
}