Giter Site home page Giter Site logo

biograkn-covid's Introduction

BioGrakn Covid

Overview | Installation | Datasets | Examples | How You Can Help

Discord Discussion Forum Stack Overflow Stack Overflow

BioGrakn Covid is an open source project to build a knowledge graph to enable research in COVID-19 and related disease areas.

Overview

We're excited to release an open source knowledge graph to speed up the research into Covid-19. Our goal is to provide a way for researchers to easily analyse and query large amounts of data and papers related to the virus.

BioGrakn Covid makes it easy to quickly trace information sources and identify articles and the information therein. This first release includes entities extracted from Covid-19 papers, and from additional datasets including, proteins, genes, disease-gene associations, coronavirus proteins, protein expression, biological pathways, and drugs.

For example, by querying for the virus SARS-CoV-2, we can find the associated human protein, proteasome subunit alpha type-2 (PSMA2), a component of the proteasome, implicated in SARS-CoV-2 replication, and its encoding gene (PSMA2). Additionally, we can identify the drug carfilzomib, a known inhibitor of the proteasome that could therefore be researched as a potential treatment for patients with Covid-19. To support the plausibility of this association and its implications, we can easily identify papers in the Covid-19 literature where this protein has been mentioned.

query_1

By examining these specific relationships and their attributes, we are directed to the data sources, including publications. This will help researchers to efficiently study the mechanisms of coronaviral infection, the immune response, and help to find targets for the development of treatments or vaccines more efficiently. We can also expand our search to include entities such as publications, organisms, proteins and genes as is shown below:

query_3

Our team currently consists of a partnership between GSK, Oxford PharmaGenesis and Vaticle

The schema that models the underlying knowledge graph alongside the descriptive query language, TypeQL, makes writing complex queries an extremely straightforward and intuitive process. Furthermore, TypeDB's automated reasoning, allows BioGrakn to become an intelligent database of biomedical data for the Covid research field that infers implicit knowledge based on the explicitly stored data. BioGrakn Covid can understand biological facts, infer based on new findings and enforce research constraints, all at query (run) time.

Installation

Prerequesites: Python >3.6, TypeDB Core 2.2.0, TypeDB Python Client API 2.1.1, Workbase 2.1.2.

Clone this repo:

    git clone https://github.com/vaticle/biograkn-covid.git

Manually download all source datasets and put them in the Datasets folder. You can find the links below.

Set up a virtual environment and install the dependencies:

    cd <path/to/biograkn-covid>/
    python3 -m venv .venv
    source .venv/bin/activate
    pip install -r requirements.txt

Start typedb

    typedb server

Start the migrator script

    python migrator.py -n 4 # insert using 4 threads

For help with the migrator script command line options:

    python migrator.py -h

Now grab a coffee (or two) while the migrator builds the database and schema for you!

Examples

TypeQL queries can be run either on TypeQL console, on workbase or through client APIs. However, we encourage running the queries on Workbase to have the best visual experience.

# Return drugs that are associated to genes, which have been mentioned in the same 
# paper as the gene which is associated to SARS.

match 
$virus isa virus, has virus-name "SARS"; 
$gene isa gene; 
$1 ($gene, $virus) isa gene-virus-association; 
$2 ($gene, $pub) isa mention; 
$3 ($pub, $gene2) isa mention; 
$gene2 isa gene; 
not {$gene2 is $gene;};
$4 ($gene2, $drug); $drug isa drug; 
offset 0; limit 30;

query_1

Datasets

Currently the datasets we've integrated include:

  • CORD-NER: The CORD-19 dataset that the White House released has been annotated and made publicly available. It uses various NER methods to recognise named entities on CORD-19 with distant or weak supervision.
  • Uniprot: We’ve downloaded the reviewed human subset, and ingested genes, transcripts and protein identifiers.
  • Coronaviruses: This is an annotated dataset of coronaviruses and their potential drug targets put together by Oxford PharmaGenesis based on literature review.
  • DGIdb: We’ve taken the Interactions TSV which includes all drug-gene interactions.
  • Human Protein Atlas: The Normal Tissue Data includes the expression profiles for proteins in human tissues.
  • Reactome: This dataset connects pathways and their participating proteins.
  • DisGeNet: We’ve taken the curated gene-disease-associations dataset, which contains associations from Uniprot, CGI, ClinGen, Genomics England and CTD, PsyGeNET, and Orphanet.
  • SemMed: This is a subset of the SemMed version 4.0 database

In progress:

  • CORD-19: We incorporate the original corpus which includes peer-reviewed publications from bioRxiv, medRxiv and others.
    • TODO: write migrator script
  • TissueNet
    • TODO: ./Migrators/TissueNet/TissueNetMigrator.py incomplete: only migrates a single data file and is not called in ./migrator.py.

We plan to add many more datasets!

How You Can Help

This is an on-going project and we need your help! If you want to contribute, you can help out by helping us including:

  • Migrate more data sources (e.g. clinical trials, DrugBank, Excelra)
  • Extend the schema by adding relevant rules
  • Create a website
  • Write tutorials and articles for researchers to get started

If you wish to get in touch, please talk to us on the #biograkn channel on our Discord (link here).

biograkn-covid's People

Contributors

cato-hub avatar catocodex avatar daniel-crowe avatar dependabot[bot] avatar gunnarklee avatar hskaushik avatar konradmy avatar lcasassa avatar tomassabat avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.