Awesome biomedical knowledge graphs

A curated list of biomedical knowledge graphs and of resources for their construction.

This repository is inspired by awesome lists and follows the style guide of the awesome manifesto.

Introduction
Survey
- PDF report and accompanying website
Curated list

Introduction

The following information was generated by 1) getting a broad overview of academic and commercial projects that provide knowledge graphs in the domain of biomedicine as well as resources for creating them and 2) narrowing them down to a small subset that I consider awesome due to the quality or relevance of their provided results. I hope both collections serve you well! If you have suggestions or find an error, please don't hesitate to contact me or to contribute directly with a pull request.

Survey

A PDF report and accompanying website were created to present a comprehensive overview of available biomedical knowledge graphs and of resources for their construction.

Curated list

A carefully selected subset of the survey's entries are presented here in the style of an awesome list.

Biomedical knowledge graphs

Biomedical Data Translator – Publication (2022), Website, Code, API, Demo
- Content:
  - A collection of harmonized APIs
- Scope:
  - "integrated data from over 250 knowledge sources, each exposed via open application programming interfaces (APIs)"
  - "a diverse community of nearly 200 basic and clinical scientists, informaticians, ontologists, software developers, and practicing clinicians distributed over 11 teams and 28 institutions to form the Biomedical Data Translator Consortium"
- Goals:
  - "integrate as many datasets as possible, using a ‘knowledge graph’–based architecture, and allow them to be cross-queried and reasoned over by translational researchers"
  - "integrating existing biomedical data sets and “translating” those data into insights intended to augment human reasoning and accelerate translational science"
  - "promote serendipitous discovery and augment human reasoning in a variety of disease spaces"
  - "federate autonomous reasoning agents and knowledge providers within a distributed system for answering translational questions"
- Sub-projects that construct knowledge graphs:
  - ROBOKOP – Publication (2019), Code, Data
  - RTX-KG2 – Publication (2022), Code, Data
Bioteque – Publication (2022), Website, Code, Data
- Content:
  - 450,000 nodes of 12 types
  - 30 million edges of 67 types
  - Extracted from 150 data sources
  - Provided as triples in multiple TSV files
- Scope:
  - "a resource of unprecedented size and scope that contains pre-calculated embeddings derived from a gigantic heterogeneous network"
  - "Bioteque embeddings retain the information contained in the large biological network"
- Goals:
  - "make biomedical knowledge embeddings available to the broad scientific community"
  - "evaluate, characterize and predict a wide set of experimental observations"
  - "assessment of high-throughput protein-protein interactome data"
  - "prediction of drug response and new repurposing opportunities"
CKG – Publication (2023), Website, Code, Data
- Full name: Clinical Knowledge Graph
- Content:
  - 20 million nodes
  - 220 million edges
  - Extracted from 26 databases, 10 ontologies, 7 million publications
  - Provided as Neo4j graph database
- Scope:
  - "prior knowledge, experimental data and de-identified clinical patient information"
  - "harmonization of proteomics with other omics data while integrating the relevant biomedical databases and text extracted from scientific publications"
- Goals:
  - "inform clinical decision-making"
  - "reveal candidate markers of prognosis and/or treatment"
  - "generate new hypotheses that ultimately translate into clinically actionable results"
  - "clinically meaningful queries and advanced statistical analyses"
  - "liver disease biomarker discovery"
  - "multi-proteomics data integration for cancer biomarker discovery and validation"
  - "prioritize treatment options for chemorefractory cases"
HALD – Publication (2023), Website, Code, Data
- Full name: Human Aging and Longevity Dataset
- Content:
  - 12,227 nodes of 10 types
  - 115,522 edges of various types
  - Extracted from 339,918 biomedical articles in PubMed
  - Provided as triples with additional information in multiple JSON and CSV files
- Scope:
  - "a text mining-based human aging and longevity dataset of the biomedical knowledge graph from all published literature related to human aging and longevity in PubMed"
- Goals:
  - "precision gerontology and geroscience analyses"
  - "provide predictions regarding the individuals’ lifespan under various treatment scenarios"
  - "devise novel, biologically-driven therapeutic and preventive strategies that address fundamental aging mechanisms"
Monarch KG – Publication (2024), Website, Code, Data
- Naming explanation: "The name ’Monarch Initiative’ was chosen because it is a community effort to create paths for diverse data to be put to use for disease discovery, not unlike the navigation routes that a monarch butterfly would take."
- Content:
  - 862,115 nodes of 88 types
  - 11,412,471 edges of 23 types
  - Extracted from 33 biomedical resources and biomedical ontologies and "updated with the latest data from each source once a month"
  - Provided in various formats such as SQLite, Neo4J, RDF, KGX
- Scope:
  - "Monarch App includes an ETL platform for ingesting, harmonizing, and serving diverse life science data relating genes, phenotypes, and diseases into a semantic KG for use in various downstream applications"
  - "Monarch KG integrates gene, disease, and phenotype data"
  - "Monarch Assistant, which will combine the ability of LLMs to answer questions in plain language with Monarch’s extensive KG and analysis algorithms"
- Goals:
  - "learn different things about the relationship between genotype and phenotype from different organisms"
  - "collect, integrate, and make a broad compendium of species and sources computable"
OREGANO – Publication (2023), Code, Data
- Content:
  - 88,937 nodes of 11 types
  - 824,231 edges of 19 types
  - Extracted from various drug, protein and phenotype databases
  - Provided as triples in a TSV file
- Scope:
  - "a holistically constructed knowledge graph using the broadest possible features and drug characteristics"
  - "integration of natural compounds (i.e. herbal and plant remedies)"
  - "incorporating together disease and drug information and natural compounds"
- Goals:
  - "computational drug repositioning"
  - "generate hypotheses (molecule/drug - target links) through link prediction"
  - "from the available data, determine whether a drug is potentially capable of binding to a new target"
  - "identify possible repositionable molecules using machine learning (or more specifically deep learning) algorithms"
PharMeBINet – Publication (2022), Website, Code, Data
- Full name: Pharmacological Medical Biochemical Network
- Content:
  - 2,869,407 nodes of 66 types
  - 15,883,653 edges of 208 types
  - Extracted from 48 data sources
  - Provided as Neo4j graph database and GraphML file
- Scope:
  - "heterogeneous information on drugs, ADRs, genes, proteins, gene variants, and diseases"
- Goals:
  - "analysis of ADRs [Adverse Drug Reactions]"
  - "analysis of possible existing connections between gene variants and drugs"
PrimeKG – Publication (2023), Website, Code, Data
- Full name: Precision Medicine Knowledge Graph
- Content:
  - 129,375 nodes of 10 types
  - 4,050,249 edges of 30 types
  - Extracted from 20 data sources
  - Provided as triples in a CSV file
- Scope:
  - "ten major biological scales, including disease-associated protein perturbations, biological processes and pathways, anatomical and phenotypic scales, and the entire range of approved drugs with their therapeutic action"
  - "improves on coverage of diseases, both rare and common, by one-to-two orders of magnitude compared to existing knowledge graphs"
- Goals:
  - "support research in precision medicine"
  - "linking biomedical knowledge to patient-level health information"
  - "personalized diagnostic strategies and targeted treatments"
  - "providing a holistic and multimodal view of diseases"
SPOKE – Publication (2023), Website, Code, API
- Full name: Scalable Precision Medicine Open Knowledge Engine
- Content:
  - 27,056,367 nodes of 21 types
  - 53,264,489 edges of 55 types
  - Extracted from 41 databases
  - Provided as a REST API that accepts graph queries, but "not available as a bulk download"
- Scope:
  - "ranging from molecular and cellular biology to pharmacology and clinical practice"
  - "focuses on experimentally determined information"
  - "computational predictions and text mining from the literature are not currently prioritized"
- Goals:
  - "applications relevant to precision medicine"
  - "provide insights into the understanding of diseases, discovering of drugs and proactively improving personal health"
  - "drug repurposing"
  - "disease prediction and interpretation of transcriptomic data"
  - "predict diagnosis"
  - "predict biomedical outcomes in a biologically meaningful manner"
SynLethKG – Publication (2021), Website, Code, Data
- Full name: Synthetic Lethality Knowledge Graph
- Content:
  - 54,012 nodes of 11 types
  - 2,231,921 edges of 24 types
  - Extracted from SynLethDB and various gene, drug and compound databases
  - Provided as triples in a CSV file
- Scope:
  - "genes, compounds, diseases, biological processes and 24 kinds of relationships that could be pertinent to SL"
- Goals:
  - "identify SL gene pairs"
  - "discovery of anti-cancer drug targets"

Tools

BioCypher – Publication (2023), Website, GitHub, PyPI
- Scope:
  - "a Python library that provides a low-code access point to data processing and ontology manipulation"
  - "a modular architecture that maximizes reuse of data and code in three ways: input, ontology and output"
  - "adhere to FAIR (Findable, Accessible, Interoperable and Reusable) and TRUST (Transparency, Responsibility, User focus, Sustainability and Technology) principles"
- Goals:
  - "make the process of creating a biomedical knowledge graph easier than ever, but still flexible and transparent"
  - "abstracting the KG build process as a combination of modular input adapters"
  - "provides easy access to state-of-the-art KGs to the average biomedical researcher"
  - "creating a more interoperable biomedical research community"
KGX – Website, GitHub, PyPI
- Scope:
  - "a Python library and set of command line utilities"
  - "The core datamodel is a Property Graph (PG), represented internally in Python using a networkx MultiDiGraph model."
- Goals:
  - "exchanging Knowledge Graphs (KGs) that conform to or are aligned to the Biolink Model"
  - "provide validation, to ensure the KGs are conformant to the Biolink Model"

Databases

Collections

Ontologies and controlled vocabularies

Collections
- BioPortal
- Ontology Lookup Service
Biolink Model – Publication (2022) Website Code
- Scope:
  - "a unified data model that bridges across multiple ontologies, schemas, and data models"
  - "a map for bringing together data from different sources under one unified model, and as a bridge between ontological domains"
- Goals:
  - "supported easier integration and interoperability of biomedical KGs"
  - "supports translation, integration, and harmonization across knowledge sources"

File formats

KGX (.json, .jsonl, .tsv, .ttl) – Website
Neo4j (.dump) – Website, Wikipedia
Resource Description Framework (RDF) – Website, Wikipedia
- Turtle (.ttl) – Website, Wikipedia
- N-Triples (.nt) – Website, Wikipedia
- Notation3 (.n3) – Website, Wikipedia

robert-haas / awesome-biomedical-knowledge-graphs Goto Github PK