Giter Site home page Giter Site logo

sdkg-11's Introduction

SDKG-11

Multimodal Reasoning based on Knowledge Graph Embedding for Specific Diseases

Cite By

Chaoyu Zhu, Zhihao Yang, Xiaoqiong Xia, Nan Li, Fan Zhong, Lei Liu, Multimodal reasoning based on knowledge graph embedding for specific diseases, Bioinformatics, 2022, 38(8), 2235-2245.

Abstract

Motivation: Knowledge Graph (KG) is becoming increasingly important in the biomedical field. Deriving new and reliable knowledge from existing knowledge by knowledge graph embedding technology is a cutting-edge method. Some add a variety of additional information to aid reasoning, namely multimodal reasoning. However, few works based on the existing biomedical KGs are focused on specific diseases.
Results: This work develops a construction and multimodal reasoning process of Specific Disease Knowledge Graphs (SDKGs). We construct SDKG-11, a SDKG set including five cancers, six non-cancer diseases, a combined Cancer5, and a combined Diseases11, aiming to discover new reliable knowledge and provide universal pre-trained knowledge for that specific disease field. SDKG-11 is obtained through original triplet extraction, standard entity set construction, entity linking, and relation linking. We implement multimodal reasoning by reverse-hyperplane projection for SDKGs based on structure, category, and description embeddings. Multimodal reasoning improves pre-existing models on all SDKGs using entity prediction task as the evaluation protocol. We verify the model's reliability in discovering new knowledge by manually proofreading predicted drug-gene, gene-disease, and disease-drug pairs. Using embedding results as initialization parameters for the biomolecular interaction classification, we demonstrate the universality of embedding models.

Files

Annotation/

E_dict_0.json ~ E_dict_5.json
get_E_dict.py : Run it first to get complete E_dict.json

Dataset/

5 Cancers : Including colon_cancer, gallbladder_cancer, gastric_cancer, liver_cancer, lung_cancer
6 NonCancer : Including alzheimer_disease, copd, coronary_heart_disease, diabetes, heart_failure, rheumatoid_arthritis
Cancer5
Disease11

Model/

KGE.py : Class of processing and tool functions for Knowledge Graph Embedding
Models.py : TransE, TransH, ConvKB structure
Run_KGE.py : Run KGE.py

C&D/

C_dict.data : Dict of entity category annotation
D_table.data : Table of entity description annotation
E_index.json : Entity index dict for C_dict and D_table
get_C_dict.py : Run it to get C_dict.data and E_index.data
D_Table.py : Structure for training description table
Optimization.py : Training optimization of BioBERT
Tokenization.py : Tokenization function of BioBERT
Run_D_Table.py : Run it to get D_table.data

Pretrained BioBERT/

bert_config.json
bert_model.ckpt.data-00000-of-00001
bert_model.ckpt.index
bert_model.ckpt.meta
vocab.txt
Self download from https://github.com/dmis-lab/biobert (The name of the original file is biobert_..., and I changed it to bert_...)

Supplementary Table/

Supplementary Table S1 (statistical analysis of entity prediction)
Supplementary Table S2 (drug-gene new inferred knowledge)
Supplementary Table S3 (gene-disease new inferred knowledge)
Supplementary Table S4 (disease-drug new inferred knowledge)
Supplementary Table S5 (closed-triplets)

Reference

(1) TransE: Translating Embeddings for Modeling Multi-relational Data
(2) TransH: Knowledge Graph Embedding by Translating on Hyperplanes
(3) ConvKB: A Novel Embedding Model for Knowledge Base Completion Based on Convolutional Neural Network
(4) BERT: BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding (Code: https://github.com/google-research/bert)
(5) BioBERT: BioBERT: a pre-trained biomedical language representation model for biomedical text mining

Version

(1) python 3.6
(2) tensorflow-gpu 1.12.0
(3) numpy 1.17.4

Operating Instructions

(1) Run get_E_dict.py to get E_dict.json in Annotation/

python get_E_dict.py

(2) Run get_C_dict.py to get C_dict.data and E_index.data in Model/C&D/ (Already in the folder, you can not run)

python get_C_dict.py   

(3) Run Run_D_Table.py to get D_table.data in Model/C&D/

python Run_D_Table.py --len_d 150 --dim 200 --l_r 1e-5 --batch_size 8 --epoches 10 --earlystop 1   

(4) Run Run_KGE.py to train TransE, TransH, and ConvKB in Model/

Parameter Interpretation

lanta_c == 0 and lanta_d == 0 : S
lanta_c != 0 and lanta_d == 0 : S + C
lanta_c == 0 and lanta_d != 0 : S + D
lanta_c != 0 and lanta_d != 0 : S + C + D

[disease] from the abbreviation of disease names as follow
{'ald' : 'alzheimer_disease',
'coc' : 'colon_cancer',
'cop' : 'copd',
'chd' : 'coronary_heart_disease',
'dia' : 'diabetes',
'gac' : 'gallbladder_cancer',
'gsc' : 'gastric_cancer',
'hef' : 'heart_failure',
'lic' : 'liver_cancer',
'luc' : 'lung_cancer',
'rha' : 'rheumatoid_arthritis',
'can' : '_cancer5',
'dis' : '_disease11'}

TransE:

python Run_KGE.py --model TransE --disease [disease] --dim 200 --margin 0.6 --lanta_c 0.0 --lanta_d 0.0 --l_r 5e-3 --epoches 1000

TransH:

python Run_KGE.py --model TransH --disease [disease] --dim 200 --margin 0.6 --lanta_c 0.0 --lanta_d 0.0 --l_r 5e-3 --epoches 1000

ConvKB:

python Run_KGE.py --model ConvKB --disease [disease] --dim 200 --n_filter 10 --lanta_c 0.0 --lanta_d 0.0 --l_r 1e-3 --epoches 200

The above are the parameters for S.
For S + C, S + D, and S + C + D, (l_r = 5e-4, epoches = 200) is recommended.

sdkg-11's People

Contributors

zhuchaoy avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.