Giter Site home page Giter Site logo

scai-bio / index Goto Github PK

View Code? Open in Web Editor NEW
3.0 4.0 0.0 690 KB

Intelligent data steward toolbox using Large Language Model embeddings for automated Data-Harmonization

Home Page: https://index.bio.scai.fraunhofer.de

License: Apache License 2.0

Python 99.54% Dockerfile 0.46%
data-harmonization data-stewardship embeddings large-language-models semantic-mapping

index's Introduction

INDEX โ€“ the Intelligent Data Steward Toolbox

example workflow GitHub Release

INDEX is an intelligent data steward toolbox that leverages Large Language Model embeddings for automated Data-Harmonization.

Table of Contents

Introduction

INDEX uses vector embeddings from variable descriptions to suggest mappings for datasets based on their semantic similarity. Mappings are stored with their vector representations in a knowledge base, where they can be used for subsequent harmonisation tasks, potentially improving the following suggestions with each iteration. Models for the computation as well as databases for storage are meant to be configurable and extendable to adapt the tool for specific use-cases.

Installation

Using pip

pip install datastew

From source

Clone the repository:

git clone https://github.com/SCAI-BIO/index
cd index

Install python requirements:

pip install -r requirements.txt

Starting the Backend locally

You can access the backend functionalities by accessing the provided REST API.

Run the Backend API on port 5000:

uvicorn datastew.api.routes:app --reload --port 5000

Run the Backend via Docker

The API can also be run via docker.

You can either build the docker container locally or download the latest build from the index GitHub package registry.

docker build . -t ghcr.io/scai-bio/datastew/backend:latest
docker pull ghcr.io/scai-bio/datastew/backend:latest

After build/download you will be able to start the container and access the INDEX API per default on localhost:8000:

docker run  -p 8000:80 ghcr.io/datastew/scai-bio/backend:latest

Usage

Python

Creating and using stored mappings

A simple example how to initialize an in memory database and compute a similarity mapping is shown in datastew/scripts/mapping_db_example.py:

from datastew.repository.sqllite import SQLLiteRepository
from datastew.repository.model import Terminology, Concept, Mapping
from datastew.embedding import MPNetAdapter

# omit mode to create a permanent db file instead
repository = SQLLiteRepository(mode="memory")
embedding_model = MPNetAdapter()

terminology = Terminology("snomed CT", "SNOMED")

text1 = "Diabetes mellitus (disorder)"
concept1 = Concept(terminology, text1, "Concept ID: 11893007")
mapping1 = Mapping(concept1, text1, embedding_model.get_embedding(text1))

text2 = "Hypertension (disorder)"
concept2 = Concept(terminology, text2, "Concept ID: 73211009")
mapping2 = Mapping(concept2, text2, embedding_model.get_embedding(text2))

repository.store_all([terminology, concept1, mapping1, concept2, mapping2])

text_to_map = "Sugar sickness"
embedding = embedding_model.get_embedding(text_to_map)
mappings, similarities = repository.get_closest_mappings(embedding, limit=2)
for mapping, similarity in zip(mappings, similarities):
    print(f"Similarity: {similarity} -> {mapping}")

output:

Similarity: 0.47353370635583486 -> Concept ID: 11893007 : Diabetes mellitus (disorder) | Diabetes mellitus (disorder)
Similarity: 0.20031612264852067 -> Concept ID: 73211009 : Hypertension (disorder) | Hypertension (disorder)

You can also import data from file sources (csv, tsv, xlsx) or from a public API like OLS. An example script to download & compute embeddings for SNOMED from ebi OLS can be found in datastew/scripts/ols_snomed_retrieval.py.

Harmonizing excel/csv resources

You can directly import common data models, terminology sources or data dictionaries for harmonization directly from a csv, tsv or excel file. An example how to match two seperate variable descriptions is shown in datastew/scripts/mapping_excel_example.py:

from datastew.process.parsing import DataDictionarySource
from datastew.process.mapping import map_dictionary_to_dictionary

# Variable and description refer to the corresponding column names in your excel sheet
source = DataDictionarySource("source.xlxs", variable_field="var", description_field="desc")
target = DataDictionarySource("target.xlxs", variable_field="var", description_field="desc")

df = map_dictionary_to_dictionary(source, target)
df.to_excel("result.xlxs")

The resulting file contains the pairwise variable mapping based on the closest similarity for all possible matches as well as a similarity measure per row.

Configuration

Description Embeddings

You can configure INDEX to use either a local language model or call OPenAPIs embedding API. While using the OpenAI API is significantly faster, you will need to provide an API key that is linked to your OpenAI account.

Currently, the following local models are implemented:

The API will default to use a local embedding model. You can adjust the model loaded on start up in the configurations.

Database

INDEX will by default store mappings in a file based db file in the index/db dir. For testing purposes the initial SQLLite file based db contains a few of mappings to concepts in SNOMED CT. All available database adapter implementations can be found in index/repository.

To exchange the DB implementation, load your custom DB adapter or pre-saved file-based DB file on application startup here. The same can be done for any other embedding model.

index's People

Contributors

mehmetcanay avatar tiadams avatar

Stargazers

 avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

index's Issues

Bug: Duplicate entries crash DB

curl -X PUT "[https://index.bio.scai.fraunhofer.de/concepts/id001/mappings?terminology_id=test_ab3&concept_name=cough&text=erkaeltung"](https://index.bio.scai.fraunhofer.de/concepts/id001/mappings?terminology_id=test_ab3&concept_name=cough&text=erkaeltung%22) -H "accept: application/json"
{"detail":"Failed to create or update concept: (sqlite3.IntegrityError) UNIQUE constraint failed: concept.id\n[SQL: INSERT INTO concept (id, name, terminology_id) VALUES (?, ?, ?)]\n[parameters: ('id001', 'cough', 'test_ab3')]\n(Background on this error at: [https://sqlalche.me/e/20/gkpj)"}](https://sqlalche.me/e/20/gkpj)%22%7D)
{"detail":"Failed to create or update terminology: This Session's transaction has been rolled back due to a previous exception during flush. To begin a new transaction with this Session, first issue Session.rollback(). Original exception was: (sqlite3.IntegrityError) UNIQUE constraint failed: concept.id\n[SQL: INSERT INTO concept (id, name, terminology_id) VALUES (?, ?, ?)]\n[parameters: ('id001', 'cough', 'test_ab3')]\n(Background on this error at: https://sqlalche.me/e/20/gkpj) (Background on this error at: [https://sqlalche.me/e/20/7s2a)"}](https://sqlalche.me/e/20/7s2a)%22%7D)

Move db file to own directory

Having the db file in the same directory as a python package will cause issues when mounting the directory in a data container or as a PVC

Add DB adapter for Weaviate (vector db)

Implement a DB adapter for weaviate:
https://weaviate.io/developers/weaviate

Use the lokal in memory / file based DB in a first implementation

It should be possible to:

Store a computed embedding together with

  • A terminology label / ID (String)
  • Label of the Model used for generating this embedding (String)
  • The original String
  • A concept label / ID (String)

Retrieve an embedding

  • Based on the highest (cosine) similarity
  • Up to limit=n most similar vectors

Retrieve limit=n Random vectors from the DB for visualiazion

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.