Giter Site home page Giter Site logo

hamedbabaei / llms4ol Goto Github PK

View Code? Open in Web Editor NEW
60.0 4.0 7.0 38.89 MB

LLMs4OL:‌ Large Language Models for Ontology Learning

Python 76.11% Shell 6.81% Jupyter Notebook 16.91% Dockerfile 0.18%
large-language-models ontology-learning knowledge-graph transformers bert bloom chatgpt flan-t5 gpt-3 gpt-4

llms4ol's Introduction

| LLMs4OL Paradigm | Task A: Term Typing | Task B: Type Taxonomy Discovery | Task C: Type Non-Taxonomic Relation Extraction | Finetuning | Task A Detailed Results | Task B Detailed Results | Task C Detailed Results | Task A Datasets | Task B Datasets | Task C Datasets | Finetuning Datasets |


LLMs4OL: Large Language Models for
Ontology Learning

Hamed Babaei Giglou, Jennifer D'Souza, and Sören Auer
{hamed.babaei, jennifer.dsouza, auer}@tib.eu
TIB Leibniz Information Center for Science and Technology, Hannover, Germany
Accepted for publication at ISWC 2023 - Research Track


Figure 1: The LLMs4OL task paradigm is an end-to-end conceptual framework for learning ontologies in different knowledge domain

Ontology Learning (OL) addresses the challenge of knowledge acquisition and representation in a variety of domains. Recent advances in NLP and the emergence of Large Language Models, which have shown a capability to be good at crystallizing knowledge and patterns from vast text sources, we introduced the LLMs4OL: Large Language Models for Ontology Learning paradigm as an empirical study of LLMs for automated construction of ontologies from various domains. The LLMs4OL paradigm tests Does the capability of LLMs to capture intricate linguistic relationships translate effectively to OL, given that OL mainly relies on automatically extracting and structuring knowledge from natural language text?.

Table of Contents

Repository Structure

.
└── LLMs4OL                             <- root directory of the repository
    ├── tuning                          <- Few-Shot finetuning directory
    │   └── ...
    ├── TaskA                           <- Term Typing task directory
    │   └── ...
    ├── TaskB                           <- Type Taxonomy Discovery task directory
    │   └── ...
    ├── TaskC                           <- Type Non-Taxonomic Relation Extraction task directory
    │   └── ...
    ├── assets                          <- artifacts directory 
    │   ├── LLMs                        <- contains pretrained LLMs
    │   ├── FSL                         <- contains fine-tuned LLMs (for training you should create this)
    │   ├── WordNetDefinitions          <- contains wordnet word definitions
    │   └── CountryCodes                <- GeoNames country codes
    ├── datasets                        <- contains datasets
    │   ├── FSL                         <- contains few-shot learning training datasets
    │   ├── TaskA                       <- contains directories for task A sources
    │   ├── TaskB                       <- contains directories for task B sources
    │   └── TaskC                       <- contains directories for task C sources
    ├── docs                            <- contains supplementary documents
    │   └── Supplementary-Material.pdf  <- contains directories for task C sources
    ├── images                          <- contains the figures
    ├── README.md                       <- README file for documenting the service.
    └── requirements.txt                <- contains Python requirements listed

LLMs4OL Paradigm

The LLMs4OL paradigm offers a conceptual framework to accelerate the automated construction of ontologies exclusively by domain experts. OL tasks are based on the ontology primitives which consist of:

  1. Corpus preparation – selecting and collecting the source texts to build the ontology.
  2. Terminology extraction – identifying and extracting relevant terms from the source text.
  3. Term typing – grouping similar terms as conceptual types.
  4. Taxonomy construction – identifying the “is-a” hierarchies between types.
  5. Relationship extraction – identifying and extracting “non-is-a” or semantic relationships between types
  6. Axiom discovery – discovering constraints and inference rules for the ontology

Toward realizing LLMs4OL, we empirically ground three core tasks of OL leveraging LLMs as a foundational basis for future work. They are presented as:

  • Term Typing
  • Type Taxonomy Discovery
  • Type Non-Taxonomic Relation Extraction

LLMs4OL Paradigm Setups

The LLMs4OL task paradigm is an end-to-end conceptual framework for learning ontologies in different knowledge domains with the aim of automation of ontology learning.

Tasks

The tasks within the blue arrow (in Figure-1) are the three OL tasks empirically validated. For each task, we created a directory with a detailed description of the task information as follows:

Datasets

To comprehensively assess LLMs for the three OL tasks we cover a variety of ontological knowledge domain sources, i.e. lexicosemantics – WN18RR (WordNet), geography – GeoNames, biomedicine – NCI, MEDICIN, SNOMEDCT_US, and web content types – Schema.Org. These sources are different for each task, so for each task, the detailed information is available as follows:

Results

The evaluation metric for Task A is reported as the mean average precision at k (MAP@K), where k = 1, And evaluations for Tasks B and C are reported in terms of the standard F1-score based on precision and recall. Complete and detailed results for tasks are presented in the following tables:

Experimental LLMs

We created experimentations using five different LMs. These LMs described as followings:

Experiments

First we created prompt templates based on existing experimental language models and their nature -- specifically for tasks A and B we created 8 templates per source, and for task C only a single template --. Next, we probe LMs as zero-shot testing. More later we attempt to boost the performance of two LLMs (Flan-T5-Large and Flan-T5-XL) in the form of few-shot learning using predefined prompt templates (different than zero-shot testing) and we test the model using zero-shot testing prompt templates.

Prompt templates for zero-shot testing are represented as follows:

Dataset Task prompt templates path answer set mapper path
WN18RR A datasets/TaskA/WN18RR/templates.json datasets/TaskA/WN18RR/label_mapper.json
GeoNames A datasets/TaskA/Geonames/templates.json datasets/TaskA/Geonames/label_mapper.json
NCI, MEDICIN, SNOMEDCT_US A datasets/TaskA/UMLS/templates.json datasets/TaskA/UMLS/label_mapper.json
Schema.Org, UMLS, GeoNames B datasets/TaskB/templates.txt datasets/TaskB/label_mapper.json
UMLS C datasets/TaskC/templates.txt datasets/TaskC/label_mapper.json

Prompt templates for training model is represented as follows:

Dataset Task prompt templates path
WN18RR, UMLS (NCI only), GeoNames, Schema.Org A, B, C tuning/templates.py

Results Overview

Figure 2. Comparative visual of the zero-shot and finetuned results. Unfilled shapes, filled shapes, and small filled stars represent performances in tasks A, B, and C, respectively.

How to run tasks

Requirements

Software Requirements:

  • Python 3.9
  • requirements.txt libraries

Instructions:

First, install the conda using conda installation guideline, and then create and activate your environments as follows:

conda create -n yourenvname python=3.9
conda activate yourenvname

Next, clone the repository and install the requirements from requirements.txt in your environments:

git clone https://github.com/HamedBabaei/LLMs4OL.git

cd LLMs4OL

pip install -r requirements.txt

Next, add your OpenAI key to the .env file for experimentations on OpenAI models. Finally, start the experiments as described in the task directories.

Running Tasks

To make each task behave separately as an encapsulated module, we have created separated directories for datasets as well as tasks and each task consists of a test_auto.sh shell script that automatically runs zero-shot testing on all the task datasets and produces results that will be stored in TaskX/results/DATASET_NAME/ directory. Also, you can easily run any model on your desired input dataset by running test_manual.sh and it will ask for the dataset, output logs to store outputs, as well as model name and device (CPU or GPU). For each of the important direcotries we produced the test.py scripts which will be called in test_manual.sh and test_auto.sh multiple times on different datasets. The strucutre of TaskA, TaskB, and TaskC directories are presented as follows (LLMs4OL/TaskX directory):

.
└── LLMs4OL                      
    ├── tuning   
    │   ├── ....
    │   ├── trainer.py
    │   └── train_eval.sh
    ├── TaskX             
    │   ├── ...
    │   ├── results
    │   |   ├── dataset1
    |   |   └── ....
    │   ├── ...
    │   ├── test.py
    │   ├── test_auto.sh
    │   ├── test_manual.sh
    │   └── README.md
    ...

The train_eval.sh in the tuning directory runs trainer.py for representative datasets and then walks through TaskX directories and calls test.py for evaluation of trained models for each dataset. How to run models in detail is described tasks directories readme.md files.

Citations

@InProceedings{llms4ol,
            author="Babaei Giglou, Hamed
            and D'Souza, Jennifer
            and Auer, Sören",
            title="LLMs4OL: Large Language Models for Ontology Learning",
            booktitle="International Semantic Web Conference",
            year="2023",
            publisher="Springer International Publishing",
}

Preprint:

@misc{giglou2023llms4ol,
      title={LLMs4OL: Large Language Models for Ontology Learning}, 
      author={Hamed Babaei Giglou and Jennifer D'Souza and Sören Auer},
      year={2023},
      eprint={2307.16648},
      archivePrefix={arXiv},
      primaryClass={cs.AI}
}

llms4ol's People

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

llms4ol's Issues

update paper with results and stats until with Task A, B and C results and datasets

  • Formulations and Task Definitions section fix
  • prompt engineering section
  • add complete results for Task A
  • add complete results for Task B
  • add complete results for Task C
  • adding datasets for task A
  • adding datasets for task B
  • adding datasets for task C
  • adding evaluation metrics for tasks B and C
  • adding LMs descriptions
  • start appendixes

Inference GPT3 on Task A

  • make evaluations

Run Model for WN18RR

  • Template 1
  • Template 2
  • Template 3
  • Template 4
  • Template 5
  • Template 6
  • Template 7
  • Template 8

Run Model for GeoNames

  • Template 1
  • Template 2
  • Template 3
  • Template 4
  • Template 5
  • Template 6
  • Template 7
  • Template 8

Run Model for NCI

  • Template 1
  • Template 2
  • Template 3
  • Template 4
  • Template 5
  • Template 6
  • Template 7
  • Template 8

Run Model for MEDCIN

  • Template 1
  • Template 2
  • Template 3
  • Template 4
  • Template 5
  • Template 6
  • Template 7
  • Template 8

Run Model for SNOMED

  • Template 1
  • Template 2
  • Template 3
  • Template 4
  • Template 5
  • Template 6
  • Template 7
  • Template 8

download and add models artifacts

download the following models and put them on the server as well as in my local directory!

  • Flan-T5-Large
  • Flan-T5-XL
  • BERT-Large
  • BART-Large

Inference GPT2 on Task B

  • Run Model-1 for GeoNames
  • Run Model-2 for GeoNames
  • Run Model-1 for UMLS
  • Run Model-2 for UMLS
  • Run Model-1 for Schema.ORG
  • Run Model-2 for Schema.ORG

Model-1 is GPT2-Large, and Model-2 is GPT2-XL

Experiments on BERT-Large for baseline model creations

Initial tasks:

  • make wordnet finalized dataset
  • make geonames finalized dataset
  • make umls finalized dataset

We are about to test templates for datasets to come up with the best template for datasets. Templates are:

Wordnet templates:

  • 1. [sentence] . [A] POS is a [MASK].
  • 2. [sentence]. [A] part-of-speech is a [MASK].
  • 3. [sentence] . '[A]' POS is a [MASK].
  • 4. [sentence]. '[A]' part-of-speech is a [MASK].
  • 5. [A] POS is a [MASK].
  • 6. [A] part-of-speech is a [MASK].
  • 7. '[A]' POS is a [MASK].
  • 8. '[A]' part-of-speech is a [MASK].

UMLS templates: For all three datasets: [MEDICIN, USMODE, SNOMEDCT_US]

  • 1. [sentence] . [A] in medicine is a [MASK].
  • 2. [sentence]. [A] in biomedicine is a [MASK].
  • 3. [sentence] . '[A]' in medicine is a [MASK].
  • 4. [sentence]. '[A]' in biomedicine is a [MASK].
  • 5. [A] in medicine is a [MASK].
  • 6. [A] in biomedicine is a [MASK].
  • 7. '[A]' in medicine is a [MASK].
  • 8. '[A]' in biomedicine is a [MASK].

Geonames templates

  • 1. [sentence] . [A] is a [MASK].
  • 2. [sentence]. [A] geographically is a [MASK].
  • 3. [sentence] . '[A]' is a [MASK].
  • 4. [sentence]. '[A]' geographically is a [MASK].
  • 5. [A] is a [MASK].
  • 6. [A] geographically is a [MASK].
  • 7. '[A]' is a [MASK].
  • 8. '[A]' geographically is a [MASK].

For Geonames [sentence] we can use the more generic template that we designed. [NAME] is a place in [COUNTRY].

We are interested in this kind of template because of the following reasons:

  • Template such as [A] is a [MASK] is a kind of general template that anyone can query from a search engine like google and here we want to use it as a query for knowledge graph to see whatsoever they have [A] knowledge or not.
  • And we add sentences in the first place to inform the model that's the information we are talking and in this way, we are able o fine-tune our models for the better!
  • These templates are appropriate for level 1 in UMLS and Geonames dataset because for level 2 we might need more tokens to be generated that that's the problem with BERT models and because of this we will move on with BART

Tasks are categorized into the following categories:

  • Define evaluation modules and structuring codes
  • Make datasets
  • Clean codes and make models in form of scrips
  • Model 1 is BERT-Large on 5 datasets, and 8 templates
    - [ ] Model 2 is Freq Model on 5 datasets, and 8 templates

Language Models as a Knowledge base - experimentation for our datasets

It seems that they have made an alignment with Wikipedia texts to obtain a sentence with that specific subject or object entity. Then to make a prediction over an object entity they used MASKs. For concept net, they have considered their own base dataset sentences. Then they created these query templates for relations (in our task I should create it for entity types as well) to query LMs.

According to their work, for us, the input should consist of the alignment text and the query template. For example for word net, we can obtain examples for any synset, for example for an entity cover we can get this sentence: "cover the child with a blanket" and adding the template at the end it would be: "cover the child with a blanket. cover word type is a [MASK]" (or any other template look like this -- this is just example) where the MASK is 'verb' (this is only an idea, but First step is to test this paper idea)

Most of the datasets which didn't have sentences from their own sources relied on Wikipedia! And I had a little look at their code and I understood that they only used embedding and vocabulary which they obtained from each LMs, to calculate a probability for tokens, and then they picked up the top ones and used search engine metrics to evaluate the results.

Now the task is for entity type detection lets:

  • Create sample sets for the Wordnet dataset.
  • Create sample sets for the Geonames dataset - Let's consider level 1 for now.
  • Create sample sets for the UMLS dataset - Let's consider NCI for now.

Dataset Help

It's a great honor to see your masterpiece, but now I'm facing difficulties. Can you provide nci_entities.json data file, thank you very much

Inference GPT2 on Task A

Prompt Engineering: #19

  • Design Prompt for WN18RR
  • Design Prompt for GeoNames
  • Design Prompt for UMLS

Inferencing

  • Run Model-1 for WN18RR
  • Run Model-2 for WN18RR
  • Run Model-1 for GeoNames
  • Run Model-2 for GeoNames
  • Run Model-1 for NCI
  • Run Model-2 for NCI
  • Run Model-1 for SNOMEDCT_US
  • Run Model-2 for SNOMEDCT_US
  • Run Model-1 for MEDCIN
  • Run Model-2 for MEDCIN

Model-1 is GPT2-Large, and Model-2 is GPT2-XL

Inference BLOOM on Task C

  • Run Model-1 for UMLS
  • Run Model-2 for UMLS
  • Run Model-3 for UMLS

Model-1 is BLOOM-1b7, and Model-2 is BLOOM-3b, and Model-3 is BLOOM-7b1

basic question. How run it on windows ?

I just use pycharm virtual environment, can't find the .env file. installed requirements.txt, I ran test.py, and the program showed me the error is KeyError: 'OPENAI_API_KEY'. Can you give me some guidance?
Sincerely yours!

Inference BLOOM on Task A

Prompt Engineering: #19

  • Design Prompt for WN18RR
  • Design Prompt for GeoNames
  • Design Prompt for UMLS

Inferencing

  • Run Model-1 for WN18RR
  • Run Model-2 for WN18RR
  • make evaluation for WN18RR
  • Run Model-1 for GeoNames
  • Run Model-2 for GeoNames
  • make evaluation for GeoNames
  • Run Model-1 for NCI
  • Run Model-2 for NCI
  • make evaluation for NCI
  • Run Model-1 for SNOMEDCT_US
  • Run Model-2 for SNOMEDCT_US
  • make evaluation for SNOMEDCT_US
  • Run Model-1 for MEDCIN
  • Run Model-2 for MEDCIN
  • make evaluation for MEDCIN

Model-1 is BLOOM-1b7, and Model-2 is BLOOM-3b

reduce the size of geoname dataset

the current size of GeoNames is too large and the inference time is almost 3 days for a single template so I am reducing the size from 1.7M to 700K

Inference BLOOM on Task B

  • Run Model-1 for GeoNames
  • Run Model-2 for GeoNames
  • Run Model-3 for GeoNames
  • Run Model-1 for UMLS
  • Run Model-2 for UMLS
  • Run Model-3 for UMLS
  • Run Model-1 for Schema.ORG
  • Run Model-2 for Schema.ORG
  • Run Model-3 for Schema.ORG

Model-1 is BLOOM-1b7, and Model-2 is BLOOM-3b, and Model-3 is BLOOM-7b1

Vision, Problem Formulation, Tasks, RQs, Writings, ....

LLMs4OL

Research Questions

RQ1: Can LLMs identify term/entity types? Task A
RQ2: Do LLMs comprehend relations?

  • RQ2.1: Can LLMs recognize types hierarchies? (is a relations -- tree structures) Task B
  • RQ2.2: Can LLMs identify Non-Is-A relations in hierarchies? (non is a relations -- graph structures) Task C

Tasks

Task A: The goal is to find out which LLMs are capable of finding terms/entities type without giving prior knowledge about types. Because we don't want to give any knowledge to LLMs about types, this task is a Generation task. Design considerations during solving this task are as follows:

  1. We are interested to know the entity types at the lowest level.
  2. Since one of the possible steps is fine-tuning so splitting data into train and test sets are required!
  3. We only will consider entities/terms.
  4. Entities in types hierarchy leaf inherent in their own parent's types as well (this is for the evaluation part since for this task we don't expect models to know the hierarchies).

Task B: The aim of this task and the next task (Task C) is to understand whatever LLMs could find relations without naming those relations. This relationship could be an undirected or directed relationship. These tasks are classification tasks.

For example:

Acquired Abnormality is a location of a Virus.

The location_of is a relation between mentioned two types. Our goal is to find that Acquired Abnormality and Virus have relations. Not find the name of the relation (which in this case is location of). Because naming relations refers to clustering similar relations and asking experts to name them. So in this task, we are interested to know what is the is a relation in terms/entities types.

In Task B, we want to only find types of relationships that form a hierarchy (a structure that struct types tree format from top to down where the top is a root -- it could be multiple roots -- and down is a leaf) and this type of relationships called is a relations. As an example:

C is a subclass of B.
B is a subclass of A.
D is a subclass of B.
E is a subclass of A.


Task C: However, in types, it is possible to find relations outside of the tree structure, and it is similar to relations between types in graph format. For example (considering Task B example):

E somehow has a relationship with C.
C somehow has another direct relation with A.

So, in this task, the goal is to analyze LLMs from this perspective.

Conference

Our Target is for ISWC 2023: https://iswc2023.semanticweb.org/call-for-research-track-papers/

Abstract submission due May 2nd, 2023
Full paper submission due May 9th, 2023
Objection and Response June 13th – 16th, 2023
Notifications July 12th, 2023
Camera ready papers due July 31st, 2023

ToDo - improve repository quality

  • Fixing LLM path from the local path to huggingface repository id #52
  • Add some documentation on how to generate Tasks datasets #52, #54
  • Upload open source datasets
  • add pre-commit
  • versioning and releasing the stable version

Tasks and conclusions from meeting 9 Jan 2023 regarding dataset preparations

WN18RR
We check out the dataset and its diagrams and we decided on a few things and tasks for this dataset.

  • Set upper bound for samples based on entity type $FQ_{type}&lt;10000$ for train, and $FQ_{type}&lt;1000$ for test and validation sets. ($FQ_{type}$ is a frequency of a type and type could be NN, JJ, VB, RB)
  • Could we check which relation types we want to consider at this step? We decided to ignore also_see and consider _hypernym. What about others?

FB15K-237
We conclude that the hierarchy that I extended for this dataset is kind of our contribution and we stick to this hierarchy for moving forward with this dataset.

  • Complete my diagrams in 01-analysis of datasets.ipynb for this dataset
  • Ignore the following class types due to the low number of frequencies (**But again double-check the frequencies before removing them). We decided to remove types with frequencies less than 1000.
Level-3-person-doctor                 213
Level-2-body_of_water                   43
Level-3-body_of_water-sea                2

Again we need to rethink this after getting clear visions (I mean completing my diagrams)


Geonames
We talk about how level 2 is being generated regarding notebook 02-Geoname-levels-creation.ipynb with a frequency matrix regarding the start string for level 2. and also we concluded the following tasks:

  • consider level 2 and set the upper bound for the number of classes, a maximum of 10 top frequent classes were considered in each level.
  • Consider level 1 and set the upper bound for the number of samples in each level-1 class based on the frequency of samples in each class $FQ_{level-1}&lt;1e6$.

After these tasks, we should see what's so ever the new version of dataset stats is fine for us in terms of the frequency of classes in each level or not.


UMLS

We have a lot of samples with entity types and relations that we don't know which to consider. However, to continue we need the following information (we decided only consider the English language):

  • Table of frequencies for types based on sources (SAB column) -- using MRREL or MRCONSO file
  • Table of frequencies for relationships based on sources (SAB column)

any of these two tasks will allow us to proceed with cutting samples into lower sizes.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.