hamedbabaei / llms4ol Goto Github PK

LLMs4OL:‌ Large Language Models for Ontology Learning

Python 76.11% Shell 6.81% Jupyter Notebook 16.91% Dockerfile 0.18%

large-language-models ontology-learning knowledge-graph transformers bert bloom chatgpt flan-t5 gpt-3 gpt-4

llms4ol's Introduction

LLMs4OL: Large Language Models for
Ontology Learning

Hamed Babaei Giglou, Jennifer D'Souza, and Sören Auer
{hamed.babaei, jennifer.dsouza, auer}@tib.eu
TIB Leibniz Information Center for Science and Technology, Hannover, Germany
Accepted for publication at ISWC 2023 - Research Track

Figure 1: The LLMs4OL task paradigm is an end-to-end conceptual framework for learning ontologies in different knowledge domain

Ontology Learning (OL) addresses the challenge of knowledge acquisition and representation in a variety of domains. Recent advances in NLP and the emergence of Large Language Models, which have shown a capability to be good at crystallizing knowledge and patterns from vast text sources, we introduced the LLMs4OL: Large Language Models for Ontology Learning paradigm as an empirical study of LLMs for automated construction of ontologies from various domains. The LLMs4OL paradigm tests Does the capability of LLMs to capture intricate linguistic relationships translate effectively to OL, given that OL mainly relies on automatically extracting and structuring knowledge from natural language text?.

Repository Structure
LLMs4OL Paradigm
LLMs4OL Paradigm Setups
Experiments
Results Overview
How to run tasks
- Requirements
- Running Tasks
Citation

Repository Structure

.
└── LLMs4OL                             <- root directory of the repository
    ├── tuning                          <- Few-Shot finetuning directory
    │   └── ...
    ├── TaskA                           <- Term Typing task directory
    │   └── ...
    ├── TaskB                           <- Type Taxonomy Discovery task directory
    │   └── ...
    ├── TaskC                           <- Type Non-Taxonomic Relation Extraction task directory
    │   └── ...
    ├── assets                          <- artifacts directory 
    │   ├── LLMs                        <- contains pretrained LLMs
    │   ├── FSL                         <- contains fine-tuned LLMs (for training you should create this)
    │   ├── WordNetDefinitions          <- contains wordnet word definitions
    │   └── CountryCodes                <- GeoNames country codes
    ├── datasets                        <- contains datasets
    │   ├── FSL                         <- contains few-shot learning training datasets
    │   ├── TaskA                       <- contains directories for task A sources
    │   ├── TaskB                       <- contains directories for task B sources
    │   └── TaskC                       <- contains directories for task C sources
    ├── docs                            <- contains supplementary documents
    │   └── Supplementary-Material.pdf  <- contains directories for task C sources
    ├── images                          <- contains the figures
    ├── README.md                       <- README file for documenting the service.
    └── requirements.txt                <- contains Python requirements listed

LLMs4OL Paradigm

The LLMs4OL paradigm offers a conceptual framework to accelerate the automated construction of ontologies exclusively by domain experts. OL tasks are based on the ontology primitives which consist of:

Corpus preparation – selecting and collecting the source texts to build the ontology.
Terminology extraction – identifying and extracting relevant terms from the source text.
Term typing – grouping similar terms as conceptual types.
Taxonomy construction – identifying the “is-a” hierarchies between types.
Relationship extraction – identifying and extracting “non-is-a” or semantic relationships between types
Axiom discovery – discovering constraints and inference rules for the ontology

Toward realizing LLMs4OL, we empirically ground three core tasks of OL leveraging LLMs as a foundational basis for future work. They are presented as:

Term Typing
Type Taxonomy Discovery
Type Non-Taxonomic Relation Extraction

LLMs4OL Paradigm Setups

The LLMs4OL task paradigm is an end-to-end conceptual framework for learning ontologies in different knowledge domains with the aim of automation of ontology learning.

Tasks

The tasks within the blue arrow (in Figure-1) are the three OL tasks empirically validated. For each task, we created a directory with a detailed description of the task information as follows:

Datasets

To comprehensively assess LLMs for the three OL tasks we cover a variety of ontological knowledge domain sources, i.e. lexicosemantics – WN18RR (WordNet), geography – GeoNames, biomedicine – NCI, MEDICIN, SNOMEDCT_US, and web content types – Schema.Org. These sources are different for each task, so for each task, the detailed information is available as follows:

Task A. Term Typing Datasets: GeoNames, NCI, MEDICIN, SNOMEDCT_US, and WN18RR
Task B. Type Taxonomy Discovery Datasets: GeoNames, Schema.Org, and UMLS
Task C. Type Non-Taxonomic Relation Extraction Datasets: UMLS

Results

The evaluation metric for Task A is reported as the mean average precision at k (MAP@K), where k = 1, And evaluations for Tasks B and C are reported in terms of the standard F1-score based on precision and recall. Complete and detailed results for tasks are presented in the following tables:

Experimental LLMs

We created experimentations using five different LMs. These LMs described as followings:

Encoder-Only:
- BERT-Large with 340M parameters
- PubMedBERT with 340M parameters
Encoder-Decoder:
- BART-Large with 400M parameters
- Flan-T5-Large with 780M parameters
- Flan-T5-XL with 3B parameters
Decoder-Only:
- BLOOM-1b7 with 1.7B parameters
- BLOOM-3b with 3B parameters
- LLaMA-7b with 7B parameters
- GPT-3 with 175B parameters
- GPT-3.5 with 174B parameters
- GPT-4 with 1T parameters

Experiments

First we created prompt templates based on existing experimental language models and their nature -- specifically for tasks A and B we created 8 templates per source, and for task C only a single template --. Next, we probe LMs as zero-shot testing. More later we attempt to boost the performance of two LLMs (Flan-T5-Large and Flan-T5-XL) in the form of few-shot learning using predefined prompt templates (different than zero-shot testing) and we test the model using zero-shot testing prompt templates.

Prompt templates for zero-shot testing are represented as follows:

Dataset	Task	prompt templates path	answer set mapper path
WN18RR	A	`datasets/TaskA/WN18RR/templates.json`	`datasets/TaskA/WN18RR/label_mapper.json`
GeoNames	A	`datasets/TaskA/Geonames/templates.json`	`datasets/TaskA/Geonames/label_mapper.json`
NCI, MEDICIN, SNOMEDCT_US	A	`datasets/TaskA/UMLS/templates.json`	`datasets/TaskA/UMLS/label_mapper.json`
Schema.Org, UMLS, GeoNames	B	`datasets/TaskB/templates.txt`	`datasets/TaskB/label_mapper.json`
UMLS	C	`datasets/TaskC/templates.txt`	`datasets/TaskC/label_mapper.json`

Prompt templates for training model is represented as follows:

Dataset	Task	prompt templates path
WN18RR, UMLS (NCI only), GeoNames, Schema.Org	A, B, C	`tuning/templates.py`

Results Overview

Figure 2. Comparative visual of the zero-shot and finetuned results. Unfilled shapes, filled shapes, and small filled stars represent performances in tasks A, B, and C, respectively.

How to run tasks

Requirements

Software Requirements:

Python 3.9
requirements.txt libraries

Instructions:

First, install the conda using conda installation guideline, and then create and activate your environments as follows:

conda create -n yourenvname python=3.9
conda activate yourenvname

Next, clone the repository and install the requirements from requirements.txt in your environments:

git clone https://github.com/HamedBabaei/LLMs4OL.git

cd LLMs4OL

pip install -r requirements.txt

Next, add your OpenAI key to the .env file for experimentations on OpenAI models. Finally, start the experiments as described in the task directories.

Running Tasks

To make each task behave separately as an encapsulated module, we have created separated directories for datasets as well as tasks and each task consists of a test_auto.sh shell script that automatically runs zero-shot testing on all the task datasets and produces results that will be stored in TaskX/results/DATASET_NAME/ directory. Also, you can easily run any model on your desired input dataset by running test_manual.sh and it will ask for the dataset, output logs to store outputs, as well as model name and device (CPU or GPU). For each of the important direcotries we produced the test.py scripts which will be called in test_manual.sh and test_auto.sh multiple times on different datasets. The strucutre of TaskA, TaskB, and TaskC directories are presented as follows (LLMs4OL/TaskX directory):

.
└── LLMs4OL                      
    ├── tuning   
    │   ├── ....
    │   ├── trainer.py
    │   └── train_eval.sh
    ├── TaskX             
    │   ├── ...
    │   ├── results
    │   |   ├── dataset1
    |   |   └── ....
    │   ├── ...
    │   ├── test.py
    │   ├── test_auto.sh
    │   ├── test_manual.sh
    │   └── README.md
    ...

The train_eval.sh in the tuning directory runs trainer.py for representative datasets and then walks through TaskX directories and calls test.py for evaluation of trained models for each dataset. How to run models in detail is described tasks directories readme.md files.

Citations

@InProceedings{llms4ol,
            author="Babaei Giglou, Hamed
            and D'Souza, Jennifer
            and Auer, Sören",
            title="LLMs4OL: Large Language Models for Ontology Learning",
            booktitle="International Semantic Web Conference",
            year="2023",
            publisher="Springer International Publishing",
}

Preprint:

@misc{giglou2023llms4ol,
      title={LLMs4OL: Large Language Models for Ontology Learning}, 
      author={Hamed Babaei Giglou and Jennifer D'Souza and Sören Auer},
      year={2023},
      eprint={2307.16648},
      archivePrefix={arXiv},
      primaryClass={cs.AI}
}

llms4ol's People

Stargazers

Watchers

Forkers

alhaol ssrisunt khazelton alialemimatinpour melsiddieg yuchen0604 babajideowoyele

llms4ol's Issues

create jupyter lab for connecting into the devboxes

update paper with results and stats until with Task A, B and C results and datasets

Create dataset for TaskB

create a dataset of ontologies (classes and subclasses of types) for the following datasets:

Geonames
~~- [x] NCI~~
~~- [x] SNOMEDC_US~~
~~- [x] MEDCINE~~
UMLS
Schema.org form here https://schema.org/docs/full.html

create sentences for UML datasets

The goal is to generate sentences for entities from the following datasets:

SNOMEDCT_US
NCI
MEDICIn

Run experiments for SNOMEDCT_US dataset -- Task A - Inference

Models to report results:

BERT-Large
Flan-T5-Large
Flan-T5-XL
BART-Large

on T1, T2, .... to T8!

Inference GPT3 on Task A

make evaluations

Run Model for WN18RR

Run Model for GeoNames

Run Model for NCI

Run Model for MEDCIN

Run Model for SNOMED

Run models for Task B - UMLS

Models to report results:

BERT-Large
BART-Large
Flan-T5-Large
Flan-T5-XL

on T1, T2, .... to T8!

download and add models artifacts

download the following models and put them on the server as well as in my local directory!

Flan-T5-Large
Flan-T5-XL
BERT-Large
BART-Large

Fix evaluation metrics

Source; https://www.kaggle.com/code/nandeshwar/mean-average-precision-map-k-metric-explained-code/notebook

Training FSL models

Inference GPT2 on Task B

Model-1 is GPT2-Large, and Model-2 is GPT2-XL

re organize datasets that I have created before for Task A setups

WordNet
Geonames
UMLS

Add types synsets for evaluations

Synsets for Geoname labels
Synsets for UMLS labels

Run models for Task B - GeoName

Models to report results:

BERT-Large
BART-Large
Flan-T5-Large
Flan-T5-XL

on T1, T2, .... to T8!

create well-structured prompts for datasets

Creating prompts for:

geoname heirarchies
umls heirarchies

Experiments on BERT-Large for baseline model creations

Initial tasks:

make wordnet finalized dataset
make geonames finalized dataset
make umls finalized dataset

We are about to test templates for datasets to come up with the best template for datasets. Templates are:

Wordnet templates:

1. [sentence] . [A] POS is a [MASK].
2. [sentence]. [A] part-of-speech is a [MASK].
3. [sentence] . '[A]' POS is a [MASK].
4. [sentence]. '[A]' part-of-speech is a [MASK].
5. [A] POS is a [MASK].
6. [A] part-of-speech is a [MASK].
7. '[A]' POS is a [MASK].
8. '[A]' part-of-speech is a [MASK].

UMLS templates: For all three datasets: [MEDICIN, USMODE, SNOMEDCT_US]

1. [sentence] . [A] in medicine is a [MASK].
2. [sentence]. [A] in biomedicine is a [MASK].
3. [sentence] . '[A]' in medicine is a [MASK].
4. [sentence]. '[A]' in biomedicine is a [MASK].
5. [A] in medicine is a [MASK].
6. [A] in biomedicine is a [MASK].
7. '[A]' in medicine is a [MASK].
8. '[A]' in biomedicine is a [MASK].

Geonames templates

1. [sentence] . [A] is a [MASK].
2. [sentence]. [A] geographically is a [MASK].
3. [sentence] . '[A]' is a [MASK].
4. [sentence]. '[A]' geographically is a [MASK].
5. [A] is a [MASK].
6. [A] geographically is a [MASK].
7. '[A]' is a [MASK].
8. '[A]' geographically is a [MASK].

For Geonames [sentence] we can use the more generic template that we designed. [NAME] is a place in [COUNTRY].

We are interested in this kind of template because of the following reasons:

Template such as [A] is a [MASK] is a kind of general template that anyone can query from a search engine like google and here we want to use it as a query for knowledge graph to see whatsoever they have [A] knowledge or not.
And we add sentences in the first place to inform the model that's the information we are talking and in this way, we are able o fine-tune our models for the better!
These templates are appropriate for level 1 in UMLS and Geonames dataset because for level 2 we might need more tokens to be generated that that's the problem with BERT models and because of this we will move on with BART

Tasks are categorized into the following categories:

Define evaluation modules and structuring codes
Make datasets
Clean codes and make models in form of scrips
Model 1 is BERT-Large on 5 datasets, and 8 templates
~~- [ ] Model 2 is Freq Model on 5 datasets, and 8 templates~~

Create templates for FSL on datasets

Evaluate FSL on Task A-Test

Language Models as a Knowledge base - experimentation for our datasets

It seems that they have made an alignment with Wikipedia texts to obtain a sentence with that specific subject or object entity. Then to make a prediction over an object entity they used MASKs. For concept net, they have considered their own base dataset sentences. Then they created these query templates for relations (in our task I should create it for entity types as well) to query LMs.

According to their work, for us, the input should consist of the alignment text and the query template. For example for word net, we can obtain examples for any synset, for example for an entity cover we can get this sentence: "cover the child with a blanket" and adding the template at the end it would be: "cover the child with a blanket. cover word type is a [MASK]" (or any other template look like this -- this is just example) where the MASK is 'verb' (this is only an idea, but First step is to test this paper idea)

Most of the datasets which didn't have sentences from their own sources relied on Wikipedia! And I had a little look at their code and I understood that they only used embedding and vocabulary which they obtained from each LMs, to calculate a probability for tokens, and then they picked up the top ones and used search engine metrics to evaluate the results.

Now the task is for entity type detection lets:

Create sample sets for the Wordnet dataset.
Create sample sets for the Geonames dataset - Let's consider level 1 for now.
Create sample sets for the UMLS dataset - Let's consider NCI for now.

Dataset Help

It's a great honor to see your masterpiece, but now I'm facing difficulties. Can you provide nci_entities.json data file, thank you very much

Few Shot Llearning (FSL)

Run experiments for WORDNET dataset -- Task A

Models to report results:

BERT-Large (or the bigger one)
Flan-T5-Large
Flan-T5-XL
BART-Large

on T1, T2, .... to T8!

Evaluate FSL on Task C-Test

Run experiments for MEDCIN dataset -- Task A - Inferencing

Models to report results:

BERT-Large
Flan-T5-Large
Flan-T5-XL
BART-Large

on T1, T2, .... to T8!

Inference GPT2 on Task A

Prompt Engineering: #19

Design Prompt for WN18RR
Design Prompt for GeoNames
Design Prompt for UMLS

Inferencing

Model-1 is GPT2-Large, and Model-2 is GPT2-XL

Implement TaskB modelings

models to consider:

BERT-Large
BART-Large
Flan-T5-Large
Flan-T5-XL

research/study on OL and prompt engineering for the project

Halard Sack course (only section 5 about OL ) : https://www.youtube.com/playlist?list=PLoOmvuyo5UAcBXlhTti7kzetSsi1PpJGR
paper lecture Pre-train, Prompt, and Predict: A Systematic Survey of Prompting Methods in Natural Language Processing : https://www.youtube.com/watch?v=TE6urdkTR4I

Inference BLOOM on Task C

Run Model-1 for UMLS
Run Model-2 for UMLS
Run Model-3 for UMLS

Model-1 is BLOOM-1b7, and Model-2 is BLOOM-3b, and Model-3 is BLOOM-7b1

Inference GPT3 on Task C

https://github.com/openai/openai-cookbook (zero-shot text classification)

Run Model for UMLS
make evaluations

basic question. How run it on windows ?

I just use pycharm virtual environment, can't find the .env file. installed requirements.txt, I ran test.py, and the program showed me the error is KeyError: 'OPENAI_API_KEY'. Can you give me some guidance?
Sincerely yours!

study about prompting

useful articles:

Inference GPT2 on Task C

Run Model-1 for UMLS
Run Model-2 for UMLS

Model-1 is GPT2-Large, and Model-2 is GPT2-XL

Start implementation of training model for FSL

Task C dataset creation

Inference BLOOM on Task A

Prompt Engineering: #19

Design Prompt for WN18RR
Design Prompt for GeoNames
Design Prompt for UMLS

Inferencing

Model-1 is BLOOM-1b7, and Model-2 is BLOOM-3b

reduce the size of geoname dataset

the current size of GeoNames is too large and the inference time is almost 3 days for a single template so I am reducing the size from 1.7M to 700K

Run experiments for GeoNames dataset -- Task A - Inferencing

Models to report results:

on T1, T2, .... to T8!

Create a dataset for FSL

Inference BLOOM on Task B

Model-1 is BLOOM-1b7, and Model-2 is BLOOM-3b, and Model-3 is BLOOM-7b1

Evaluate FSL on Task B-Test

Run experiments for NCI dataset -- Task A - Inferencing

Models to report results:

BERT-Large
Flan-T5-Large
Flan-T5-XL
BART-Large

on T1, T2, .... to T8!

Prompt Engineering

Prompt Engineering for Left-to-Right LMs

Run models for Task B - Schema.Org

Models to report results:

BERT-Large
BART-Large
Flan-T5-Large
Flan-T5-XL

on T1, T2, .... to T8!

Vision, Problem Formulation, Tasks, RQs, Writings, ....

Research Questions

RQ1: Can LLMs identify term/entity types? Task A
RQ2: Do LLMs comprehend relations?

RQ2.1: Can LLMs recognize types hierarchies? (is a relations -- tree structures) Task B
RQ2.2: Can LLMs identify Non-Is-A relations in hierarchies? (non is a relations -- graph structures) Task C

Tasks

Task A: The goal is to find out which LLMs are capable of finding terms/entities type without giving prior knowledge about types. Because we don't want to give any knowledge to LLMs about types, this task is a Generation task. Design considerations during solving this task are as follows:

We are interested to know the entity types at the lowest level.
Since one of the possible steps is fine-tuning so splitting data into train and test sets are required!
We only will consider entities/terms.
Entities in types hierarchy leaf inherent in their own parent's types as well (this is for the evaluation part since for this task we don't expect models to know the hierarchies).

Task B: The aim of this task and the next task (Task C) is to understand whatever LLMs could find relations without naming those relations. This relationship could be an undirected or directed relationship. These tasks are classification tasks.

For example:

Acquired Abnormality is a location of a Virus.

The location_of is a relation between mentioned two types. Our goal is to find that Acquired Abnormality and Virus have relations. Not find the name of the relation (which in this case is location of). Because naming relations refers to clustering similar relations and asking experts to name them. So in this task, we are interested to know what is the is a relation in terms/entities types.

In Task B, we want to only find types of relationships that form a hierarchy (a structure that struct types tree format from top to down where the top is a root -- it could be multiple roots -- and down is a leaf) and this type of relationships called is a relations. As an example:

C is a subclass of B.
B is a subclass of A.
D is a subclass of B.
E is a subclass of A.

Task C: However, in types, it is possible to find relations outside of the tree structure, and it is similar to relations between types in graph format. For example (considering Task B example):

E somehow has a relationship with C.
C somehow has another direct relation with A.

So, in this task, the goal is to analyze LLMs from this perspective.

Conference

Our Target is for ISWC 2023: https://iswc2023.semanticweb.org/call-for-research-track-papers/

Abstract submission due	May 2nd, 2023
Full paper submission due	May 9th, 2023
Objection and Response	June 13th – 16th, 2023
Notifications	July 12th, 2023
Camera ready papers due	July 31st, 2023

ToDo - improve repository quality

Fixing LLM path from the local path to huggingface repository id #52
Add some documentation on how to generate Tasks datasets #52, #54
Upload open source datasets
add pre-commit
versioning and releasing the stable version

Inference GPT3 on Task B

[ ] https://github.com/openai/openai-cookbook (zero-shot text classification)

Run Model for GeoNames
Run Model for UMLS
Run Model for Schema.ORG
make evaluations

Task C, first round of experimentations

Models to report results:

BERT-Large
BART-Large
Flan-T5-Large
Flan-T5-XL

on a single template!

Code restructuring for TaskA experimentation starts

move codes to devboxes (for task A only)
make changes with devbox
refactor codes with new structures that we have

Tasks and conclusions from meeting 9 Jan 2023 regarding dataset preparations

WN18RR
We check out the dataset and its diagrams and we decided on a few things and tasks for this dataset.

Set upper bound for samples based on entity type $FQ_{type}<10000$ for train, and $FQ_{type}<1000$ for test and validation sets. ($FQ_{type}$ is a frequency of a type and type could be NN, JJ, VB, RB)
Could we check which relation types we want to consider at this step? We decided to ignore also_see and consider _hypernym. What about others?

FB15K-237
We conclude that the hierarchy that I extended for this dataset is kind of our contribution and we stick to this hierarchy for moving forward with this dataset.

Complete my diagrams in 01-analysis of datasets.ipynb for this dataset
Ignore the following class types due to the low number of frequencies (**But again double-check the frequencies before removing them). We decided to remove types with frequencies less than 1000.

Level-3-person-doctor                 213
Level-2-body_of_water                   43
Level-3-body_of_water-sea                2

Again we need to rethink this after getting clear visions (I mean completing my diagrams)

Geonames
We talk about how level 2 is being generated regarding notebook 02-Geoname-levels-creation.ipynb with a frequency matrix regarding the start string for level 2. and also we concluded the following tasks:

consider level 2 and set the upper bound for the number of classes, a maximum of 10 top frequent classes were considered in each level.
Consider level 1 and set the upper bound for the number of samples in each level-1 class based on the frequency of samples in each class $FQ_{level-1}<1e6$.

After these tasks, we should see what's so ever the new version of dataset stats is fine for us in terms of the frequency of classes in each level or not.

UMLS

We have a lot of samples with entity types and relations that we don't know which to consider. However, to continue we need the following information (we decided only consider the English language):

Table of frequencies for types based on sources (SAB column) -- using MRREL or MRCONSO file
Table of frequencies for relationships based on sources (SAB column)

any of these two tasks will allow us to proceed with cutting samples into lower sizes.

hamedbabaei / llms4ol Goto Github PK

llms4ol's Introduction

LLMs4OL: Large Language Models for Ontology Learning

Table of Contents

Repository Structure

LLMs4OL Paradigm

LLMs4OL Paradigm Setups

Tasks

Datasets

Results

Experimental LLMs

Experiments

Results Overview

How to run tasks

Requirements

Running Tasks

Citations

llms4ol's People

Stargazers

Watchers

Forkers

llms4ol's Issues

Research Questions

Tasks

Conference

Recommend Projects

Recommend Topics

Recommend Org

LLMs4OL: Large Language Models for
Ontology Learning