Giter Site home page Giter Site logo

hamedbabaei / llms4ol Goto Github PK

View Code? Open in Web Editor NEW
64.0 4.0 7.0 38.89 MB

LLMs4OL:‌ Large Language Models for Ontology Learning

Python 76.11% Shell 6.81% Jupyter Notebook 16.91% Dockerfile 0.18%
large-language-models ontology-learning knowledge-graph transformers bert bloom chatgpt flan-t5 gpt-3 gpt-4

llms4ol's Issues

download and add models artifacts

download the following models and put them on the server as well as in my local directory!

  • Flan-T5-Large
  • Flan-T5-XL
  • BERT-Large
  • BART-Large

Experiments on BERT-Large for baseline model creations

Initial tasks:

  • make wordnet finalized dataset
  • make geonames finalized dataset
  • make umls finalized dataset

We are about to test templates for datasets to come up with the best template for datasets. Templates are:

Wordnet templates:

  • 1. [sentence] . [A] POS is a [MASK].
  • 2. [sentence]. [A] part-of-speech is a [MASK].
  • 3. [sentence] . '[A]' POS is a [MASK].
  • 4. [sentence]. '[A]' part-of-speech is a [MASK].
  • 5. [A] POS is a [MASK].
  • 6. [A] part-of-speech is a [MASK].
  • 7. '[A]' POS is a [MASK].
  • 8. '[A]' part-of-speech is a [MASK].

UMLS templates: For all three datasets: [MEDICIN, USMODE, SNOMEDCT_US]

  • 1. [sentence] . [A] in medicine is a [MASK].
  • 2. [sentence]. [A] in biomedicine is a [MASK].
  • 3. [sentence] . '[A]' in medicine is a [MASK].
  • 4. [sentence]. '[A]' in biomedicine is a [MASK].
  • 5. [A] in medicine is a [MASK].
  • 6. [A] in biomedicine is a [MASK].
  • 7. '[A]' in medicine is a [MASK].
  • 8. '[A]' in biomedicine is a [MASK].

Geonames templates

  • 1. [sentence] . [A] is a [MASK].
  • 2. [sentence]. [A] geographically is a [MASK].
  • 3. [sentence] . '[A]' is a [MASK].
  • 4. [sentence]. '[A]' geographically is a [MASK].
  • 5. [A] is a [MASK].
  • 6. [A] geographically is a [MASK].
  • 7. '[A]' is a [MASK].
  • 8. '[A]' geographically is a [MASK].

For Geonames [sentence] we can use the more generic template that we designed. [NAME] is a place in [COUNTRY].

We are interested in this kind of template because of the following reasons:

  • Template such as [A] is a [MASK] is a kind of general template that anyone can query from a search engine like google and here we want to use it as a query for knowledge graph to see whatsoever they have [A] knowledge or not.
  • And we add sentences in the first place to inform the model that's the information we are talking and in this way, we are able o fine-tune our models for the better!
  • These templates are appropriate for level 1 in UMLS and Geonames dataset because for level 2 we might need more tokens to be generated that that's the problem with BERT models and because of this we will move on with BART

Tasks are categorized into the following categories:

  • Define evaluation modules and structuring codes
  • Make datasets
  • Clean codes and make models in form of scrips
  • Model 1 is BERT-Large on 5 datasets, and 8 templates
    - [ ] Model 2 is Freq Model on 5 datasets, and 8 templates

reduce the size of geoname dataset

the current size of GeoNames is too large and the inference time is almost 3 days for a single template so I am reducing the size from 1.7M to 700K

update paper with results and stats until with Task A, B and C results and datasets

  • Formulations and Task Definitions section fix
  • prompt engineering section
  • add complete results for Task A
  • add complete results for Task B
  • add complete results for Task C
  • adding datasets for task A
  • adding datasets for task B
  • adding datasets for task C
  • adding evaluation metrics for tasks B and C
  • adding LMs descriptions
  • start appendixes

Inference GPT3 on Task A

  • make evaluations

Run Model for WN18RR

  • Template 1
  • Template 2
  • Template 3
  • Template 4
  • Template 5
  • Template 6
  • Template 7
  • Template 8

Run Model for GeoNames

  • Template 1
  • Template 2
  • Template 3
  • Template 4
  • Template 5
  • Template 6
  • Template 7
  • Template 8

Run Model for NCI

  • Template 1
  • Template 2
  • Template 3
  • Template 4
  • Template 5
  • Template 6
  • Template 7
  • Template 8

Run Model for MEDCIN

  • Template 1
  • Template 2
  • Template 3
  • Template 4
  • Template 5
  • Template 6
  • Template 7
  • Template 8

Run Model for SNOMED

  • Template 1
  • Template 2
  • Template 3
  • Template 4
  • Template 5
  • Template 6
  • Template 7
  • Template 8

Dataset Help

It's a great honor to see your masterpiece, but now I'm facing difficulties. Can you provide nci_entities.json data file, thank you very much

Language Models as a Knowledge base - experimentation for our datasets

It seems that they have made an alignment with Wikipedia texts to obtain a sentence with that specific subject or object entity. Then to make a prediction over an object entity they used MASKs. For concept net, they have considered their own base dataset sentences. Then they created these query templates for relations (in our task I should create it for entity types as well) to query LMs.

According to their work, for us, the input should consist of the alignment text and the query template. For example for word net, we can obtain examples for any synset, for example for an entity cover we can get this sentence: "cover the child with a blanket" and adding the template at the end it would be: "cover the child with a blanket. cover word type is a [MASK]" (or any other template look like this -- this is just example) where the MASK is 'verb' (this is only an idea, but First step is to test this paper idea)

Most of the datasets which didn't have sentences from their own sources relied on Wikipedia! And I had a little look at their code and I understood that they only used embedding and vocabulary which they obtained from each LMs, to calculate a probability for tokens, and then they picked up the top ones and used search engine metrics to evaluate the results.

Now the task is for entity type detection lets:

  • Create sample sets for the Wordnet dataset.
  • Create sample sets for the Geonames dataset - Let's consider level 1 for now.
  • Create sample sets for the UMLS dataset - Let's consider NCI for now.

Vision, Problem Formulation, Tasks, RQs, Writings, ....

LLMs4OL

Research Questions

RQ1: Can LLMs identify term/entity types? Task A
RQ2: Do LLMs comprehend relations?

  • RQ2.1: Can LLMs recognize types hierarchies? (is a relations -- tree structures) Task B
  • RQ2.2: Can LLMs identify Non-Is-A relations in hierarchies? (non is a relations -- graph structures) Task C

Tasks

Task A: The goal is to find out which LLMs are capable of finding terms/entities type without giving prior knowledge about types. Because we don't want to give any knowledge to LLMs about types, this task is a Generation task. Design considerations during solving this task are as follows:

  1. We are interested to know the entity types at the lowest level.
  2. Since one of the possible steps is fine-tuning so splitting data into train and test sets are required!
  3. We only will consider entities/terms.
  4. Entities in types hierarchy leaf inherent in their own parent's types as well (this is for the evaluation part since for this task we don't expect models to know the hierarchies).

Task B: The aim of this task and the next task (Task C) is to understand whatever LLMs could find relations without naming those relations. This relationship could be an undirected or directed relationship. These tasks are classification tasks.

For example:

Acquired Abnormality is a location of a Virus.

The location_of is a relation between mentioned two types. Our goal is to find that Acquired Abnormality and Virus have relations. Not find the name of the relation (which in this case is location of). Because naming relations refers to clustering similar relations and asking experts to name them. So in this task, we are interested to know what is the is a relation in terms/entities types.

In Task B, we want to only find types of relationships that form a hierarchy (a structure that struct types tree format from top to down where the top is a root -- it could be multiple roots -- and down is a leaf) and this type of relationships called is a relations. As an example:

C is a subclass of B.
B is a subclass of A.
D is a subclass of B.
E is a subclass of A.


Task C: However, in types, it is possible to find relations outside of the tree structure, and it is similar to relations between types in graph format. For example (considering Task B example):

E somehow has a relationship with C.
C somehow has another direct relation with A.

So, in this task, the goal is to analyze LLMs from this perspective.

Conference

Our Target is for ISWC 2023: https://iswc2023.semanticweb.org/call-for-research-track-papers/

Abstract submission due May 2nd, 2023
Full paper submission due May 9th, 2023
Objection and Response June 13th – 16th, 2023
Notifications July 12th, 2023
Camera ready papers due July 31st, 2023

Inference BLOOM on Task A

Prompt Engineering: #19

  • Design Prompt for WN18RR
  • Design Prompt for GeoNames
  • Design Prompt for UMLS

Inferencing

  • Run Model-1 for WN18RR
  • Run Model-2 for WN18RR
  • make evaluation for WN18RR
  • Run Model-1 for GeoNames
  • Run Model-2 for GeoNames
  • make evaluation for GeoNames
  • Run Model-1 for NCI
  • Run Model-2 for NCI
  • make evaluation for NCI
  • Run Model-1 for SNOMEDCT_US
  • Run Model-2 for SNOMEDCT_US
  • make evaluation for SNOMEDCT_US
  • Run Model-1 for MEDCIN
  • Run Model-2 for MEDCIN
  • make evaluation for MEDCIN

Model-1 is BLOOM-1b7, and Model-2 is BLOOM-3b

Inference BLOOM on Task B

  • Run Model-1 for GeoNames
  • Run Model-2 for GeoNames
  • Run Model-3 for GeoNames
  • Run Model-1 for UMLS
  • Run Model-2 for UMLS
  • Run Model-3 for UMLS
  • Run Model-1 for Schema.ORG
  • Run Model-2 for Schema.ORG
  • Run Model-3 for Schema.ORG

Model-1 is BLOOM-1b7, and Model-2 is BLOOM-3b, and Model-3 is BLOOM-7b1

basic question. How run it on windows ?

I just use pycharm virtual environment, can't find the .env file. installed requirements.txt, I ran test.py, and the program showed me the error is KeyError: 'OPENAI_API_KEY'. Can you give me some guidance?
Sincerely yours!

Tasks and conclusions from meeting 9 Jan 2023 regarding dataset preparations

WN18RR
We check out the dataset and its diagrams and we decided on a few things and tasks for this dataset.

  • Set upper bound for samples based on entity type $FQ_{type}<10000$ for train, and $FQ_{type}<1000$ for test and validation sets. ($FQ_{type}$ is a frequency of a type and type could be NN, JJ, VB, RB)
  • Could we check which relation types we want to consider at this step? We decided to ignore also_see and consider _hypernym. What about others?

FB15K-237
We conclude that the hierarchy that I extended for this dataset is kind of our contribution and we stick to this hierarchy for moving forward with this dataset.

  • Complete my diagrams in 01-analysis of datasets.ipynb for this dataset
  • Ignore the following class types due to the low number of frequencies (**But again double-check the frequencies before removing them). We decided to remove types with frequencies less than 1000.
Level-3-person-doctor                 213
Level-2-body_of_water                   43
Level-3-body_of_water-sea                2

Again we need to rethink this after getting clear visions (I mean completing my diagrams)


Geonames
We talk about how level 2 is being generated regarding notebook 02-Geoname-levels-creation.ipynb with a frequency matrix regarding the start string for level 2. and also we concluded the following tasks:

  • consider level 2 and set the upper bound for the number of classes, a maximum of 10 top frequent classes were considered in each level.
  • Consider level 1 and set the upper bound for the number of samples in each level-1 class based on the frequency of samples in each class $FQ_{level-1}<1e6$.

After these tasks, we should see what's so ever the new version of dataset stats is fine for us in terms of the frequency of classes in each level or not.


UMLS

We have a lot of samples with entity types and relations that we don't know which to consider. However, to continue we need the following information (we decided only consider the English language):

  • Table of frequencies for types based on sources (SAB column) -- using MRREL or MRCONSO file
  • Table of frequencies for relationships based on sources (SAB column)

any of these two tasks will allow us to proceed with cutting samples into lower sizes.

ToDo - improve repository quality

  • Fixing LLM path from the local path to huggingface repository id #52
  • Add some documentation on how to generate Tasks datasets #52, #54
  • Upload open source datasets
  • add pre-commit
  • versioning and releasing the stable version

Inference GPT2 on Task A

Prompt Engineering: #19

  • Design Prompt for WN18RR
  • Design Prompt for GeoNames
  • Design Prompt for UMLS

Inferencing

  • Run Model-1 for WN18RR
  • Run Model-2 for WN18RR
  • Run Model-1 for GeoNames
  • Run Model-2 for GeoNames
  • Run Model-1 for NCI
  • Run Model-2 for NCI
  • Run Model-1 for SNOMEDCT_US
  • Run Model-2 for SNOMEDCT_US
  • Run Model-1 for MEDCIN
  • Run Model-2 for MEDCIN

Model-1 is GPT2-Large, and Model-2 is GPT2-XL

Inference BLOOM on Task C

  • Run Model-1 for UMLS
  • Run Model-2 for UMLS
  • Run Model-3 for UMLS

Model-1 is BLOOM-1b7, and Model-2 is BLOOM-3b, and Model-3 is BLOOM-7b1

Inference GPT2 on Task B

  • Run Model-1 for GeoNames
  • Run Model-2 for GeoNames
  • Run Model-1 for UMLS
  • Run Model-2 for UMLS
  • Run Model-1 for Schema.ORG
  • Run Model-2 for Schema.ORG

Model-1 is GPT2-Large, and Model-2 is GPT2-XL

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.