Giter Site home page Giter Site logo

dice-group / multpax Goto Github PK

View Code? Open in Web Editor NEW
4.0 3.0 1.0 6.59 MB

A Multitask Framework for Present and Absent Keyphrase Generation using Knowledge Graphs

License: MIT License

Jupyter Notebook 91.77% Python 8.23%
keyphrase-extraction absent-keyphrases disaster-tweets

multpax's Introduction

MultPAX: Keyphrase Extraction using Language Models and Knowledge Graphs

This repositoy contains the source code of our paper: "MultPAX: Keyphrase Extraction using Language models and Knowledge Graphs". The paper has been accepted at the ISWC 2022 conference.

Fig. 1 the architecure of MultPAX framework

Summary:

  • Keyphrase extraction is the process of extracting a small set of phrases that best describe an input corpus.
  • The automatic generation of keyphrases has become essential for many natural language applications such as text categorization, indexing, and summarization.
  • In this paper, we propose MultPAX, a multitask framework for extracting present and absent keyphrases using pretrained language models and knowledge graphs. In particular, our framework contains three components:
    1. MultPAX identifies present keyphrases from the input corpus.
    2. MultPAX then links the input corpus with external knowledge graphs to get more relevant phrases.
    3. MultPAX ranks the extracted phrases based on their semantic relatedness to input corpus.

Our Contributions:

1) We propose an *unsupervised* multitask framework that not only extracts present keyphrases, but also generate absent ones.
    
2) To the best of our knowledge, our approach is the first attempt that leverages existing knowledge graphs for keyphrase extraction without the need to create keyphrase vocabularies or phrase banks.
    
3) We introduce an embedding-based F1 score that considers semantic similarity between generated and ground-truth keyphrases rather than the existing exact-matching. 
    
4) We carried out several experiments on four benchmark datasets. The evaluation results showed that our approach proved to be more accurate compared with state-of-the-art baselines.  

Repository Structure:

.
├── Baselines
│   ├── EmbedRank-Baseline.ipynb
│   ├── EmbedRank(Wordwise)- Baseline.ipynb
│   ├── TextRank-Baseline.ipynb
│   └── YAKE-Baseline.ipynb
├── Inspec experiment
│   └── MltPAX-Inspec.ipynb
├── Krapivin2009 experiment
│   └── MltPAX-Krapivin2009.ipynb
├── NUS experiment
│   └── MltPAX-NUS.ipynb
├── SemEval2010 experiment
│   └── MltPAX-SemEval2010.ipynb
└── .DS_Store

How to run:

We conduct several experiments on four benchmark datasets, namely: Inspec, SemEval2010, NUS and Krapivin2009. The datasets are available at the Dropbox Folder.

To setup the experiments, you need to install the following libraries via pip install -r requirments.txt or install them manually:

Python 3.7
keybert
sentence-transformers 2.2.0
SPARQLWrapper 2.0.0
SciPy 1.8.0
NumPy 1.21.5
Pandas 1.4.2
NLTK 3.6.6 
requests 2.27.1
py-babelnet

We provide our experiements as Jupyter notebooks (see Experiments Folder) and source files (see src Folder). We recommend using Jupyter notebooks for an interactive execution of our experiments. Furhtermore, we provide a Jupyter notebook for each experiments:

Baselines:

We obtain the implementation of baselines: TextRank, YAKE from the open source library PKE. The source-codes for these baselines are available at:

Furthermore, we implemented the EmbedRank using the BERT pretrained model from the spaCycake library. Our implementation can be found at:

For the baseline AutoGen: We obtain the implemenation from its official GitHub repository

For the baseline CopyRNN, the implemenation can be obtained from its Github repository.

Evaluation

The following notebooks contains the implementation of the evaluation metrics used in our experiments:


Citation

@INPROCEEDINGS
{zahera2022multpax, 
author = "Hamada M. Zahera, Daniel Vollmers, Mohamed Ahmed Sherif and Axel-Cyrille Ngonga Ngomo", 
title = "MultPAX: Keyphrase Extraction using Language Models and Knowledge Graphs",
booktitle = "The 21th International Semantic Web Conference (ISWC) 2022", 
year = "2022", series = "Springer"}

multpax's People

Contributors

hzahera avatar

Stargazers

 avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

Forkers

ivan-salas

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.