SimCKP

Source code for our Findings of EMNLP 2023 paper SimCKP: Simple Contrastive Learning of Keyphrase Representations.

Requirements

Python (tested on 3.8.18)
CUDA (tested on 11.8)
PyTorch (tested on 2.0.1)
Transformers (tested on 4.34.0)
nltk (tested on 3.7)
fasttext (tested on 0.9.2)
datasketch (tested on 1.6.4)
wandb
tqdm

Setting Up POS Tagger

To use Stanford POS tagger,

Install OpenJDK 1.8 in server:

>> sudo apt-get update
>> sudo apt-get install openjdk-8-jdk

Install Stanford CoreNLP (may be a different version; v4.4.0 was used in paper):

>> wget http://nlp.stanford.edu/software/stanford-corenlp-latest.zip
>> unzip stanford-corenlp-latest.zip && cd stanford-corenlp-4.4.0

Run Java server in the Stanford CoreNLP directory:

>> nohup java -mx4g -cp "*" edu.stanford.nlp.pipeline.StanfordCoreNLPServer \
    -preload tokenize,ssplit,pos,lemma,ner,parse,depparse \
    -status_port 9000 -port 9000 -timeout 15000 1> /dev/null 2>&1 &

Ref: https://github.com/nltk/nltk/wiki/Stanford-CoreNLP-API-in-NLTK
Install OpenJDK without sudo: https://stackoverflow.com/questions/2549873/installing-jdk-without-sudo

Preprocessing

Before running the preprocessing script, make sure the raw data is present as follows:

[working directory]
 |-- data/
 |    |-- raw_data/
 |    |    |-- inspec_test.json
 |    |    |-- kp20k_test.json
 |    |    |-- kp20k_train.json
 |    |    |-- kp20k_valid.json
 |    |    |-- krapivin_test.json
 |    |    |-- nus_test.json
 |    |    |-- semeval_test.json
 |    |-- stanford-corenlp-4.4.0/
 |    |-- lid.176.bin

Raw dataset can be downloaded here. lid.176.bin is a pre-trained fasttext model for detecting languages in text and can be downloaded here.

To preprocess raw data, run

>> python cleanse.py

Then, annotate part-of-speech tags by running

>> python pos_tagging.py

To skip data cleansing and POS tagging, just download our POS tagged data.

Training

Train the extractor (stage 1):

>> bash scripts/train_extractor.sh

Extract present keyphrases and generate absent candidates:

>> bash scripts/extract.sh

Train the reranker (stage 2):

>> bash scripts/train_ranker.sh

Rerank absent keyphrases:

>> bash scripts/rank.sh

Evaluation

Evaluate both present and absent keyphrase predictions:

>> bash scripts/evaluate.sh

baochi0212 / gen-keykp Goto Github PK

gen-keykp's Introduction

SimCKP

Requirements

Setting Up POS Tagger

Preprocessing

Training

Evaluation

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent