Giter Site home page Giter Site logo

gen-keykp's Introduction

SimCKP

architecture

Source code for our Findings of EMNLP 2023 paper SimCKP: Simple Contrastive Learning of Keyphrase Representations.

Requirements

  • Python (tested on 3.8.18)
  • CUDA (tested on 11.8)
  • PyTorch (tested on 2.0.1)
  • Transformers (tested on 4.34.0)
  • nltk (tested on 3.7)
  • fasttext (tested on 0.9.2)
  • datasketch (tested on 1.6.4)
  • wandb
  • tqdm

Setting Up POS Tagger

To use Stanford POS tagger,

  1. Install OpenJDK 1.8 in server:
    >> sudo apt-get update
    >> sudo apt-get install openjdk-8-jdk
  2. Install Stanford CoreNLP (may be a different version; v4.4.0 was used in paper):
    >> wget http://nlp.stanford.edu/software/stanford-corenlp-latest.zip
    >> unzip stanford-corenlp-latest.zip && cd stanford-corenlp-4.4.0
  3. Run Java server in the Stanford CoreNLP directory:
    >> nohup java -mx4g -cp "*" edu.stanford.nlp.pipeline.StanfordCoreNLPServer \
        -preload tokenize,ssplit,pos,lemma,ner,parse,depparse \
        -status_port 9000 -port 9000 -timeout 15000 1> /dev/null 2>&1 &

Preprocessing

Before running the preprocessing script, make sure the raw data is present as follows:

[working directory]
 |-- data/
 |    |-- raw_data/
 |    |    |-- inspec_test.json
 |    |    |-- kp20k_test.json
 |    |    |-- kp20k_train.json
 |    |    |-- kp20k_valid.json
 |    |    |-- krapivin_test.json
 |    |    |-- nus_test.json
 |    |    |-- semeval_test.json
 |    |-- stanford-corenlp-4.4.0/
 |    |-- lid.176.bin

Raw dataset can be downloaded here. lid.176.bin is a pre-trained fasttext model for detecting languages in text and can be downloaded here.

To preprocess raw data, run

>> python cleanse.py

Then, annotate part-of-speech tags by running

>> python pos_tagging.py

To skip data cleansing and POS tagging, just download our POS tagged data.

Training

Train the extractor (stage 1):

>> bash scripts/train_extractor.sh

Extract present keyphrases and generate absent candidates:

>> bash scripts/extract.sh

Train the reranker (stage 2):

>> bash scripts/train_ranker.sh

Rerank absent keyphrases:

>> bash scripts/rank.sh

Evaluation

Evaluate both present and absent keyphrase predictions:

>> bash scripts/evaluate.sh

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.