Giter Site home page Giter Site logo

dliofindia / aspect-document-similarity Goto Github PK

View Code? Open in Web Editor NEW

This project forked from malteos/aspect-document-similarity

0.0 0.0 0.0 456 KB

Implementation, trained models and result data for the paper "Aspect-based Document Similarity for Research Papers" #COLING2020

License: MIT License

Python 21.75% Jupyter Notebook 72.70% Shell 5.55%

aspect-document-similarity's Introduction

Aspect-based Document Similarity for Research Papers

Implementation, trained models and result data for the paper Aspect-based Document Similarity for Research Papers (PDF on Arxiv). The supplemental material is available for download under GitHub Releases or Zenodo.

Demo

Google Colab

You can try our trained models directly on Google Colab on all papers available on Semantic Scholar (via DOI, ArXiv ID, ACL ID, PubMed ID):

Click here for demo

Requirements

  • Python 3.7
  • CUDA GPU (for Transformers)

Datasets

Installation

Create a new virtual environment for Python 3.7 with Conda:

conda create -n paper python=3.7
conda activate paper

Clone repository and install dependencies:

git clone https://github.com/malteos/aspect-document-similarity.git repo
cd repo
pip install -r requirements.txt

Experiments

To reproduce our experiments, follow these steps (if you just want to train and test the models, skip the first two steps):

Prepare

export DIR=./output

# ACL Anthology 
# Get parscit files from: https://acl-arc.comp.nus.edu.sg/archives/acl-arc-160301-parscit/)
sh ./sbin/download_parsecit.sh

# CORD-19
wget https://ai2-semanticscholar-cord-19.s3-us-west-2.amazonaws.com/historical_releases/cord-19_2020-03-13.tar.gz

# Get additional data (collected from Semantic Scholar API)
wget https://github.com/malteos/aspect-document-similarity/releases/download/1.0/acl_s2.tar
wget https://github.com/malteos/aspect-document-similarity/releases/download/1.0/cord19_s2.tar

Build datasets

# ACL
python -m acl.dataset save_dataset <input_dir> <parscit_dir> <output_dir>

# CORD-19
python -m cord19.dataset save_dataset <input_dir> <output_dir>

Use dataset

The datasets are built on the Huggingface NLP library (soon available on the official repository):

from nlp import load_dataset

# Training data for first CV split
train_dataset = load_dataset(
    './datasets/cord19_docrel/cord19_docrel.py',
    name='relations',
    split='fold_1_train'
)                   

Use models

from models.auto_modelling import AutoModelForMultiLabelSequenceClassification

# Load models with pretrained weights from Huggingface model hub
acl_model = AutoModelForMultiLabelSequenceClassification('malteos/aspect-acl-scibert-scivocab-uncased')
cord19_model = AutoModelForMultiLabelSequenceClassification('malteos/aspect-cord19-scibert-scivocab-uncased')

# Use the models in standard Huggingface fashion ...
# acl_model(input_ids, token_type_ids, ...)
# cord19_model(input_ids, token_type_ids, ...)

Train models

All models are trained with the trainer_cli.py script:

python trainer_cli.py --cv_fold $CV_FOLD \
    --output_dir $OUTPUT_DIR \
    --model_name_or_path $MODEL_NAME \
    --doc_id_col $DOC_ID_COL \
    --doc_a_col $DOC_A_COL \
    --doc_b_col $DOC_B_COL \
    --nlp_dataset $NLP_DATASET \
    --nlp_cache_dir $NLP_CACHE_DIR \
    --cache_dir $CACHE_DIR \
    --num_train_epochs $EPOCHS \
    --seed $SEED \
    --per_gpu_eval_batch_size $EVAL_BATCH_SIZE \
    --per_gpu_train_batch_size $TRAIN_BATCH_SIZE \
    --learning_rate $LR \
    --do_train \
    --save_predictions

The exact parameters are available in sbin/acl and sbin/cord19.

Evaluation

The results can be computed and viewed with a Jupyter notebook. Figures and tables from the paper are part of the notebook.

jupyter notebook evaluation.ipynb

Due to the space constraints some results could not be included in the paper. The full results for all methods and all test samples are available as CSV files under Releases (or via the Jupyter notebook).

How to cite

If you are using our code, please cite our paper:

@InProceedings{Ostendorff2020c,
  title = {Aspect-based Document Similarity for Research Papers},
  booktitle = {Proceedings of the 28th International Conference on Computational Linguistics (COLING 2020)},
  author = {Ostendorff, Malte and Ruas, Terry and Blume, Till and Gipp, Bela and Rehm, Georg},
  year = {2020},
  month = {Dec.},
}

License

MIT

aspect-document-similarity's People

Contributors

malteos avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.