Giter Site home page Giter Site logo

scireviewgen's Introduction

SciReviewGen

This is the official dataset repository for SciReviewGen: A Large-scale Dataset for Automatic Literature Review Generation in ACL findings 2023.

Dataset

Data format

split_survey_df & original_survey_df

  • Row:
    • literature review chapter or the entire text of literature review
  • Column:
    • paper_id: paper_id used in S2ORC
    • title: title of the literature review
    • abstract: abstract of the literature review
    • section: chapter title
    • text: body text of literature review chapter or literature review paper
    • n_bibs: number of the cited papers that can be used as inputs
    • n_nonbibs: number of the cited papers that cannot be used as inputs
    • bib_titles: titles of the cited papers
    • bib_abstracts: abstracts of the cited papers
    • bib_citing_sentences: citing sentences that cite the cited papers
    • split: train/val/test split

summarization_csv

  • Row:
    • literature review chapter
  • Column:
    • reference: literature review title <s> chapter title <s> abstract of cited paper 1 <s> BIB001 </s> literature review title <s> chapter title <s> abstract of cited paper 2 <s> BIB002 </s> ...
    • target: literature review chapter

How to create SciReviewGen from S2ORC

0. Environment

  • Python 3.9
  • Run the following command to clone the repository and install the required packages
git clone https://github.com/tetsu9923/SciReviewGen.git
cd SciReviewGen
pip install -r requirements.txt

1. Preprocessing

  • Download S2ORC (We use the version released on 2020-07-05, which contains papers up until 2020-04-14)
  • Run the following command:
python json_to_df.py \
  -s2orc_path <Path to the S2ORC full dataset directory (Typically ".../s2orc/full/20200705v1/full")> \
  -dataset_path <Path to the generated dataset> \
  --field <Optional: the field of the literature reviews (mag_field_of_study in S2ORC, default="Computer Science")>

The metadata and pdf parses of the candidates for the literature reviews and the cited papers are stored in dataset_path (in the form of pandas dataframe).

2. Construct SciReviewGen

  • Run the following command:
python make_section_df.py \
  -dataset_path <Path to the generated dataset> \
  --version <Optional: the version of SciReviewGen ("split" or "original", default="split")>

The SciReviewGen dataset (split_survey_df.pkl or original_survey_df.pkl) is stored in dataset_path (in the form of pandas dataframe). filtered_dict.pkl gives the list of literature reviews after filtering by the SciBERT-based classifier (Section 3.2).

3. Construct csv data for summarization

  • Run the following command:
python make_summarization_csv.py \
  -dataset_path <Path to the generated dataset> 

The csv files for summarization (train.csv, val.csv, and test.csv) are stored in dataset_path. If you train QFiD on the generated csv files, add --for_qfid argument as below.

python make_summarization_csv.py \
  -dataset_path <Path to the generated dataset> \
  --for_qfid

Additional resources

SciBERT-based literature review classifier

We trained the SciBERT-based literature review classifier. The model weights are available here.

Query-weighted Fusion-in-Decoder (QFiD)

We proposed Query-weighted Fusion-in-Decoder (QFiD) that explicitly considers the relevance of each input document to the queries. You can train QFiD on SciReviewGen csv data (Make sure that you passed --for_qfid argument when executing make_summarization_csv.py).

Train

  • Modify qfid/train.sh (CUDA_VISIBLE_DEVICES, csv file path, outpput_dir, and num_train_epochs)
  • Run the following command:
cd qfid
./train.sh

Test

  • Modify qfid/test.sh (CUDA_VISIBLE_DEVICES, csv file path, outpput_dir, and num_train_epochs. Please set num_train_epochs as the number of epochs you trained in total)
  • Run the following command:
./test.sh

Licenses

  • SciReviewGen is released under CC BY-NC 4.0. You can use SciReviewGen for only non-commercial purposes.
  • SciReviewGen is created based on S2ORC. Note that S2ORC is released under CC BY-NC 4.0, which allows users to copy and redistribute for only non-commercial purposes.

scireviewgen's People

Contributors

tetsu9923 avatar

Stargazers

 avatar Kaito Sugimoto avatar Koen Dercksen avatar Patrick Jiang avatar  avatar Jeff Hammerbacher avatar Lorenzo Leon avatar Tong Zhu (朱桐) avatar

Watchers

 avatar

Forkers

dankoan

scireviewgen's Issues

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.