Giter Site home page Giter Site logo

yilunzhao / odsum Goto Github PK

View Code? Open in Web Editor NEW

This project forked from yale-nlp/odsum

0.0 0.0 0.0 68.41 MB

Data and code for paper "ODSum: New Benchmarks for Open Domain Multi-Document Summarization"

License: MIT License

Shell 0.68% Python 99.32%

odsum's Introduction

ODSum

Overview

ODSum introduces a benchmark for the task of Open Domain Multi-Document Summarization.

Dataset

The ODSum dataset is designed to evaluate the performance of modern summarization models in multi-document contexts spanning an open domain.

Data Processing

To process the data and convert it into formats compatible with various summarization models, refer to data_process.ipynb

Dataset Structure

Story

You can access the raw documents and queries paired with summaries in the data/story/raw folder. The data/story/oracle folder associates these queries with their respective 'ground truth' articles.

For the retrieval part, three distinct strategies are provided:

  • Sparse Retrieval (data/story/sparse)
  • Dense Retrieval (data/story/dense)
  • LLM-Embedding Retrieval (data/story/LLM-embedding)

Each of these retrieval folders contains three sub-versions:

  • min: Contains the least number of retrieved documents based on relevancy.
  • mean: An average number of retrieved documents.
  • max: Contains the maximum number of documents deemed relevant by the retriever.

Files in each folder:

  • Raw Data:
    • documents: Contains the stories or documents.
    • queries: Paired queries with four human-written summaries. There is no clear relationship between the query and the story in this raw form.
  • Oracle Data:
    • Maps each query to its corresponding 'ground truth' articles.
  • Retrieval Data (Applies to sparse, dense, and LLM-embedding):
    • min: Data with the minimum number of retrieved documents.
    • mean: Data with an average number of retrieved documents.
    • max: Data with the maximum number of retrieved documents based on their relevancy.

Note: The retrievers rank the documents based on their relevancy to the query, and they select the most relevant few. The number of retrieved documents is variable, depending on the retrieval strategy and the version (min, mean, max).

Models

BART Description: A sequence-to-sequence Transformer pre-trained using both a sentence permutation and text infilling objective. Checkpoint & Training: Used the BART-Large variant fine-tuned on the CNN/DailyMail dataset. It's further fine-tuned on ODSum with AdamW optimizer, utilizing a unique input format that merges queries and documents. Limitation: Due to a restricted context length of 1024 tokens, BART serves as a baseline model. PRIMERA Description: Designed explicitly for multi-document summarization, PRIMERA simplifies the processing of concatenated documents using efficient encoder-decoder transformers. Implementation: Fine-tuned on each ODSum setting. With a max input length of 4K tokens, documents are truncated to fit within this constraint. GPT Description: A well-known language model from OpenAI with proven efficacy in text summarization. Variants & Training: Employed both gpt-3.5-16k-turbo-0613 and gpt-4-0613 versions. Special prompts were crafted to guide GPT in summarization, emphasizing the placement of queries at articles' ends for better output.

Limitation: Stories and meetings had to be truncated to match GPT's max token limit. Llama-2 Description: An expansive series of auto-regressive text models renowned for capabilities from logical reasoning to text generation.

Checkpoint: Utilized the Llama-2-70b-Chat variant, which is particularly optimized for dialog contexts. For efficiency during inference, 4-bit NF4 quantization is employed.

odsum's People

Contributors

cyber-e-j avatar dadyita avatar yilunzhao avatar shi-kejian avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.