Giter Site home page Giter Site logo

eternityyw / tram-benchmark Goto Github PK

View Code? Open in Web Editor NEW
15.0 1.0 3.0 96.95 MB

TRAM: Benchmarking Temporal Reasoning for Large Language Models (Findings of ACL 2024)

Home Page: https://arxiv.org/abs/2310.00835

License: MIT License

Jupyter Notebook 100.00%
bert-models large-language-models prompting temporal-reasoning

tram-benchmark's Introduction

TRAM: Benchmarking Temporal Reasoning for Large Language Models

This repository contains datasets, data processing code, model descriptions, and a datasheet for the benchmark used for 'TRAM: Benchmarking Temporal Reasoning in Large Language Models'.

Datasets

TRAM encompasses ten temporal reasoning tasks, presented as multiple-choice questions (MCQs) across a range of time-related domains. For clarity, we ensure that each question has only one correct answer. TRAM incorporates existing natural language understanding datasets, human-crafted templates and questions, web sources, and program generation. Answers have been derived through a combination of expert annotations and programmatic generation. The benchmark includes 526,668 problems in total. For each dataset, we introduce a few-shot development set, with 5 questions per category, and a separate test set for evaluation. All datasets used for experiments can be downloaded in /datasets folder. Overview of ten tasks included in the benchmark:

image

[1] Zhou et al., 2019, [2] Rajpurkar et al., 2016, [3] Uzzaman et al., 2013, [4] Williams et al., 2018, [5] Bowman et al., 2015, [6] Roemmele et al., 2011, [7] Mostafazadeh et al., 2016, [8] Mostafazadeh et al., 2017

Note: The “Data Size" column aggregates totals from both the development and test sets. “K-Way MC" signifies a multiple-choice response format with K options. Amb. Res. denotes Ambiguity Resolution. NLI stands for natural language inference. “Same" indicates the text source is the same as the row above.

For more details, please refer to the paper.

Models

We evaluate the performance of several well-known language models on the TRAM benchmark, which is organized into two main categories. In the first category, we consider four popular large language models (LLMs): the open-source model Llama-2-13b-chat, and the closed-source models PaLM-bison-chat, GPT-3.5-turbo, and GPT-4. We evaluate each model using two prompting strategies: standard prompting (SP) and chain-of-thought (CoT) prompting. Under both strategies, the models undergo tests in zero-shot and 5-shot settings. For all models, we apply greedy decoding (i.e., temperature = 0) for response generation. Each of these models is accessed using its corresponding API key.

In the second category, we consider minimal supervision as opposed to traditional fully supervised learning in order to establish baseline evaluations. Specifically, we employ four representative BERT-style models, including BERT-base, BERT-large, RoBERTa-base, and RoBERTa-large. For the temporal NLI task, we employ the Sequence Classification variant of BERT and RoBERTa from Huggingface (i.e., BertForSequenceClassification and RobertaForSequenceClassification), given its suitability for the task's structure. However, for the other tasks, we utilize the Multiple Choice variant of BERT and RoBERTa from Huggingface (i.e., BertForMultipleChoice, RobertaForMultipleChoice).

tram-benchmark's People

Contributors

eternityyw avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar

tram-benchmark's Issues

Missing time_event.csv

Hi, thank you for sharing awesome work!
It seems like 'time_event.csv' is missing in this repository --- it is required for processing ordering, typical_time, and storytelling.ipynb. Could you kindly share it?
image

Thanks!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.