TRAM: Benchmarking Temporal Reasoning for Large Language Models

This repository contains datasets, data processing code, model descriptions, and a datasheet for the benchmark used for 'TRAM: Benchmarking Temporal Reasoning in Large Language Models'.

Datasets

TRAM encompasses ten temporal reasoning tasks, presented as multiple-choice questions (MCQs) across a range of time-related domains. For clarity, we ensure that each question has only one correct answer. TRAM incorporates existing natural language understanding datasets, human-crafted templates and questions, web sources, and program generation. Answers have been derived through a combination of expert annotations and programmatic generation. The benchmark includes 526,668 problems in total. For each dataset, we introduce a few-shot development set, with 5 questions per category, and a separate test set for evaluation. All datasets used for experiments can be downloaded in /datasets folder. Overview of ten tasks included in the benchmark:

_{[1] Zhou et al., 2019, [2] Rajpurkar et al., 2016, [3] Uzzaman et al., 2013, [4] Williams et al., 2018, [5] Bowman et al., 2015, [6] Roemmele et al., 2011, [7] Mostafazadeh et al., 2016, [8] Mostafazadeh et al., 2017}

Note: The “Data Size" column aggregates totals from both the development and test sets. “K-Way MC" signifies a multiple-choice response format with K options. Amb. Res. denotes Ambiguity Resolution. NLI stands for natural language inference. “Same" indicates the text source is the same as the row above.

For more details, please refer to the paper.

Models

We evaluate the performance of several well-known language models on the TRAM benchmark, which is organized into two main categories. In the first category, we consider four popular large language models (LLMs): the open-source model Llama-2-13b-chat, and the closed-source models PaLM-bison-chat, GPT-3.5-turbo, and GPT-4. We evaluate each model using two prompting strategies: standard prompting (SP) and chain-of-thought (CoT) prompting. Under both strategies, the models undergo tests in zero-shot and 5-shot settings. For all models, we apply greedy decoding (i.e., temperature = 0) for response generation. Each of these models is accessed using its corresponding API key.

In the second category, we consider minimal supervision as opposed to traditional fully supervised learning in order to establish baseline evaluations. Specifically, we employ four representative BERT-style models, including BERT-base, BERT-large, RoBERTa-base, and RoBERTa-large. For the temporal NLI task, we employ the Sequence Classification variant of BERT and RoBERTa from Huggingface (i.e., BertForSequenceClassification and RobertaForSequenceClassification), given its suitability for the task's structure. However, for the other tasks, we utilize the Multiple Choice variant of BERT and RoBERTa from Huggingface (i.e., BertForMultipleChoice, RobertaForMultipleChoice).

eternityyw / tram-benchmark Goto Github PK

tram-benchmark's Introduction

TRAM: Benchmarking Temporal Reasoning for Large Language Models

Datasets

Models

tram-benchmark's People

Contributors

Stargazers

Watchers

Forkers

tram-benchmark's Issues

Missing time_event.csv

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent