infobench's Introduction

InfoBench

Paper: InFoBench: Evaluating Instruction Following Ability in Large Language Models
Dataset: InFoBench Dataset

Citation

@article{qin2024infobench,
      title={InFoBench: Evaluating Instruction Following Ability in Large Language Models}, 
      author={Yiwei Qin and Kaiqiang Song and Yebowen Hu and Wenlin Yao and Sangwoo Cho and Xiaoyang Wang and Xuansheng Wu and Fei Liu and Pengfei Liu and Dong Yu},
      year={2024},
      eprint={2401.03601},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

Evaluation with InFoBench

Step1: Dataset Usage

You can directly download it with huggingface datasets.

from datasets import load_dataset

dataset = load_dataset("kqsong/InFoBench")

Step2: Generating the response

Provide an output file in model/output.json. Each data entry should be a json object with a newline, containing all the fields in the input format. The generated response should be included in the json object with the new field named output.

We suggest using greedy decoding to avoid the randomness of decoding.

Step3: Evaluation

Evaluate LLM's outputs on decomposed questions. Using GPT-4-0314 by default in this research.

python evaluation.py \
  --api_key <OPENAI KEY> \
  --eval_model gpt-4-0314 \
  --input model/output.json \
  --output_dir evaluation/ \
  --temperature 0

Each data entry will include an "eval" key in the format of List[bool] which represents "Yes" or "No" answers to each decomposed question. The final output evaluation file will be saved in JSON format at location <output_dir>/<eval_model>/.

infobench's People

Contributors

Stargazers

Watchers

infobench's Issues

There are some spelling mistakes in prompt. Is it intentional or accidental？

SYS_MSG ="Based on the provided Input (if any) and Generated Text, answer the ensuing Ouestions with either a YES or NOchoice. Your selection should be based on your judgment as well as the following rules:\n\n- YES: Select 'YES' if the generated text entirely fulfills the condition specified in the question. Howevernote that even minor inaccuracies exclude the text from receiving a 'YES' rating. As an illustration. consider aquestion that asks. "Does each sentence in the generated text use a second person?” If even one sentence doesnot use the second person, the answer should NOT be 'YES'. To qualify for a 'YES' rating, the generated textmust be entirely accurate and relevant to the question\n\n- NO: Opt for 'NO' if the generated text fails to meet the question's requirements or provides no informationthat could be utilized to answer the question. For instance, if the question asks. "Is the second sentence irthe generated text a compound sentence?" and the generated text only has one sentence. it offers no relevantinformation to answer the question. Consequently, the answer should be 'NO'.'''"

There are some spelling mistakes like "Ouestions", "Howevernote". Is it intentional or accidental？

qinyiwei / infobench Goto Github PK

infobench's Introduction

InfoBench

Citation

Evaluation with InFoBench

Step1: Dataset Usage

Step2: Generating the response

Step3: Evaluation

infobench's People

Contributors

Stargazers

Watchers

Forkers

infobench's Issues

There are some spelling mistakes in prompt. Is it intentional or accidental？

How to get Easy Set and Hard Set score?

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent