Giter Site home page Giter Site logo

wangdongde / complex-question-answering-evaluation-of-chatgpt Goto Github PK

View Code? Open in Web Editor NEW

This project forked from kse-eleven/complex-question-answering-evaluation-of-chatgpt

0.0 0.0 0.0 96.97 MB

A large-scale complex question answering evaluation of ChatGPT and similar large-language models

License: GNU General Public License v3.0

complex-question-answering-evaluation-of-chatgpt's Introduction

Complex-Question-Answering-Evaluation-of-ChatGPT

Evaluation of ChatGPT as a Question Answering System for Answering Complex Questions

A framework for detailed evaluation of the ability of ChatGPT and similar large-scale language models to answer complex questions.

This repository is a subproject of KSESEU.

If you use the code, please cite the following paper:
Evaluation of ChatGPT as a Question Answering System for Answering Complex Questions [Arxiv](To be added)


This repository is mainly contributed by Yiming Tan, Dehai Min, Yu Li, Wenbo Li, Nan Hu, Guilin Qi.

๐Ÿ”ฅ๐ŸŽ‰ We have released the answers of chatgpt and other models to a total of 194,782 questions across 8 datasets, including multiple languages in Datasets we publish.

To our knowledge(2023-3-9), this is the first public release of a large-scale Q&A dataset for chatgpt.

Overview

To evaluate ChatGPT's ability to answer complex knowledge, we propose an evaluation framework: First, we classify the latent features that constitute complex questions, and describe each question under test with multi-labels for identifying combinatorial reasoning. Secondly, following the black-box test specification of CheckList proposed by Microsoft, we design an evaluation method that introduces CoT hints to measure the reasoning function and reliability of large language models in answering complex questions. Our evaluation uses 8 real complex question answering datasets, including six English datasets and two multilingual datasets, to further analyze the potential impact of language bias. We compared the evaluation results of ChatGPT, GPT3.5, GPT3, and FLAN-T5 to identify persistent historical issues in LLMs. All data and results are available for further analysis.

Datasets we publish

We classify the answers of these models for the KBQA dataset according to dataset and model, and release them in this folder.

answers_from_models : The response(answers) of these models(Chatgpt, Gpt3/Gpt3.5, FLAN-T5) to the KBQA datasets mentioned in Datasets we use.

Datasets Size Col.Size Lang
KQAPro 117970 106173 EN
LC-quad2.0 26975 26975 EN
WQSP 4737 4700 EN
CWQ 31158 31158 EN
GrailQA 64331 6763 EN
GraphQuestions 4776 4776 EN
QALD-9 6045 6045 Mul
MKQA 260000 6144 Mul
Total Collected 194782

datasets : We have processed the 8 datasets mentioned in Datasets we use into a unified format and released them in this folder. The datasets in the unified format include the following items: question_id, question, ground_truth, SPARQL, and our added labels. Additionally, we have generated alias dictionaries from Wikipedia for the ground truth, which we can use during the evaluation.

Datasets we use

Given that the training data of the Language Model (LLM) covers Wikipedia extensively, we have opted to evaluate our model using open-domain complex question-answering datasets related to Wikipedia. Specifically, we have curated a set of 8 distinct datasets for this purpose, as follows:

๐Ÿ’ฅ Please note : The links in the Source section below refer to the original datasets as published by their respective authors. For our experiments in this paper, we have processed these datasets accordingly, including random sampling and formatting. Please download the datasets used in our experiments from this folder: datasets.

Monolingual datasets Source Paper
WebQuestionSP(WQSP) Download_url Paper_url
ComplexWebQuestion(CWQ) Download_url Paper_url
GraphQuestions Download_url Paper_url
GrailQA Download_url Paper_url
KQApro Download_url Paper_url
LC-quad2.0 Download_url Paper_url

Multilingual dataset

Multilingual datasets Source Paper
QALD-9 Download_url Paper_url
MKQA Download_url Paper_url

CheckList Model

Minimum Functionality Test (MFT)

We assess the LLM's ability to handle each feature in the CQA scenario through the Minimal Functional Test (MFT); we classify the answer types into 9 categories, respectively Mixed fact (MISC);Reason (WHY);Location (LOC);Time (DATE/TIME);Character (PER);Yes or no (Boolean);Number (NUM);Organization (ORG);Unable to answer (UNA)

At the same time, we divide the labels of "reasoning type" into eight categories, which are: SetOperation; Filtering; Counting; The most valuable; Sort; Single-hop; Multi-hop; Star-shape

We also take into account the "language type" label that may have an impact on model performance: de; ru; pt; hi_IN; en; Fa; it; fr; ro; es; nl; pt_BR; zh cn

We adopted a simple idea of expanding the matching range to strengthen the generalization of answer matching, including the following two operations:

  1. Subtree marking method provided by constituent tree.

  2. A strategy of exact matching between the noun phrase list and the answer list is employed.

For the samples that did not complete the matching, we set a threshold based on the cosine similarity between phrase vectors to obtain potential correct matches. The parts above the threshold are manually judged whether the answer is right or wrong.

Invariance test (INV)

Invariance test means adding perturbations to the original sentence that should not change the output of the model. The main purpose of this test is to verify that ChatGPT maintains the invariance of the answer in the case of increasing disturbance. We mainly use two methods to perform the invariance test:

  1. To change the spelling of words in a sentence, we imitate the habit of humans when typing sentences, and perform random letter repetition and random letter omission and stemming methods on words.
  2. Rewrite the sentence, paraphrasing the sentence without changing the original meaning of the sentence, and evaluate whether the result has changed.

Directional Expectation test (DIR)

Directional Expectation test refers to perturbing the input with known expected results to evaluate whether the final result is developing in the direction we expect. We mainly conduct directional expectation tests from three aspects:

  1. Conduct experiments on "reasoning types", mainly on SetOperation types, Filtering types, counting types, and comparison (most value and sorting) types.
  2. Use the type of answer to guide, what type of answer we prompt to the question, and then evaluate whether the type of answer matches the type we prompt.
  3. Using a step-by-step guidance method, ask each noun or noun phrase in the sentence again, and finally ask the question again to evaluate whether the accuracy of the answer has improved.

complex-question-answering-evaluation-of-chatgpt's People

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.