Evaluation of ChatGPT as a Question Answering System for Answering Complex Questions
A framework for detailed evaluation of the ability of ChatGPT and similar large-scale language models to answer complex questions.
This repository is a subproject of KSESEU.
If you use the code, please cite the following paper:
Evaluation of ChatGPT as a Question Answering System for Answering Complex Questions [Arxiv](To be added)
This repository is mainly contributed by Yiming Tan, Dehai Min, Yu Li, Wenbo Li, Nan Hu, Guilin Qi.
๐ฅ๐ We have released the answers of chatgpt and other models to a total of 194,782 questions across 8 datasets, including multiple languages in Datasets we publish.
To our knowledge(2023-3-9), this is the first public release of a large-scale Q&A dataset for chatgpt.
To evaluate ChatGPT's ability to answer complex knowledge, we propose an evaluation framework: First, we classify the latent features that constitute complex questions, and describe each question under test with multi-labels for identifying combinatorial reasoning. Secondly, following the black-box test specification of CheckList proposed by Microsoft, we design an evaluation method that introduces CoT hints to measure the reasoning function and reliability of large language models in answering complex questions. Our evaluation uses 8 real complex question answering datasets, including six English datasets and two multilingual datasets, to further analyze the potential impact of language bias. We compared the evaluation results of ChatGPT, GPT3.5, GPT3, and FLAN-T5 to identify persistent historical issues in LLMs. All data and results are available for further analysis.
We classify the answers of these models for the KBQA dataset according to dataset and model, and release them in this folder.
answers_from_models : The response(answers) of these models(Chatgpt, Gpt3/Gpt3.5, FLAN-T5) to the KBQA datasets mentioned in Datasets we use.
Datasets | Size | Col.Size | Lang |
---|---|---|---|
KQAPro | 117970 | 106173 | EN |
LC-quad2.0 | 26975 | 26975 | EN |
WQSP | 4737 | 4700 | EN |
CWQ | 31158 | 31158 | EN |
GrailQA | 64331 | 6763 | EN |
GraphQuestions | 4776 | 4776 | EN |
QALD-9 | 6045 | 6045 | Mul |
MKQA | 260000 | 6144 | Mul |
Total Collected | 194782 |
datasets : We have processed the 8 datasets mentioned in Datasets we use into a unified format and released them in this folder. The datasets in the unified format include the following items: question_id, question, ground_truth, SPARQL, and our added labels. Additionally, we have generated alias dictionaries from Wikipedia for the ground truth, which we can use during the evaluation.
Given that the training data of the Language Model (LLM) covers Wikipedia extensively, we have opted to evaluate our model using open-domain complex question-answering datasets related to Wikipedia. Specifically, we have curated a set of 8 distinct datasets for this purpose, as follows:
๐ฅ Please note : The links in the Source
section below refer to the original datasets as published by their respective authors. For our experiments in this paper, we have processed these datasets accordingly, including random sampling and formatting. Please download the datasets used in our experiments from this folder: datasets.
Monolingual datasets | Source | Paper |
---|---|---|
WebQuestionSP(WQSP) | Download_url | Paper_url |
ComplexWebQuestion(CWQ) | Download_url | Paper_url |
GraphQuestions | Download_url | Paper_url |
GrailQA | Download_url | Paper_url |
KQApro | Download_url | Paper_url |
LC-quad2.0 | Download_url | Paper_url |
Multilingual dataset
Multilingual datasets | Source | Paper |
---|---|---|
QALD-9 | Download_url | Paper_url |
MKQA | Download_url | Paper_url |
We assess the LLM's ability to handle each feature in the CQA scenario through the Minimal Functional Test (MFT); we classify the answer types into 9 categories, respectively Mixed fact (MISC);Reason (WHY);Location (LOC);Time (DATE/TIME);Character (PER);Yes or no (Boolean);Number (NUM);Organization (ORG);Unable to answer (UNA)
At the same time, we divide the labels of "reasoning type" into eight categories, which are: SetOperation; Filtering; Counting; The most valuable; Sort; Single-hop; Multi-hop; Star-shape
We also take into account the "language type" label that may have an impact on model performance: de; ru; pt; hi_IN; en; Fa; it; fr; ro; es; nl; pt_BR; zh cn
We adopted a simple idea of expanding the matching range to strengthen the generalization of answer matching, including the following two operations:
-
Subtree marking method provided by constituent tree.
-
A strategy of exact matching between the noun phrase list and the answer list is employed.
For the samples that did not complete the matching, we set a threshold based on the cosine similarity between phrase vectors to obtain potential correct matches. The parts above the threshold are manually judged whether the answer is right or wrong.
Invariance test means adding perturbations to the original sentence that should not change the output of the model. The main purpose of this test is to verify that ChatGPT maintains the invariance of the answer in the case of increasing disturbance. We mainly use two methods to perform the invariance test:
- To change the spelling of words in a sentence, we imitate the habit of humans when typing sentences, and perform random letter repetition and random letter omission and stemming methods on words.
- Rewrite the sentence, paraphrasing the sentence without changing the original meaning of the sentence, and evaluate whether the result has changed.
Directional Expectation test refers to perturbing the input with known expected results to evaluate whether the final result is developing in the direction we expect. We mainly conduct directional expectation tests from three aspects:
- Conduct experiments on "reasoning types", mainly on SetOperation types, Filtering types, counting types, and comparison (most value and sorting) types.
- Use the type of answer to guide, what type of answer we prompt to the question, and then evaluate whether the type of answer matches the type we prompt.
- Using a step-by-step guidance method, ask each noun or noun phrase in the sentence again, and finally ask the question again to evaluate whether the accuracy of the answer has improved.