The llm-enem from viniciusarruda

Solving the Enem exam with LLMs
Live Demo

About

This repository aims to run LLMs on the Enem, a Brazilian University Admission Exam.

It employs the approach from the paper Evaluating GPT-3.5 and GPT-4 Models on Brazilian University Admission Exams with the dataset they have relased, named ENEM 2022.

Evaluated models: GPT-3.5, GPT-4, Falcon 7B, LLaMA2 7B, and MariTalk.

The code was written aiming to have few dependencies and facilitate the use of LLMs other than OpenAI-based ones.

Dataset

ENEM 2022

The ENEM 2022 dataset is available under the folder dataset/enem in a processed format ready to use with the LLMs. The processing procedure was done taking into consideration the instructions given by the author with little modification. In order to replicate it, replace the original write_out.py file with the dataset/enem/write_out.py file.

The original Enem exam used to build the ENEM 2022 dataset can be downloaded here and here.

Installation

Note:

This project was developed using Windows 11 with python 3.10.9.

Clone this repository, create a new environment (recommended) and install the dependencies:

pip install -r requirements.txt

Usage

Evaluate OpenAI LLMs

1. Set the OpenAI API key:

Visit OpenAI to retrieve your API key and add to your environment variable.

On Windows:

$Env:OPENAI_API_KEY="sk-xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"

On Linux:

export OPENAI_API_KEY="sk-xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"

2. Run the evaluation script:

You can run with any model starting with gpt-3.5-turbo and gpt-4. The results reported in this repository are the gpt-3.5-turbo-0613 and gpt-4-0613 versions.

For the dataset, the options are: Zero-shot, Few-shot, and Few-shot with Chain-of-Thought, which evaluates the dataset files enem_2022_0_shot.json, enem_2022_3_shot.json, and enem_cot_2022_3_shot.json, respectively.

python evaluator.py evaluate --models "['gpt-3.5-turbo-0613', 'gpt-4-0613']" --dataset_names "['Zero-shot', 'Few-shot', 'Few-shot with Chain-of-Thought']"

The results will be placed in the reports folder (beware, this will overwrite the current files). To produce the results.html file with a summary table as in the results section, run:

python evaluator.py build_results_table --models "['gpt-3.5-turbo-0613', 'gpt-4-0613']" --dataset_names "['Zero-shot', 'Few-shot', 'Few-shot with Chain-of-Thought']" --output_filename "gpt_results.html"

Evaluate MariTalk

MariTalk is currently free. Thus, my API key was explicitly written in the code.

1. Run the evaluation script:

python evaluator.py evaluate --models "['MariTalk']" --dataset_names "['Zero-shot', 'Few-shot', 'Few-shot with Chain-of-Thought']"

python evaluator.py build_results_table --models "['MariTalk']" --dataset_names "['Zero-shot', 'Few-shot', 'Few-shot with Chain-of-Thought']" --output_filename "maritalk_results.html"

Evaluate LLMs hosted on Hugging Face via Inference Endpoints

1. Create a Hugging Face Inference Endpoint

Select a LLM from the Hugging Face model hub.

This repository was tested with the following models:

Create an endpoint at the Hugging Face Inference Endpoint platform.

2. Set the environment parameters:

Visit the endpoint UI to retrieve your token, name and url, and add to your environment variable:

$Env:huggingface_token="hf_xxxxxxxxxxxxxxxxxx"

$Env:huggingface_namespace="xxxxxxxxxxxxxxxxxx"

Using the Falcon-7B model as example, set the following environment variables using the following pattern:

$Env:huggingface_Falcon7B_name="xxxxxxxxxxxxxxxxxx"

$Env:huggingface_Falcon7B_url="https://xxxxxxxxxxxxxxxxxx.endpoints.huggingface.cloud"

3. Run the evaluation script:

python evaluator.py evaluate --models "['Falcon-7B', 'LLaMA-2-7B']" --dataset_names "['Zero-shot', 'Few-shot', 'Few-shot with Chain-of-Thought']"

python evaluator.py build_results_table --models "['Falcon-7B', 'LLaMA-2-7B']" --dataset_names "['Zero-shot', 'Few-shot', 'Few-shot with Chain-of-Thought']" --output_filename "falcon_llama_results.html"

Streamlit Demo

The streamlit demo is available for MariTalk and the OpenAI models.

streamlit run streamlit_app.py

Results

GPT-3.5 and GPT-4 Models

Evaluation on the ENEM 2022 dataset, with the models gpt-3.5-turbo-0613 and gpt-4-0613:

Area	gpt-3.5-turbo-0613			gpt-4-0613
Area	zero-shot	three-shot	three-shot with CoT	zero-shot	three-shot	three-shot with CoT
Languages and Codes	25/33 (75.76%)	28/33 (84.85%)	25/33 (75.76%)	30/33 (90.91%)	29/33 (87.88%)	30/33 (90.91%)
Human Sciences	34/37 (91.89%)	33/37 (89.19%)	33/37 (89.19%)	35/37 (94.59%)	36/37 (97.30%)	35/37 (94.59%)
Natural Sciences	19/26 (73.08%)	19/26 (73.08%)	19/26 (73.08%)	20/26 (76.92%)	22/26 (84.62%)	21/26 (80.77%)
Mathematics	11/22 (50.00%)	3/22 (13.64%)	6/22 (27.27%)	8/22 (36.36%)	10/22 (45.45%)	16/22 (72.73%)
Total	89/118 (75.42%)	83/118 (70.34%)	83/118 (70.34%)	93/118 (78.81%)	97/118 (82.20%)	102/118 (86.44%)

Detailed results can be seen in the reports folder.

MariTalk Model

Evaluation on the ENEM 2022 dataset, with the model MariTalk:

Area	MariTalk
Area	zero-shot	three-shot	three-shot with CoT
Languages and Codes	15/33 (45.45%)	20/33 (60.61%)	18/33 (54.55%)
Human Sciences	22/37 (59.46%)	22/37 (59.46%)	31/37 (83.78%)
Natural Sciences	15/26 (57.69%)	10/26 (38.46%)	15/26 (57.69%)
Mathematics	6/22 (27.27%)	1/22 (4.55%)	5/22 (22.73%)
Total	58/118 (49.15%)	53/118 (44.92%)	69/118 (58.47%)

Detailed results can be seen in the reports folder.

Falcon-7B and LLaMA-2-7B Models

The evaluation on the ENEM 2022 dataset, with the models Falcon-7B and LLaMA-2-7B, was done using the Hugging Face Inference Endpoints. These models require a further investigation on how to build better prompts and how to automate the interpretation of their outputs. As can be seen in the detailed reports folder, there are several issues, such as mixing english with portuguese, answering with gibberish text, and badly formatted answers. Thus, the results table should not be considered, being kept in the repository for informational purposes only.

Citation

If you use the ENEM 2022 dataset in your research, even the processed version released in this repository, please cite the original work.

Also, if you use this code or the results published in this repository in your research, please cite:

@misc{arruda2023,
  author = {Vinicius Arruda},
  title = {Solving the Enem exam with LLMs},
  year = {2013},
  publisher = {GitHub},
  journal = {GitHub repository},
  howpublished = {\url{https://github.com/viniciusarruda/llm-enem}},
}

viniciusarruda / llm-enem Goto Github PK

llm-enem's Introduction

Solving the Enem exam with LLMsLive Demo