Giter Site home page Giter Site logo

viniciusarruda / llm-enem Goto Github PK

View Code? Open in Web Editor NEW
8.0 1.0 0.0 586 KB

Solving the Enem exam with LLMs

Home Page: https://llm-enem.streamlit.app/

Python 88.59% CSS 1.84% Jinja 4.01% HTML 5.55%
enem falcon-7b gpt-3 gpt-4 huggingface llama2 llms python streamlit

llm-enem's Introduction


Solving the Enem exam with LLMs
Live Demo

About

This repository aims to run LLMs on the Enem, a Brazilian University Admission Exam.

It employs the approach from the paper Evaluating GPT-3.5 and GPT-4 Models on Brazilian University Admission Exams with the dataset they have relased, named ENEM 2022.

Evaluated models: GPT-3.5, GPT-4, Falcon 7B, LLaMA2 7B, and MariTalk.

The code was written aiming to have few dependencies and facilitate the use of LLMs other than OpenAI-based ones.

Dataset

ENEM 2022

The ENEM 2022 dataset is available under the folder dataset/enem in a processed format ready to use with the LLMs. The processing procedure was done taking into consideration the instructions given by the author with little modification. In order to replicate it, replace the original write_out.py file with the dataset/enem/write_out.py file.

The original Enem exam used to build the ENEM 2022 dataset can be downloaded here and here.

Installation

Note:

This project was developed using Windows 11 with python 3.10.9.

Clone this repository, create a new environment (recommended) and install the dependencies:

pip install -r requirements.txt

Usage

Evaluate OpenAI LLMs

1. Set the OpenAI API key:

Visit OpenAI to retrieve your API key and add to your environment variable.

On Windows:

$Env:OPENAI_API_KEY="sk-xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"

On Linux:

export OPENAI_API_KEY="sk-xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"

2. Run the evaluation script:

You can run with any model starting with gpt-3.5-turbo and gpt-4. The results reported in this repository are the gpt-3.5-turbo-0613 and gpt-4-0613 versions.

For the dataset, the options are: Zero-shot, Few-shot, and Few-shot with Chain-of-Thought, which evaluates the dataset files enem_2022_0_shot.json, enem_2022_3_shot.json, and enem_cot_2022_3_shot.json, respectively.

python evaluator.py evaluate --models "['gpt-3.5-turbo-0613', 'gpt-4-0613']" --dataset_names "['Zero-shot', 'Few-shot', 'Few-shot with Chain-of-Thought']"

The results will be placed in the reports folder (beware, this will overwrite the current files). To produce the results.html file with a summary table as in the results section, run:

python evaluator.py build_results_table --models "['gpt-3.5-turbo-0613', 'gpt-4-0613']" --dataset_names "['Zero-shot', 'Few-shot', 'Few-shot with Chain-of-Thought']" --output_filename "gpt_results.html"

Evaluate MariTalk

MariTalk is currently free. Thus, my API key was explicitly written in the code.

1. Run the evaluation script:

python evaluator.py evaluate --models "['MariTalk']" --dataset_names "['Zero-shot', 'Few-shot', 'Few-shot with Chain-of-Thought']"
python evaluator.py build_results_table --models "['MariTalk']" --dataset_names "['Zero-shot', 'Few-shot', 'Few-shot with Chain-of-Thought']" --output_filename "maritalk_results.html"

Evaluate LLMs hosted on Hugging Face via Inference Endpoints

1. Create a Hugging Face Inference Endpoint

Select a LLM from the Hugging Face model hub.

This repository was tested with the following models:

Create an endpoint at the Hugging Face Inference Endpoint platform.

2. Set the environment parameters:

Visit the endpoint UI to retrieve your token, name and url, and add to your environment variable:

$Env:huggingface_token="hf_xxxxxxxxxxxxxxxxxx"
$Env:huggingface_namespace="xxxxxxxxxxxxxxxxxx"

Using the Falcon-7B model as example, set the following environment variables using the following pattern:

$Env:huggingface_Falcon7B_name="xxxxxxxxxxxxxxxxxx"
$Env:huggingface_Falcon7B_url="https://xxxxxxxxxxxxxxxxxx.endpoints.huggingface.cloud"

3. Run the evaluation script:

python evaluator.py evaluate --models "['Falcon-7B', 'LLaMA-2-7B']" --dataset_names "['Zero-shot', 'Few-shot', 'Few-shot with Chain-of-Thought']"
python evaluator.py build_results_table --models "['Falcon-7B', 'LLaMA-2-7B']" --dataset_names "['Zero-shot', 'Few-shot', 'Few-shot with Chain-of-Thought']" --output_filename "falcon_llama_results.html"

Streamlit Demo

Streamlit Demo

The streamlit demo is available for MariTalk and the OpenAI models.

streamlit run streamlit_app.py

Results

GPT-3.5 and GPT-4 Models

Evaluation on the ENEM 2022 dataset, with the models gpt-3.5-turbo-0613 and gpt-4-0613:

Area gpt-3.5-turbo-0613 gpt-4-0613
zero-shot three-shot three-shot
with CoT
zero-shot three-shot three-shot
with CoT
Languages and Codes 25/33 (75.76%) 28/33 (84.85%) 25/33 (75.76%) 30/33 (90.91%) 29/33 (87.88%) 30/33 (90.91%)
Human Sciences 34/37 (91.89%) 33/37 (89.19%) 33/37 (89.19%) 35/37 (94.59%) 36/37 (97.30%) 35/37 (94.59%)
Natural Sciences 19/26 (73.08%) 19/26 (73.08%) 19/26 (73.08%) 20/26 (76.92%) 22/26 (84.62%) 21/26 (80.77%)
Mathematics 11/22 (50.00%) 3/22 (13.64%) 6/22 (27.27%) 8/22 (36.36%) 10/22 (45.45%) 16/22 (72.73%)
Total 89/118 (75.42%) 83/118 (70.34%) 83/118 (70.34%) 93/118 (78.81%) 97/118 (82.20%) 102/118 (86.44%)

Detailed results can be seen in the reports folder.

MariTalk Model

Evaluation on the ENEM 2022 dataset, with the model MariTalk:

Area MariTalk
zero-shot three-shot three-shot
with CoT
Languages and Codes 15/33 (45.45%) 20/33 (60.61%) 18/33 (54.55%)
Human Sciences 22/37 (59.46%) 22/37 (59.46%) 31/37 (83.78%)
Natural Sciences 15/26 (57.69%) 10/26 (38.46%) 15/26 (57.69%)
Mathematics 6/22 (27.27%) 1/22 (4.55%) 5/22 (22.73%)
Total 58/118 (49.15%) 53/118 (44.92%) 69/118 (58.47%)

Detailed results can be seen in the reports folder.

Falcon-7B and LLaMA-2-7B Models

The evaluation on the ENEM 2022 dataset, with the models Falcon-7B and LLaMA-2-7B, was done using the Hugging Face Inference Endpoints. These models require a further investigation on how to build better prompts and how to automate the interpretation of their outputs. As can be seen in the detailed reports folder, there are several issues, such as mixing english with portuguese, answering with gibberish text, and badly formatted answers. Thus, the results table should not be considered, being kept in the repository for informational purposes only.

Citation

If you use the ENEM 2022 dataset in your research, even the processed version released in this repository, please cite the original work.

Also, if you use this code or the results published in this repository in your research, please cite:

@misc{arruda2023,
  author = {Vinicius Arruda},
  title = {Solving the Enem exam with LLMs},
  year = {2013},
  publisher = {GitHub},
  journal = {GitHub repository},
  howpublished = {\url{https://github.com/viniciusarruda/llm-enem}},
}

llm-enem's People

Contributors

viniciusarruda avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.