The Zero-Shot Replication Framework is a minimal environment designed to replicate zero-shot results from past academic papers. It currently supports OpenAI, Anthropic, and HuggingFace models to generate completions for various datasets and provides tools for handling, evaluating, and storing these completions.
Category | gpt-3.5-turbo-0301 | gpt-3.5-turbo-0613 | claude-2 | gpt-4-0314 | gpt-4-0613 | wizard-coder-34b | gpt-4 Baseline | Sources |
---|---|---|---|---|---|---|---|---|
Standard Bench | ||||||||
HumanEval | 67.0 | 61.5 | 65.2 | 86.0 | 84.1 | 70.7 | 67.0 | [1] |
HumanEval+ | 59.1 | 54.2 | 54.9 | 80.5 | 74.4 | 60.3 | N/A | |
MATH | 35.4 | 37.2 | 17.6 | 51.6 | 50.3 | N/A | 42.2 | [3] |
LeetCodeSparks | [1,2] | |||||||
Easy | 60.0 | 76.2 | 52.4 | 76.2 | 61.2 | 38.1 | 68.2-75.6 | [1,2]* |
Medium | 15.0 | 22.0 | 9.8 | 19.5 | 31.7 | 12.2 | 26.7-40.0 | [1,2]* |
Hard | 0.0 | 0.0 | 0.0 | 4.6 | 13.6 | 0.0 | 6.6-10.7 | [1,2]* |
LeetCode100 | ||||||||
Easy | 83.0 | 80.0 | 73.0 | 91.0 | 88.0 | 71.0 | N/A | |
Medium | 16.0 | 16.0 | 16.0 | 26.0 | 21.0 | 9.0 | N/A | |
Hard | 1.0 | 3.0 | 2.0 | 6.0 | 6.0 | 2.0 | N/A |
*The gpt-4 LeetCodeSparks baseline is approximate, as we do not see a precise list of LeetCode problems listed in the referenced reports. We define 'LeetCodeSparks' as the 84 problems used for the human evaluation measurement mentioned in [2]
'LeetCode_100' is an expected out-of-sample dataset we introduce of 100 recent easy, medium, and hard LeetCode problems. The problems live in the range 2554-2818.
- Easy configuration of models and parameters.
- Ability to choose datasets to run on.
- Extensibility through a pluggable problem generator.
- Python >= 3.10 and < 3.12
- Poetry for package management
- anthropic: "0.3.10"
- astunparse: "1.6.3"
- black: ^23.3.0
- evalplus: ^0.1.6
- numpy: "^1.25.2"
- openai: 0.27.8
- pandas: ^2.0.3
- python-dotenv: ^1.0.0
- python-leetcode: "1.2.1"
- automata
- transformers: "^4.32.0"
- torch: "1.13.1"
- accelerate: "^0.22.0"
- sentencepiece: "^0.1.99"
- protobuf: "^4.24.1"
- flake8: "6.1.0"
- isort: "5.12.0"
- mypy: "^1.5.1"
- pre-commit: "^3.3.3"
- sourcery: "^1.6.0"
- types-requests: "^2.31.0.2"
- types-attrs: "^19.1.0"
- yapf: "0.40.1"
Make sure you have Poetry installed, then clone the repository and install the dependencies.
git clone https://github.com/your-username/zero-shot-replication.git
cd zero-shot-replication
git submodule update --init --recursive
poetry install # to install automata, poetry install -E automata
cp .env.example .env # Copy the example environment file
# Edit the .env file to add your OpenAI API key, etc.
# Optional
# If developing, install the pre-commit hooks
# pre-commit install
# If using automata, install the repo
# git submodule add -f https://github.com/emrgnt-cmplxty/zero-shot-replication.git zero_shot_replication/automata
You can run the zero-shot replication by executing the runner.py
file with various command-line arguments.
poetry run python runner.py --provider openai --dataset human-eval --model gpt-4-0613 --temperature 0.7
--provider
: Which provider to use for zero-shot completions (default: "openai").--dataset
: Which dataset to run on (default: "human-eval").--model
: Model name to load from the provider (default: "gpt-3.5-turbo").--temperature
: Temperature parameter for the provided model (default: 0.7).--output_file_name
: Filename to override the default output file name with.
To see explicit commands ran to generate the reported results, check out the commands.md menu.
This project is licensed under the Apache-2.0 License.