Giter Site home page Giter Site logo

zero-shot-replication's Introduction

Zero-Shot Replication Framework

Overview

The Zero-Shot Replication Framework is a tool designed to replicate zero-shot results from recent academic papers or model reports. Additionally, it aims to extend evaluations to better understand the strengths and weaknesses of various approaches. The framework currently supports OpenAI, Anthropic, and HuggingFace models.

Features

  • Simple model and parameter configuration.
  • Choice of datasets for evaluation.
  • Extensibility through a modular provider / model / dataset setup.

pass@1 results (all proprietary models accessed on 08/24-08/25, 2023)

To better understand these results, please check the notes below

Proprietary Models

Category gpt-3.5-turbo-0301 gpt-3.5-turbo-0613 claude-2 gpt-4-0314 gpt-4-0613 gpt-4 Baseline Sources
Standard Bench
HumanEval 67.0 61.5 65.2 86.0 84.1 67.0 [1]
HumanEval+ 59.1 54.2 54.9 80.5 74.4 N/A
MATH 35.4 37.2 17.6 51.6 50.3 42.2 [3]
LeetCodeSparks [1,2]
Easy 60.0 76.2 52.4 76.2 61.2 68.2-75.6 [1,2]*
Medium 15.0 22.0 9.8 19.5 31.7 26.7-40.0 [1,2]*
Hard 0.0 0.0 0.0 4.6 13.6 6.6-10.7 [1,2]*
LeetCode100
Easy 83.0 80.0 73.0 91.0 88.0 N/A
Medium 16.0 16.0 16.0 26.0 21.0 N/A
Hard 1.0 3.0 2.0 6.0 6.0 N/A

OpenSource Models (vs latest GPT-4)

Category code-llama-34b wizard-coder-34b phind-v2-34b
Standard Bench
HumanEval 56.7 69.5 75.0
HumanEval+ 48.2 60.3 70.1
LeetCodeSparks
Easy 33.3 42.9 52.4
Medium 2.4 12.2 7.3
Hard 0.0 0.0 0.0
LeetCode100
Easy 53.0 68.0 63.0
Medium 3.0 9.0 5.0
Hard 0.0 0.0 3.0

Notes on Results

  • Our modified prompting for HumanEval may differ from other benchmarks.
  • The GPT-4 LeetCodeSparks baseline is approximate. We don't have a precise list of LeetCode problems from the referenced reports.
  • We define 'LeetCodeSparks' as the 84 problems used for the human evaluation measurement mentioned in [2].
  • 'LeetCode_100' is our out-of-sample dataset, introducing 100 recent easy, medium, and hard LeetCode problems ranging from 2554-2818.

Installation

# Repository setup
git clone https://github.com/your-username/zero-shot-replication.git
cd zero-shot-replication
git submodule update --init --recursive
# Install dependencies
poetry install

Optional Dependencies

  • vllm_support: For VLLM functionalities, required for WizardCoder model.
  • automata: For automata agent evaluations.
  • python-leetcode: For leetcode evaluations.
  • evalplus: For HumanEval and HumanEval+ evaluations.
  • quantized_support: For running 4 or 8 bit models.

Possible Weirdness

I sometimes see that setting torch==2.0.1 results in issues with the cuda environment initialization on my remote machine. One workaround was to first install torch=2.0.0, which requires commenting out of vllm, and to then increment the torch version and uncoment vllm. This may solve some user issues.


Requirements

  • Python >= 3.11 and < 3.12
  • Poetry for package management

Optional Feature Requirements

For additional features, you can install the optional dependencies:

poetry install -E <extra_name>
  • WizardCode Model Gen.: vllm_support
  • Phind Model Gen.: transformers must be installed from git (currently by hand)
  • Automata Agent Gen.: automata
  • Leetcode Evaluation: python-leetcode

Usage

You can run the zero-shot replication by executing the runner.py file with various command-line arguments.

poetry run python runner.py --provider openai --dataset human-eval --model gpt-4-0613 --temperature 0.7

Command-Line Arguments

  • --provider: Which provider to use for zero-shot completions (default: "openai").
  • --dataset: Which dataset to run on (default: "human-eval").
  • --model: Model name to load from the provider (default: "gpt-3.5-turbo").
  • --temperature: Temperature parameter for the provided model (default: 0.7).
  • --output_file_name: Filename to override the default output file name with.

To see explicit commands ran to generate the reported results, check out the commands.md menu.

License

This project is licensed under the Apache-2.0 License.

Sources

[1] GPT-4 Technical Report

[2] Sparks of Artificial General Intelligence

[3] Solving Challenging Math Word Problems Using GPT-4 Code Interpreter with Code-based Self-Verification

zero-shot-replication's People

Contributors

brutalsavage avatar emrgnt-cmplxty avatar nolantrem avatar yifever avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

zero-shot-replication's Issues

Specifying temperature for results?

Hey, great to see a table of results like this, but one thing I noticed was that it seems like temperatures aren't specified on the table. Seems like it might be a good thing to specify since from the results folder it seems to be different for different models? (most are 0.7 but if I'm reading it right, WizardCoder is 0.2 for some reason?) - Have you tried multiple runs w/ different temperatures to see how the scores change?

A few other questions:

  • I'm noticed in commands.md that for HumanEval the --pset=human-eval for the runner, but --dataset humaneval for the eval. Is this just specific to the difference in how evalplus evaluates for why this is different?

  • I'm made some changes in a local version to support my already downloaded model and since I plan on testing Phind's fine tune next. Would it make sense for huggingface models to be able to pass in arbitrary model names?

  • I am currently testing w/ bitsandbytes (load_in_4bit and load_in_8bit) since I'm interested in how quantization affects real-world performance. Is this something you'd want a pull request for as a new runner option?

  • Along those lines, GPTQ and GGML/GGUF quants are very popular in the community (the latter in particular is being built into more and more client-facing apps). Would you be interested in a PR for adding those? GPTQ is I believe just integrated into transformers, and the latter wouldn't be too hard to load w/ llama-cpp-python.

Related project, working together

Hi, great project!

I am working on FastEval which is a project to quickly evaluate language models on various benchmarks. The general goal is different from your project since I am focusing on an evaluation framework to evaluate all models under the same evaluation setting, which would be different than replicating the exact results from papers.

Still, I thought that you might be interested since there are many similarities like the focus on zero-shot evaluation and the benchmarks. For example, FastEval currently also implements HumanEval+, MATH along with others all in a zero-shot setting. The scores are different from yours due the different project goals and resulting implementations, though often the scores are quite close to the papers.

I believe that despite the different goals of the projects, there are things we could work on together since we are both interested in them. For example, I believe that you might also be interested in having fast inference implementations (vLLM, other inference backends, data parallel...) that can speed up evaluation by 20x and more or model-specific prompt templates that are often required for evaluating the models in the same way the authors do.

Let me know if you are interested in working together on the parts that are common to our projects, we could then discuss the details. If not because the project goals are too different or so, then that's also fine. Either way, great work! ^_^

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.