Giter Site home page Giter Site logo

jeankaddour / bigcode-evaluation-harness Goto Github PK

View Code? Open in Web Editor NEW

This project forked from bigcode-project/bigcode-evaluation-harness

0.0 0.0 0.0 227 KB

A framework for the evaluation of autoregressive code generation language models.

License: Apache License 2.0

Shell 1.33% Python 98.67%

bigcode-evaluation-harness's Introduction

Code Generation LM Evaluation Harness

Features

This is a framework to evaluate autoregressive code generation language models. This is a work in progress part of the BigCode project, and is inspired from EleutherAI/lm-evaluation-harness for evaluating language models in general. We welcome contributions to fix issues, enhance features and add new benchmarks. You can find a contribution guides in docs/guide.md and CONTRIBUTING.md and more documentation in docs/README.md.

Below are the features and tasks of this framework:

Setup

git clone https://github.com/bigcode-project/bigcode-evaluation-harness.git
cd bigcode-evaluation-harness

Install torch based on your device type and the other packages using:

pip install -r requirements.txt

Also make sure you have git-lfs installed and are logged in the Hub

huggingface-cli login

We use accelerate to generate code/text in parallel when multiple GPUs are present (multi-GPU mode). You can configure it using:

accelerate config

This evaluation harness can also be used in an an evaluation only mode, you can use a Multi-CPU setting. For this mode you can also find an example of setup instructions in evaluation_setup.sh, where we configure the environement and evaluate some MBPP generations donwloaded from the hub.

Usage

You can use this evaluation harness to generate text solutions to code benchmarks with your model, to evaluate (and execute) the solutions or to do both. While it is betetr to use GPUs for the generation, the evaluation only requires CPUs. So it might be beneficial to separate these two steps. By default both generation and evaluation are performed.

For more details on how to evaluate on the tasks, please refer to the documentation in docs/README.md.

Generation and evaluation

Below are some examples to generate and evaluate on some tasks.

accelerate launch  main.py \
  --model <MODEL_NAME> \
  --tasks <TASK_NAME> \
  --limit <NUMBER_PROBLEMS> \
  --max_length_generation <MAX_LENGTH> \
  --temperature <TEMPERATURE> \
  --do_sample True \
  --n_samples 100 \
  --batch_size 10 \
  --allow_code_execution=False 
  • limit represnts the number of problems to solve, if it's not provided all problems in the benchamrk are selected.
  • allow_code_execution is for executing the generated code: read the displayed warning before setting it to True.

Some tasks don't require code execution such as codexglue_code_to_text-<LANGUAGE>/codexglue_code_to_text-python-left/conala/concode that use BLEU evaluation. In addition, we generate one candidate solution for each problem in these tasks, so use n_samples=1 and batch_size=1. (Note that batch_size should always be equal or less than n_samples).

  • For APPS tasks, you can use n_samples=1 for strict and average accuracies (from the original APPS paper) and n_samples>1 for pass@k.

Generation only

If you want to generate solutions without executing and evaluating the code, set generation_only to True, in addition to the instructions above. This will save the solutions in a json file in the working directory.

Evaluation only

If you already have the generations in a json file from this evaluation harness and want to evaluate them, specify the path of the generations via the generation_path argument. You may need to reconfigure accelerate to use multiple CPUs. For this mode you can also find an example of setup instructions in evaluation_setup.sh.

Below is an example, be mind of specifying arguments proper to the task you are evaluating on, and note that model value here only serves for documenting the experiment.

accelerate launch  main.py   --tasks mbpp  --allow_code_execution=True  --generations_path generations.json  --model incoder-temperature-08

Implementing new tasks

To implement a new task in this evaluation harness, see the guide in docs/guide. The are also contribution guidelines in this CONTRIBUTING.md

Documentation

We provide documentation for the existing benchmarks and how we make the evaluation in docs/README.md.

Remarks

  • Currenltly, we use parallel evaluation across multiple GPUs using accelerate, this assumes that you can fit the model in one GPU.
  • Please note this evaluation harness tries to cover a wide set of models, but there could still be room for improvement based on each model, some might require different prompt engineering or post-processing of the code generations.
  • For some scores of ongoing experiments please refer to example_scores/README.md.

Acknowledgements

We thank EleutherAI for their work on the lm-evaluation harness from which this repository is inspired.

bigcode-evaluation-harness's People

Contributors

loubnabnl avatar lvwerra avatar muennighoff avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.