Giter Site home page Giter Site logo

rbawden / lm-evaluation-harness Goto Github PK

View Code? Open in Web Editor NEW

This project forked from bigscience-workshop/lm-evaluation-harness

0.0 0.0 0.0 9.84 MB

A framework for few-shot evaluation of autoregressive language models.

License: MIT License

Python 99.92% Shell 0.08%

lm-evaluation-harness's Introduction

lm-evaluation-harness + promptsource

codecov

Overview

This project provides a unified framework to test causal (GPT-2, GPT-3, GPTNeo, etc) and seq2seq (T5, T0) language models via prompt evaluation.

As of now, all prompts are provided via the promptsource eval-hackathon branch; all datasets are from huggingface datasets.

This fork is not backwards compatible with the original evaluation harness.

Installation

git clone https://github.com/bigscience-workshop/lm-evaluation-harness
cd lm-evaluation-harness
pip install -e ".[dev]"

CLI Usage ๐Ÿ–ฅ๏ธ

To evaluate a model (e.g. GPT-2) on NLP tasks such as SuperGLUE WiC, you can run the following command:

python main.py \
    --model_api_name 'hf-causal' \
    --model_args pretrained='gpt2' \
    --task_name 'wic' \
    --template_names 'same_sense','polysemous' \
    --device cpu

Additional arguments can be provided to the model constructor using the --model_args flag. For larger models supported by HuggingFace transformers, we provide parallelism and mixed-precision utilities through the accelerate package. It can be activated for hf-causal/hf-seq2seq by passing use_accelerate=True and dtype=half to the --model_args flag, respectively. For finer grained control over accelerate options, see the constructor docstrings for HuggingFaceAutoLM in huggingface.py.

python main.py \
    --model_api_name 'hf-causal' \
    --model_args use_accelerate=True,pretrained='facebook/opt-13b' \
    --task_name wnli

If you have access to the OpenAI API, you can also evaluate GPT-3 engines:

export OPENAI_API_SECRET_KEY={YOUR_KEY_HERE}
python main.py \
    --model_api_name 'openai' \
    --model_args engine='curie' \
    --task_name hans

When reporting results from eval harness, please include the task versions (shown in results["versions"]) for reproducibility. This allows bug fixes to tasks while also ensuring that previously reported scores are reproducible.

Detailed Usage

usage: main.py [-h] --model_api_name MODEL_API_NAME [--model_args MODEL_ARGS] --task_name TASK_NAME
               [--template_names TEMPLATE_NAMES] [--num_fewshot NUM_FEWSHOT] [--batch_size BATCH_SIZE]
               [--device DEVICE] [--limit LIMIT] [--output_path OUTPUT_PATH] [--template_idx TEMPLATE_IDX]
               [--bootstrap_iters BOOTSTRAP_ITERS] [--no_tracking] [--use_cache]

optional arguments:
  -h, --help            show this help message and exit
  --model_api_name MODEL_API_NAME
                        Name of the model API to use. See `lm_eval.list_model_apis()` for available APIs
  --model_args MODEL_ARGS
                        Model constructor args that you'd pass into a model of type `--model_api_name`. These must
                        be comma-separated keyword args, e.g. `key1=value1,key2=value2`, with no spaces
  --task_name TASK_NAME
                        Name of the task to use as found in the lm_eval registry. See: `lm_eval.list_tasks()`
  --task_args TASK_ARGS
                        Optional task constructor args that you'd pass into a task class of kind " `--task_name`.
                        These must be comma-separated keyword args, e.g. `key1=value1,key2=value2`, with no spaces.
                        WARNING: To avoid parsing errors, ensure your strings are quoted. For example,
                            `example_separator='\n+++\n'`
                        WARNING: Values must NOT contain commas.
  --template_names TEMPLATE_NAMES
                        Comma-separated list of template names for the specified task. Example:
                        `> python main.py ... --task_name rte --template_names imply,mean`
                        - Default: `all_templates`
                        - General Selectors:
                            - `"all_templates"`: Selects all templates for the task
                            - `"original_templates"`: Selects only templates that are designed to match the original task
  --num_fewshot NUM_FEWSHOT
  --batch_size BATCH_SIZE
  --seed SEED
  --device DEVICE       The device to place your model onto, e.g. cuda:0. For large models available through the
                        HuggingFace Hub you should use `accelerate` by passing `use_accelerate=True` to
                        `--model_args`
  --limit LIMIT         Limit the number of examples to evaluate on; ONLY USE THIS FOR DEBUGGING PURPOSES
  --output_path OUTPUT_PATH
                        Use output_path as `output_filename`. For example:
                        `> python main.py ... --output_path blop`
                        # saves files into `outputs/blop.json` Warning: You currently cannot change/add folder
                        structure.
  --template_idx TEMPLATE_IDX
                        Choose template by index from available templates
  --bootstrap_iters BOOTSTRAP_ITERS
                        Iters for stderr computation
  --no_tracking         Skip carbon emission tracking
  --use_cache           Whether to cache your model's predictions or not

Library Usage ๐Ÿ“–

You can also use lm_eval as a library:

import lm_eval

model = lm_eval.get_model("hf-causal", pretrained="gpt2", device="cpu")
tasks = lm_eval.get_task_list(
    "superglue_rte",
    template_names=["does this imply", "must be true"])
results = lm_eval.evaluate(model=model, tasks=tasks)

The main user-facing functions are:

Some high-level convenience functions are also made available:

Gotchas ๐Ÿฉน

  • You must pass templates to PerplexityTasks even though they will be ignored, as models will be scored from the raw text found in the task's dataset.

  • Multi-lingual ROUGE is unsupported as general token splitting is absent from rouge-score. For multi-lingual tasks, please ignore rouge metrics until this is resolved. NOTE: English works as intended.

  • Task versioning is not fully integrated! If you're reporting your model's results, please include the package versions or commit IDs for this lm-evaluation-harness branch as well as the HuggingFace datasets and promptsource packages.

  • promptsource installation issue: Some prompts may be excluded from the installed promptsource branch due to git-based pip installation issues. If the latest commit on the promptsource/eval-hackathon branch contains a prompt you're looking for but was not included in the installed version from our setup.py, you should run the following from within your environment:

    pip uninstall promptsource
    git clone --single-branch --branch eval-hackathon https://github.com/bigscience-workshop/promptsource
    cd promptsource
    pip install -e .

Features

  • Growing number of tasks integrated with promptsource (20+).

  • Support for HuggingFace Causal language models, HuggingFace Seq2Seq models, and the OpenAI Completions API (GPT-3), with flexible tokenization-agnostic interfaces.

Implementing new tasks

To implement a new task in eval harness, follow the PromptSourceTask template.

lm-evaluation-harness's People

Contributors

anishthite avatar anthony-dipofi avatar cfoster0 avatar cjlovering avatar dirkgr avatar erictang000 avatar jeffhsu3 avatar jon-tow avatar jordiclive avatar kasnerz avatar khalidalt avatar kingoflolz avatar kkawamu1 avatar leogao2 avatar muennighoff avatar oskarvanderwal avatar pruksmhc avatar rbawden avatar researcher2 avatar samsontmr avatar sdtblck avatar shashi456 avatar stellaathena avatar thefazzer avatar thomasw21 avatar tomlimi avatar tttyuntian avatar uyhcire avatar xagi-dev avatar zphang avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.