Giter Site home page Giter Site logo

long-context-fine-tuning-blogpost's Introduction

Needle in a Haystack - Biographies Benchmark

This repo is for benchmarking LLM's ability to extract small bits of information from long context. We adapted the benchmark from Greg Kamradt's original Needle in a Haystack Benchmark to our preferences.

Original benchmark Original tweet

Aside from the benchmarking code, we also create a dataset to fine-tune for the task at hand.

The original benchmark

In the original "Needle in a Haystack" benchmark, we extact a small bit of information, called "the needle", from a large context. The large context, called "the haystack", are concatenated esseys by Paul Graham. The following text ("needle") is inserted at varying positions into these esseys varying positions: "The best thing to do in San Francisco is eat a sandwich and sit in Dolores Park on a sunny day.". Note that the information from the esseys and the needle are not related much. Therefore, it might be easier for a model to single out information about the needle in the haystack. The model is then given the haystack with the needle and asked "What is the best thing to do in San Francisco?" and not to "give information outside the document or repeat your findings". Note that eating a sandwich and sitting in Dolores Park on a sunny day is understood to be a good answer to the posed question based on general knowledge outside the context. We therefore expect models trained on large chunks of publicly available data to be preconditioned to output such information.

Notable changes to the original benchmark

To facilitate better understanding of our code, we link important lines here. This benchmark makes the following changes:

  • Based on the wiki_bio dataset.
    • We randomize all names so that model's can not rely on previously learnt information. (link)
    • While the needle is a random biography, the haystack is a concatenation of equally random biographies. (link)
      • The intuition behind the design choice is that information in the haystack should be similar to the needle so that the benchmark gives us a better understanding of how well-suited the model is to index large chunks of similar data.
  • Model must extract multiple small bits of information: (link)
    • Date of birth
    • Date of death
    • Nationality
    • Whether or not the person in question is/was a sportsperson
    • Whether or not the person in question is/was a politician
  • Structure model's outputs as json or dictionaries (link)
    • This breaks down the complexity of evaluation and makes it more reliable.
    • It also reduces the cost of the benchmark.

Usage

Before running anything, note that the provided code is neither production grade, nor a general tool. You will need to understand and modify it if you want to do anything but reproducing results.

We use the following workflow:

  1. Create synthetic datasets with the tools under dataset/
    • Run dataset/clean_biographies_ds.py to download the biographies dataset and clean it.
    • Run dataset/create_fine_tuning_ds.py to create fine-tuning datasets. Take a good look at the script before and fit it to your system if needed. (This can query Anyscale Endpoints to create labels for the dataset)
  2. Maybe fine-tune with your tool of choice. The dataset are in an OpenAI/Anyscale compatible format.
  3. Fit plot_aggregated.py and plot_haystack.py to whatever models you are benchmarking.
  4. Benchmark and plot with bio_haystack_benchmark.py
    • This requires you to set set your AE_API_KEY and OPENAI_API_KEY as environment variables. Comment out relevant lines if needed.
  5. After benchmarking some models, use plot_aggregated.py to plot an overview.

long-context-fine-tuning-blogpost's People

Contributors

arturniederfahrenhorst avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.