Giter Site home page Giter Site logo

stephenleo / llm-structured-output-benchmarks Goto Github PK

View Code? Open in Web Editor NEW
113.0 5.0 4.0 740 KB

Benchmark various LLM Structured Output frameworks: Instructor, Mirascope, Langchain, LlamaIndex, Fructose, Marvin, Outlines, etc on tasks like multi-label classification, named entity recognition, synthetic data generation, etc.

License: Apache License 2.0

Python 100.00%

llm-structured-output-benchmarks's Introduction

🧩 LLM Structured Output Benchmarks

Python 3.11.9 DOI GitHub - License

Github dev.to badge dev.to badge

Benchmark various LLM Structured Output frameworks: Instructor, Mirascope, Langchain, LlamaIndex, Fructose, Marvin, Outlines, LMFormatEnforcer, etc on tasks like multi-label classification, named entity recognition, synthetic data generation, etc.

πŸ† Benchmark Results [2024-08-25]

  1. Multi-label classification
    Framework Model Reliability Latency p95 (s)
    Fructose gpt-4o-mini-2024-07-18 1.000 1.138
    Modelsmith gpt-4o-mini-2024-07-18 1.000 1.184
    OpenAI Structured Output gpt-4o-mini-2024-07-18 1.000 1.201
    Instructor gpt-4o-mini-2024-07-18 1.000 1.206
    Outlines unsloth/llama-3-8b-Instruct-bnb-4bit 1.000 1.804*
    LMFormatEnforcer unsloth/llama-3-8b-Instruct-bnb-4bit 1.000 3.649*
    Llamaindex gpt-4o-mini-2024-07-18 0.996 0.853
    Marvin gpt-4o-mini-2024-07-18 0.988 1.338
    Mirascope gpt-4o-mini-2024-07-18 0.985 1.531
  2. Named Entity Recognition
    Framework Model Reliability Latency p95 (s) Precision Recall F1 Score
    OpenAI Structured Output gpt-4o-mini-2024-07-18 1.000 3.459 0.834 0.748 0.789
    LMFormatEnforcer unsloth/llama-3-8b-Instruct-bnb-4bit 1.000 6.573* 0.701 0.262 0.382
    Instructor gpt-4o-mini-2024-07-18 0.998 2.438 0.776 0.768 0.772
    Mirascope gpt-4o-mini-2024-07-18 0.989 3.879 0.768 0.738 0.752
    Llamaindex gpt-4o-mini-2024-07-18 0.979 5.771 0.792 0.310 0.446
    Marvin gpt-4o-mini-2024-07-18 0.979 3.270 0.822 0.776 0.798
  3. Synthetic Data Generation
    Framework Model Reliability Latency p95 (s) Variety
    Instructor gpt-4o-mini-2024-07-18 1.000 1.923 0.750
    Marvin gpt-4o-mini-2024-07-18 1.000 1.496 0.010
    Llamaindex gpt-4o-mini-2024-07-18 1.000 1.003 0.020
    Modelsmith gpt-4o-mini-2024-07-18 0.970 2.324 0.835
    Mirascope gpt-4o-mini-2024-07-18 0.790 3.383 0.886
    Outlines unsloth/llama-3-8b-Instruct-bnb-4bit 0.690 2.354* 0.942
    OpenAI Structured Output gpt-4o-mini-2024-07-18 0.650 1.431 0.877
    LMFormatEnforcer unsloth/llama-3-8b-Instruct-bnb-4bit 0.650 2.561* 0.662

* NVIDIA GeForce RTX 4080 Super GPU

πŸƒ Run the benchmark

  1. Install the requirements using pip install -r requirements.txt
  2. Set the OpenAI api key: export OPENAI_API_KEY=sk-...
  3. Run the benchmark using python -m main run-benchmark
  4. Raw results are stored in the results directory.
  5. Generate the results using:
    • Multilabel classification: python -m main generate-results
    • NER: python -m main generate-results --task ner
    • Synthetic data generation: python -m main generate-results --task synthetic_data_generation
  6. To get help on the command line arguments, add --help after the command. Eg., python -m main run-benchmark --help

πŸ§ͺ Benchmark methodology

  1. Multi-label classification:
    • Task: Given a text, predict the labels associated with it.
    • Data:
      • Base data: Alexa intent detection dataset
      • Benchmarking test is run using synthetic data generated by running: python -m data_sources.generate_dataset generate-multilabel-data.
      • The synthetic data is generated by sampling and combining rows from the base data to achieve multiple classes per row according to some distribution for num classes per row. See python -m data_sources.generate_dataset generate-multilabel-data --help for more details.
    • Prompt: "Classify the following text: {text}"
    • Evaluation Metrics:
      1. Reliability: The percentage of times the framework returns valid labels without errors. The average of all the rows percent_successful values.
      2. Latency: The 95th percentile of the time taken to run the framework on the data.
    • Experiment Details: Run each row through the framework n_runs number of times and log the percent of successful runs for each row.
  2. Named Entity Recognition
    • Task: Given a text, extract the entities present in it.
    • Data:
      • Base data: Synthetic PII Finance dataset
      • Benchmarking test is run using a sampled data generated by running: python -m data_sources.generate_dataset generate-ner-data.
      • The data is sampled from the base data to achieve number of entities per row according to some distribution. See python -m data_sources.generate_dataset generate-ner-data --help for more details.
    • Prompt: Extract and resolve a list of entities from the following text: {text}
    • Evaluation Metrics:
      1. Reliability: The percentage of times the framework returns valid labels without errors. The average of all the rows percent_successful values.
      2. Latency: The 95th percentile of the time taken to run the framework on the data.
      3. Precision: The micro average of the precision of the framework on the data.
      4. Recall: The micro average of the recall of the framework on the data.
      5. F1 Score: The micro average of the F1 score of the framework on the data.
    • Experiment Details: Run each row through the framework n_runs number of times and log the percent of successful runs for each row.
  3. Synthetic Data Generation
    • Task: Generate synthetic data similar according to a Pydantic data model schema.
    • Data:
      • Two level nested User details Pydantic schema.
    • Prompt: Generate a random person's information. The name must be chosen at random. Make it something you wouldn't normally choose.
    • Evaluation Metrics:
      1. Reliability: The percentage of times the framework returns valid labels without errors. The average of all the rows percent_successful values.
      2. Latency: The 95th percentile of the time taken to run the framework on the data.
      3. Variety: The percent of names that are unique compared to all names generated.
    • Experiment Details: Run each row through the framework n_runs number of times and log the percent of successful runs.

πŸ“Š Adding new data

  1. Create a new pandas dataframe pickle file with the following columns:
    • text: The text to be sent to the framework
    • labels: List of labels associated with the text
    • See data/multilabel_classification.pkl for an example.
  2. Add the path to the new pickle file in the ./config.yaml file under the source_data_pickle_path key for all the frameworks you want to test.
  3. Run the benchmark using python -m main run-benchmark to test the new data on all the frameworks!
  4. Generate the results using python -m main generate-results

πŸ—οΈ Adding a new framework

The easiest way to create a new framework is to reference the ./frameworks/instructor_framework.py file. Detailed steps are as follows:

  1. Create a .py file in frameworks directory with the name of the framework. Eg., instructor_framework.py for the instructor framework.
  2. In this .py file create a class that inherits BaseFramework from frameworks.base.
  3. The class should define an init method that initializes the base class. Here are the arguments the base class expects:
    • task (str): the task that the framework is being tested on. Obtained from ./config.yaml file. Allowed values are "multilabel_classification" and "ner"
    • prompt (str): Prompt template used. Obtained from the init_kwargs in the ./config.yaml file.
    • llm_model (str): LLM model to be used. Obtained from the init_kwargs in the ./config.yaml file.
    • llm_model_family (str): LLM model family to be used. Current supported values as "openai" and "transformers". Obtained from the init_kwargs in the ./config.yaml file.
    • retries (int): Number of retries for the framework. Default is $0$. Obtained from the init_kwargs in the ./config.yaml file.
    • source_data_picke_path (str): Path to the source data pickle file. Obtained from the init_kwargs in the ./config.yaml file.
    • sample_rows (int): Number of rows to sample from the source data. Useful for testing on a smaller subset of data. Default is $0$ which uses all rows in source_data_pickle_path for the benchmarking. Obtained from the init_kwargs in the ./config.yaml file.
    • response_model (Any): The response model to be used. Internally passed by the benchmarking script.
  4. The class should define a run method that takes three arguments:
    • task: The task that the framework is being tested on. Obtained from the task in the ./config.yaml file. Eg., "multilabel_classification"
    • n_runs: number of times to repeat each text
    • expected_response: Output expected from the framework. Use default value of None
    • inputs: a dictionary of {"text": str} where str is the text to be sent to the framework. Use default value of empty dictionary {}
  5. This run method should create another run_experiment function that takes inputs as argument, runs that input through the framework and returns the output.
  6. The run_experiment function should be annotated with the @experiment decorator from frameworks.base with n_runs, expected_resposne and task as arguments.
  7. The run method should call the run_experiment function and return the four outputs predictions, percent_successful, metrics and latencies.
  8. Import this new class in frameworks/__init__.py.
  9. Add a new entry in the ./config.yaml file with the name of the class as the key. The yaml entry can have the following fields
    • task: the task that the framework is being tested on. Obtained from ./config.yaml file. Allowed values are "multilabel_classification" and "ner"
    • n_runs: number of times to repeat each text
    • init_kwargs: all the arguments that need to be passed to the init method of the class, including those mentioned in step 3 above.

🧭 Roadmap

  1. Framework related tasks:
    Framework Multi-label classification Named Entity Recognition Synthetic Data Generation
    OpenAI Structured Output βœ… OpenAI βœ… OpenAI βœ… OpenAI
    Instructor βœ… OpenAI βœ… OpenAI βœ… OpenAI
    Mirascope βœ… OpenAI βœ… OpenAI βœ… OpenAI
    Fructose βœ… OpenAI 🚧 In Progress 🚧 In Progress
    Marvin βœ… OpenAI βœ… OpenAI βœ… OpenAI
    Llamaindex βœ… OpenAI βœ… OpenAI βœ… OpenAI
    Modelsmith βœ… OpenAI 🚧 In Progress βœ… OpenAI
    Outlines βœ… HF Transformers 🚧 In Progress βœ… HF Transformers
    LM format enforcer βœ… HF Transformers βœ… HF Transformers βœ… HF Transformers
    Jsonformer ❌ No Enum Support πŸ’­ Planning πŸ’­ Planning
    Strictjson ❌ Non-standard schema ❌ Non-standard schema ❌ Non-standard schema
    Guidance πŸ’­ Planning πŸ’­ Planning πŸ’­ Planning
    DsPy πŸ’­ Planning πŸ’­ Planning πŸ’­ Planning
    Langchain πŸ’­ Planning πŸ’­ Planning πŸ’­ Planning
  2. Others
    • Latency metrics
    • CICD pipeline for benchmark run automation
    • Async run

πŸ’‘ Contribution guidelines

Contributions are welcome! Here are the steps to contribute:

  1. Please open an issue with any new framework you would like to add. This will help avoid duplication of effort.
  2. Once the issue is assigned to you, pls submit a PR with the new framework!

πŸŽ“ Citation

To cite LLM Structured Output Benchmarks in your work, please use the following bibtex reference:

@software{marie_stephen_leo_2024_12327267,
  author       = {Marie Stephen Leo},
  title        = {{stephenleo/llm-structured-output-benchmarks: 
                   Release for Zenodo}},
  month        = jun,
  year         = 2024,
  publisher    = {Zenodo},
  version      = {v0.0.1},
  doi          = {10.5281/zenodo.12327267},
  url          = {https://doi.org/10.5281/zenodo.12327267}
}

πŸ™ Feedback

If this work helped you in any way, please consider ⭐ this repository to give me feedback so I can spend more time on this project.

llm-structured-output-benchmarks's People

Contributors

stephenleo avatar

Stargazers

Thibault avatar Geraldus Wilsen avatar Annie avatar Christo Olivier avatar  avatar Atlantis avatar  avatar  avatar Winthrop Gillis avatar  avatar Naresh Kumar avatar Jai Mittal avatar Jonathon W. Marshall avatar Mohammed OE Abdallah avatar Talha SARI avatar Dipanjan (DJ) Sarkar avatar  avatar Vinh Tran avatar  avatar Tanmay Singh avatar  avatar Konrad avatar  avatar HyunjunJeon avatar MaxData avatar  avatar Steffen RΓΆcker avatar Aditya Kulshrestha avatar Rehan Fazal avatar Gibran Iqbal avatar Kevin Armengol avatar Michael Scofield avatar Juan Pablo Manson avatar Kendrick Lu avatar  avatar  avatar Menghong Han avatar Leonard Musk avatar remichu avatar  avatar  avatar Hristo Vrigazov avatar Chenghao Mou avatar ProgrammerUnknown avatar 张倧成 avatar Mohammad Alali avatar  avatar Matthew Guan avatar Jakub Bartczuk avatar Abderrahmane Mohammedi avatar Vincent Herlemont avatar Elliot Partridge avatar David Cecchini avatar RΓ©mi Louf avatar  avatar  avatar  avatar brightown avatar ζ­₯倩 avatar  avatar  avatar Meng Zhao avatar jhj0517 avatar Yubin Wang avatar Lei Zhao avatar Jordan Moshcovitis avatar Vlad G avatar Amir Mehr avatar Paolo Bonicco avatar  avatar Deividas Petkus avatar Andrejs Agejevs avatar Chris Pang avatar  avatar Luca Mannini avatar Tim Kersey avatar  avatar  avatar Rishu Mehrotra avatar  avatar Brent Doll avatar Nicholas Barker avatar Jerry Liu avatar  avatar ali_robot avatar Doğan Keskin avatar Richard Chong avatar Taco Hiddink avatar  avatar Ali Abuharb avatar Veera Vignesh avatar Chris Hokamp avatar  avatar Ashish Kumar Singh  avatar wcbudz avatar Amiruddin Nagri avatar Caio Lang avatar Maxine Lai avatar Vibhas Singh avatar Asim Shah avatar

Watchers

Jim Kring avatar Christian Weyer avatar Yubin Wang avatar Ankush Singal avatar  avatar

llm-structured-output-benchmarks's Issues

Demo polyfactory framework

Hi, it's nice to come across a cross-library/model benchmark like this!

When looking at evaluations for structured output libraries, I feel like "valid response" is such a low bar when used on its own as a metric, and I think adding accuracy-related metrics would help these benchmarks be more informative.

I fully acknowledge that this is a bit on the ornery side, but since it only took a few lines of code (it was very easy to do in this repo!), I wanted to submit a demo PR for a new framework that uses polyfactory to generate valid responses based on the response model, with 100% reliability and a latency of 0.000, maybe 0.001 on a bad day.

I'd potentially be interested in contributing to work on additional metrics/tasks in the future, in particular named entity recognition!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    πŸ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. πŸ“ŠπŸ“ˆπŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❀️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.