aws-samples / foundation-model-benchmarking-tool Goto Github PK

Foundation model benchmarking tool. Run any model on Amazon SageMaker and benchmark for performance across instance type and serving stack options.

License: MIT No Attribution

Jupyter Notebook 78.28% Python 21.50% Shell 0.22%

benchmarking foundation-models inferentia llama2 p4d sagemaker generative-ai benchmark bedrock llama3

foundation-model-benchmarking-tool's Introduction

Foundation Model benchmarking tool (FMBench)

Benchmark any Foundation Model (FM) on any AWS service [Amazon SageMaker, Amazon Bedrock, Amazon EKS, Bring your own endpoint etc.]

Amazon SageMaker | Amazon Bedrock

A key challenge with FMs is the ability to benchmark their performance in terms of inference latency, throughput and cost so as to determine which model running with what combination of the hardware and serving stack provides the best price-performance combination for a given workload.

Stated as business problem, the ask is “What is the dollar cost per transaction for a given generative AI workload that serves a given number of users while keeping the response time under a target threshold?”

But to really answer this question, we need to answer an engineering question (an optimization problem, actually) corresponding to this business problem: “What is the minimum number of instances N, of most cost optimal instance type T, that are needed to serve a workload W while keeping the average transaction latency under L seconds?”

W: = {R transactions per-minute, average prompt token length P, average generation token length G}

This foundation model benchmarking tool (a.k.a. FMBench) is a tool to answer the above engineering question and thus answer the original business question about how to get the best price performance for a given workload. Here is one of the plots generated by FMBench to help answer the above question (the numbers on the y-axis, transactions per minute and latency have been removed from the image below, you can find them in the actual plot generated on running FMBench).

Models benchmarked

Configuration files are available in the configs folder for the following models in this repo.

🚨 Llama3 on Amazon SageMaker 🚨

Llama3 is now available on SageMaker (read blog post), and you can now benchmark it using FMBench. Here are the config files for benchmarking Llama3-8b-instruct and Llama3-70b-instruct on ml.p4d.24xlarge and ml.g5.12xlarge instance.

Config file for Llama3-8b-instruct on ml.p4d.24xlarge and ml.g5.12xlarge
Config file for Llama3-70b-instruct on ml.p4d.24xlarge and ml.g5.48xlarge

Full list of benchmarked models

Model	SageMaker g4dn/g5/p3	SageMaker Inf2	SageMaker P4	SageMaker P5	Bedrock On-demand throughput	Bedrock provisioned throughput
Anthropic Claude-3 Sonnet					✅	✅
Anthropic Claude-3 Haiku					✅
Mistral-7b-instruct	✅		✅	✅	✅
Mistral-7b-AWQ				✅
Mixtral-8x7b-instruct					✅
Llama3-8b instruct	✅		✅
Llama3-70b instruct	✅		✅
Llama2-13b chat	✅	✅	✅		✅
Llama2-70b chat	✅	✅	✅		✅
Amazon Titan text lite					✅
Amazon Titan text express					✅
Cohere Command text					✅
Cohere Command light text					✅
AI21 J2 Mid					✅
AI21 J2 Ultra					✅
distilbert-base-uncased	✅

Description

FMBench is a Python package for running performance benchmarks for any model deployed on Amazon SageMaker or available on Amazon Bedrock or deployed by you on an AWS service of choice (such as Amazon EKS or Amazon EC2) a.k.a Bring your own endpoint. For SageMaker, FMBench provides both the option of deploying models on SageMaker as part of its workflow and use the endpoint or skip the deployment step and use an endpoint you provide to send inference requests to and measure metrics such as inference latency, error rate, transactions per second etc. This approach allows for benchmarking different combinations of instance types (g5, p4d, p5, Inf2), inference containers (DeepSpeed, TensorRT, HuggingFace TGI and others) and parameters such as tensor parallelism, rolling batch etc. Because FMBench is model agnostic therefore it can be used not only testing third party models but also open-source models and proprietary models trained by enterprises on their own data.

FMBench can be run on any AWS platform where we can run Python, such as Amazon EC2, Amazon SageMaker, or even the AWS CloudShell. It is important to run this tool on an AWS platform so that internet round trip time does not get included in the end-to-end response time latency.

The workflow for FMBench is as follows:

Create configuration file
        |
        |-----> Deploy model on SageMaker/Use models on Bedrock/Bring your own endpoint
                    |
                    |-----> Run inference against deployed endpoint(s)
                                     |
                                     |------> Create a benchmarking report

Create a dataset of different prompt sizes and select one or more such datasets for running the tests.
1. Currently FMBench supports datasets from LongBench and filter out individual items from the dataset based on their size in tokens (for example, prompts less than 500 tokens, between 500 to 1000 tokens and so on and so forth). Alternatively, you can download the folder from this link to load the data.
Deploy any model that is deployable on SageMaker on any supported instance type (g5, p4d, Inf2).
1. Models could be either available via SageMaker JumpStart (list available here) as well as models not available via JumpStart but still deployable on SageMaker through the low level boto3 (Python) SDK (Bring Your Own Script).
2. Model deployment is completely configurable in terms of the inference container to use, environment variable to set, setting.properties file to provide (for inference containers such as DJL that use it) and instance type to use.
Benchmark FM performance in terms of inference latency, transactions per minute and dollar cost per transaction for any FM that can be deployed on SageMaker.
1. Tests are run for each combination of the configured concurrency levels i.e. transactions (inference requests) sent to the endpoint in parallel and dataset. For example, run multiple datasets of say prompt sizes between 3000 to 4000 tokens at concurrency levels of 1, 2, 4, 6, 8 etc. so as to test how many transactions of what token length can the endpoint handle while still maintaining an acceptable level of inference latency.
Generate a report that compares and contrasts the performance of the model over different test configurations and stores the reports in an Amazon S3 bucket.
1. The report is generated in the Markdown format and consists of plots, tables and text that highlight the key results and provide an overall recommendation on what is the best combination of instance type and serving stack to use for the model under stack for a dataset of interest.
2. The report is created as an artifact of reproducible research so that anyone having access to the model, instance type and serving stack can run the code and recreate the same results and report.
Multiple configuration files that can be used as reference for benchmarking new models and instance types.

Getting started

FMBench is available as a Python package on PyPi and is run as a command line tool once it is installed. All data that includes metrics, reports and results are stored in an Amazon S3 bucket.

Quickstart

Launch the AWS CloudFormation template included in this repository using one of the buttons from the table below. The CloudFormation template creates the following resources within your AWS account: Amazon S3 buckets, Amazon IAM role and an Amazon SageMaker Notebook with this repository cloned. A read S3 bucket is created which contains all the files (configuration files, datasets) required to run FMBench and a write S3 bucket is created which will hold the metrics and reports generated by FMBench. The CloudFormation stack takes about 5-minutes to create.

AWS Region Link

us-east-1 (N. Virginia)

us-west-2 (Oregon)
Once the CloudFormation stack is created, navigate to SageMaker Notebooks and open the fmbench-notebook.

AWS Region	Link
us-east-1 (N. Virginia)
us-west-2 (Oregon)

On the fmbench-notebook open a Terminal and run the following commands.

conda create --name fmbench_python311 -y python=3.11 ipykernel
source activate fmbench_python311;
pip install -U fmbench

Now you are ready to fmbench with the following command line. We will use a sample config file placed in the S3 bucket by the CloudFormation stack for a quick first run.
1. We benchmark performance for the Llama2-7b model on a ml.g5.xlarge and a ml.g5.2xlarge instance type, using the huggingface-pytorch-tgi-inference inference container. This test would take about 30 minutes to complete and cost about $0.20.
2. It uses a simple relationship of 750 words equals 1000 tokens, to get a more accurate representation of token counts use the Llama2 tokenizer (instructions are provided in the next section). It is strongly recommended that for more accurate results on token throughput you use a tokenizer specific to the model you are testing rather than the default tokenizer. See instructions provided later in this document on how to use a custom tokenizer.
```
account=`aws sts get-caller-identity | jq .Account | tr -d '"'`
region=`aws configure get region`
fmbench --config-file s3://sagemaker-fmbench-read-${region}-${account}/configs/config-llama2-7b-g5-quick.yml >> fmbench.log 2>&1
```
3. Open another terminal window and do a tail -f on the fmbench.log file to see all the traces being generated at runtime.
```
tail -f fmbench.log
```
The generated reports and metrics are available in the sagemaker-fmbench-write-<replace_w_your_aws_region>-<replace_w_your_aws_account_id> bucket. The metrics and report files are also downloaded locally and in the results directory (created by FMBench) and the benchmarking report is available as a markdown file called report.md in the results directory. You can view the rendered Markdown report in the SageMaker notebook itself or download the metrics and report files to your machine for offline analysis.

The DIY version (with gory details)

Follow the prerequisites below to set up your environment before running the code:

Python 3.11: Setup a Python 3.11 virtual environment and install FMBench.
```
python -m venv .fmbench
pip install fmbench
```
S3 buckets for test data, scripts, and results: Create two buckets within your AWS account:
- Read bucket: This bucket contains tokenizer files, prompt template, source data and deployment scripts stored in a directory structure as shown below. FMBench needs to have read access to this bucket.
```
s3://<read-bucket-name>
    ├── source_data/
    ├── source_data/<source-data-file-name>.json
    ├── prompt_template/
    ├── prompt_template/prompt_template.txt
    ├── scripts/
    ├── scripts/<deployment-script-name>.py
    ├── tokenizer/
    ├── tokenizer/tokenizer.json
    ├── tokenizer/config.json
```
  - The details of the bucket structure is as follows:
    1. Source Data Directory: Create a source_data directory that stores the dataset you want to benchmark with. FMBench uses Q&A datasets from the LongBench dataset or alternatively from this link. Support for bring your own dataset will be added soon.
      - Download the different files specified in the LongBench dataset into the source_data directory. Following is a good list to get started with:
        
        2wikimqa
        
        hotpotqa
        
        narrativeqa
        
        triviaqa
        
        Store these files in the source_data directory.
    2. Prompt Template Directory: Create a prompt_template directory that contains a prompt_template.txt file. This .txt file contains the prompt template that your specific model supports. FMBench already supports the prompt template compatible with Llama models.
    3. Scripts Directory: FMBench also supports a bring your own script (BYOS) mode for deploying models that are not natively available via SageMaker JumpStart i.e. anything not included in this list. Here are the steps to use BYOS.
      1. Create a Python script to deploy your model on a SageMaker endpoint. This script needs to have a deploy function that 2_deploy_model.ipynb can invoke. See p4d_hf_tgi.py for reference.
      2. Place your deployment script in the scripts directory in your read bucket. If your script deploys a model directly from HuggingFace and needs to have access to a HuggingFace auth token, then create a file called hf_token.txt and put the auth token in that file. The .gitignore file in this repo has rules to not commit the hf_token.txt to the repo. Today, FMBench provides inference scripts for:
        
        All SageMaker Jumpstart Models
        
        Text-Generation-Inference (TGI) container supported models
        
        Deep Java Library DeepSpeed container supported models
        
        Deployment scripts for the options above are available in the scripts directory, you can use these as reference for creating your own deployment scripts as well.
    4. Tokenizer Directory: Place the tokenizer.json, config.json and any other files required for your model's tokenizer in the tokenizer directory. The tokenizer for your model should be compatible with the tokenizers package. FMBench uses AutoTokenizer.from_pretrained to load the tokenizer.
      
      As an example, to use the Llama 2 Tokenizer for counting prompt and generation tokens for the Llama 2 family of models: Accept the License here: meta approval form and download the tokenizer.json and config.json files from Hugging Face website and place them in the tokenizer directory.
- Write bucket: All prompt payloads, model endpoint and metrics generated by FMBench are stored in this bucket. FMBench requires write permissions to store the results in this bucket. No directory structure needs to be pre-created in this bucket, everything is created by FMBench at runtime.
```
s3://<write-bucket-name>
    ├── <test-name>
    ├── <test-name>/data
    ├── <test-name>/data/metrics
    ├── <test-name>/data/models
    ├── <test-name>/data/prompts
```

Bring your own `dataset` | `endpoint`

FMBench started out as supporting only SageMaker endpoints and while that is still true as far as deploying the endpoint through FMBench is concerned but we now support the ability to support external endpoints and external datasets.

Bring your own dataset

By default FMBench uses the LongBench dataset dataset for testing the models, but this is not the only dataset you can test with. You may want to test with other datasets available on HuggingFace or use your own datasets for testing. You can do this by converting your dataset to the JSON lines format. We provide a code sample for converting any HuggingFace dataset into JSON lines format and uploading it to the S3 bucket used by FMBench in the bring_your_own_dataset notebook. Follow the steps described in the notebook to bring your own dataset for testing with FMBench.

Bring your own endpoint (a.k.a. support for external endpoints)

If you have an endpoint deployed on say Amazon EKS or Amazon EC2 or have your models hosted on a fully-managed service such as Amazon Bedrock, you can still bring your endpoint to FMBench and run tests against your endpoint. To do this you need to do the following:

Create a derived class from FMBenchPredictor abstract class and provide implementation for the constructor, the get_predictions method and the endpoint_name property. See SageMakerPredictor for an example. Save this file locally as say my_custom_predictor.py.
Upload your new Python file (my_custom_predictor.py) for your custom FMBench predictor to your FMBench read bucket and the scripts prefix specified in the s3_read_data section (read_bucket and scripts_prefix).
Edit the configuration file you are using for your FMBench for the following:
- Skip the deployment step by setting the 2_deploy_model.ipynb step under run_steps to no.
- Set the inference_script under any experiment in the experiments section for which you want to use your new custom inference script to point to your new Python file (my_custom_predictor.py) that contains your custom predictor.

Steps to run

pip install the FMBench package from PyPi.
Create a config file using one of the config files available here.
1. The configuration file is a YAML file containing configuration for all steps of the benchmarking process. It is recommended to create a copy of an existing config file and tweak it as necessary to create a new one for your experiment.
Create the read and write buckets as mentioned in the prerequisites section. Mention the respective directories for your read and write buckets within the config files.

Run the FMBench tool from the command line.

# the config file path could be an S3 path and https path 
# or even a path to a file on the local filesystem
fmbench --config-file \path\to\config\file

Depending upon the experiments in the config file, the FMBench run may take a few minutes to several hours. Once the run completes, you can find the report and metrics in the write S3 bucket set in the config file. The report is generated as a markdown file called report.md and is available in the metrics directory in the write S3 bucket.

Results

Here is a screenshot of the report.md file generated by FMBench.

Building the `FMBench` Python package

The following steps describe how to build the FMBench Python package.

Clone the FMBench repo from GitHub.
Make any code changes as needed.
Install poetry.
```
pip install poetry
```
Change directory to the FMBench repo directory and run poetry build.
```
poetry build
```
The .whl file is generated in the dist folder. Install the .whl in your current Python environment.
```
pip install dist/fmbench-X.Y.Z-py3-none-any.whl
```
Run FMBench as usual through the FMBench CLI command.

Pending enhancements

View the ISSUES on GitHub and add any you might think be an beneficial iteration to this benchmarking harness.

Security

See CONTRIBUTING for more information.

License

This library is licensed under the MIT-0 License. See the LICENSE file.

Star History

foundation-model-benchmarking-tool's People

Contributors

Stargazers

Watchers

Forkers

arif-tamboli melanie531 samhays sandy4321 pdtgct sherryxding antara678 madhurprash

foundation-model-benchmarking-tool's Issues

Need for Integration of FMBench Tool with External 3rd party Synthetic Data Generation Tools for Benchmarking usecase.

It would be good to have a native integration in the Config.Yaml file of FMBench for integrating with other 3rd party tools for pulling synthetically generated datasets.
How to split this synthetically generated datasets for FMBench based evaluation of FM's would be important functionality to have.
Final Results (results.md) generated by this tool could have some Visual comparison capability with other 3rd party tools which could be used for holistic evaluation of FM's.

Assign cost per run for FMBT

To calculate the cost per config file run for this FMBT harness. This includes model instance type, inference, cost per transactions and so on to sum up the entire run's total cost.

Add support for different payload formats for bring your own datasets for that might be needed for different inference containers

This tool currently supports the HF TGI container, and DJL Deep Speed container on SageMaker and both use the same format but in future other containers might need a different payload format.

Goal: To give user full flexibility to bring their payloads or contain code that generalizes payload generation irrespective of the container type that the user uses. Two options for solution to this issue here:

1/ Have the user bring in their own payload
2/ Have a generic function defined to convert the payload in support for the container type the user is using to deploy their model and generate inference from.

The business summary plot in the report needs to have a caption for disclaimer

Visualizations are powerful, so the message that it is the full serving stack that results in a particular performance benchmark can get lost unless explicitly called out. It is possible that someone could take the impression that a given instance type always performance better without considering that it is the instance type+inference container +parameters and so the results should not be taken out of context.

Provide config file for FLAN-T5 out of the box

FLAN-T5 XL is still used by multiple customers so a comparison of this model across g5.xlarge and g5.2xlarge instances would be very useful and so a config file for this should be provided.

Add support for models hosted on platforms other than SageMaker

While this tool was never thought of as testing models hosted on anything other than SageMaker, but technically there is nothing preventing this. There are two things that need to happen for this.

1/ The models have to be deployed on the platform of choice. This part can be externalized, meaning the deployment code in this repo does not deploy those models, they are deployed separately, outside of this tool.

2/ Have support for bring your own inference script which knows how to query your endpoint. This inference script then runs inferences against the endpoints on platforms other than SageMaker. And so at this point it does not matter if the endpoint is on EKS or EC2.

Add a AWS CloudFormation template and provide some canned datasets to get started

Code cleanup needed to replace the notebooks with regular python files

The work for this repo started as a skunkworks project done over the holidays in the Winter of 2024 and so at the time this was just a bunch of notebooks but now that it has transformed into a formal open-source project with a ton of functionality and a bunch of roadmap items, the notebooks have turned unwieldy!

We need to replace the notebooks with regular python scripts and plus there is a whole bunch of code cleanup that needs to happen to replace global imports, use type hints everywhere, optimizations of functions, etc., the list is long.

Assigning this to myself for now, and would create issues for specific items.

The README gives an impression that this tool is comparing instance types when it should be serving stacks

Update the readme to clarify that it is the complete serving stack that we are comparing across different experiments. The serving stack includes everything from the instance type, the inference container and the backend configuration parameters.

Per notebook run support via py package repo

The need of this issue is to have granular access to each notebook while having the repo being a package that can be run via pip. This is for advanced users in the space who want to change the code for different metrics and modifications so they can run each notebook one by one along with having the option to pip install the fmbt package.

Add support for a default tokenizer if no tokenizer is specified

Add Autoscaling functionality to FMBench

To have a functionalit that enables FMBT to support sagemaker endpoint autoscaling via config files for experiment runs.

Allow different experiments in the same config file to use different regions

The tool can easily allow to compare performance across regions if we allowed region to be a part of the per experiment config.

Compare different models for the same dataset

There is nothing in FMBench which prevents different experiments in the same config file to use different models, but the generated report is not architected in the same way i.e. it is not created to compare different models but rather compare the same model across serving stacks, so that would need to change. This has been requested by multiple customers, the idea being if we find different models that are fit for task, we now want to find the model and serving stack combination which provides the best price:performance.

To add support for bring your own 'SageMaker' pre existing endpoint

We want to add and test functionality for an already deployed SageMaker endpoint and test it with FMBench without having the need to re deploy the endpoint/create a new endpoint

Add code to determine the cost of running an entire experiment and include it in the final report

Add code to determine the cost of running an entire experiment and include it in the final report. This would only include the cost of running the SageMaker endpoints based on hourly public pricing (the cost of running this code on a notebook or a EC2 is trivial in comparison and can be ignored).

Running this entire benchmarking test, we can add a couple of lines to calculate the total cost being used to run the specific experiment from an end to end perspective to answer simple questions like:

I am running the experiment for this config file and got the benchmarking results successfully in 'x' time. What is the cost that will be incurred to run this experiment?

Try FMBench with instance count set to > 1 to see how scaling impacts latency and transactions per minute

It would be interesting to see the effect of scaling to multiple instances behind the same endpoint. How does inference latency change as endpoints start to scale (automatically, we could also add parameters for scaling policy), can we support the more transactions with auto-scaling instances while keeping the latency below a threshold and what are the cost implications of doing that. This needs to be fleshed out but this is an interesting area.

This would also need to include support for the Inference Configuration feature that is now available with SageMaker.

Add documentation for the config file on readthedocs.io

The config file has lots of parameters and while some of them are explained via inline comments, we really need a better way for documenting this.

FMBench to support benchmarking for embedding models

This issue represents the need to benchmark for embedding models that are hosted via sagemaker, bedrock and bring your own custom embedding models.

FMBench to support an REST Predictor

To add support for an REST predictor to run inferences on a model deployed out of sagemaker/bedrock

Convert 'FMBT' into a python packaged version

Create a Python Package out of the 'FMBT' tool to be able to use it as a PyPi Library.

FMbench support for different AWS Region

Would be good to have a CFT template that works in different AWS Region. Would be good to provide support in us-west-2

[Highest priority] Add support for reading and writing files (configuration, metrics, bring your own model scripts) from Amazon S3.

This includes adding support for s3 readability and interaction, including all data and metrics being accessible via your personal s3 bucket. The goal of this issue is to abstract out the code on this repo in a way where you can bring your own script, your own model, source data files, upload it to s3, and then expect this repo to run, in order to generate these test results within your s3 bucket that you define. Aim for this is to have a folder in a bucket where you upload your source data files, a folder where you upload your bring your own model script, prompt template, and other model artifacts as need be, and then run this repo to generate test results within the 'data' folder that is programmatically generated containing information on metrics, per chunk and inference results, deployed model configurations, and more.

Add support for a custom token counter

Currently only the LLama tokenizer is supported but we want to allow users to bring their own token counting logic for different models. This way regardless of the model type, or token count methodology, the user should be able to get accurate results based on their token counter that they use.

Goal: Abstracting out the repo and tool to a point where no matter what token counter type the user uses, you can bring that and run the container to get accurate test results.

Containerize FMBench and provide instructions for running the container on EC2

Containerize FMBT and provide instructions for running the container on EC2 - Once all of the files are integrated via S3 and all the code is abstracted out in terms of generating metrics for any deployable model on SageMaker (including bring your own model/scripts), we want to be able to containerize this and run the container on EC2.

Goal: To choose the specific config file and prompt to run the container on it, generating results without any heavy lifting of any development efforts.

Emit live metrics

Emit live metrics so that they can be monitored through Grafana via live dashboard. More information to come on this issue but the goal here is to provide full flexibility to the user to be able to view metrics in ways that best suits the needs of their business and technological goals.

[TBD] --> Some sort of an analytics pipeline sending and emitting live results for different model configurations, their results and different metrics based on the needs of the user.

Merge model accuracy metrics also into this tool

The fact that we are running inference means that we can also measure the accuracy of those inferences i.e. through rouge score, cosine similarity (to an expert generated response) or other metrics. If we add that then this tool can provide a complete bechmarking solution that includes accuracy as well as cost.

Support for custom datasets and custom inference scripts

For models other than Llama and Mistral (say BERT) we need datasets other than LongBench and these models have their own response format.

Add support for bring your own dataset by parameterizing the prompt template.
Add support for custom inference scripts.

Add Bedrock benchmarking to this tool

Can we add Bedrock to this tool? While support for bring your own inference script would do that but we need to think through Bedrock specific options such as provisioned throughput, auto-generated reported formats, do we want to compare Bedrock and SageMaker side by side.

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.