aws / fmeval Goto Github PK

Foundation Model Evaluations Library

License: Apache License 2.0

Shell 0.70% Python 99.30%

fmeval's Introduction

Foundation Model Evaluations Library

fmeval is a library to evaluate Large Language Models (LLMs) in order to help select the best LLM for your use case. The library evaluates LLMs for the following tasks:

Open-ended generation - The production of natural human responses to text that does not have a pre-defined structure.
Text summarization - The generation of a condensed summary retaining the key information contained in a longer text.
Question Answering - The generation of a relevant and accurate response to an answer.
Classification - Assigning a category, such as a label or score to text, based on its content.

The library contains

Algorithms to evaluate LLMs for Accuracy, Toxicity, Semantic Robustness and Prompt Stereotyping across different tasks.
Implementations of the ModelRunner interface. ModelRunner encapsulates the logic for invoking different types of LLMs, exposing a predict method to simplify interactions with LLMs within the eval algorithm code. We have built-in support for Amazon SageMaker Endpoints and JumpStart models. The user can extend the interface for their own model classes by implementing the predict method.

Installation

fmeval is developed under python3.10. To install the package, simply run:

pip install fmeval

Usage

You can see examples of running evaluations on your LLMs with built-in or custom datasets in the examples folder.

The main steps for using fmeval are:

Create a ModelRunner which can perform invocation on your LLM. fmeval provides built-in support for Amazon SageMaker Endpoints and JumpStart LLMs. You can also extend the ModelRunner interface for any LLMs hosted anywhere.
Use any of the supported eval_algorithms.

For example,

from fmeval.eval_algorithms.toxicity import Toxicity, ToxicityConfig

eval_algo = Toxicity(ToxicityConfig())
eval_output = eval_algo.evaluate(model=model_runner)

Note: You can update the default eval config parameters for your specific use case.

Using a custom dataset for an evaluation

We have our built-in datasets configured, which are consumed for computing the scores in eval algorithms. You can choose to use a custom dataset in the following manner.

Create a DataConfig for your custom dataset

config = DataConfig(
    dataset_name="custom_dataset",
    dataset_uri="./custom_dataset.jsonl",
    dataset_mime_type="application/jsonlines",
    model_input_location="question",
    target_output_location="answer",
)

Use an eval algorithm with a custom dataset

eval_algo = Toxicity(ToxicityConfig())
eval_output = eval_algo.evaluate(model=model_runner, dataset_config=config)

Please refer to the developer guide and examples for more details around the usage of eval algorithms.

Telemetry

fmeval has telemetry enabled for tracking the usage of AWS-provided/hosted LLMs. This data is tracked using the number of SageMaker or JumpStart ModelRunner objects that get created. Telemetry can be disabled by setting the DISABLE_FMEVAL_TELEMETRY environment variable to true.

Troubleshooting

Users running fmeval on a Windows machine may encounter the error OSError: [Errno 0] AssignProcessToJobObject() failed when fmeval internally calls ray.init(). This OS error is a known Ray issue, and is detailed here. Multiple users have reported that installing Python from the official Python website rather than the Microsoft store fixes this issue. You can view more details on limitations of running Ray on Windows on Ray's webpage.
If you run into the error error: can't find Rust compiler while installing fmeval on a Mac, please try running the steps below.

curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
rustup install 1.72.1
rustup default 1.72.1-aarch64-apple-darwin
rustup toolchain remove stable-aarch64-apple-darwin
rm -rf $HOME/.rustup/toolchains/stable-aarch64-apple-darwin
mv $HOME/.rustup/toolchains/1.72.1-aarch64-apple-darwin $HOME/.rustup/toolchains/stable-aarch64-apple-darwin

If you run into out of memory (OOM) errors, especially while running evaluations that use LLMs as evaluators like toxicity and summarization accuracy, it is likely that your machine does not have enough memory to load the evaluator models. By default, fmeval loads multiple copies of the model into memory to maximize parallelization, where the exact number depends on the number of cores on the machine. To reduce the number of models that get loaded in parallel, you can set the environment variable PARALLELIZATION_FACTOR to a value that suits your machine.

Development

Setup and the use of `devtool`

Once you have created a virtual environment with python3.10, run the following command to set up the development environment:

./devtool install_deps_dev
./devtool install_deps
./devtool all

Note: If you are on a Mac, the install_poetry_version devtool command may fail when running the poetry installation script. If there is a failure, you should get error logs sent to a file with a name like poetry-installer-error-cvulo5s0.log. Open the logs, and if the error message looks like the following:

dyld[10908]: Library not loaded: @loader_path/../../../../Python.framework/Versions/3.10/Python
  Referenced from: <8A5DEEDB-CE8E-325F-88B0-B0397BD5A5DE> /Users/daniezh/Library/Application Support/pypoetry/venv/bin/python3
  Reason: tried: '/Users/daniezh/Library/Application Support/pypoetry/venv/bin/../../../../Python.framework/Versions/3.10/Python' (no such file), '/Library/Frameworks/Python.framework/Versions/3.10/Python' (no such file), '/System/Library/Frameworks/Python.framework/Versions/3.10/Python' (no such file, not in dyld cache)

Traceback:

  File "<string>", line 923, in main
  File "<string>", line 562, in run

then you will need to tweak the poetry installation script and re-run it.

Steps:

curl -sSL https://install.python-poetry.org > poetry_script.py
Change the symlinks argument in builder = venv.EnvBuilder(clear=True, with_pip=True, symlinks=False) to True. See mionker's comment here for an explanation.
python poetry_script.py --version 1.8.2 (where 1.8.2 is the version listed in devtool; this may change after the time of this writing).
Confirm installation via poetry --version

Additionally, if you already have an existing version of Poetry installed and want to install a new version, before you re-run the above command, you will need to uninstall Poetry:

curl -sSL https://install.python-poetry.org | python3 - --uninstall

Before submitting a PR, rerun ./devtool all for testing and linting. It should run without errors.

Adding python dependencies

We use poetry to manage python dependencies in this project. If you want to add a new dependency, please update the pyproject.toml file, and run the poetry update command to update the poetry.lock file (which is checked in).

Other than this step to add dependencies, use devtool commands for installing dependencies, linting and testing. Execute the command ./devtool without any arguments to see a list of available options.

Adding your own evaluation algorithm and/or metrics

The evaluation algorithms and metrics provided by fmeval are implemented using Transform and TransformPipeline objects. You can leverage these existing tools to similarly implement your own metrics and algorithms in a modular manner.

Here, we provide a high-level overview of what these classes represent and how they are used. Specific implementation details can be found in their respective docstrings (see src/fmeval/transforms/transform.py and src/fmeval/transforms/transform_pipeline.py).

Preface

At a high level, an evaluation algorithm takes an initial tabular dataset consisting of a number of "records" (i.e. rows) and repeatedly transforms this dataset until the dataset either contains all the evaluation metrics, or at least all the intermediate data needed to compute said metrics. The transformations that get applied to the dataset inherently operate at a per-record level, and simply get applied to every record in the dataset to transform the dataset in full.

The `Transform` class

We represent the concept of a record-level transformation using the Transform class. Transform is a callable class where its __call__ method takes a single argument, record, which represents the record to be transformed. A record is represented by a Python dictionary. To implement your own record-level transformation logic, create a concrete subclass of Transform and implement its __call__ method.

Example:

Let's implement a Transform for a simple, toy metric.

class NumSpaces(Transform):
    """
    Augments the input record (which contains some text data)
    with the number of spaces found in the text.
    """
    def __call__(self, record: Dict[str, Any]) -> Dict[str, Any]:
        input_text = record["input_text"]
        record["num_spaces"] = input_text.count(" ")
        return record

One issue with this simple example is that the keys used for the input text data and the output data are both hard-coded. This generally isn't desirable, so let's improve on our running example.

class NumSpaces(Transform):
    """
    Augments the input record (which contains some text data)
    with the number of spaces found in the text.
    """

    def __init__(self, text_key, output_key):
        super().__init__(text_key, output_key)  # always need to pass all init args to superclass init
        self.text_key = text_key  # the dict key corresponding to the input text data
        self.output_key = output_key  # the dict key corresponding to the output data (i.e. number of spaces)

    def __call__(self, record: Dict[str, Any]) -> Dict[str, Any]:
        input_text = record[self.text_key]
        record[self.output_key] = input_text.count(" ")
        return record

Since __call__ only takes a single argument, record, we pass the information regarding which keys to use for input and output data to __init__ and save them as instance attributes. Note that all subclasses of Transform need to call super().__init__ with all of their __init__ arguments, due to low-level implementation details regarding how we apply the Transforms to the dataset.

The `TransformPipeline` class

While Transform encapsulates the logic for the record-level transformation, we still don't have a mechanism for applying the transform to a dataset. This is where TransformPipeline comes in. A TransformPipeline represents a sequence, or "pipeline", of Transform objects that you wish to apply to a dataset. After initializing a TransformPipeline with a list of Transforms, simply call its execute method on an input dataset.

Example: Here, we implement a pipeline for a very simple evaluation. The steps are:

Construct LLM prompts from raw text inputs
Feed the prompts to a ModelRunner to get the model outputs
Compute the "number of spaces" metric we defined above

# Use the built-in utility Transform for generating prompts
gen_prompt = GeneratePrompt(
    input_keys="model_input",
    output_keys="prompt",
    prompt_template="Answer the following question: $model_input",
)

# Use the built-in utility Transform for getting model outputs
model = ... # some ModelRunner
get_model_outputs = GetModelOutputs(
    input_to_output_keys={"prompt": ["model_output"]},
    model_runner=model,
)

# Our new metric!
compute_num_spaces = NumSpaces(
    text_key="model_output",
    output_key="num_spaces",
)

my_pipeline = TransformPipeline([gen_prompt, get_model_outputs, compute_num_spaces])
dataset = # load some dataset
dataset = my_pipeline.execute(dataset)

Conclusion

To implement new metrics, create a new Transform that encapsulates the logic for computing said metric. Since the logic for all evaluation algorithms can be represented as a sequence of different Transforms, implementing a new evaluation algorithm essentially amounts to defining a TransformPipeline. Please see the built-in evaluation algorithms for examples.

Security

See CONTRIBUTING for more information.

License

This project is licensed under the Apache-2.0 License.

fmeval's People

Contributors

Stargazers

Watchers

Forkers

danielezhu keerthanvasist jgalego athewsey abhisodhani satish615 xiaoyi-cheng oyangz melanie531 harelix rvasahu-amazon ajitkumarkp kangmintong marcelcastrobr pdifranc taturabe sandy4321 malhotra18 gmaikelc polaschwoebel shashipal95 dnagithub01 franluca sapiensdev0 agrawalnishant sabrinalameiras evandrofranco aclouddevops connorads jamesthomasmoran shrestha-bikash sherryxding senays vijaypabothu blessyvincent saqlainhussainshah mehdi00012 kirupang-code sigamani awsvmaringa

fmeval's Issues

[Feature] EvalAlgorithmInterface.evaluate should accept a list of DataConfigs for consistency

Today EvalAlgorithmInterface.evaluate is typed to return List[EvalOutput] ("for dataset(s)", per the docstring), but its dataset_config argument only accepts Optional[DataConfig].

It seems like most concrete eval algorithms (like QAAccuracy here) either take the user's data_config for a single dataset, or take all the pre-defined DATASET_CONFIGS relevant to the evaluator's problem type.

...So the internal logic of evaluators is set up to support providing multiple datasets and returning multiple results already, but we seem to prevent users from calling evaluate() with multiple of their own datasets for no particular reason?

Evaluation Algorithm for Recommendations

Thanks for the great work.

I'm working with a customer that is looking to use LLMs for recommendation. We are proactively working on some evaluation metrics/algorithms.

Are there any plans to add any evaluation algorithms for recommendations?
I am happy to contribute and submit a PR for this, if not.

Add support for system prompt and messages API through ModelRunner.predict()

System prompt is required for Claude Messaging API and for GPT. FMEval API current predict API doesn't support:

passing a system prompt a long side the user prompt.
Passing messages to support chat prompting and multi-modal prompts.
The accuracy of these models depends on being able to separate the system and the user prompt.

Current API:

class ModelRunner(ABC):
...
def predict(self, prompt: str) -> Tuple[Optional[str], Optional[float]]:

fmeval is incompatible to properly install on (new) SageMaker Studio Python kernel

The (newly launched at re:Invent) SageMaker Studio standard JupyterLab Python 3 kernel currently includes:

amazon-sagemaker-jupyter-scheduler 3.0.4 which requires aiobotocore==2.7.*
- That aiobotocore enforces botocore>=1.31.16,<1.31.65
- fmeval requires sagemaker = "^2.199.0" which I think depends on boto3>=1.33.3,<2.0
autogluon-multimodal 0.8.2 which requires transformers[sentencepiece]<4.32.0,>=4.31.0
- fmeval pins transformers = "4.22.1"

...And so while kind of I'm able to %pip install fmeval on it, I haven't been able to find any configuration that doesn't introduce dependency conflicts somewhere.

Can we relax the constraint to allow some newer versions of transformers? Does fmeval really need that new a version of sagemaker?

[Feature] Add system metrics collected during evaluation to eval_output

It would useful to collect system metrics, e.g. latency, during the evaluation and to provide a summary in the evaluation output.

Cannot import FactualKnowledge module

Importing FactualKnowledge module into my jupyter notebook on sagemaker studio gives the following:

Failed running ./devtool: line 14: pre-commit: command not found

Using MacOS

.venv) (base) 682f678a18ea:fmeval gili$ sh +x ./devtool all
+ set +x
CD -> /Users/gili/dev/fmeval
==================
all
==================
OS Type: darwin22
-n Python version: 
Python 3.10.0

Detected darwin OS Type, setting OBJC_DISABLE_INITIALIZE_FORK_SAFETY in env

Lint checks
-e ===========

1. pre-commit hooks
===================
./devtool: line 14: pre-commit: command not found

String enums not comparing as expected

Hi team!

I was surprised today when the following didn't work as expected:

from fmeval.eval import get_eval_algorithm
from fmeval.eval_algorithms import EvalAlgorithm

get_eval_algorithm(EvalAlgorithm.QA_ACCURACY)
# Throws: "EvalAlgorithmClientError: Unknown eval algorithm QA ACCURACY"

The reason it seems, as discussed here on StackOverflow is that Python string enums require an additional parent class for their string values to work in comparison - So at the moment:

print(EvalAlgorithm.QA_ACCURACY == "qa_accuracy")  # 'False'
print(EvalAlgorithm.QA_ACCURACY == "QA_ACCURACY")  # Also 'False'
print(EvalAlgorithm.QA_ACCURACY == "QA ACCURACY")  # Also 'False' (despite the error msg above!)
print(EvalAlgorithm.QA_ACCURACY) # 'QA ACCURACY' because of the __str__ method

I propose editing fmeval.eval_algorithms.EvalAlgorithm to inherit from (str, Enum) instead of (Enum) (no other changes needed). From my testing it wouldn't break your custom __str__ method, but would allow logical comparisons to work:

print(EvalAlgorithm.QA_ACCURACY == "qa_accuracy")  # 'True'
print(EvalAlgorithm.QA_ACCURACY) # Still 'QA ACCURACY' because of the __str__ method

If so, I think the same refactor should also be applied to fmeval.eval_algorithms.ModelTask and fmeval.reporting.constants.ListType?

[Feature] Add callback mechanism to evaluation

I'm integrating fmeval with experiments tracking solutions (MLflow for now), and the lack of callback mechanisms means that the tracing can only happen after an evaluation is completed.
Drawbacks:

results can be recorded only once the evaluation has completed (similar to #278 )
if the evaluation fails during the execution, the already generated values are lost

The suggested solution is to implement a callback mechanism to be able to tracks results as they're generated, simplifying integration with experiment tracking solutions.

Unable to run in Docker container (unable to register worker with raylet)

Hi team,

We're trying to build a containerized Streamlit app using fmeval, but evaluation is dying with:

INFO worker.py:1642 -- Started a local Ray instance.
core_worker.cc:203: Failed to register worker 01000000ffffffffffffffffffffffffffffffffffffffffffffffff to Raylet. IOError: [RayletClient] Unable to register worker with raylet. No such file or directory

The Dockerfile is nothing fancy - based on python:3.10 base:

FROM --platform=linux/amd64 python:3.10

WORKDIR /usr/src/app
COPY src/requirements.txt ./requirements.txt
RUN pip3 install --no-cache-dir -r requirements.txt
COPY src/* ./

EXPOSE 8501

HEALTHCHECK CMD curl --fail http://localhost:8501/_stcore/health || exit 1

ENV AWS_DEFAULT_REGION=us-east-1

ENTRYPOINT ["streamlit", "run", "app.py", "--server.port=8501", "--server.address=0.0.0.0"]

Our installs are also minimal:

fmeval==0.3.0
# Explicit pandas pin for: https://github.com/ray-project/ray/issues/42572
pandas<2.2.0
streamlit==1.30.0

The app works fine locally (outside of Docker). Can anybody suggest what extra configs or dependencies are needed to run properly in Docker? I'm exploring whether switching to a rayproject/ray-based image helps, but it introduces some other initial errors and would be much better to know what the actual requirements are than being tied to one base image.

[Feature] JSON export/import for EvalOutput classes

I'd like to be able to easily and persistently store EvalOutput results to e.g. local disk or NoSQL databases like DynamoDB... And ideally also load them back into fmeval/Python objects.

There are several good reasons why I'd prefer to avoid just using pickle... and JSON seems like a natural fit for this kind of data, but we can't simply json.dumps() an EvalOutput object today.

It would be useful if we offered a clear mechanism to save the evaluation summary/scores to JSON, and ideally load back from JSON as well.

[Feature] LLM-based (QA Accuracy) eval algorithm

The metrics-based approaches in the QAAccuracy eval algorithm seem to harshly penalize verbose models (like Claude) on datasets with concise reference answers (like SQuAD).

It'd be useful if this library could provide support for LLM-based evaluation of LLM results: For example asking a model whether the reference answer and the generated answer agree or disagree. I'd imagine it working something along the lines of LlamaIndex's CorrectnessEvaluator?

As I understand it should be possible in theory to implement something like this by building a custom EvalAlgorithmInterface-based class, but there are a lot of design questions to consider like:

Is it possible to control whether the same LLM, a different LLM, or a panel of multiple LLMs gets used for the evaluation step, versus the original answer generation?
Since there are lots of different ways to use LLMs for self-critique, maybe e.g. QAAccuracyByLLMCritic should be a subtype of some broader class? Certainly it'd be interesting to use LLMs to judge other aspects like relevancy, and specific aspects of tone e.g. ("did it discuss my competitor companies XYZ")

Chinese model and content support

Hi experts,
Does fmeval support Chinese QA evaluation and LLM model like ChatGLM and Baichuan which is deployed in SageMaker endpoint?
Thanks.

ValidationException: An error occurred (ValidationException) when calling the InvokeModel operation: "claude-3-sonnet-20240229" is not supported on this API. Please use the Messages API instead.

Hi ,

Using the code the evaluate summary using below code

import json
import boto3
import os

##Bedrock clients for model inference

bedrock_runtime = boto3.client('bedrock-runtime', region_name='eu-west-3')

model_id = "anthropic.claude-3-sonnet-20240229-v1:0"
accept = "application/json"
contentType = "application/json"

from detectpdc_p_crime import base_prompt, response_pos, response_neg

base_prompt = """
Summarize the below content in half

Answers:
"""

query_prompt ="five-time world champion michelle kwan withdrew from the #### us figure skating championships on wednesday , but will petition us skating officials for the chance to compete at the #### turin olympics ."

full_prompt = base_prompt + query_prompt

aws_body = {
"anthropic_version": "bedrock-2023-05-31",
"max_tokens": 200,
"messages": [
{
"role": "user",
"content": [
{
"type": "text",
"text": full_prompt
}
]
}
]
}

body = json.dumps(aws_body)

Invoke the model

response = bedrock_runtime.invoke_model(body=body, modelId=model_id)

Parse the invocation response

response_body = json.loads(response["body"].read())
outputs = response_body.get("content")

print(outputs)

from fmeval.data_loaders.data_config import DataConfig
from fmeval.model_runners.bedrock_model_runner import BedrockModelRunner
from fmeval.constants import MIME_TYPE_JSONLINES
from fmeval.eval_algorithms.summarization_accuracy import SummarizationAccuracy

config = DataConfig(
dataset_name="gigaword_sample",
dataset_uri="gigaword_sample.jsonl",
dataset_mime_type=MIME_TYPE_JSONLINES,
model_input_location="document",
target_output_location="summary"
)

bedrock_model_runner = BedrockModelRunner(
model_id=model_id,
output='completion',
content_template='{"prompt": $prompt, "max_tokens_to_sample": 500}'
)

eval_algo = SummarizationAccuracy()
eval_output = eval_algo.evaluate(model=bedrock_model_runner, dataset_config=config,
prompt_template="Human: Summarise the following text in one sentence: $model_input\n\nAssistant:\n", save=True)

print (eval_output)

error

raise error_class(parsed_response, operation_name)

botocore.errorfactory.ValidationException: An error occurred (ValidationException) when calling the InvokeModel operation: "claude-3-sonnet-20240229" is not supported on this API. Please use the Messages API instead.

Though the claude model used to generate summary is using 'invoke_model' to generate the summaries but getting error with fmeval

[Feature] Support string output path in `EvalAlgorithmInterface.evaluate(save="...")`

Today the save parameter of EvalAlgorithmInterface.evaluate() is just a boolean.

From hunting around I found that the output path is taken either from an environment variable or else a default under /tmp... And that it should also be overrideable by setting the obviously-supposed-to-be-private property eval_algo._eval_results_path.

IMO it's harder and less obvious than it should be for a developer using the library to save results in a folder they want. It'd be much easier if we could support eval_algo.evaluate(save="my/cool/folder") and ideally automatically create the provided folder if it doesn't already exist?

Dependency issues

Hi all,
I am trying to install the package on my Mac via pip install fmeval, but installation fails because of problems with the tokenizers package. It seems that the rust compilation fails. Is there a way to install with out this dependency?

[Feature] Streaming results/progress and summary metrics for faster feedback

I recently helped build an app for data-driven prompt engineering, in which users run fmeval-based evaluations for fast feedback to help refine prompt templates.

One challenge I noticed is that batch evaluation delays this feedback. Today, either users have to make a trade-off when sizing their input dataset - or application designers would have to implement some kind of chunking before fmeval - to trade off between the speed and quality of results.

To accelerate workflows like these, it would be useful if we could start receiving results ASAP while the batch job runs (including point-in-time summary metrics) so an app could display intermediate progress. That way, a prompt engineer could identify obviously-underperforming changes early and revert the change + abort the full evaluation.

A caveat to this though: I'm not aware of a common standard pattern yet for this kind of progress callback/hook in Python's synchronous-by-default ecosystem... Or how Ray might affect that.

out of memory for summarization accuracy

Hi, I'm getting OOM when running with m5.large (8GB RAM). What's the minimum memory required? Worth putting in README

[Feature] Image fields for multi-modal models

I'm trying to evaluate Claude v3's performance for some document understanding tasks, with a workflow that includes passing the image of the page in as one of the inputs.

Is fmeval considering native handling for image/multi-modal fields in input datasets?

[Feature] Multi-variable prompt templates

I see from this comment this feature may be coming already, but the attached issue was closed with the interim workaround.

For our use-case we have a multi-field dataset for example:

source_doc	question	ref_answers
(full text)	What date was the agreement signed, in YYYY-MM-DD format?	2024-04-09
...	...	...

...Where the final LLM prompt would combine both the source document and the question, in a (constant) template. I could easily see more general use-cases with other fields too.

Today, we're hacking around fmeval by doing prompt fulfilment as a separate step before the library. It would be much better if fmeval is able to directly process raw multi-field datasets like this, by taking a prompt template that can reference arbitrary fields from the source record.

My reason for revisiting this was Claude v3's messages API, which means we're going to have to do more sophisticated fulfilment on our side to achieve the same effect.

aws / fmeval Goto Github PK

fmeval's Introduction

Foundation Model Evaluations Library

Installation

Usage

Using a custom dataset for an evaluation

Telemetry

Troubleshooting

Development

Setup and the use of devtool

Adding python dependencies

Adding your own evaluation algorithm and/or metrics

Preface

The Transform class

The TransformPipeline class

Conclusion

Security

License

fmeval's People

Contributors

Stargazers

Watchers

Forkers

fmeval's Issues

Invoke the model

Parse the invocation response

Recommend Projects

Recommend Topics

Recommend Org

Setup and the use of `devtool`

The `Transform` class

The `TransformPipeline` class