aws / fmeval Goto Github PK
View Code? Open in Web Editor NEWFoundation Model Evaluations Library
Home Page: http://aws.github.io/fmeval
License: Apache License 2.0
Foundation Model Evaluations Library
Home Page: http://aws.github.io/fmeval
License: Apache License 2.0
Running model-comparison.ipynb I realized the radar plot is not displaying.
I fixed by adding the following code:
import plotly.io as pio
pio.renderers.default = 'notebook'
Today the save
parameter of EvalAlgorithmInterface.evaluate()
is just a boolean.
From hunting around I found that the output path is taken either from an environment variable or else a default under /tmp... And that it should also be overrideable by setting the obviously-supposed-to-be-private property eval_algo._eval_results_path
.
IMO it's harder and less obvious than it should be for a developer using the library to save results in a folder they want. It'd be much easier if we could support eval_algo.evaluate(save="my/cool/folder")
and ideally automatically create the provided folder if it doesn't already exist?
Hi all,
I am trying to install the package on my Mac via pip install fmeval
, but installation fails because of problems with the tokenizers package. It seems that the rust compilation fails. Is there a way to install with out this dependency?
Using MacOS
.venv) (base) 682f678a18ea:fmeval gili$ sh +x ./devtool all
+ set +x
CD -> /Users/gili/dev/fmeval
==================
all
==================
OS Type: darwin22
-n Python version:
Python 3.10.0
Detected darwin OS Type, setting OBJC_DISABLE_INITIALIZE_FORK_SAFETY in env
Lint checks
-e ===========
1. pre-commit hooks
===================
./devtool: line 14: pre-commit: command not found
I recently helped build an app for data-driven prompt engineering, in which users run fmeval-based evaluations for fast feedback to help refine prompt templates.
One challenge I noticed is that batch evaluation delays this feedback. Today, either users have to make a trade-off when sizing their input dataset - or application designers would have to implement some kind of chunking before fmeval - to trade off between the speed and quality of results.
To accelerate workflows like these, it would be useful if we could start receiving results ASAP while the batch job runs (including point-in-time summary metrics) so an app could display intermediate progress. That way, a prompt engineer could identify obviously-underperforming changes early and revert the change + abort the full evaluation.
A caveat to this though: I'm not aware of a common standard pattern yet for this kind of progress callback/hook in Python's synchronous-by-default ecosystem... Or how Ray might affect that.
Are there plans to make this package compatible with Python 3.11 and 3.12?
The (newly launched at re:Invent) SageMaker Studio standard JupyterLab Python 3 kernel currently includes:
amazon-sagemaker-jupyter-scheduler
3.0.4 which requires aiobotocore==2.7.*
sagemaker = "^2.199.0"
which I think depends on boto3>=1.33.3,<2.0autogluon-multimodal
0.8.2 which requires transformers[sentencepiece]<4.32.0,>=4.31.0
transformers = "4.22.1"
...And so while kind of I'm able to %pip install fmeval
on it, I haven't been able to find any configuration that doesn't introduce dependency conflicts somewhere.
Can we relax the constraint to allow some newer versions of transformers
? Does fmeval really need that new a version of sagemaker
?
System prompt is required for Claude Messaging API and for GPT. FMEval API current predict API doesn't support:
Current API:
class ModelRunner(ABC):
...
def predict(self, prompt: str) -> Tuple[Optional[str], Optional[float]]:
I see from this comment this feature may be coming already, but the attached issue was closed with the interim workaround.
For our use-case we have a multi-field dataset for example:
source_doc | question | ref_answers |
---|---|---|
(full text) | What date was the agreement signed, in YYYY-MM-DD format? | 2024-04-09 |
... | ... | ... |
...Where the final LLM prompt would combine both the source document and the question, in a (constant) template. I could easily see more general use-cases with other fields too.
Today, we're hacking around fmeval by doing prompt fulfilment as a separate step before the library. It would be much better if fmeval is able to directly process raw multi-field datasets like this, by taking a prompt template that can reference arbitrary fields from the source record.
My reason for revisiting this was Claude v3's messages API, which means we're going to have to do more sophisticated fulfilment on our side to achieve the same effect.
I'd like to be able to easily and persistently store EvalOutput results to e.g. local disk or NoSQL databases like DynamoDB... And ideally also load them back into fmeval/Python objects.
There are several good reasons why I'd prefer to avoid just using pickle... and JSON seems like a natural fit for this kind of data, but we can't simply json.dumps()
an EvalOutput
object today.
It would be useful if we offered a clear mechanism to save the evaluation summary/scores to JSON, and ideally load back from JSON as well.
Hi experts,
Does fmeval support Chinese QA evaluation and LLM model like ChatGLM and Baichuan which is deployed in SageMaker endpoint?
Thanks.
Thanks for the great work.
I'm working with a customer that is looking to use LLMs for recommendation. We are proactively working on some evaluation metrics/algorithms.
Are there any plans to add any evaluation algorithms for recommendations?
I am happy to contribute and submit a PR for this, if not.
The metrics-based approaches in the QAAccuracy
eval algorithm seem to harshly penalize verbose models (like Claude) on datasets with concise reference answers (like SQuAD).
It'd be useful if this library could provide support for LLM-based evaluation of LLM results: For example asking a model whether the reference answer and the generated answer agree or disagree. I'd imagine it working something along the lines of LlamaIndex's CorrectnessEvaluator
?
As I understand it should be possible in theory to implement something like this by building a custom EvalAlgorithmInterface
-based class, but there are a lot of design questions to consider like:
QAAccuracyByLLMCritic
should be a subtype of some broader class? Certainly it'd be interesting to use LLMs to judge other aspects like relevancy, and specific aspects of tone e.g. ("did it discuss my competitor companies XYZ")Hi ,
Using the code the evaluate summary using below code
import json
import boto3
import os
##Bedrock clients for model inference
bedrock_runtime = boto3.client('bedrock-runtime', region_name='eu-west-3')
model_id = "anthropic.claude-3-sonnet-20240229-v1:0"
accept = "application/json"
contentType = "application/json"
from detectpdc_p_crime import base_prompt, response_pos, response_neg
base_prompt = """
Summarize the below content in half
Answers:
"""
query_prompt ="five-time world champion michelle kwan withdrew from the #### us figure skating championships on wednesday , but will petition us skating officials for the chance to compete at the #### turin olympics ."
full_prompt = base_prompt + query_prompt
aws_body = {
"anthropic_version": "bedrock-2023-05-31",
"max_tokens": 200,
"messages": [
{
"role": "user",
"content": [
{
"type": "text",
"text": full_prompt
}
]
}
]
}
body = json.dumps(aws_body)
response = bedrock_runtime.invoke_model(body=body, modelId=model_id)
response_body = json.loads(response["body"].read())
outputs = response_body.get("content")
print(outputs)
from fmeval.data_loaders.data_config import DataConfig
from fmeval.model_runners.bedrock_model_runner import BedrockModelRunner
from fmeval.constants import MIME_TYPE_JSONLINES
from fmeval.eval_algorithms.summarization_accuracy import SummarizationAccuracy
config = DataConfig(
dataset_name="gigaword_sample",
dataset_uri="gigaword_sample.jsonl",
dataset_mime_type=MIME_TYPE_JSONLINES,
model_input_location="document",
target_output_location="summary"
)
bedrock_model_runner = BedrockModelRunner(
model_id=model_id,
output='completion',
content_template='{"prompt": $prompt, "max_tokens_to_sample": 500}'
)
eval_algo = SummarizationAccuracy()
eval_output = eval_algo.evaluate(model=bedrock_model_runner, dataset_config=config,
prompt_template="Human: Summarise the following text in one sentence: $model_input\n\nAssistant:\n", save=True)
print (eval_output)
error
raise error_class(parsed_response, operation_name)
botocore.errorfactory.ValidationException: An error occurred (ValidationException) when calling the InvokeModel operation: "claude-3-sonnet-20240229" is not supported on this API. Please use the Messages API instead.
Though the claude model used to generate summary is using 'invoke_model' to generate the summaries but getting error with fmeval
It would useful to collect system metrics, e.g. latency, during the evaluation and to provide a summary in the evaluation output.
Hi team!
I was surprised today when the following didn't work as expected:
from fmeval.eval import get_eval_algorithm
from fmeval.eval_algorithms import EvalAlgorithm
get_eval_algorithm(EvalAlgorithm.QA_ACCURACY)
# Throws: "EvalAlgorithmClientError: Unknown eval algorithm QA ACCURACY"
The reason it seems, as discussed here on StackOverflow is that Python string enums require an additional parent class for their string values to work in comparison - So at the moment:
print(EvalAlgorithm.QA_ACCURACY == "qa_accuracy") # 'False'
print(EvalAlgorithm.QA_ACCURACY == "QA_ACCURACY") # Also 'False'
print(EvalAlgorithm.QA_ACCURACY == "QA ACCURACY") # Also 'False' (despite the error msg above!)
print(EvalAlgorithm.QA_ACCURACY) # 'QA ACCURACY' because of the __str__ method
I propose editing fmeval.eval_algorithms.EvalAlgorithm to inherit from (str, Enum)
instead of (Enum)
(no other changes needed). From my testing it wouldn't break your custom __str__
method, but would allow logical comparisons to work:
print(EvalAlgorithm.QA_ACCURACY == "qa_accuracy") # 'True'
print(EvalAlgorithm.QA_ACCURACY) # Still 'QA ACCURACY' because of the __str__ method
If so, I think the same refactor should also be applied to fmeval.eval_algorithms.ModelTask and fmeval.reporting.constants.ListType?
Hi, I'm getting OOM when running with m5.large (8GB RAM). What's the minimum memory required? Worth putting in README
Today EvalAlgorithmInterface.evaluate is typed to return List[EvalOutput]
("for dataset(s)", per the docstring), but its dataset_config
argument only accepts Optional[DataConfig]
.
It seems like most concrete eval algorithms (like QAAccuracy here) either take the user's data_config
for a single dataset, or take all the pre-defined DATASET_CONFIGS relevant to the evaluator's problem type.
...So the internal logic of evaluators is set up to support providing multiple datasets and returning multiple results already, but we seem to prevent users from calling evaluate()
with multiple of their own datasets for no particular reason?
I'm integrating fmeval with experiments tracking solutions (MLflow for now), and the lack of callback mechanisms means that the tracing can only happen after an evaluation is completed.
Drawbacks:
The suggested solution is to implement a callback mechanism to be able to tracks results as they're generated, simplifying integration with experiment tracking solutions.
I'm trying to evaluate Claude v3's performance for some document understanding tasks, with a workflow that includes passing the image of the page in as one of the inputs.
Is fmeval considering native handling for image/multi-modal fields in input datasets?
Hi team,
We're trying to build a containerized Streamlit app using fmeval, but evaluation is dying with:
INFO worker.py:1642 -- Started a local Ray instance.
core_worker.cc:203: Failed to register worker 01000000ffffffffffffffffffffffffffffffffffffffffffffffff to Raylet. IOError: [RayletClient] Unable to register worker with raylet. No such file or directory
The Dockerfile is nothing fancy - based on python:3.10
base:
FROM --platform=linux/amd64 python:3.10
WORKDIR /usr/src/app
COPY src/requirements.txt ./requirements.txt
RUN pip3 install --no-cache-dir -r requirements.txt
COPY src/* ./
EXPOSE 8501
HEALTHCHECK CMD curl --fail http://localhost:8501/_stcore/health || exit 1
ENV AWS_DEFAULT_REGION=us-east-1
ENTRYPOINT ["streamlit", "run", "app.py", "--server.port=8501", "--server.address=0.0.0.0"]
Our installs are also minimal:
fmeval==0.3.0
# Explicit pandas pin for: https://github.com/ray-project/ray/issues/42572
pandas<2.2.0
streamlit==1.30.0
The app works fine locally (outside of Docker). Can anybody suggest what extra configs or dependencies are needed to run properly in Docker? I'm exploring whether switching to a rayproject/ray
-based image helps, but it introduces some other initial errors and would be much better to know what the actual requirements are than being tied to one base image.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.