Giter Site home page Giter Site logo

fmeval's Issues

[Feature] Support string output path in `EvalAlgorithmInterface.evaluate(save="...")`

Today the save parameter of EvalAlgorithmInterface.evaluate() is just a boolean.

From hunting around I found that the output path is taken either from an environment variable or else a default under /tmp... And that it should also be overrideable by setting the obviously-supposed-to-be-private property eval_algo._eval_results_path.

IMO it's harder and less obvious than it should be for a developer using the library to save results in a folder they want. It'd be much easier if we could support eval_algo.evaluate(save="my/cool/folder") and ideally automatically create the provided folder if it doesn't already exist?

Dependency issues

Hi all,
I am trying to install the package on my Mac via pip install fmeval, but installation fails because of problems with the tokenizers package. It seems that the rust compilation fails. Is there a way to install with out this dependency?

Failed running ./devtool: line 14: pre-commit: command not found

Using MacOS

.venv) (base) 682f678a18ea:fmeval gili$ sh +x ./devtool all
+ set +x
CD -> /Users/gili/dev/fmeval
==================
all
==================
OS Type: darwin22
-n Python version: 
Python 3.10.0

Detected darwin OS Type, setting OBJC_DISABLE_INITIALIZE_FORK_SAFETY in env

Lint checks
-e ===========

1. pre-commit hooks
===================
./devtool: line 14: pre-commit: command not found

[Feature] Streaming results/progress and summary metrics for faster feedback

I recently helped build an app for data-driven prompt engineering, in which users run fmeval-based evaluations for fast feedback to help refine prompt templates.

One challenge I noticed is that batch evaluation delays this feedback. Today, either users have to make a trade-off when sizing their input dataset - or application designers would have to implement some kind of chunking before fmeval - to trade off between the speed and quality of results.

To accelerate workflows like these, it would be useful if we could start receiving results ASAP while the batch job runs (including point-in-time summary metrics) so an app could display intermediate progress. That way, a prompt engineer could identify obviously-underperforming changes early and revert the change + abort the full evaluation.

A caveat to this though: I'm not aware of a common standard pattern yet for this kind of progress callback/hook in Python's synchronous-by-default ecosystem... Or how Ray might affect that.

fmeval is incompatible to properly install on (new) SageMaker Studio Python kernel

The (newly launched at re:Invent) SageMaker Studio standard JupyterLab Python 3 kernel currently includes:

  • amazon-sagemaker-jupyter-scheduler 3.0.4 which requires aiobotocore==2.7.*
    • That aiobotocore enforces botocore>=1.31.16,<1.31.65
    • fmeval requires sagemaker = "^2.199.0" which I think depends on boto3>=1.33.3,<2.0
  • autogluon-multimodal 0.8.2 which requires transformers[sentencepiece]<4.32.0,>=4.31.0
    • fmeval pins transformers = "4.22.1"

...And so while kind of I'm able to %pip install fmeval on it, I haven't been able to find any configuration that doesn't introduce dependency conflicts somewhere.

Can we relax the constraint to allow some newer versions of transformers? Does fmeval really need that new a version of sagemaker?

Add support for system prompt and messages API through ModelRunner.predict()

System prompt is required for Claude Messaging API and for GPT. FMEval API current predict API doesn't support:

  1. passing a system prompt a long side the user prompt.
  2. Passing messages to support chat prompting and multi-modal prompts.
    The accuracy of these models depends on being able to separate the system and the user prompt.

Current API:

class ModelRunner(ABC):
...
def predict(self, prompt: str) -> Tuple[Optional[str], Optional[float]]:

[Feature] Multi-variable prompt templates

I see from this comment this feature may be coming already, but the attached issue was closed with the interim workaround.

For our use-case we have a multi-field dataset for example:

source_doc question ref_answers
(full text) What date was the agreement signed, in YYYY-MM-DD format? 2024-04-09
... ... ...

...Where the final LLM prompt would combine both the source document and the question, in a (constant) template. I could easily see more general use-cases with other fields too.

Today, we're hacking around fmeval by doing prompt fulfilment as a separate step before the library. It would be much better if fmeval is able to directly process raw multi-field datasets like this, by taking a prompt template that can reference arbitrary fields from the source record.

My reason for revisiting this was Claude v3's messages API, which means we're going to have to do more sophisticated fulfilment on our side to achieve the same effect.

[Feature] JSON export/import for EvalOutput classes

I'd like to be able to easily and persistently store EvalOutput results to e.g. local disk or NoSQL databases like DynamoDB... And ideally also load them back into fmeval/Python objects.

There are several good reasons why I'd prefer to avoid just using pickle... and JSON seems like a natural fit for this kind of data, but we can't simply json.dumps() an EvalOutput object today.

It would be useful if we offered a clear mechanism to save the evaluation summary/scores to JSON, and ideally load back from JSON as well.

Chinese model and content support

Hi experts,
Does fmeval support Chinese QA evaluation and LLM model like ChatGLM and Baichuan which is deployed in SageMaker endpoint?
Thanks.

Evaluation Algorithm for Recommendations

Thanks for the great work.

I'm working with a customer that is looking to use LLMs for recommendation. We are proactively working on some evaluation metrics/algorithms.

Are there any plans to add any evaluation algorithms for recommendations?
I am happy to contribute and submit a PR for this, if not.

[Feature] LLM-based (QA Accuracy) eval algorithm

The metrics-based approaches in the QAAccuracy eval algorithm seem to harshly penalize verbose models (like Claude) on datasets with concise reference answers (like SQuAD).

It'd be useful if this library could provide support for LLM-based evaluation of LLM results: For example asking a model whether the reference answer and the generated answer agree or disagree. I'd imagine it working something along the lines of LlamaIndex's CorrectnessEvaluator?

As I understand it should be possible in theory to implement something like this by building a custom EvalAlgorithmInterface-based class, but there are a lot of design questions to consider like:

  • Is it possible to control whether the same LLM, a different LLM, or a panel of multiple LLMs gets used for the evaluation step, versus the original answer generation?
  • Since there are lots of different ways to use LLMs for self-critique, maybe e.g. QAAccuracyByLLMCritic should be a subtype of some broader class? Certainly it'd be interesting to use LLMs to judge other aspects like relevancy, and specific aspects of tone e.g. ("did it discuss my competitor companies XYZ")

ValidationException: An error occurred (ValidationException) when calling the InvokeModel operation: "claude-3-sonnet-20240229" is not supported on this API. Please use the Messages API instead.

Hi ,

Using the code the evaluate summary using below code

import json
import boto3
import os

##Bedrock clients for model inference

bedrock_runtime = boto3.client('bedrock-runtime', region_name='eu-west-3')

model_id = "anthropic.claude-3-sonnet-20240229-v1:0"
accept = "application/json"
contentType = "application/json"

from detectpdc_p_crime import base_prompt, response_pos, response_neg

base_prompt = """
Summarize the below content in half

Answers:
"""

query_prompt ="five-time world champion michelle kwan withdrew from the #### us figure skating championships on wednesday , but will petition us skating officials for the chance to compete at the #### turin olympics ."

full_prompt = base_prompt + query_prompt

aws_body = {
"anthropic_version": "bedrock-2023-05-31",
"max_tokens": 200,
"messages": [
{
"role": "user",
"content": [
{
"type": "text",
"text": full_prompt
}
]
}
]
}

body = json.dumps(aws_body)

Invoke the model

response = bedrock_runtime.invoke_model(body=body, modelId=model_id)

Parse the invocation response

response_body = json.loads(response["body"].read())
outputs = response_body.get("content")

print(outputs)

from fmeval.data_loaders.data_config import DataConfig
from fmeval.model_runners.bedrock_model_runner import BedrockModelRunner
from fmeval.constants import MIME_TYPE_JSONLINES
from fmeval.eval_algorithms.summarization_accuracy import SummarizationAccuracy

config = DataConfig(
dataset_name="gigaword_sample",
dataset_uri="gigaword_sample.jsonl",
dataset_mime_type=MIME_TYPE_JSONLINES,
model_input_location="document",
target_output_location="summary"
)

bedrock_model_runner = BedrockModelRunner(
model_id=model_id,
output='completion',
content_template='{"prompt": $prompt, "max_tokens_to_sample": 500}'
)

eval_algo = SummarizationAccuracy()
eval_output = eval_algo.evaluate(model=bedrock_model_runner, dataset_config=config,
prompt_template="Human: Summarise the following text in one sentence: $model_input\n\nAssistant:\n", save=True)

print (eval_output)


error

raise error_class(parsed_response, operation_name)

botocore.errorfactory.ValidationException: An error occurred (ValidationException) when calling the InvokeModel operation: "claude-3-sonnet-20240229" is not supported on this API. Please use the Messages API instead.

Though the claude model used to generate summary is using 'invoke_model' to generate the summaries but getting error with fmeval

String enums not comparing as expected

Hi team!

I was surprised today when the following didn't work as expected:

from fmeval.eval import get_eval_algorithm
from fmeval.eval_algorithms import EvalAlgorithm

get_eval_algorithm(EvalAlgorithm.QA_ACCURACY)
# Throws: "EvalAlgorithmClientError: Unknown eval algorithm QA ACCURACY"

The reason it seems, as discussed here on StackOverflow is that Python string enums require an additional parent class for their string values to work in comparison - So at the moment:

print(EvalAlgorithm.QA_ACCURACY == "qa_accuracy")  # 'False'
print(EvalAlgorithm.QA_ACCURACY == "QA_ACCURACY")  # Also 'False'
print(EvalAlgorithm.QA_ACCURACY == "QA ACCURACY")  # Also 'False' (despite the error msg above!)
print(EvalAlgorithm.QA_ACCURACY) # 'QA ACCURACY' because of the __str__ method

I propose editing fmeval.eval_algorithms.EvalAlgorithm to inherit from (str, Enum) instead of (Enum) (no other changes needed). From my testing it wouldn't break your custom __str__ method, but would allow logical comparisons to work:

print(EvalAlgorithm.QA_ACCURACY == "qa_accuracy")  # 'True'
print(EvalAlgorithm.QA_ACCURACY) # Still 'QA ACCURACY' because of the __str__ method

If so, I think the same refactor should also be applied to fmeval.eval_algorithms.ModelTask and fmeval.reporting.constants.ListType?

[Feature] EvalAlgorithmInterface.evaluate should accept a list of DataConfigs for consistency

Today EvalAlgorithmInterface.evaluate is typed to return List[EvalOutput] ("for dataset(s)", per the docstring), but its dataset_config argument only accepts Optional[DataConfig].

It seems like most concrete eval algorithms (like QAAccuracy here) either take the user's data_config for a single dataset, or take all the pre-defined DATASET_CONFIGS relevant to the evaluator's problem type.

...So the internal logic of evaluators is set up to support providing multiple datasets and returning multiple results already, but we seem to prevent users from calling evaluate() with multiple of their own datasets for no particular reason?

[Feature] Add callback mechanism to evaluation

I'm integrating fmeval with experiments tracking solutions (MLflow for now), and the lack of callback mechanisms means that the tracing can only happen after an evaluation is completed.
Drawbacks:

  • results can be recorded only once the evaluation has completed (similar to #278 )
  • if the evaluation fails during the execution, the already generated values are lost

The suggested solution is to implement a callback mechanism to be able to tracks results as they're generated, simplifying integration with experiment tracking solutions.

[Feature] Image fields for multi-modal models

I'm trying to evaluate Claude v3's performance for some document understanding tasks, with a workflow that includes passing the image of the page in as one of the inputs.

Is fmeval considering native handling for image/multi-modal fields in input datasets?

Unable to run in Docker container (unable to register worker with raylet)

Hi team,

We're trying to build a containerized Streamlit app using fmeval, but evaluation is dying with:

INFO worker.py:1642 -- Started a local Ray instance.
core_worker.cc:203: Failed to register worker 01000000ffffffffffffffffffffffffffffffffffffffffffffffff to Raylet. IOError: [RayletClient] Unable to register worker with raylet. No such file or directory

The Dockerfile is nothing fancy - based on python:3.10 base:

FROM --platform=linux/amd64 python:3.10

WORKDIR /usr/src/app
COPY src/requirements.txt ./requirements.txt
RUN pip3 install --no-cache-dir -r requirements.txt
COPY src/* ./

EXPOSE 8501

HEALTHCHECK CMD curl --fail http://localhost:8501/_stcore/health || exit 1

ENV AWS_DEFAULT_REGION=us-east-1

ENTRYPOINT ["streamlit", "run", "app.py", "--server.port=8501", "--server.address=0.0.0.0"]

Our installs are also minimal:

fmeval==0.3.0
# Explicit pandas pin for: https://github.com/ray-project/ray/issues/42572
pandas<2.2.0
streamlit==1.30.0

The app works fine locally (outside of Docker). Can anybody suggest what extra configs or dependencies are needed to run properly in Docker? I'm exploring whether switching to a rayproject/ray-based image helps, but it introduces some other initial errors and would be much better to know what the actual requirements are than being tied to one base image.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.