braintrustdata / autoevals Goto Github PK
View Code? Open in Web Editor NEWAutoEvals is a tool for quickly and easily evaluating AI model outputs using best practices.
License: MIT License
AutoEvals is a tool for quickly and easily evaluating AI model outputs using best practices.
License: MIT License
#3 fixes the current problem, but on main
right now, make test
fails. We should setup a basic CI/CD pipeline.
Hi,
It seems that Azure OpenAI (https://oai.azure.com/) is not supported. Is that the case?
Will be glad to add that functionality if does not exists yet
Tried this code snippet:
from autoevals.llm import *
import openai
openai.api_key = "sk-"
# Create a new LLM-based evaluator
evaluator = Factuality()
# Evaluate an example LLM completion
input = "Which country has the highest population?"
output = "People's Republic of China"
expected = "China"
result = evaluator(output, expected, input=input)
print(result)
# The evaluator returns a score from [0,1] and includes the raw outputs from the evaluator
print(f"Factuality score: {result.score}")
print(f"Factuality metadata: {result.metadata['rationale']}")
I see this error:
Score(name='Factuality', score=0, metadata={}, error=KeyError('usage'))
Factuality score: 0
Traceback (most recent call last):
File "/Users/ishaanjaffer/Github/litellm/litellm/tests/test_autoeval.py", line 19, in <module>
print(f"Factuality metadata: {result.metadata['rationale']}")
KeyError: 'rationale'
Any suggestions on how i can debug this ?
Hi! I am wondering if it's possible to use open source or self-deployed llms (and not only open ai) as the judge or evaluator? If yes, could you please refer to an example or part of the docs explaining that? Thanks!
When using ContextRelevancy()
:
Correct Behavior
# INPUTS
question = "List 3 movies about sci-fi in the genre of fiction."
context = ['ex machina', 'i am mother', 'mother/android']
answer = "These three films explore the complex relationship between humans and artificial intelligence. In 'Ex Machina,' a programmer interacts with a humanoid AI, questioning consciousness and morality. 'I Am Mother' features a girl raised by a robot in a post-extinction world, who challenges her understanding of trust and the outside world when a human arrives. 'Mother/Android' follows a pregnant woman and her boyfriend navigating a post-apocalyptic landscape controlled by hostile androids, highlighting themes of survival and human resilience."
# OUTPUTS
score = 0.9459459459459459
metadata = {'relevant_sentences': [{'sentence': 'ex machina', 'reasons': []}, {'sentence': 'i am mother', 'reasons': []}, {'sentence': 'mother/android', 'reasons': []}]}
Incorrect Behavior
# INPUTS
question = "3, sci-fi, fiction, movies"
# same context and answer
# ERROR
ValueError("score (9.81081081081081) must be between 0 and 1")
That error should be handled within ContextRelevancy()
.
It could be clearer how to use the evaluators that use "context" in addition to input and output in the Eval
run, such as Faithfulness
and ContextRelevancy
.
Right now, I'm including contexts in the metadata. I only figured this out after few hours of poking around since the behavior is undocumented.
Here's an annotated version of my code which worked:
import { Eval } from "braintrust";
import { Faithfulness, ContextRelevancy } from "autoevals";
import "dotenv/config";
import { strict as assert } from "assert";
assert(process.env.OPENAI_OPENAI_API_KEY, "need openai key from openai");
const openAiApiKey = process.env.OPENAI_OPENAI_API_KEY;
const model = "gpt-4o-mini";
const evaluatorLlmConf = {
openAiApiKey,
model,
};
/**
Evaluate whether the output is faithful to the model input.
*/
const makeAnswerFaithfulness = function (args: {
input: string;
output: string;
// passing context in metadata
metadata: { context: string[] };
}) {
return Faithfulness({
input: args.input,
output: args.output,
context: args.metadata.context,
...evaluatorLlmConf,
});
};
/**
Evaluate whether answer is relevant to the input.
*/
const makeAnswerRelevance = function (args: {
input: string;
output: string;
metadata: { context: string[] };
}) {
return AnswerRelevancy({
input: args.input,
output: args.output,
context: args.metadata.context,
...evaluatorLlmConf,
});
};
/**
Evaluate whether context is relevant to the input.
*/
const makeContextRelevance = function (args: {
input: string;
output: string;
metadata: { context: string[] };
}) {
return ContextRelevancy({
input: args.input,
output: args.output,
context: args.metadata.context,
...evaluatorLlmConf,
});
};
const dataset = [
{
input: "What is the capital of France",
tags: ["paris"],
metadata: {
// including context in metadata here as well
context: [
"The capital of France is Paris.",
"Berlin is the capital of Germany.",
],
},
output: "Paris is the capital of France.",
},
{
input: "Who wrote Harry Potter",
tags: ["harry-potter"],
metadata: {
context: [
"Harry Potter was written by J.K. Rowling.",
"The Lord of the Rings was written by J.R.R. Tolkien.",
],
},
output: "J.R.R. Tolkien wrote Harry Potter.",
},
{
input: "What is the largest planet in our solar system",
tags: ["jupiter"],
metadata: {
context: [
"Jupiter is the largest planet in our solar system.",
"Saturn has the largest rings in our solar system.",
],
},
output: "Saturn is the largest planet in our solar system.",
},
];
function makeGeneratedAnswerReturner(outputs: string[]) {
// closure over iterator
let counter = 0;
return async (_input: string) => {
counter++;
return outputs[counter - 1];
};
}
Eval("mdb-test", {
experimentName: "rag-metrics",
metadata: {
testing: true,
},
data: () => {
return dataset;
},
task: makeGeneratedAnswerReturner(dataset.map((d) => d.output)),
scores: [makeAnswerFaithfulness, makeContextRelevance],
});
Currently, it's less than straightforward to run evals if the answer is pre-generated, or based on case-specific data beyond the input.
This is because the Eval
's task()
function only accepts the input
string as an argument.
I think it's important to be able to evaluate against pre-generated outputs so that we can decouple the evaluation stage (in Braintrust) from the dataset generation stage, which doesn't necessarily require Braintrust.
Here's my current implementation, which relies on creating a closure over the task() function to iterate thought pre-generated responses:
import { Eval } from "braintrust";
import { Faithfulness, AnswerRelevancy, ContextRelevancy } from "autoevals";
import "dotenv/config";
import { strict as assert } from "assert";
assert(process.env.OPENAI_OPENAI_API_KEY, "need openai key from openai");
const openAiApiKey = process.env.OPENAI_OPENAI_API_KEY;
const model = "gpt-4o-mini";
const evaluatorLlmConf = {
openAiApiKey,
model,
};
/**
Evaluate whether the output is faithful to the model input.
*/
const makeAnswerFaithfulness = function (args: {
input: string;
output: string;
metadata: { context: string[] };
}) {
return Faithfulness({
input: args.input,
output: args.output,
context: args.metadata.context,
...evaluatorLlmConf,
});
};
/**
Evaluate whether answer is relevant to the input.
*/
const makeAnswerRelevance = function (args: {
input: string;
output: string;
metadata: { context: string[] };
}) {
return AnswerRelevancy({
input: args.input,
output: args.output,
context: args.metadata.context,
...evaluatorLlmConf,
});
};
/**
Evaluate whether context is relevant to the input.
*/
const makeContextRelevance = function (args: {
input: string;
output: string;
metadata: { context: string[] };
}) {
return ContextRelevancy({
input: args.input,
output: args.output,
context: args.metadata.context,
...evaluatorLlmConf,
});
};
const dataset = [
{
input: "What is the capital of France",
tags: ["paris"],
metadata: {
context: [
"The capital of France is Paris.",
"Berlin is the capital of Germany.",
],
},
output: "Paris is the capital of France.",
},
{
input: "Who wrote Harry Potter",
tags: ["harry-potter"],
metadata: {
context: [
"Harry Potter was written by J.K. Rowling.",
"The Lord of the Rings was written by J.R.R. Tolkien.",
],
},
output: "J.R.R. Tolkien wrote Harry Potter.",
},
{
input: "What is the largest planet in our solar system",
tags: ["jupiter"],
metadata: {
context: [
"Jupiter is the largest planet in our solar system.",
"Saturn has the largest rings in our solar system.",
],
},
output: "Saturn is the largest planet in our solar system.",
},
];
// The relevant code for this issue. Note the closure.
function makeGeneratedAnswerReturner(outputs: string[]) {
// closure over iterator
let counter = 0;
return async (_input: string) => {
counter++;
return outputs[counter - 1];
};
}
Eval("mdb-test", {
experimentName: "rag-metrics",
metadata: {
testing: true,
},
data: () => {
return dataset;
},
task: makeGeneratedAnswerReturner(dataset.map((d) => d.output)),
scores: [makeAnswerFaithfulness, makeAnswerRelevance, makeContextRelevance],
});
While this seems to work fine, it would be clearer and less reliant on closures (which some folks might be less familiar with), if you could pass additional data to the task function.
I think a straight-forward way to do this would be to allow passing all the contents of the Data object being evaluated to the task()
function.
This'd give the task function a signature like:
interface Data {
input: string;
expected?: string;
tags?: string[];
metadata: Record<string, string>;
}
type TaskFunc = (input: Data) => string;
Then I could include any pre-generated answers or other logic that I want to use in the Data.metadata
object. For example, this could look like:
const dataset = [
{
input: "What is the capital of France",
tags: ["paris"],
metadata: {
context: [
"The capital of France is Paris.",
"Berlin is the capital of Germany.",
],
output: "Paris is the capital of France.",
},
},
];
Eval("mdb-test", {
experimentName: "rag-metrics",
data: () => {
return dataset;
},
// Now the task() func takes the whole data object
task(data) {
return data.metadata.output
},
scores: [makeAnswerFaithfulness, makeAnswerRelevance, makeContextRelevance],
});
the docs on autoevals did not show that openai.api_key
needs to be set
my code snippet looks like this currently, it would be nice if auto evals could read OPENAI_API_KEY
from my env and use it for the request
from autoevals.llm import *
import autoevals
# litellm completion call
import litellm
question = "which country has the highest population"
response = litellm.completion(
model = "gpt-3.5-turbo",
messages = [
{
"role": "user",
"content": question
}
],
)
# use the auto eval Factuality() evaluator
evaluator = Factuality()
openai.api_key = "" # set your openai api key for evaluator
result = evaluator(
output=response.choices[0]["message"]["content"],
expected="India",
input=question
)
print(result)
When using the AnswerRelevancy
evaluator from the autoevals
npm package, I run into the following error when passing the OpenAI API key to the AnswerRelevancy.openAiApiKey
property:
AggregateError: Found exceptions for the following scorers: makeAnswerRelevance
at callback (/Users/ben.p/projects/chatbot/node_modules/braintrust/dist/cli.js:5113:19)
at process.processTicksAndRejections (node:internal/process/task_queues:95:5)
at async /Users/ben.p/projects/chatbot/node_modules/braintrust/dist/cli.js:5136:16 {
[errors]: [
BadRequestError: 400 Error: No API keys found (for null). You can configure API secrets at https://www.braintrust.dev/app/settings?subroute=secrets
at APIError.generate (/Users/ben.p/projects/chatbot/node_modules/autoevals/node_modules/openai/error.js:45:20)
at OpenAI.makeStatusError (/Users/ben.p/projects/chatbot/node_modules/autoevals/node_modules/openai/core.js:263:33)
at OpenAI.makeRequest (/Users/ben.p/projects/chatbot/node_modules/autoevals/node_modules/openai/core.js:306:30)
at process.processTicksAndRejections (node:internal/process/task_queues:95:5) {
status: 400,
headers: [Object],
request_id: undefined,
error: undefined,
code: undefined,
param: undefined,
type: undefined
}
]
}
I do not hav this error with the Faithfulness
and ContextRelevancy
evaluators.
Here is my source code: https://github.com/mongodb/chatbot/pull/450/files#diff-720d5a77593bb16d732d787d35a130f12ade9b93af9dd912d9a4706657dd6555R31
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.