braintrustdata / autoevals Goto Github PK

View Code? Open in Web Editor NEW

153.0 153.0 17.0 669 KB

AutoEvals is a tool for quickly and easily evaluating AI model outputs using best practices.

License: MIT License

Makefile 0.50% Python 49.23% Shell 0.07% JavaScript 0.85% TypeScript 49.35%

autoevals's People

Contributors

Stargazers

Watchers

Forkers

wong-codaio guruchahal wpusergithub abrenneke dashk ecatkins ishaan-jaff xiayu1028 transitive-bullshit dheerajiiitv bardia-pourvakil saved-repos mongodben oluwaseunmauwedo mojowebs khnext-ai techthiyanes

autoevals's Issues

(docs) add examples on how to use autoevals.llm

It's unclear how to use the different types of autoevals, can you add example usage and supported input params

Add a `make test` github action

#3 fixes the current problem, but on main right now, make test fails. We should setup a basic CI/CD pipeline.

Support of Azure Open AI models and API

Hi,

It seems that Azure OpenAI (https://oai.azure.com/) is not supported. Is that the case?
Will be glad to add that functionality if does not exists yet

Factuality Evaluator failing

Tried this code snippet:

from autoevals.llm import *
import openai

openai.api_key = "sk-"
 
# Create a new LLM-based evaluator
evaluator = Factuality()
 
# Evaluate an example LLM completion
input = "Which country has the highest population?"
output = "People's Republic of China"
expected = "China"
 
result = evaluator(output, expected, input=input)
print(result)
 
# The evaluator returns a score from [0,1] and includes the raw outputs from the evaluator
print(f"Factuality score: {result.score}")
print(f"Factuality metadata: {result.metadata['rationale']}")

I see this error:
Score(name='Factuality', score=0, metadata={}, error=KeyError('usage'))
Factuality score: 0

Traceback (most recent call last):
  File "/Users/ishaanjaffer/Github/litellm/litellm/tests/test_autoeval.py", line 19, in <module>
    print(f"Factuality metadata: {result.metadata['rationale']}")
KeyError: 'rationale'

Any suggestions on how i can debug this ?

General Question about the Evaluator LLM

Hi! I am wondering if it's possible to use open source or self-deployed llms (and not only open ai) as the judge or evaluator? If yes, could you please refer to an example or part of the docs explaining that? Thanks!

Context Relevancy issue with score not between 0 and 1.

Description

When using ContextRelevancy():

Correct Behavior

# INPUTS
question = "List 3 movies about sci-fi in the genre of fiction."
context = ['ex machina', 'i am mother', 'mother/android']
answer = "These three films explore the complex relationship between humans and artificial intelligence. In 'Ex Machina,' a programmer interacts with a humanoid AI, questioning consciousness and morality. 'I Am Mother' features a girl raised by a robot in a post-extinction world, who challenges her understanding of trust and the outside world when a human arrives. 'Mother/Android' follows a pregnant woman and her boyfriend navigating a post-apocalyptic landscape controlled by hostile androids, highlighting themes of survival and human resilience."

# OUTPUTS
score = 0.9459459459459459
metadata = {'relevant_sentences': [{'sentence': 'ex machina', 'reasons': []}, {'sentence': 'i am mother', 'reasons': []}, {'sentence': 'mother/android', 'reasons': []}]}

Incorrect Behavior

# INPUTS
question = "3, sci-fi, fiction, movies"
# same context and answer

# ERROR
ValueError("score (9.81081081081081) must be between 0 and 1")

That error should be handled within ContextRelevancy().

(`autoevals` JS) Better support and documentation for using context-based evaluators in `Eval` run

It could be clearer how to use the evaluators that use "context" in addition to input and output in the Eval run, such as Faithfulness and ContextRelevancy.

Right now, I'm including contexts in the metadata. I only figured this out after few hours of poking around since the behavior is undocumented.

Here's an annotated version of my code which worked:

import { Eval } from "braintrust";
import { Faithfulness, ContextRelevancy } from "autoevals";
import "dotenv/config";
import { strict as assert } from "assert";
assert(process.env.OPENAI_OPENAI_API_KEY, "need openai key from openai");
const openAiApiKey = process.env.OPENAI_OPENAI_API_KEY;
const model = "gpt-4o-mini";
const evaluatorLlmConf = {
  openAiApiKey,
  model,
};
/**
  Evaluate whether the output is faithful to the model input.
 */
const makeAnswerFaithfulness = function (args: {
  input: string;
  output: string;
  // passing context in metadata
  metadata: { context: string[] };
}) {
  return Faithfulness({
    input: args.input,
    output: args.output,
    context: args.metadata.context,
    ...evaluatorLlmConf,
  });
};

/**
  Evaluate whether answer is relevant to the input.
 */
const makeAnswerRelevance = function (args: {
  input: string;
  output: string;
  metadata: { context: string[] };
}) {
  return AnswerRelevancy({
    input: args.input,
    output: args.output,
    context: args.metadata.context,
    ...evaluatorLlmConf,
  });
};

/**
  Evaluate whether context is relevant to the input.
 */
const makeContextRelevance = function (args: {
  input: string;
  output: string;
  metadata: { context: string[] };
}) {
  return ContextRelevancy({
    input: args.input,
    output: args.output,
    context: args.metadata.context,
    ...evaluatorLlmConf,
  });
};

const dataset = [
  {
    input: "What is the capital of France",
    tags: ["paris"],
    metadata: {
      // including context in metadata here as well
      context: [
        "The capital of France is Paris.",
        "Berlin is the capital of Germany.",
      ],
    },
    output: "Paris is the capital of France.",
  },
  {
    input: "Who wrote Harry Potter",
    tags: ["harry-potter"],
    metadata: {
      context: [
        "Harry Potter was written by J.K. Rowling.",
        "The Lord of the Rings was written by J.R.R. Tolkien.",
      ],
    },
    output: "J.R.R. Tolkien wrote Harry Potter.",
  },
  {
    input: "What is the largest planet in our solar system",
    tags: ["jupiter"],
    metadata: {
      context: [
        "Jupiter is the largest planet in our solar system.",
        "Saturn has the largest rings in our solar system.",
      ],
    },
    output: "Saturn is the largest planet in our solar system.",
  },
];

function makeGeneratedAnswerReturner(outputs: string[]) {
  // closure over iterator
  let counter = 0;
  return async (_input: string) => {
    counter++;
    return outputs[counter - 1];
  };
}

Eval("mdb-test", {
  experimentName: "rag-metrics",
  metadata: {
    testing: true,
  },

  data: () => {
    return dataset;
  },
  task: makeGeneratedAnswerReturner(dataset.map((d) => d.output)),
  scores: [makeAnswerFaithfulness, makeContextRelevance],
});

(`autoevals` JS): Better support for evaluating based on pre-generated answer

Currently, it's less than straightforward to run evals if the answer is pre-generated, or based on case-specific data beyond the input.

This is because the Eval's task() function only accepts the input string as an argument.

I think it's important to be able to evaluate against pre-generated outputs so that we can decouple the evaluation stage (in Braintrust) from the dataset generation stage, which doesn't necessarily require Braintrust.

Here's my current implementation, which relies on creating a closure over the task() function to iterate thought pre-generated responses:

import { Eval } from "braintrust";
import { Faithfulness, AnswerRelevancy, ContextRelevancy } from "autoevals";
import "dotenv/config";
import { strict as assert } from "assert";
assert(process.env.OPENAI_OPENAI_API_KEY, "need openai key from openai");
const openAiApiKey = process.env.OPENAI_OPENAI_API_KEY;
const model = "gpt-4o-mini";
const evaluatorLlmConf = {
  openAiApiKey,
  model,
};
/**
  Evaluate whether the output is faithful to the model input.
 */
const makeAnswerFaithfulness = function (args: {
  input: string;
  output: string;
  metadata: { context: string[] };
}) {
  return Faithfulness({
    input: args.input,
    output: args.output,
    context: args.metadata.context,
    ...evaluatorLlmConf,
  });
};

/**
  Evaluate whether answer is relevant to the input.
 */
const makeAnswerRelevance = function (args: {
  input: string;
  output: string;
  metadata: { context: string[] };
}) {
  return AnswerRelevancy({
    input: args.input,
    output: args.output,
    context: args.metadata.context,
    ...evaluatorLlmConf,
  });
};

/**
  Evaluate whether context is relevant to the input.
 */
const makeContextRelevance = function (args: {
  input: string;
  output: string;
  metadata: { context: string[] };
}) {
  return ContextRelevancy({
    input: args.input,
    output: args.output,
    context: args.metadata.context,
    ...evaluatorLlmConf,
  });
};

const dataset = [
  {
    input: "What is the capital of France",
    tags: ["paris"],
    metadata: {
      context: [
        "The capital of France is Paris.",
        "Berlin is the capital of Germany.",
      ],
    },
    output: "Paris is the capital of France.",
  },
  {
    input: "Who wrote Harry Potter",
    tags: ["harry-potter"],
    metadata: {
      context: [
        "Harry Potter was written by J.K. Rowling.",
        "The Lord of the Rings was written by J.R.R. Tolkien.",
      ],
    },
    output: "J.R.R. Tolkien wrote Harry Potter.",
  },
  {
    input: "What is the largest planet in our solar system",
    tags: ["jupiter"],
    metadata: {
      context: [
        "Jupiter is the largest planet in our solar system.",
        "Saturn has the largest rings in our solar system.",
      ],
    },
    output: "Saturn is the largest planet in our solar system.",
  },
];

// The relevant code for this issue. Note the closure.
function makeGeneratedAnswerReturner(outputs: string[]) {
  // closure over iterator
  let counter = 0;
  return async (_input: string) => {
    counter++;
    return outputs[counter - 1];
  };
}

Eval("mdb-test", {
  experimentName: "rag-metrics",
  metadata: {
    testing: true,
  },

  data: () => {
    return dataset;
  },
  task: makeGeneratedAnswerReturner(dataset.map((d) => d.output)),
  scores: [makeAnswerFaithfulness, makeAnswerRelevance, makeContextRelevance],
});

While this seems to work fine, it would be clearer and less reliant on closures (which some folks might be less familiar with), if you could pass additional data to the task function.

I think a straight-forward way to do this would be to allow passing all the contents of the Data object being evaluated to the task() function.

This'd give the task function a signature like:

interface Data {
  input: string;
  expected?: string;
  tags?: string[];
  metadata: Record<string, string>;
}
type TaskFunc = (input: Data) => string;

Then I could include any pre-generated answers or other logic that I want to use in the Data.metadata object. For example, this could look like:

const dataset = [
  {
    input: "What is the capital of France",
    tags: ["paris"],
    metadata: {
      context: [
        "The capital of France is Paris.",
        "Berlin is the capital of Germany.",
      ],
      output: "Paris is the capital of France.",
    },
    
  },
];

Eval("mdb-test", {
  experimentName: "rag-metrics",
  data: () => {
    return dataset;
  },
  // Now the task() func takes the whole data object
  task(data) {
    return data.metadata.output
  },
  scores: [makeAnswerFaithfulness, makeAnswerRelevance, makeContextRelevance],
});

Question about deps

Should tsx and @types/node be dev deps?

[Feat] Can you read openai api key from the .env

the docs on autoevals did not show that openai.api_key needs to be set

my code snippet looks like this currently, it would be nice if auto evals could read OPENAI_API_KEY from my env and use it for the request

from autoevals.llm import *
import autoevals

# litellm completion call
import litellm
question = "which country has the highest population"
response = litellm.completion(
    model = "gpt-3.5-turbo",
    messages = [
        {
            "role": "user",
            "content": question
        }
    ],
)


# use the auto eval Factuality() evaluator
evaluator = Factuality()
openai.api_key = "" # set your openai api key for evaluator
result = evaluator(
    output=response.choices[0]["message"]["content"],
    expected="India",
    input=question
)

print(result)

JS `AnswerRelevancy` bug with model configuration

When using the AnswerRelevancy evaluator from the autoevals npm package, I run into the following error when passing the OpenAI API key to the AnswerRelevancy.openAiApiKey property:

AggregateError: Found exceptions for the following scorers: makeAnswerRelevance
    at callback (/Users/ben.p/projects/chatbot/node_modules/braintrust/dist/cli.js:5113:19)
    at process.processTicksAndRejections (node:internal/process/task_queues:95:5)
    at async /Users/ben.p/projects/chatbot/node_modules/braintrust/dist/cli.js:5136:16 {
  [errors]: [
    BadRequestError: 400 Error: No API keys found (for null). You can configure API secrets at https://www.braintrust.dev/app/settings?subroute=secrets
        at APIError.generate (/Users/ben.p/projects/chatbot/node_modules/autoevals/node_modules/openai/error.js:45:20)
        at OpenAI.makeStatusError (/Users/ben.p/projects/chatbot/node_modules/autoevals/node_modules/openai/core.js:263:33)
        at OpenAI.makeRequest (/Users/ben.p/projects/chatbot/node_modules/autoevals/node_modules/openai/core.js:306:30)
        at process.processTicksAndRejections (node:internal/process/task_queues:95:5) {
      status: 400,
      headers: [Object],
      request_id: undefined,
      error: undefined,
      code: undefined,
      param: undefined,
      type: undefined
    }
  ]
}

I do not hav this error with the Faithfulness and ContextRelevancy evaluators.

Here is my source code: https://github.com/mongodb/chatbot/pull/450/files#diff-720d5a77593bb16d732d787d35a130f12ade9b93af9dd912d9a4706657dd6555R31