Giter Site home page Giter Site logo

braintrustdata / autoevals Goto Github PK

View Code? Open in Web Editor NEW
153.0 153.0 17.0 669 KB

AutoEvals is a tool for quickly and easily evaluating AI model outputs using best practices.

License: MIT License

Makefile 0.50% Python 49.23% Shell 0.07% JavaScript 0.85% TypeScript 49.35%

autoevals's People

Contributors

abrenneke avatar ankrgyl avatar aphinx avatar bardia-pourvakil avatar danielericlee avatar dashk avatar davidatbraintrust avatar ecatkins avatar j13huang avatar manugoyal avatar mongodben avatar tara-nagar avatar transitive-bullshit avatar wong-codaio avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

autoevals's Issues

Factuality Evaluator failing

Tried this code snippet:

from autoevals.llm import *
import openai

openai.api_key = "sk-"
 
# Create a new LLM-based evaluator
evaluator = Factuality()
 
# Evaluate an example LLM completion
input = "Which country has the highest population?"
output = "People's Republic of China"
expected = "China"
 
result = evaluator(output, expected, input=input)
print(result)
 
# The evaluator returns a score from [0,1] and includes the raw outputs from the evaluator
print(f"Factuality score: {result.score}")
print(f"Factuality metadata: {result.metadata['rationale']}")

I see this error:
Score(name='Factuality', score=0, metadata={}, error=KeyError('usage'))
Factuality score: 0

Traceback (most recent call last):
  File "/Users/ishaanjaffer/Github/litellm/litellm/tests/test_autoeval.py", line 19, in <module>
    print(f"Factuality metadata: {result.metadata['rationale']}")
KeyError: 'rationale'

Any suggestions on how i can debug this ?

General Question about the Evaluator LLM

Hi! I am wondering if it's possible to use open source or self-deployed llms (and not only open ai) as the judge or evaluator? If yes, could you please refer to an example or part of the docs explaining that? Thanks!

Context Relevancy issue with score not between 0 and 1.

Description

When using ContextRelevancy():

Correct Behavior

# INPUTS
question = "List 3 movies about sci-fi in the genre of fiction."
context = ['ex machina', 'i am mother', 'mother/android']
answer = "These three films explore the complex relationship between humans and artificial intelligence. In 'Ex Machina,' a programmer interacts with a humanoid AI, questioning consciousness and morality. 'I Am Mother' features a girl raised by a robot in a post-extinction world, who challenges her understanding of trust and the outside world when a human arrives. 'Mother/Android' follows a pregnant woman and her boyfriend navigating a post-apocalyptic landscape controlled by hostile androids, highlighting themes of survival and human resilience."

# OUTPUTS
score = 0.9459459459459459
metadata = {'relevant_sentences': [{'sentence': 'ex machina', 'reasons': []}, {'sentence': 'i am mother', 'reasons': []}, {'sentence': 'mother/android', 'reasons': []}]}

Incorrect Behavior

# INPUTS
question = "3, sci-fi, fiction, movies"
# same context and answer

# ERROR
ValueError("score (9.81081081081081) must be between 0 and 1")

That error should be handled within ContextRelevancy().

(`autoevals` JS) Better support and documentation for using context-based evaluators in `Eval` run

It could be clearer how to use the evaluators that use "context" in addition to input and output in the Eval run, such as Faithfulness and ContextRelevancy.

Right now, I'm including contexts in the metadata. I only figured this out after few hours of poking around since the behavior is undocumented.

Here's an annotated version of my code which worked:

import { Eval } from "braintrust";
import { Faithfulness, ContextRelevancy } from "autoevals";
import "dotenv/config";
import { strict as assert } from "assert";
assert(process.env.OPENAI_OPENAI_API_KEY, "need openai key from openai");
const openAiApiKey = process.env.OPENAI_OPENAI_API_KEY;
const model = "gpt-4o-mini";
const evaluatorLlmConf = {
  openAiApiKey,
  model,
};
/**
  Evaluate whether the output is faithful to the model input.
 */
const makeAnswerFaithfulness = function (args: {
  input: string;
  output: string;
  // passing context in metadata
  metadata: { context: string[] };
}) {
  return Faithfulness({
    input: args.input,
    output: args.output,
    context: args.metadata.context,
    ...evaluatorLlmConf,
  });
};

/**
  Evaluate whether answer is relevant to the input.
 */
const makeAnswerRelevance = function (args: {
  input: string;
  output: string;
  metadata: { context: string[] };
}) {
  return AnswerRelevancy({
    input: args.input,
    output: args.output,
    context: args.metadata.context,
    ...evaluatorLlmConf,
  });
};

/**
  Evaluate whether context is relevant to the input.
 */
const makeContextRelevance = function (args: {
  input: string;
  output: string;
  metadata: { context: string[] };
}) {
  return ContextRelevancy({
    input: args.input,
    output: args.output,
    context: args.metadata.context,
    ...evaluatorLlmConf,
  });
};

const dataset = [
  {
    input: "What is the capital of France",
    tags: ["paris"],
    metadata: {
      // including context in metadata here as well
      context: [
        "The capital of France is Paris.",
        "Berlin is the capital of Germany.",
      ],
    },
    output: "Paris is the capital of France.",
  },
  {
    input: "Who wrote Harry Potter",
    tags: ["harry-potter"],
    metadata: {
      context: [
        "Harry Potter was written by J.K. Rowling.",
        "The Lord of the Rings was written by J.R.R. Tolkien.",
      ],
    },
    output: "J.R.R. Tolkien wrote Harry Potter.",
  },
  {
    input: "What is the largest planet in our solar system",
    tags: ["jupiter"],
    metadata: {
      context: [
        "Jupiter is the largest planet in our solar system.",
        "Saturn has the largest rings in our solar system.",
      ],
    },
    output: "Saturn is the largest planet in our solar system.",
  },
];

function makeGeneratedAnswerReturner(outputs: string[]) {
  // closure over iterator
  let counter = 0;
  return async (_input: string) => {
    counter++;
    return outputs[counter - 1];
  };
}

Eval("mdb-test", {
  experimentName: "rag-metrics",
  metadata: {
    testing: true,
  },

  data: () => {
    return dataset;
  },
  task: makeGeneratedAnswerReturner(dataset.map((d) => d.output)),
  scores: [makeAnswerFaithfulness, makeContextRelevance],
});

(`autoevals` JS): Better support for evaluating based on pre-generated answer

Currently, it's less than straightforward to run evals if the answer is pre-generated, or based on case-specific data beyond the input.

This is because the Eval's task() function only accepts the input string as an argument.

I think it's important to be able to evaluate against pre-generated outputs so that we can decouple the evaluation stage (in Braintrust) from the dataset generation stage, which doesn't necessarily require Braintrust.

Here's my current implementation, which relies on creating a closure over the task() function to iterate thought pre-generated responses:

import { Eval } from "braintrust";
import { Faithfulness, AnswerRelevancy, ContextRelevancy } from "autoevals";
import "dotenv/config";
import { strict as assert } from "assert";
assert(process.env.OPENAI_OPENAI_API_KEY, "need openai key from openai");
const openAiApiKey = process.env.OPENAI_OPENAI_API_KEY;
const model = "gpt-4o-mini";
const evaluatorLlmConf = {
  openAiApiKey,
  model,
};
/**
  Evaluate whether the output is faithful to the model input.
 */
const makeAnswerFaithfulness = function (args: {
  input: string;
  output: string;
  metadata: { context: string[] };
}) {
  return Faithfulness({
    input: args.input,
    output: args.output,
    context: args.metadata.context,
    ...evaluatorLlmConf,
  });
};

/**
  Evaluate whether answer is relevant to the input.
 */
const makeAnswerRelevance = function (args: {
  input: string;
  output: string;
  metadata: { context: string[] };
}) {
  return AnswerRelevancy({
    input: args.input,
    output: args.output,
    context: args.metadata.context,
    ...evaluatorLlmConf,
  });
};

/**
  Evaluate whether context is relevant to the input.
 */
const makeContextRelevance = function (args: {
  input: string;
  output: string;
  metadata: { context: string[] };
}) {
  return ContextRelevancy({
    input: args.input,
    output: args.output,
    context: args.metadata.context,
    ...evaluatorLlmConf,
  });
};

const dataset = [
  {
    input: "What is the capital of France",
    tags: ["paris"],
    metadata: {
      context: [
        "The capital of France is Paris.",
        "Berlin is the capital of Germany.",
      ],
    },
    output: "Paris is the capital of France.",
  },
  {
    input: "Who wrote Harry Potter",
    tags: ["harry-potter"],
    metadata: {
      context: [
        "Harry Potter was written by J.K. Rowling.",
        "The Lord of the Rings was written by J.R.R. Tolkien.",
      ],
    },
    output: "J.R.R. Tolkien wrote Harry Potter.",
  },
  {
    input: "What is the largest planet in our solar system",
    tags: ["jupiter"],
    metadata: {
      context: [
        "Jupiter is the largest planet in our solar system.",
        "Saturn has the largest rings in our solar system.",
      ],
    },
    output: "Saturn is the largest planet in our solar system.",
  },
];

// The relevant code for this issue. Note the closure.
function makeGeneratedAnswerReturner(outputs: string[]) {
  // closure over iterator
  let counter = 0;
  return async (_input: string) => {
    counter++;
    return outputs[counter - 1];
  };
}

Eval("mdb-test", {
  experimentName: "rag-metrics",
  metadata: {
    testing: true,
  },

  data: () => {
    return dataset;
  },
  task: makeGeneratedAnswerReturner(dataset.map((d) => d.output)),
  scores: [makeAnswerFaithfulness, makeAnswerRelevance, makeContextRelevance],
});

While this seems to work fine, it would be clearer and less reliant on closures (which some folks might be less familiar with), if you could pass additional data to the task function.

I think a straight-forward way to do this would be to allow passing all the contents of the Data object being evaluated to the task() function.

This'd give the task function a signature like:

interface Data {
  input: string;
  expected?: string;
  tags?: string[];
  metadata: Record<string, string>;
}
type TaskFunc = (input: Data) => string;

Then I could include any pre-generated answers or other logic that I want to use in the Data.metadata object. For example, this could look like:

const dataset = [
  {
    input: "What is the capital of France",
    tags: ["paris"],
    metadata: {
      context: [
        "The capital of France is Paris.",
        "Berlin is the capital of Germany.",
      ],
      output: "Paris is the capital of France.",
    },
    
  },
];

Eval("mdb-test", {
  experimentName: "rag-metrics",
  data: () => {
    return dataset;
  },
  // Now the task() func takes the whole data object
  task(data) {
    return data.metadata.output
  },
  scores: [makeAnswerFaithfulness, makeAnswerRelevance, makeContextRelevance],
});

[Feat] Can you read openai api key from the .env

the docs on autoevals did not show that openai.api_key needs to be set

my code snippet looks like this currently, it would be nice if auto evals could read OPENAI_API_KEY from my env and use it for the request

from autoevals.llm import *
import autoevals

# litellm completion call
import litellm
question = "which country has the highest population"
response = litellm.completion(
    model = "gpt-3.5-turbo",
    messages = [
        {
            "role": "user",
            "content": question
        }
    ],
)


# use the auto eval Factuality() evaluator
evaluator = Factuality()
openai.api_key = "" # set your openai api key for evaluator
result = evaluator(
    output=response.choices[0]["message"]["content"],
    expected="India",
    input=question
)

print(result)

JS `AnswerRelevancy` bug with model configuration

When using the AnswerRelevancy evaluator from the autoevals npm package, I run into the following error when passing the OpenAI API key to the AnswerRelevancy.openAiApiKey property:

AggregateError: Found exceptions for the following scorers: makeAnswerRelevance
    at callback (/Users/ben.p/projects/chatbot/node_modules/braintrust/dist/cli.js:5113:19)
    at process.processTicksAndRejections (node:internal/process/task_queues:95:5)
    at async /Users/ben.p/projects/chatbot/node_modules/braintrust/dist/cli.js:5136:16 {
  [errors]: [
    BadRequestError: 400 Error: No API keys found (for null). You can configure API secrets at https://www.braintrust.dev/app/settings?subroute=secrets
        at APIError.generate (/Users/ben.p/projects/chatbot/node_modules/autoevals/node_modules/openai/error.js:45:20)
        at OpenAI.makeStatusError (/Users/ben.p/projects/chatbot/node_modules/autoevals/node_modules/openai/core.js:263:33)
        at OpenAI.makeRequest (/Users/ben.p/projects/chatbot/node_modules/autoevals/node_modules/openai/core.js:306:30)
        at process.processTicksAndRejections (node:internal/process/task_queues:95:5) {
      status: 400,
      headers: [Object],
      request_id: undefined,
      error: undefined,
      code: undefined,
      param: undefined,
      type: undefined
    }
  ]
}

I do not hav this error with the Faithfulness and ContextRelevancy evaluators.

Here is my source code: https://github.com/mongodb/chatbot/pull/450/files#diff-720d5a77593bb16d732d787d35a130f12ade9b93af9dd912d9a4706657dd6555R31

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.