Topic: llm-evaluation Goto Github

Some thing interesting about llm-evaluation

👇 Here are 69 public repositories matching this topic...

adamcoscia / iscore

llm-evaluation,Upload, score, and visually compare multiple LLM-graded summaries simultaneously!

User: adamcoscia

Home Page: https://arxiv.org/abs/2403.04760

llm-evaluation visual-analytics learning-sciences summary-evaluation responsible-ai ethical-ai transformers

agenta-ai / agenta

llm-evaluation,The all-in-one LLM developer platform: prompt management, evaluation, human feedback, and deployment all in one place.

Organization: agenta-ai

Home Page: http://www.agenta.ai

langchain llmops large-language-models llm llm-tools llms prompt-engineering prompt-management llama-index llm-evaluation

agenta-ai / job_extractor_template

llm-evaluation,Template for an AI application that extracts the job information from a job description using openAI functions and langchain

Organization: agenta-ai

Home Page: https://agenta.ai

example extract-data extract-information extraction langchain llm llm-evaluation llm-evaluation-toolkit llmops openai openai-function-example template unstructured-text

allenai / commongen-eval

llm-evaluation,Evaluating LLMs with CommonGen-Lite

Organization: allenai

Home Page: https://inklab.usc.edu/CommonGen/

chatgpt evaluation gpt-evaluation llama2 llm llm-evaluation text-generation

antoniogr7 / pratical-llms

llm-evaluation,A collection of hand on notebook for LLMs practitioner

User: antoniogr7

genai llm llm-evaluation llm-inference llm-serving llm-training quantization

llm-evaluation,FactScoreLite is an implementation of the FactScore metric, designed for detailed accuracy assessment in text generation. This package builds upon the framework provided by the original FactScore repository, which is no longer maintained and contains outdated functions.

User: armingh2000

answer-evaluation evaluation gpt-4 gpt-evaluation large-language-models llm-evaluation llms natural-language-processing nlp openai

athina-ai / athina-evals

llm-evaluation,Python SDK for running evaluations on LLM generated responses

Organization: athina-ai

Home Page: https://docs.athina.ai

evaluation evaluation-framework evaluation-metrics llm-eval llm-evaluation llm-evaluation-toolkit llm-ops llmops

aws-samples / fm-leaderboarder

llm-evaluation,FM-Leaderboard-er allows you to create leaderboard to find the best LLM/prompt for your own business use case based on your data, task, prompts

Organization: aws-samples

llm-benchmarking llm-evaluation llm-evaluation-framework

azminewasi / awesome-llms-iclr-24

llm-evaluation,It is a comprehensive resource hub compiling all LLM papers accepted at the International Conference on Learning Representations (ICLR) in 2024.

User: azminewasi

large-language-model large-language-models large-language-models-and-translation-systems large-language-models-for-graph-learning llm llm-agent llm-evaluation llm-framework llm-inference llm-privacy

babelscape / alert

llm-evaluation,Official repository for the paper "ALERT: A Comprehensive Benchmark for Assessing Large Language Models’ Safety through Red Teaming"

Organization: babelscape

Home Page: https://arxiv.org/abs/2404.08676

ai artificial-intelligence llm llm-evaluation llm-safety llm-safety-benchmark nlp nlp-machine-learning red-teaming

chainlit / literal-cookbook

llm-evaluation,Cookbooks and tutorials on Literal AI

Organization: chainlit

Home Page: https://cloud.getliteral.ai/

llm llm-evaluation prompt-engineering rag

chanliang / conner

llm-evaluation,The implementation for EMNLP 2023 paper ”Beyond Factuality: A Comprehensive Evaluation of Large Language Models as Knowledge Generators“

User: chanliang

Home Page: https://arxiv.org/abs/2310.07289

chatgpt emnlp2023 factuality hallucinations large-language-models llama llm-evaluation nlg-evaluation

confident-ai / deepeval

llm-evaluation,The LLM Evaluation Framework

Organization: confident-ai

Home Page: https://docs.confident-ai.com/

evaluation-metrics evaluation-framework llm-evaluation llm-evaluation-framework llm-evaluation-metrics

deshwalmahesh / phudge

llm-evaluation,Official repo for the paper PHUDGE: Phi-3 as Scalable Judge. Evaluate your LLMs with or without custom rubric, reference answer, absolute, relative and much more. It contains a list of all the available tool, methods, repo, code etc to detect hallucination, LLM evaluation, grading and much more.

User: deshwalmahesh

Home Page: https://arxiv.org/abs/2405.08029

ai custom-dataset evaluation feedback-collection finetuning judge llm ml nlp phi-3

euskoog / openai-assistants-evals

llm-evaluation,Visualize LLM Evaluations for OpenAI Assistants

User: euskoog

Home Page: https://openai-assistants-evals-dash.vercel.app/

llm-evaluation llms openai openai-assistants tailwindcss

euskoog / openai-assistants-link

llm-evaluation,Link your OpenAI Assistants to a custom store + Evaluate Assistant responses

User: euskoog

fastapi llm-evaluation llms openai openai-assistant-api openai-assistants python

evaluation-tools / nutcracker

llm-evaluation,Large Model Evaluation Experiments

Organization: evaluation-tools

large-language-models llm llm-evaluation llmops

giacomomeloni / exploringllms

llm-evaluation,Exploring the depths of LLMs 🚀

User: giacomomeloni

generative-ai llm llm-evaluation prompt-engineering rag retrieval-augmented-generation

giskard-ai / giskard

llm-evaluation,🐢 Open-Source Evaluation & Testing for LLMs and ML models

Organization: giskard-ai

Home Page: https://docs.giskard.ai

mlops ml-validation ml-testing ai-testing ai-safety ml-safety llmops ethical-artificial-intelligence responsible-ai fairness-ai

gurpreetkaurjethra / llms-evaluation

llm-evaluation,LLMs Evaluation

User: gurpreetkaurjethra

generative-ai large-language-models llm llm-evaluation

$dart-math$

hkust-nlp / dart-math

llm-evaluation,Official implementation for the paper *🎯DART-Math: Difficulty-Aware Rejection Tuning for Mathematical Problem-Solving*

Organization: hkust-nlp

Home Page: https://hkust-nlp.github.io/dart-math/

deep-learning mathematics nlp llm llm-evaluation llm-inference llm-training

intuit-ai-research / dcr-consistency

llm-evaluation,DCR-Consistency: Divide-Conquer-Reasoning for Consistency Evaluation and Improvement of Large Language Models

Organization: intuit-ai-research

blackbox consistency divide-and-conquer-approach hallucinations large-language-models llm llm-evaluation summarization

ivarfresh / interaction_llms

llm-evaluation,[Personalize@EACL 2024] LLM Agents in Interaction: Measuring Personality Consistency and Linguistic Alignment in Interacting Populations of Large Language Models.

User: ivarfresh

bfi generative-agents linguistic-alignment llm-evaluation llms personality-traits

j0st / politicalllm

llm-evaluation,A framework for automatically manipulating and evaluating the political ideology of LLMs with two ideology tests: Wahl-O-Mat and Political Compass Test.

User: j0st

Home Page: https://huggingface.co/spaces/jost/PoliticalLLM

llm-evaluation llms manifesto-project pct political-ideology-detection rag wahlomat german

kwinkunks / promptly

llm-evaluation,A prompt collection for testing and evaluation of LLMs.

User: kwinkunks

chatgpt llm-evaluation prompt-engineering prompts

langfuse / langfuse

llm-evaluation,🪢 Open source LLM engineering platform: Observability, metrics, evals, prompt management, playground, datasets. Integrates with LlamaIndex, Langchain, OpenAI SDK, LiteLLM, and more. 🍊YC W23

Organization: langfuse

Home Page: https://langfuse.com/docs

analytics llm llmops gpt large-language-models openai self-hosted ycombinator monitoring observability

llm-evaluation-s-always-fatiguing / leaf-playground

llm-evaluation,A framework to build scenario simulation projects where human and LLM based agents can participant in, with a user-friendly web UI to visualize simulation, support automatically evaluation on agent action level.

Organization: llm-evaluation-s-always-fatiguing

llm-evaluation agent-based-simulation automation evaluations agent agents chatgpt

loganrjmurphy / leaneuclid

llm-evaluation,LeanEuclid is a benchmark for autoformalization in the domain of Euclidean geometry, targeting the proof assistant Lean.

User: loganrjmurphy

Home Page: http://arxiv.org/abs/2405.17216

lean4 theorem-proving llm-evaluation formalization euclidean-geometry autoformalization

microsoft / prompty

llm-evaluation,Prompty makes it easy to create, manage, debug, and evaluate LLM prompts for your AI applications. Prompty is an asset class and format for LLM prompts designed to enhance observability, understandability, and portability for developers.

Organization: microsoft

Home Page: https://prompty.ai

generative-ai llm-evaluation llms promptengineering

minnesotanlp / cobbler

llm-evaluation,Code and data for Koo et al's ACL 2024 paper "Benchmarking Cognitive Biases in Large Language Models as Evaluators"

Organization: minnesotanlp

Home Page: https://minnesotanlp.github.io/cobbler-project-page/

bias evaluation llm nlp bias-detection llm-as-a-judge llm-as-evaluator llm-as-judge llm-evaluation llms

networks-learning / prediction-powered-ranking

llm-evaluation,Code for "Prediction-Powered Ranking of Large Language Models", Arxiv 2024.

Organization: networks-learning

llm-eval llm-evaluation llm-evaluation-framework ranking-algorithm prediction-powered-inference rank-sets

onejune2018 / awesome-llm-eval

llm-evaluation,Awesome-LLM-Eval: a curated list of tools, datasets/benchmark, demos, leaderboard, papers, docs and models, mainly for Evaluation on LLMs. 一个由工具、基准/数据、演示、排行榜和大模型等组成的精选列表，主要面向基础大模型评测，旨在探求生成式AI的技术边界.

User: onejune2018

awsome-list benchmark bert chatglm chatgpt dataset evaluation gpt3 large-language-model leaderboard

parea-ai / parea-sdk-py

llm-evaluation,Python SDK for experimenting, testing, evaluating & monitoring LLM-powered applications - Parea AI (YC S23)

Organization: parea-ai

Home Page: https://docs.parea.ai/sdk/python

llm llm-evaluation llm-tools llmops llms-benchmarking llm-eval llm-evaluation-framework llm-evaluation-toolkit prompt-engineering generative-ai

parea-ai / parea-sdk-ts

llm-evaluation,TypeScript SDK for experimenting, testing, evaluating & monitoring LLM-powered applications - Parea AI (YC S23)

Organization: parea-ai

Home Page: https://docs.parea.ai/sdk/typescript

llm llm-evaluation llm-evaluation-framework llm-evaluation-toolkit llm-tools llms llms-benchmarking llm-eval prompt-engineering

petroivaniuk / llms-tools

llm-evaluation,A list of LLMs Tools & Projects

User: petroivaniuk

ai chatgpt data-science llm machine-learning chat-bot chatbots llm-evaluation open-source-llm

praful932 / llmsearch

llm-evaluation,Find better generation parameters for your LLM

User: praful932

Home Page: https://llmsearch.netlify.app

llm llm-evaluation llm-inference nlp

promptfoo / promptfoo

llm-evaluation,Test your prompts, agents, and RAGs. Use LLM evals to improve your app's quality and catch problems. Compare performance of GPT, Claude, Gemini, Llama, and more. Simple declarative configs with command line and CI/CD integration.

Organization: promptfoo

Home Page: https://www.promptfoo.dev/

llm prompt-engineering prompts llmops prompt-testing testing rag evaluation evaluation-framework llm-eval

psycoy / mixeval

llm-evaluation,The official evaluation suite and dynamic data release for MixEval.

User: psycoy

Home Page: https://mixeval.github.io/

benchmark benchmark-mixture benchmarking-framework benchmarking-suite evaluation evaluation-framework foundation-models large-language-model large-language-models large-multimodal-models

raga-ai-hub / raga-llm-hub

llm-evaluation,Framework for LLM evaluation, guardrails and security

Organization: raga-ai-hub

Home Page: https://www.raga.ai/llms

guardrails llm-evaluation llmops llm-security

re-align / just-eval

llm-evaluation,A simple GPT-based evaluation tool for multi-aspect, interpretable assessment of LLMs.

Organization: re-align

Home Page: https://allenai.github.io/re-align/

evaluation gpt4 llm llm-eval llm-evaluation llm-evaluation-toolkit

relari-ai / continuous-eval

llm-evaluation,Open-Source Evaluation for GenAI Application Pipelines

Organization: relari-ai

Home Page: https://docs.relari.ai/

evaluation-framework evaluation-metrics information-retrieval llm-evaluation llmops rag retrieval-augmented-generation

rochitasundar / generative-ai-with-large-language-models

llm-evaluation,This repository contains the lab work for Coursera course on "Generative AI with Large Language Models".

User: rochitasundar

Home Page: https://www.coursera.org/account/accomplishments/certificate/8JAYVEUAQF56

flan-t5 instruction-finetuning kl-divergence large-language-models llm-evaluation low-rank-adaptation parameter-efficient-fine-tuning prompt-engineering proximal-policy-optimization reinforcement-learning

rungalileo / hallucination-index

llm-evaluation,Initiative to evaluate and rank the most popular LLMs across common task types based on their propensity to hallucinate.

Organization: rungalileo

Home Page: https://www.rungalileo.io/hallucinationindex

hallucinations large-language-models llm llm-evaluation openai rag retrieval-augmented-generation

value4ai / awesome-llm-in-social-science

llm-evaluation,Awesome papers involving LLMs in Social Science.

Organization: value4ai

large-language-models llm-agent llms simulation-environment social-science alignment economics llm-evaluation policy psychology

vidhyavarshanyjs / ensemblex

llm-evaluation,EnsembleX utilizes the Knapsack algorithm to optimize Large Language Model (LLM) ensembles for quality-cost trade-offs, offering tailored suggestions across various domains through a Streamlit dashboard visualization.

User: vidhyavarshanyjs

Home Page: https://ensemblex.streamlit.app

benchmark huggingface knapsack large-language-models llm llm-evaluation python streamlit open-llm-leaderboard