berriai / litellm Goto Github PK

Call all LLM APIs using the OpenAI format. Use Bedrock, Azure, OpenAI, Cohere, Anthropic, Ollama, Sagemaker, HuggingFace, Replicate (100+ LLMs)

Home Page: https://docs.litellm.ai/docs/

License: Other

Python 91.60% Dockerfile 0.11% Shell 0.12% HTML 1.89% JavaScript 0.15% CSS 0.01% TypeScript 6.07% Smarty 0.06%

anthropic langchain langchain-python llm llmops openai

litellm's People

Contributors

Stargazers

Watchers

Forkers

hbcbh1999 jawond zakhar-kogan ptzagk iicc20 neurogen-dev vital121 pandyamarut tomchapin francyjglisboa hughes-research nihadse vrobot amanx123 kryptogo qqq-tech yomaser slachenberg vincelwt zlf568 positioner yujonglee wdshin ituwo adriensas jwcastillo nandini-star yourbuddyconner varunbagga19 sorokinvld shashank40 tiffany8 erikbjare galkleinman nirga fcam77 estill01 shosseini811 parth-haptik alejandrosuarez tony163163 ishandutta2007 itsharex histmeisah emekaokoli19 kanishkahalder1771 mkchaitanya03 jensinjames ucshaurya shauryr sree181 pratik2315 neubig rishiguin techthiyanes happysalada heydoctordre jianantian ch0c0l8ra1n taik superflows-ai williamespegren jordanbtucker promptmetheus warjiang yshiy-yoshi jabariholder polya20 blue-ethan f901107 phodaie zeroxclem sycomix zochory sysadminxxx michaelfeil prestomac jacquesgariepy rian-t trawmoney mocy yiixiahulun aigenerative osbarcelos79 billhowe vicrojo 6rz6 simardeepsingh-zsh colabdog reecevinto bitsnaps mbrukman josephrp coconut49 lucashofer linediconsine canada4663 laurentperez ishan-marikar lcw99

litellm's Issues

add docs on calling embeddings w/ litellm

none in the docs :( - https://litellm.readthedocs.io/en/latest/#usage

expose error types

expose error types so people can try-except based off of them, instead of needing to import from openai

add promptlayer, humanloop integration

return usage for all providers - As OpenAI does

OpenAI returns usage for requests, but other providers - eg. Cohere does not.

{
  "id": "chatcmpl-7hPIZUYBst5jQA4odnYgxgZ9iPZ51",
  "object": "chat.completion",
  "created": 1690579707,
  "model": "gpt-3.5-turbo-0613",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "Hello! I'm an AI language model, so I don't have feelings, but I'm here to help you with any questions or conversations you have. How can I assist you today?"
      },
      "finish_reason": "stop"
    }
  ],
  "usage": {
    "prompt_tokens": 13,
    "completion_tokens": 38,
    "total_tokens": 51
  }

Current Cohere return val from litellm - it's missing usage - which I need to calculate costs $

{'choices': [{'finish_reason': 'stop', 'index': 0, 'message': {'content': cohere.Generation {
	id: 3369b9a7-5755-4a4b-9da6-1ebab7c7924a
	prompt: Hello, how are you?
	text:  I am doing well, thank you. How can I help you today?
	likelihood: None
	finish_reason: None
	token_likelihoods: None
}, 'role': 'assistant'}}]}

Add the ability to set the key based on the model

Usage

litellm.set_key(model)

Reason:
Not everyone sets their keys in OPENAI_API_KEY, eg openai.api_key = get_api_key("OPENAI")

litellm should handle setting the appropriate env variables for a given model

add retries w/ exponential backoff

https://github.com/reworkd/AgentGPT/pull/1146/files/ff3d52d4a7abd848b6691af75e115f477db3d4f9#r1276947112

Show me details about my llm calls - breakdown models called, total costs, by user ID

I need visibility to see what litellm is calling
eg. What models get called

If I'm using litellm for my I/I - it should give me visibility into what the package itself is doing

Add exception mapping for openai api connection error

(bug) Not raising OpenAI Exceptions

litellm v 0.1.231 (latest)

cc @krrishdholakia

Sentry breadcrumbing broken in prod

I can see the error thrown but i don't see any of the sentry breadcrumbs from within litellm being fired

Error feedback - malformed input

How do you tell me if i gave malformed input vs. your translation messed up?

Sweep: map args to openai-python args for completion + embedding

Feels like i have to keep adding one-off args for individual situations - e.g. max tokens. Would be easier to just have the openai-python args at the top level, and then we can figure out the mapping within the functions.

Map timeout exceptions to something readable

cc @krrishdholakia since you were working on exceptions

It's really hard to detect if there was a timeout exception when i use litellm

I want something like this - just an easy way to detect if there was a timeout error.

    except Exception as e:
        print(f"in replicate llama, got error {e}")
        pass
        if e == "FunctionTimedOut":
            pass
        else:
            pytest.fail(f"Error occurred: {e}")

currently I need to go and understand how func_timeout works to catch the timeout error😭

Ensure acompletion works + Test it

Test anthropic streaming

you can stream responses with claude - this should be tested

update docs on new way of calling llm_client

from main import completion 

main.success_callback=["posthog"]
main.failure_callback=["posthog", "sentry", "slack"]

response = completion(model="gpt-3.5-turbo", messages=messages)

Guarantee format of exceptions

each LLM provider has a diff exception format, most builders need to implement their own try/except handling for each provider they use

handle max tokens

Anthropic has:

completion = anthropic.completions.create(
    model="claude-2",
    max_tokens_to_sample=300,
    prompt=f"{HUMAN_PROMPT} how does a court case get to the Supreme Court? {AI_PROMPT}",
)

Replicate has:

    output = replicate.run(
      "replicate/flan-t5-xl:eec2f71c986dfa3b7a5d842d22e1130550f015720966bec48beaae059b19ef4c",
      # "replicate/llama-7b:2014ee1247354f2e81c0b3650d71ca715bc1e610189855f134c30ecb841fae21",
      input={
        "prompt": prompt,
        "max_length": 256
      })

i believe they both achieve similar purposes.

Link to docs: Anthropic

Replicate

fake stream output if model doesn't support it but stream=True

Add testing for sentry breadcrumbs

Make sure we don't break this core functionality

Add back OpenRouter support

Adding back great work done by @zakhar-kogan on his PR

cc @krrishdholakia

Add an example on how to deploy litellm on a render server

cc @zakhar-kogan

fix response format response

a common way to parse the openai response is

response.choices[0].message.content

this doesn't work with litellm. only this works -

['choices'][0]['message']['content']

check valid api key

Someone can use the openai sdk to validate access key through something like this. This is something we should be able to support

 def verify_access_key(self):
        """
        Verify the access key is valid.

        Returns:
            bool: True if the access key is valid, False otherwise.
        """
        try:
            models = openai.Model.list()
            return True
        except Exception as exception:
            logger.info("OpenAi Exception:", exception)
            return False

Integrate with my secret manager

When I use litellm, I don't want to always read in my secrets. IT should just check my secret manager for keys.

Magic wand moment:

import litellm

litellm.secret_manager = InfisicalClient(token="good_morning ")

response = litellm.completion(model="gpt-3.5-turbo", messages=messages)

Add testing for Builds - Ensure stability

Add a Github action to run through tests. Before pushing a new build to PyPi it should run the tests

add claude streaming

Build out async functionality

return acreate functions instead of a wrapped create function for the providers that support returning async functions

Handle add args to Non OpenAI models

for ex if user makes a call

response = completion(
    model='command-nightly'
    messages=messages,
    max_tokens=self.config['max_tokens'],
    temperature=self.config['temperature'],
    top_p=self.config['top_p'],
    request_timeout=self.config['request_timeout'],
)

this could potentially fail, ensure it does not

modify additional_details for llm_client

I don't really want to create and maintain separate dictionaries across my code as i'm calling llm_client.completions.. can i not just set identifiers in one place and be done with it?

I like how sentry does this

from sentry_sdk import set_user

set_user({"email": "[email protected]"})

Track pricing per request

I need to map a completion / embedding request to $. Need a simple way to get '$' cost and write the cost to my DB.

Currently doing


input_text = " ".join([message["content"] for message in messages])
input_tokens = count_tokens(input_text)

response_text = response['choices'][0]['message']['content']
response_tokens = count_tokens(response_text)

input_tokens_cost = input_tokens_cost_map[model]
output_tokens_cost = output_tokens_cost_map[model]

total_cost = input_tokens * input_tokens_cost + response_tokens * output_tokens_cost

        
############### MODEL Cost Mapping ##################
input_tokens_cost_map = {
   'gpt-3.5-turbo': 0.0015,
   'gpt-4': 0.03,
   'chatgpt-test': 0.0015,
   'chatgpt-v-2': 0.0015,
}


output_tokens_cost_map = {
   'gpt-3.5-turbo': 0.002 ,
   'gpt-4': 0.06,
   'chatgpt-test': 0.002,
   'chatgpt-v-2': 0.002,
    
}

#####################################################

Create tests for individual llm providers

This error was caused because of a malformed input to replicate. Could have been avoided with a simple check.

Fix github -> PyPI new package flow

Do you address token counting?

Not sure if litellm is in the right layer to count tokens. But if we switch to using litellm rather than implement our own support for multiple models, it is not as simple as replacing the completion function calls. We also need to count tokens used by different models while they rely on different tokenizers.

In any event, please keep litellm as a thin layer as possible. Don't take it if you believe it is not the right job for litellm. That would ruin litellm easier than lacking functionality. Consider adding some util functions? Happy to discuss.

Note: we are building DevChat, a potential user of litellm. Thanks for your great work!

Simplify llm_client

Critiques of existing approach:

maintaining 2 functions with the same args where 1 calls the other, etc. is difficult → leads to mapping errors, etc.
- Since litellm_client.completion is essentially a wrapper around the raw completion function → why not treat it like one

v1?

how do you simplify the function calling but keep the ease of letting me send my data to other places?

from main import completion 

main.success_callback=["posthog"]
main.failure_callback=["posthog", "sentry", "slack"]

response = completion(model="gpt-3.5-turbo", messages=messages)

Just create a wrapper around the completion and embedding endpoints, which should help simplify the core calling logic

If it fails I'll log it to PostHog + Slack

Wonder if litellm can help simplify this for me.

[
    {
        "requests": 10,
        "requests_context": 10,
        "context_tokens": 102,
        "requests_generated": 10,
        "generated_tokens": 1053,
        "recorded_date": "2023-07-31",
        "model_id": "gpt-3.5-turbo",
        "generated_tokens_cost_usd_cent": 23,
        "context_tokens_cost_usd_cent": 20
    },
    {
        "requests": 20,
        "requests_context": 20,
        "context_tokens": 1234,
        "requests_generated": 10,
        "generated_tokens": 1234,
        "recorded_date": "2023-07-30",
        "model_id": "gpt-3.5-turbo",
        "generated_tokens_cost_usd_cent": 13,
        "context_tokens_cost_usd_cent": 30
    }
]

Add support for replicate [code in ticket]

Here's my code for replicate:

elif "replicate" in model: 
    prompt = " ".join([message["content"] for message in messages])
    output = replicate.run(
      model,
      input={
        "prompt": prompt,
      })
    print(f"output: {output}")
    response = ""
    for item in output: 
      print(f"item: {item}")
      response += item
    new_response = {
      "choices": [
        {
          "finish_reason": "stop",
          "index": 0,
          "message": {
              "content": response,
              "role": "assistant"
          }
        }
      ]
    }
    print(f"new response: {new_response}")
    response = new_response