hyperonym / basaran Goto Github PK

Basaran is an open-source alternative to the OpenAI text completion API. It provides a compatible streaming API for your Hugging Face Transformers-based text generation models.

License: MIT License

Python 57.63% Dockerfile 4.64% Makefile 2.58% CSS 9.55% JavaScript 22.11% HTML 1.99% Shell 0.40% Jinja 1.10%

gpt huggingface language-model natural-language-processing openai-api text-generation llm python transformers llama

basaran's Introduction

Basaran

Basaran is an open-source alternative to the OpenAI text completion API. It provides a compatible streaming API for your Hugging Face Transformers-based text generation models.

The open source community will eventually witness the Stable Diffusion moment for large language models (LLMs), and Basaran allows you to replace OpenAI's service with the latest open-source model to power your application without modifying a single line of code.

The key features of Basaran are:

Streaming generation using various decoding strategies.
Support for both decoder-only and encoder-decoder models.
Detokenizer that handles surrogates and whitespace.
Multi-GPU support with optional quantization.
Real-time partial progress using server-sent events.
Compatibility with OpenAI API and client libraries.
Comes with a fancy web-based playground!

Quick Start

TL;DR

Replace user/repo with your selected model and X.Y.Z with the latest version, then run:

docker run -p 80:80 -e MODEL=user/repo hyperonym/basaran:X.Y.Z

And you're good to go! 🚀

Playground: http://127.0.0.1/
API:        http://127.0.0.1/v1/completions

Installation

Using Docker (Recommended)

Docker images are available on Docker Hub and GitHub Packages.

For GPU acceleration, you also need to install the NVIDIA Driver and NVIDIA Container Runtime. Basaran's image already comes with related libraries such as CUDA and cuDNN, so there is no need to install them manually.

Basaran's image can be used in three ways:

Run directly: By specifying the MODEL="user/repo" environment variable, the corresponding model can be downloaded from Hugging Face Hub during the first startup.
Bundling: Create a new Dockerfile to preload a public model or bundle a private model.
Bind mount: Mount a model from the local file system into the container and point the MODEL environment variable to the corresponding path.

For the above use cases, you can find sample Dockerfiles and docker-compose files in the deployments directory.

Using pip

Basaran is tested on Python 3.8+ and PyTorch 1.13+. You should create a virtual environment with the version of Python you want to use, and activate it before proceeding.

Install with pip:

pip install basaran

Install dependencies required for GPU acceleration (optional):

pip install accelerate bitsandbytes

Replace user/repo with the selected model and run Basaran:

MODEL=user/repo PORT=80 python -m basaran

For a complete list of environment variables, see __init__.py.

Running From Source

If you want to access the latest features or hack it yourself, you can choose to run from source using git.

Clone the repository:

git clone https://github.com/hyperonym/basaran.git && cd basaran

Install dependencies:

pip install -r requirements.txt

Replace user/repo with the selected model and run Basaran:

MODEL=user/repo PORT=80 python -m basaran

Basic Usage

cURL

Basaran's HTTP request and response formats are consistent with the OpenAI API.

Taking text completion as an example:

curl http://127.0.0.1/v1/completions \
    -H 'Content-Type: application/json' \
    -d '{ "prompt": "once upon a time,", "echo": true }'

Example response

{
    "id": "cmpl-e08c701b4ba032c09ef080e1",
    "object": "text_completion",
    "created": 1678003509,
    "model": "bigscience/bloomz-560m",
    "choices": [
        {
            "text": "once upon a time, the human being faces a complicated situation and he needs to find a new life.",
            "index": 0,
            "logprobs": null,
            "finish_reason": "length"
        }
    ],
    "usage": {
        "prompt_tokens": 5,
        "completion_tokens": 21,
        "total_tokens": 26
    }
}

OpenAI Client Library

If your application uses client libraries provided by OpenAI, you only need to modify the OPENAI_API_BASE environment variable to match Basaran's endpoint:

OPENAI_API_BASE="http://127.0.0.1/v1" python your_app.py

The examples directory contains examples of using the OpenAI Python library.

Using as a Python Library

Basaran is also available as a library on PyPI. It can be used directly in Python without the need to start a separate API server.

Install with pip:

pip install basaran

Use the load_model function to load a model:

from basaran.model import load_model

model = load_model("user/repo")

Generate streaming output by calling the model:

for choice in model("once upon a time"):
    print(choice)

The examples directory contains examples of using Basaran as a library.

Compatibility

Basaran's API format is consistent with OpenAI's, with differences in compatibility mainly in terms of parameter support and response fields. The following sections provide detailed information on the compatibility of each endpoint.

Models

Each Basaran process serves only one model, so the result will only contain that model.

Completions

Although Basaran does not support the model parameter, the OpenAI client library requires it to be present. Therefore, you can enter any random model name.

Parameter	Basaran	OpenAI	Default Value	Maximum Value
`model`	○	●	-	-
`prompt`	●	●	`""`	`COMPLETION_MAX_PROMPT`
`suffix`	○	●	-	-
`min_tokens`	●	○	`0`	`COMPLETION_MAX_TOKENS`
`max_tokens`	●	●	`16`	`COMPLETION_MAX_TOKENS`
`temperature`	●	●	`1.0`	-
`top_p`	●	●	`1.0`	-
`n`	●	●	`1`	`COMPLETION_MAX_N`
`stream`	●	●	`false`	-
`logprobs`	●	●	`0`	`COMPLETION_MAX_LOGPROBS`
`echo`	●	●	`false`	-
`stop`	○	●	-	-
`presence_penalty`	○	●	-	-
`frequency_penalty`	○	●	-	-
`best_of`	○	●	-	-
`logit_bias`	○	●	-	-
`user`	○	●	-	-

Chat

Providing a unified chat API is currently difficult because each model has a different format for chat history.

Therefore, it is recommended to pre-format the chat history based on the requirements of the specific model and use it as the prompt for the completion API.

GPT-NeoXT-Chat-Base-20B

**Summarize a long document into a single sentence and ...**

<human>: Last year, the travel industry saw a big ...

<bot>: If you're traveling this spring break, ...

<human>: But ...

<bot>:

chatglm-6b

[Round 0]
问：你好
答：你好!有什么我可以帮助你的吗?
[Round 1]
问：你是谁？
答：

Roadmap

See the open issues for a full list of proposed features.

Contributing

This project is open-source. If you have any ideas or questions, please feel free to reach out by creating an issue!

Contributions are greatly appreciated, please refer to CONTRIBUTING.md for more information.

License

Basaran is available under the MIT License.

basaran's People

Contributors

Stargazers

Watchers

Forkers

marcus-arcadius fardeon bhardwajrahul davizucon orinocoz rogervaas rossman22590 volpeon standardgalactic techthiyanes raj-khare getswype lcw99 randominterests ronald-d-rogers nicpopovic cogignition mooz chrisfitkin awesome-archive cygwynd container-labs bonabobo thireus staringos vinno97 hertera1 sieu-n liamdgray artivis ccclyu aicodehunt vchauhan1 rhochmayr ccsweets kunato muflhi001 webrulon brucepro worthmining dkzdev idoru patricknercessian creatorrr gear273 datadropp brainchain-ai zxs-learn hojinyang eduardomoraesritter cgb-ai-chatgpt octag0no tranhoangnguyen03 louanes1 thenetguy matt-osborn rebarfdn josegron qqq-tech rheehot hoanghoang1408 hodlen tanquangduong mwksandman ankit1057 sytelus tushar-ml dapper-magician sohaha mvandermeulen fxcl huaxlin toddy1242 ink-splatters haesleinhuepf deepuncertainty weihao97 chris-volpacchio jesst3r

basaran's Issues

RuntimeError: mat1 and mat2 shapes cannot be multiplied

When I call multiple streaming completions at the same time I get the error below.

start listening on 127.0.0.1:8888
ERROR:waitress:Exception while serving /v1/completions
Traceback (most recent call last):
  File "/home/chang/anaconda3/envs/hf38/lib/python3.8/site-packages/waitress/channel.py", line 428, in service
    task.service()
  File "/home/chang/anaconda3/envs/hf38/lib/python3.8/site-packages/waitress/task.py", line 168, in service
    self.execute()
  File "/home/chang/anaconda3/envs/hf38/lib/python3.8/site-packages/waitress/task.py", line 456, in execute
    for chunk in app_iter:
  File "/home/chang/anaconda3/envs/hf38/lib/python3.8/site-packages/werkzeug/wsgi.py", line 500, in __next__
    return self._next()
  File "/home/chang/anaconda3/envs/hf38/lib/python3.8/site-packages/werkzeug/wrappers/response.py", line 50, in _iter_encoded
    for item in iterable:
  File "/home/chang/AI/llm/basaran/basaran/__main__.py", line 168, in stream
    for choice in stream_model(**options):
  File "/home/chang/AI/llm/basaran/basaran/model.py", line 73, in __call__
    for (
  File "/home/chang/AI/llm/basaran/basaran/model.py", line 237, in generate
    outputs = self.model(
  File "/home/chang/anaconda3/envs/hf38/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/chang/anaconda3/envs/hf38/lib/python3.8/site-packages/accelerate/hooks.py", line 165, in new_forward
    output = old_forward(*args, **kwargs)
  File "/home/chang/anaconda3/envs/hf38/lib/python3.8/site-packages/transformers/models/gpt_neox/modeling_gpt_neox.py", line 662, in forward
    outputs = self.gpt_neox(
  File "/home/chang/anaconda3/envs/hf38/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/chang/anaconda3/envs/hf38/lib/python3.8/site-packages/accelerate/hooks.py", line 165, in new_forward
    output = old_forward(*args, **kwargs)
  File "/home/chang/anaconda3/envs/hf38/lib/python3.8/site-packages/transformers/models/gpt_neox/modeling_gpt_neox.py", line 553, in forward
    outputs = layer(
  File "/home/chang/anaconda3/envs/hf38/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/chang/anaconda3/envs/hf38/lib/python3.8/site-packages/accelerate/hooks.py", line 165, in new_forward
    output = old_forward(*args, **kwargs)
  File "/home/chang/anaconda3/envs/hf38/lib/python3.8/site-packages/transformers/models/gpt_neox/modeling_gpt_neox.py", line 335, in forward
    mlp_output = self.mlp(self.post_attention_layernorm(hidden_states))
  File "/home/chang/anaconda3/envs/hf38/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/chang/anaconda3/envs/hf38/lib/python3.8/site-packages/accelerate/hooks.py", line 165, in new_forward
    output = old_forward(*args, **kwargs)
  File "/home/chang/anaconda3/envs/hf38/lib/python3.8/site-packages/transformers/models/gpt_neox/modeling_gpt_neox.py", line 297, in forward
    hidden_states = self.dense_4h_to_h(hidden_states)
  File "/home/chang/anaconda3/envs/hf38/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/chang/anaconda3/envs/hf38/lib/python3.8/site-packages/accelerate/hooks.py", line 165, in new_forward
    output = old_forward(*args, **kwargs)
  File "/home/chang/anaconda3/envs/hf38/lib/python3.8/site-packages/bitsandbytes/nn/modules.py", line 320, in forward
    out = bnb.matmul(x, self.weight, bias=self.bias, state=self.state)
  File "/home/chang/anaconda3/envs/hf38/lib/python3.8/site-packages/bitsandbytes/autograd/_functions.py", line 500, in matmul
    return MatMul8bitLt.apply(A, B, out, bias, state)
  File "/home/chang/anaconda3/envs/hf38/lib/python3.8/site-packages/torch/autograd/function.py", line 506, in apply
    return super().apply(*args, **kwargs)  # type: ignore[misc]
  File "/home/chang/anaconda3/envs/hf38/lib/python3.8/site-packages/bitsandbytes/autograd/_functions.py", line 417, in forward
    output += torch.matmul(subA, state.subB)
RuntimeError: mat1 and mat2 shapes cannot be multiplied (238x13 and 29x5120)
ERROR:waitress:Exception while serving /v1/completions

[bug] Strange output from StreamModel

RuntimeError: expected scalar type Float but found Half

Found an issue with loading the Salesforcet5/codet5-large-ntp-py model.

basaran_1  | ERROR:waitress:Exception while serving /v1/completions
basaran_1  | Traceback (most recent call last):
basaran_1  |   File "/usr/local/lib/python3.8/dist-packages/waitress/channel.py", line 428, in service
basaran_1  |     task.service()
basaran_1  |   File "/usr/local/lib/python3.8/dist-packages/waitress/task.py", line 168, in service
basaran_1  |     self.execute()
basaran_1  |   File "/usr/local/lib/python3.8/dist-packages/waitress/task.py", line 456, in execute
basaran_1  |     for chunk in app_iter:
basaran_1  |   File "/usr/local/lib/python3.8/dist-packages/werkzeug/wsgi.py", line 289, in __next__
basaran_1  |     return self._next()
basaran_1  |   File "/usr/local/lib/python3.8/dist-packages/werkzeug/wrappers/response.py", line 32, in _iter_encoded
basaran_1  |     for item in iterable:
basaran_1  |   File "/app/basaran/__main__.py", line 187, in stream
basaran_1  |     for choice in stream_model(**options):
basaran_1  |   File "/app/basaran/model.py", line 73, in __call__
basaran_1  |     for (
basaran_1  |   File "/app/basaran/model.py", line 215, in generate
basaran_1  |     kwargs["encoder_outputs"] = encoder(**encoder_kwargs)
basaran_1  |   File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1423, in _call_impl
basaran_1  |     return forward_call(*input, **kwargs)
basaran_1  |   File "/usr/local/lib/python3.8/dist-packages/accelerate/hooks.py", line 165, in new_forward
basaran_1  |     output = old_forward(*args, **kwargs)
basaran_1  |   File "/usr/local/lib/python3.8/dist-packages/transformers/models/t5/modeling_t5.py", line 1090, in forward
basaran_1  |     layer_outputs = layer_module(
basaran_1  |   File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1423, in _call_impl
basaran_1  |     return forward_call(*input, **kwargs)
basaran_1  |   File "/usr/local/lib/python3.8/dist-packages/accelerate/hooks.py", line 165, in new_forward
basaran_1  |     output = old_forward(*args, **kwargs)
basaran_1  |   File "/usr/local/lib/python3.8/dist-packages/transformers/models/t5/modeling_t5.py", line 693, in forward
basaran_1  |     self_attention_outputs = self.layer[0](
basaran_1  |   File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1423, in _call_impl
basaran_1  |     return forward_call(*input, **kwargs)
basaran_1  |   File "/usr/local/lib/python3.8/dist-packages/accelerate/hooks.py", line 165, in new_forward
basaran_1  |     output = old_forward(*args, **kwargs)
basaran_1  |   File "/usr/local/lib/python3.8/dist-packages/transformers/models/t5/modeling_t5.py", line 599, in forward
basaran_1  |     normed_hidden_states = self.layer_norm(hidden_states)
basaran_1  |   File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1423, in _call_impl
basaran_1  |     return forward_call(*input, **kwargs)
basaran_1  |   File "/usr/local/lib/python3.8/dist-packages/accelerate/hooks.py", line 165, in new_forward
basaran_1  |     output = old_forward(*args, **kwargs)
basaran_1  |   File "/usr/local/lib/python3.8/dist-packages/apex/normalization/fused_layer_norm.py", line 386, in forward
basaran_1  |     return fused_rms_norm_affine(input, self.weight, self.normalized_shape, self.eps)
basaran_1  |   File "/usr/local/lib/python3.8/dist-packages/apex/normalization/fused_layer_norm.py", line 189, in fused_rms_norm_affine
basaran_1  |     return FusedRMSNormAffineFunction.apply(*args)
basaran_1  |   File "/usr/local/lib/python3.8/dist-packages/apex/normalization/fused_layer_norm.py", line 69, in forward
basaran_1  |     output, invvar = fused_layer_norm_cuda.rms_forward_affine(
basaran_1  | RuntimeError: expected scalar type Float but found Half

I've forked this repo and added a fix, however I think it breaks every other model out there, so I didn't make a PR.
I can still create a PR if you'd like me to.

DataDropp@61c1d41

Slow Streaming

Thanks for this package, working great and pretty fast too when i tried using this for Bloomz 7B model but when i tried the same for this model : [GPT-NeoXT-Chat-Base-20B] (https://huggingface.co/togethercomputer/GPT-NeoXT-Chat-Base-20B), the streaming token generation seems very very slow (~1 token / 2-3 secs).

Just checking if this is expected or am i missing something, as i can see you guys have tested this model too as per README

I'm running it on single A100 machine and during the streaming token generation the GPU Util is around ~55%.

:latest version tag

Can you please add the :latest tag version to the newest Docker image instead of :0.15.2, :0.15.3 etc? So people will stop to manually change the version everytime when new Docker version is out.

Instructions unclear

in windows
replacing private.dockerfile with:

FROM hyperonym/basaran:0.15.3

# Copy model files
COPY vicuna128 D:\basaran\models\vicuna128

# Provide default environment variables
ENV MODEL="D:\basaran\models\vicuna128"
ENV MODEL_LOCAL_FILES_ONLY="true"
ENV SERVER_MODEL_NAME="vicuna128"

and doing docker build deployments\bundle\private.dockerfile gives a location error because it can't find the model under models/vicuna so i tried running:

basaran> docker run -p 80:80 -e MODEL="D:\basaran\models\vicuna128" --name fun hyperonym/basaran:0.15.3
>>
Traceback (most recent call last):
  File "/usr/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/app/basaran/__main__.py", line 38, in <module>
    stream_model = load_model(
  File "/app/basaran/model.py", line 318, in load_model
    tokenizer = AutoTokenizer.from_pretrained(name_or_path, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/transformers/models/auto/tokenization_auto.py", line 642, in from_pretrained
    tokenizer_config = get_tokenizer_config(pretrained_model_name_or_path, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/transformers/models/auto/tokenization_auto.py", line 486, in get_tokenizer_config
    resolved_config_file = cached_file(
  File "/usr/local/lib/python3.8/dist-packages/transformers/utils/hub.py", line 409, in cached_file
    resolved_file = hf_hub_download(
  File "/usr/local/lib/python3.8/dist-packages/huggingface_hub/utils/_validators.py", line 112, in _inner_fn
    validate_repo_id(arg_value)
  File "/usr/local/lib/python3.8/dist-packages/huggingface_hub/utils/_validators.py", line 166, in validate_repo_id
    raise HFValidationError(
huggingface_hub.utils._validators.HFValidationError: Repo id must use alphanumeric chars or '-', '_', '.', '--' and '..' are forbidden, '-' and '.' cannot start or end the name, max length is 96: 'D:\basaran\models\vicuna128'.

it runs normally and seems to default to a model called
bigscience/bloomz-560m which runs on http://127.0.0.1/
is there any way to swap models after running the docker file?? or any way you could post the output of your terminal when running a private docker file in this way?

im not new to docker, but im also unemployed so have no one to ask about this.

I want use the function prefix_allowed_tokens_fn, where of basaran's source code shall I modify?

Hello, we all know that in huggingface transformers' origin model.generate() method, we can set the function paremeterprefix_allowed_tokens_fn to restrict the generate rule. I want to use this function in basaran just like I used in origin model.generate(), could you please tell me where of the source code shall I modify to make the model generation obey my custom prefix_allowed_tokens_fn?

Define chat history format using jinja template

Checklist

Use jinja template to render message history of system, user, and assistant as a prompt for completion.
Allow specifying template file through environment variable to adapt to formatting requirements of different models.
Provide a reasonable default template.

Do you have Discord community?

Discord community will be a great place to share thoughts and get help

GPTQ & 4bit

My apologies if this is a really stupid question... but

Is there scope here to provide the ability to load 4bit models? such as vicuna-13B-1.1-GPTQ-4bit-128g or even 4bit 30B llama models will squeeze into 24GB VRAM. I know this can all be done in other web-ui projects, but having an OpenAI like API such as this project would be amazing.

How to send Audio Inputs to the Basaran

Hi Team,
I am trying to replicate this text-completion behaviour with OpenAI Whisper Model, how can I send audio inputs to the basaran so that it can generate streaming output

Langchain Prompt Format

I have been working to integrate langchain with basaran and I am encountering an issue that I believe has to do with the prompt format. It seems that when langchain is posting to basaran, the prompt is a list and not a string. For example:

{"prompt": ["Sample data is test data\n\nQuestion: What is sample data?\nHelpful Answer:"], "model": "text-davinci-003", "temperature": 0.1, "max_tokens": 256, "top_p": 1, "frequency_penalty": 0, "presence_penalty": 0, "n": 1, "logit_bias": {}}

returns

{"id":"cmpl-3728b36ca4b7a2aac121df7f","object":"text_completion","created":1685652057,"model":"wizard-vicuna-13B","choices":[{"text":"","index":0,"logprobs":null,"finish_reason":"length"}],"usage":{"prompt_tokens":1,"completion_tokens":256,"total_tokens":257}}

It sees the prompt as a single token and doesn't return anything. I am able to replicate the issue by changing the example to use a list. The model seems to take the single (empty?) token and generate text. For instance:

curl http://127.0.0.1/v1/completions \
    -H 'Content-Type: application/json' \
    -d '{ "prompt": ["once upon a time,"], "echo": true }'

returns

{"id":"cmpl-ef0bc647b6de2f4986c728e8","object":"text_completion","created":1685652242,"model":"wizard-vicuna-13B","choices":[{"text":"Ahituv, Nima, 1974-\nIntroduction:","index":0,"logprobs":null,"finish_reason":"length"}],"usage":{"prompt_tokens":1,"completion_tokens":17,"total_tokens":18}}

Would you consider this a langchain issue if the openAI API supports the call-- or am I missing something in my basaran setup?

How to pass `max_token_length` to `load_model` ?

I want the model to keep outputing text for the max_token_length i specify.

Add a chat interface to playground

Checklist

Add a tab menu at the top to allow for selection between completion or chat.
Allow to specify the title of the playground, with the model name as the subtitle.
Refactor and review all frontend code.

Support ARM Docker images

As a local workaround, add the --platform tag.

docker run -p 80:80 -e MODEL=user/repo --platform=linux/amd64 hyperonym/basaran:0.13.5

Inference should stop if connection is aborted/closed

For chat use cases on consumer hardware this is basically a show-stopper. The user needs to be able to stop a response, because consumer on-device inference is quite slow, and so if they don't like where a generation is headed, then they can stop the response in the chat UI (which aborts the HTTP request), but they'll need to wait for the response to finish behind the scenes so that their processor is free to write another response (and I'm not sure how they'd find out when it is actually finished, other than looking at their CPU usage).

Beam search

I am using this package, which shows much better performance than MS DeepSpeed-MII, which has a similar function. Do you have plans to implement beam search as well?

Getting error for model when using vicuna model

2023-04-18 17:03:51 Traceback (most recent call last):
2023-04-18 17:03:51 File "/usr/lib/python3.8/runpy.py", line 194, in _run_module_as_main
2023-04-18 17:03:51 return _run_code(code, main_globals, None,
2023-04-18 17:03:51 File "/usr/lib/python3.8/runpy.py", line 87, in _run_code
2023-04-18 17:03:51 exec(code, run_globals)
2023-04-18 17:03:51 File "/app/basaran/main.py", line 38, in
2023-04-18 17:03:51 stream_model = load_model(
2023-04-18 17:03:51 File "/app/basaran/model.py", line 318, in load_model
2023-04-18 17:03:51 tokenizer = AutoTokenizer.from_pretrained(name_or_path, **kwargs)
2023-04-18 17:03:51 File "/usr/local/lib/python3.8/dist-packages/transformers/models/auto/tokenization_auto.py", line 657, in from_pretrained
2023-04-18 17:03:51 config = AutoConfig.from_pretrained(
2023-04-18 17:03:51 File "/usr/local/lib/python3.8/dist-packages/transformers/models/auto/configuration_auto.py", line 916, in from_pretrained
2023-04-18 17:03:51 config_dict, unused_kwargs = PretrainedConfig.get_config_dict(pretrained_model_name_or_path, **kwargs)
2023-04-18 17:03:51 File "/usr/local/lib/python3.8/dist-packages/transformers/configuration_utils.py", line 573, in get_config_dict
2023-04-18 17:03:51 config_dict, kwargs = cls._get_config_dict(pretrained_model_name_or_path, **kwargs)
2023-04-18 17:03:51 File "/usr/local/lib/python3.8/dist-packages/transformers/configuration_utils.py", line 628, in _get_config_dict
2023-04-18 17:03:51 resolved_config_file = cached_file(
2023-04-18 17:03:51 File "/usr/local/lib/python3.8/dist-packages/transformers/utils/hub.py", line 380, in cached_file
2023-04-18 17:03:51 raise EnvironmentError(
2023-04-18 17:03:51 OSError: /models/vicuna does not appear to have a file named config.json. Checkout 'https://huggingface.co//models/vicuna/None' for available files.

Maybe the documentation can improve on running a custom model. It is pretty vague right now.

FROM hyperonym/basaran:0.15.3

# Copy model files
COPY ./model /models/vicuna

# Provide default environment variables
ENV MODEL="/models/vicuna"
ENV MODEL_LOCAL_FILES_ONLY="true"
ENV MODEL_HALF_PRECISION="true"
ENV SERVER_MODEL_NAME="vicuna"

`v1/completions` does not include `data: ` prefix when `stream:true`

I've just swapped over the endpoints in my code, and the parsing logic broke for streaming responses due to the lack of data: before the JSON object. Is this intended behavior, for some reason?

The only OpenAI v1/completion endpoint that I've tested is text-davinci-003, and it returns a stream where each JSON object is prefixed with data: .

How can I use this with a project using OpenAI's nodejs library?

OpenAI's official node.js library does not support changing the API URL endpoint.

Is there a workaround or other libraries I could be using?

Docker run runs and then exits, does not set up server

I'm running

docker run -p 80:80 -e MODEL=user/repo hyperonym/basaran:0.14.1

where user/repo is a Hugging Face repo. It then appears to download the model, but then once it finishes it just exits the process.

I'm using 0.14.1 because it looks like that's the only one that supports arm64 chips.

Error when Running Vicuna's FastChat Model without GPU

I am new to Vicuna.

I wish to use their open source model to train my dataset.

I don't have a GPU in my computer, so I wanted to use their RESTful API Server. I used Windows PowerShell for the commands below.

According to their explanation (https://github.com/lm-sys/FastChat/blob/main/docs/openai_api.md)

First, I launched the command

python3 -m fastchat.serve.controller

. Then, it opened a localhost for me. I opened it in my browser and it displayed the following message:

{"detail":"Not Found"}.

Next, I opened a new PowerShell window and ran their second command:

python3 -m fastchat.serve.model_worker --model-path lmsys/vicuna-7b-v1.3
However, I encountered the following error:

"AssertionError: Torch not compiled with CUDA enabled".

Does this error occur because I do not have a GPU in my computer?

QLoRa support

does it qlora?

concurrent request supported?

Does this server support concurent request ?
Is is thread-safe for concurent request?

Llama 2 models not working - how to pass auth token?

I am trying to run the llama 2 models and here is the command and the logs:

sudo docker run -p 80:80 -e MODEL=meta-llama/Llama-2-7b-hf hyperonym/basaran:0.19.0
Traceback (most recent call last):
  File "/usr/local/lib/python3.8/dist-packages/huggingface_hub/utils/_errors.py", line 259, in hf_raise_for_status
    response.raise_for_status()
  File "/usr/local/lib/python3.8/dist-packages/requests/models.py", line 1021, in raise_for_status
    raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 401 Client Error: Unauthorized for url: https://huggingface.co/meta-llama/Llama-2-7b-hf/resolve/main/tokenizer_config.json

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/usr/local/lib/python3.8/dist-packages/transformers/utils/hub.py", line 417, in cached_file
    resolved_file = hf_hub_download(
  File "/usr/local/lib/python3.8/dist-packages/huggingface_hub/utils/_validators.py", line 118, in _inner_fn
    return fn(*args, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/huggingface_hub/file_download.py", line 1195, in hf_hub_download
    metadata = get_hf_file_metadata(
  File "/usr/local/lib/python3.8/dist-packages/huggingface_hub/utils/_validators.py", line 118, in _inner_fn
    return fn(*args, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/huggingface_hub/file_download.py", line 1541, in get_hf_file_metadata
    hf_raise_for_status(r)
  File "/usr/local/lib/python3.8/dist-packages/huggingface_hub/utils/_errors.py", line 291, in hf_raise_for_status
    raise RepositoryNotFoundError(message, response) from e
huggingface_hub.utils._errors.RepositoryNotFoundError: 401 Client Error. (Request ID: Root=1-64ba9aef-0847ce6e5dbe16fd46aae799)

Repository Not Found for url: https://huggingface.co/meta-llama/Llama-2-7b-hf/resolve/main/tokenizer_config.json.
Please make sure you specified the correct `repo_id` and `repo_type`.
If you are trying to access a private or gated repo, make sure you are authenticated.
Invalid username or password.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/app/basaran/__main__.py", line 41, in <module>
    stream_model = load_model(
  File "/app/basaran/model.py", line 319, in load_model
    tokenizer = AutoTokenizer.from_pretrained(name_or_path, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/transformers/models/auto/tokenization_auto.py", line 643, in from_pretrained
    tokenizer_config = get_tokenizer_config(pretrained_model_name_or_path, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/transformers/models/auto/tokenization_auto.py", line 487, in get_tokenizer_config
    resolved_config_file = cached_file(
  File "/usr/local/lib/python3.8/dist-packages/transformers/utils/hub.py", line 433, in cached_file
    raise EnvironmentError(
OSError: meta-llama/Llama-2-7b-hf is not a local folder and is not a valid model identifier listed on 'https://huggingface.co/models'
If this is a private repository, make sure to pass a token having permission to this repo with `use_auth_token` or log in with `huggingface-cli login` and pass `use_auth_token=True`.

I have been granted access to those models by both Meta and HF and I am logged in using huggingface_cli

$ /home/arsaboo/.local/bin/huggingface-cli whoami
arsaboo

Api for embeddings

Are there any plans on adding text embeddings?

FR support for using fine tuned models that use Peft

Use case: I've fine tuned a model in a similar model as here:
https://colab.research.google.com/drive/1jCkpikz0J2o20FBQmYmAGdiKmJGOMo-o?usp=sharing#scrollTo=hsD1VKqeA62Z

I now have a base model and an adapter. How do I let Basaran load the model + adapter?

Vicuna problem

Has anyone got this model to work yet? Running into this:

OSError: anon8231489123/vicuna-13b-GPTQ-4bit-128g does not appear to have a file named pytorch_model-00001-of-00003.bin. Checkout 'https://huggingface.co/anon8231489123/vicuna-13b-GPTQ-4bit-128g/main' for available files.

CORS headers

CORS headers are required to use the API from the client side, otherwise we get errors like this:

Access to fetch at 'http://127.0.0.1/v1/completions' from origin 'http://localhost:3001' has been blocked by CORS policy: No 'Access-Control-Allow-Origin' header is present on the requested resource. If an opaque response serves your needs, set the request's mode to 'no-cors' to fetch the resource with CORS disabled.

Can CORS headers be added? Client-side usage is needed in OpenCharacters.

The header: Access-Control-Allow-Origin: * should be added to /v1/completions responses.

ValueError: Tokenizer class LLaMATokenizer does not exist or is not currently imported.

Running this command:

docker run -p 80:80 -e MODEL=decapoda-research/llama-7b-hf hyperonym/basaran:0.15.3

Gives:

ValueError: Tokenizer class LLaMATokenizer does not exist or is not currently imported.

It worked fine with the cerebras/Cerebras-GPT-1.3B and bigscience/bloom-560m models.

(Thanks for your work on this project! Currently integrating it with OpenCharacters)

Falcon 40B : too slow and random answers

Hi,
When i deployed the Falcon 40B model on the Basaran WebUI i had :
-random answers, by example, when i said "hi", i get : " był AbramsPlayEvent磨}$,ocempreferred LaceKUZOOOoodlesWCHawaiiVEsecured cardvue ..."
-a very slow inference, whereas i was using a RunPod server costing $10 per hour with 4 GPU A100 80GB

I tried to custom the setting like that :
kwargs = {
"local_files_only": local_files_only,
"trust_remote_code": trust_remote_code,
"torch_dtype": torch.bfloat16,
"device_map": "auto"
}

i used the half precision, but nothing changed,

Any idea how i could handle this issue ?

Thanks (and congrat for this beautiful webui !)

The installed version of bitsandbytes was compiled without GPU support

Failed to run with MODEL_LOAD_IN_8BIT=true:

===================================BUG REPORT===================================
Welcome to bitsandbytes. For bug reports, please submit your error trace to: https://github.com/TimDettmers/bitsandbytes/issues
================================================================================
/opt/conda/lib/python3.10/site-packages/bitsandbytes/cextension.py:31: UserWarning: The installed version of bitsandbytes was compiled without GPU support. 8-bit optimizers and GPU quantization are unavailable.
  warn("The installed version of bitsandbytes was compiled without GPU support. "
Traceback (most recent call last):
  File "/app/model.py", line 300, in load_model
    model = AutoModelForCausalLM.from_pretrained(name_or_path, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/transformers/models/auto/auto_factory.py", line 464, in from_pretrained
    return model_class.from_pretrained(
  File "/opt/conda/lib/python3.10/site-packages/transformers/modeling_utils.py", line 2478, in from_pretrained
    ) = cls._load_pretrained_model(
  File "/opt/conda/lib/python3.10/site-packages/transformers/modeling_utils.py", line 2794, in _load_pretrained_model
    new_error_msgs, offload_index, state_dict_index = _load_state_dict_into_meta_model(
  File "/opt/conda/lib/python3.10/site-packages/transformers/modeling_utils.py", line 665, in _load_state_dict_into_meta_model
    set_module_8bit_tensor_to_device(model, param_name, param_device, value=param)
  File "/opt/conda/lib/python3.10/site-packages/transformers/utils/bitsandbytes.py", line 71, in set_module_8bit_tensor_to_device
    new_value = bnb.nn.Int8Params(new_value, requires_grad=False, has_fp16_weights=has_fp16_weights).to(device)
  File "/opt/conda/lib/python3.10/site-packages/bitsandbytes/nn/modules.py", line 196, in to
    return self.cuda(device)
  File "/opt/conda/lib/python3.10/site-packages/bitsandbytes/nn/modules.py", line 160, in cuda
    CB, CBt, SCB, SCBt, coo_tensorB = bnb.functional.double_quant(B)
  File "/opt/conda/lib/python3.10/site-packages/bitsandbytes/functional.py", line 1616, in double_quant
    row_stats, col_stats, nnz_row_ptr = get_colrow_absmax(
  File "/opt/conda/lib/python3.10/site-packages/bitsandbytes/functional.py", line 1505, in get_colrow_absmax
    lib.cget_col_row_stats(ptrA, ptrRowStats, ptrColStats, ptrNnzrows, ct.c_float(threshold), rows, cols)
  File "/opt/conda/lib/python3.10/ctypes/__init__.py", line 387, in __getattr__
    func = self.__getitem__(name)
  File "/opt/conda/lib/python3.10/ctypes/__init__.py", line 392, in __getitem__
    func = self._FuncPtr((name_or_ordinal, self))
AttributeError: /opt/conda/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cpu.so: undefined symbol: cget_col_row_stats

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/app/server.py", line 78, in <module>
    model = load_model(MODEL, MODEL_CACHE_DIR, load_in_8bit)
  File "/app/model.py", line 304, in load_model
    if not model.can_generate():
UnboundLocalError: local variable 'model' referenced before assignment

Question about COMPLETION_MAX_PROMPT

Hi, I noticed that COMPLETION_MAX_PROMPT is defined via the length of the prompt in characters, rather than tokens, and was wondering if this is intended?
If intentional, it may be worth clarifying somewhere (as currently the default value is identical to that of COMPLETION_MAX_TOKENS) and/or adding a warning when a prompt is being truncated.

Add support for chat completion API

Checklist

Regardless of whether the last role in messages is user or assistant, the response will always be assistant.
When steam=true, the first returned event will always be {"role": "assistant"}.
When steam=true, the specific finish_reason will be yielded as a separate event.

Compatibility

Parameter	Basaran	OpenAI	Default Value	Maximum Value
`model`	○	●	-	-
`messages`	●	●	`[]`	`CHAT_MAX_PROMPT`
`min_tokens`	●	○	`0`	`CHAT_MAX_TOKENS`
`max_tokens`	●	●	`512`	`CHAT_MAX_TOKENS`
`temperature`	●	●	`1.0`	-
`top_p`	●	●	`1.0`	-
`n`	●	●	`1`	`CHAT_MAX_N`
`stream`	●	●	`false`	-
`stop`	○	●	-	-
`presence_penalty`	○	●	-	-
`frequency_penalty`	○	●	-	-
`logit_bias`	○	●	-	-
`user`	○	●	-	-

Examples

Chat completion (n=1, stream=false)

{
    "id": "chatcmpl-6z5sqEUkSdUyWqsNFRyyJ1s7kCzpM",
    "object": "chat.completion",
    "created": 1680018644,
    "model": "gpt-3.5-turbo-0301",
    "usage": {
        "prompt_tokens": 12,
        "completion_tokens": 14,
        "total_tokens": 26
    },
    "choices": [
        {
            "message": {
                "role": "assistant",
                "content": "I was created by a team of developers and engineers at OpenAI."
            },
            "finish_reason": "stop",
            "index": 0
        }
    ]
}

Chat completion (n=2, stream=false)

{
    "id": "chatcmpl-6z5sOxJ318FNEWVkDsRbsZPS7wexL",
    "object": "chat.completion",
    "created": 1680018616,
    "model": "gpt-3.5-turbo-0301",
    "usage": {
        "prompt_tokens": 12,
        "completion_tokens": 34,
        "total_tokens": 46
    },
    "choices": [
        {
            "message": {
                "role": "assistant",
                "content": "I was created by a team of developers at OpenAI."
            },
            "finish_reason": "stop",
            "index": 0
        },
        {
            "message": {
                "role": "assistant",
                "content": "I was created by a team of developers at OpenAI, using advanced artificial intelligence and natural language processing technologies."
            },
            "finish_reason": "stop",
            "index": 1
        }
    ]
}

Chat completion (n=1, stream=true)

data: {"id":"chatcmpl-6z5sk5NJqhsESrfo25sQzjMfCbcZ0","object":"chat.completion.chunk","created":1680018638,"model":"gpt-3.5-turbo-0301","choices":[{"delta":{"role":"assistant"},"index":0,"finish_reason":null}]}

data: {"id":"chatcmpl-6z5sk5NJqhsESrfo25sQzjMfCbcZ0","object":"chat.completion.chunk","created":1680018638,"model":"gpt-3.5-turbo-0301","choices":[{"delta":{"content":"I"},"index":0,"finish_reason":null}]}

data: {"id":"chatcmpl-6z5sk5NJqhsESrfo25sQzjMfCbcZ0","object":"chat.completion.chunk","created":1680018638,"model":"gpt-3.5-turbo-0301","choices":[{"delta":{"content":" was"},"index":0,"finish_reason":null}]}

data: {"id":"chatcmpl-6z5sk5NJqhsESrfo25sQzjMfCbcZ0","object":"chat.completion.chunk","created":1680018638,"model":"gpt-3.5-turbo-0301","choices":[{"delta":{"content":" created"},"index":0,"finish_reason":null}]}

data: {"id":"chatcmpl-6z5sk5NJqhsESrfo25sQzjMfCbcZ0","object":"chat.completion.chunk","created":1680018638,"model":"gpt-3.5-turbo-0301","choices":[{"delta":{"content":" by"},"index":0,"finish_reason":null}]}

data: {"id":"chatcmpl-6z5sk5NJqhsESrfo25sQzjMfCbcZ0","object":"chat.completion.chunk","created":1680018638,"model":"gpt-3.5-turbo-0301","choices":[{"delta":{"content":" a"},"index":0,"finish_reason":null}]}

data: {"id":"chatcmpl-6z5sk5NJqhsESrfo25sQzjMfCbcZ0","object":"chat.completion.chunk","created":1680018638,"model":"gpt-3.5-turbo-0301","choices":[{"delta":{"content":" team"},"index":0,"finish_reason":null}]}

data: {"id":"chatcmpl-6z5sk5NJqhsESrfo25sQzjMfCbcZ0","object":"chat.completion.chunk","created":1680018638,"model":"gpt-3.5-turbo-0301","choices":[{"delta":{"content":" of"},"index":0,"finish_reason":null}]}

data: {"id":"chatcmpl-6z5sk5NJqhsESrfo25sQzjMfCbcZ0","object":"chat.completion.chunk","created":1680018638,"model":"gpt-3.5-turbo-0301","choices":[{"delta":{"content":" developers"},"index":0,"finish_reason":null}]}

data: {"id":"chatcmpl-6z5sk5NJqhsESrfo25sQzjMfCbcZ0","object":"chat.completion.chunk","created":1680018638,"model":"gpt-3.5-turbo-0301","choices":[{"delta":{"content":" at"},"index":0,"finish_reason":null}]}

data: {"id":"chatcmpl-6z5sk5NJqhsESrfo25sQzjMfCbcZ0","object":"chat.completion.chunk","created":1680018638,"model":"gpt-3.5-turbo-0301","choices":[{"delta":{"content":" Open"},"index":0,"finish_reason":null}]}

data: {"id":"chatcmpl-6z5sk5NJqhsESrfo25sQzjMfCbcZ0","object":"chat.completion.chunk","created":1680018638,"model":"gpt-3.5-turbo-0301","choices":[{"delta":{"content":"AI"},"index":0,"finish_reason":null}]}

data: {"id":"chatcmpl-6z5sk5NJqhsESrfo25sQzjMfCbcZ0","object":"chat.completion.chunk","created":1680018638,"model":"gpt-3.5-turbo-0301","choices":[{"delta":{"content":"."},"index":0,"finish_reason":null}]}

data: {"id":"chatcmpl-6z5sk5NJqhsESrfo25sQzjMfCbcZ0","object":"chat.completion.chunk","created":1680018638,"model":"gpt-3.5-turbo-0301","choices":[{"delta":{},"index":0,"finish_reason":"stop"}]}

data: [DONE]

Chat completion (n=2, stream=true)

data: {"id":"chatcmpl-6z5rT81CjeF0YZH9BCf5fXscq2iqA","object":"chat.completion.chunk","created":1680018559,"model":"gpt-3.5-turbo-0301","choices":[{"delta":{"role":"assistant"},"index":1,"finish_reason":null}]}

data: {"id":"chatcmpl-6z5rT81CjeF0YZH9BCf5fXscq2iqA","object":"chat.completion.chunk","created":1680018559,"model":"gpt-3.5-turbo-0301","choices":[{"delta":{"content":"I"},"index":1,"finish_reason":null}]}

data: {"id":"chatcmpl-6z5rT81CjeF0YZH9BCf5fXscq2iqA","object":"chat.completion.chunk","created":1680018559,"model":"gpt-3.5-turbo-0301","choices":[{"delta":{"role":"assistant"},"index":0,"finish_reason":null}]}

data: {"id":"chatcmpl-6z5rT81CjeF0YZH9BCf5fXscq2iqA","object":"chat.completion.chunk","created":1680018559,"model":"gpt-3.5-turbo-0301","choices":[{"delta":{"content":"I"},"index":0,"finish_reason":null}]}

data: {"id":"chatcmpl-6z5rT81CjeF0YZH9BCf5fXscq2iqA","object":"chat.completion.chunk","created":1680018559,"model":"gpt-3.5-turbo-0301","choices":[{"delta":{"content":" was"},"index":0,"finish_reason":null}]}

data: {"id":"chatcmpl-6z5rT81CjeF0YZH9BCf5fXscq2iqA","object":"chat.completion.chunk","created":1680018559,"model":"gpt-3.5-turbo-0301","choices":[{"delta":{"content":" was"},"index":1,"finish_reason":null}]}

data: {"id":"chatcmpl-6z5rT81CjeF0YZH9BCf5fXscq2iqA","object":"chat.completion.chunk","created":1680018559,"model":"gpt-3.5-turbo-0301","choices":[{"delta":{"content":" created"},"index":1,"finish_reason":null}]}

data: {"id":"chatcmpl-6z5rT81CjeF0YZH9BCf5fXscq2iqA","object":"chat.completion.chunk","created":1680018559,"model":"gpt-3.5-turbo-0301","choices":[{"delta":{"content":" created"},"index":0,"finish_reason":null}]}

data: {"id":"chatcmpl-6z5rT81CjeF0YZH9BCf5fXscq2iqA","object":"chat.completion.chunk","created":1680018559,"model":"gpt-3.5-turbo-0301","choices":[{"delta":{"content":" by"},"index":0,"finish_reason":null}]}

data: {"id":"chatcmpl-6z5rT81CjeF0YZH9BCf5fXscq2iqA","object":"chat.completion.chunk","created":1680018559,"model":"gpt-3.5-turbo-0301","choices":[{"delta":{"content":" by"},"index":1,"finish_reason":null}]}

data: {"id":"chatcmpl-6z5rT81CjeF0YZH9BCf5fXscq2iqA","object":"chat.completion.chunk","created":1680018559,"model":"gpt-3.5-turbo-0301","choices":[{"delta":{"content":" a"},"index":1,"finish_reason":null}]}

data: {"id":"chatcmpl-6z5rT81CjeF0YZH9BCf5fXscq2iqA","object":"chat.completion.chunk","created":1680018559,"model":"gpt-3.5-turbo-0301","choices":[{"delta":{"content":" Open"},"index":0,"finish_reason":null}]}

data: {"id":"chatcmpl-6z5rT81CjeF0YZH9BCf5fXscq2iqA","object":"chat.completion.chunk","created":1680018559,"model":"gpt-3.5-turbo-0301","choices":[{"delta":{"content":"AI"},"index":0,"finish_reason":null}]}

data: {"id":"chatcmpl-6z5rT81CjeF0YZH9BCf5fXscq2iqA","object":"chat.completion.chunk","created":1680018559,"model":"gpt-3.5-turbo-0301","choices":[{"delta":{"content":" team"},"index":1,"finish_reason":null}]}

data: {"id":"chatcmpl-6z5rT81CjeF0YZH9BCf5fXscq2iqA","object":"chat.completion.chunk","created":1680018559,"model":"gpt-3.5-turbo-0301","choices":[{"delta":{"content":"."},"index":0,"finish_reason":null}]}

data: {"id":"chatcmpl-6z5rT81CjeF0YZH9BCf5fXscq2iqA","object":"chat.completion.chunk","created":1680018559,"model":"gpt-3.5-turbo-0301","choices":[{"delta":{"content":" of"},"index":1,"finish_reason":null}]}

data: {"id":"chatcmpl-6z5rT81CjeF0YZH9BCf5fXscq2iqA","object":"chat.completion.chunk","created":1680018559,"model":"gpt-3.5-turbo-0301","choices":[{"delta":{"content":" programmers"},"index":1,"finish_reason":null}]}

data: {"id":"chatcmpl-6z5rT81CjeF0YZH9BCf5fXscq2iqA","object":"chat.completion.chunk","created":1680018559,"model":"gpt-3.5-turbo-0301","choices":[{"delta":{"content":" and"},"index":1,"finish_reason":null}]}

data: {"id":"chatcmpl-6z5rT81CjeF0YZH9BCf5fXscq2iqA","object":"chat.completion.chunk","created":1680018559,"model":"gpt-3.5-turbo-0301","choices":[{"delta":{"content":" developers"},"index":1,"finish_reason":null}]}

data: {"id":"chatcmpl-6z5rT81CjeF0YZH9BCf5fXscq2iqA","object":"chat.completion.chunk","created":1680018559,"model":"gpt-3.5-turbo-0301","choices":[{"delta":{"content":" at"},"index":1,"finish_reason":null}]}

data: {"id":"chatcmpl-6z5rT81CjeF0YZH9BCf5fXscq2iqA","object":"chat.completion.chunk","created":1680018559,"model":"gpt-3.5-turbo-0301","choices":[{"delta":{"content":" Open"},"index":1,"finish_reason":null}]}

data: {"id":"chatcmpl-6z5rT81CjeF0YZH9BCf5fXscq2iqA","object":"chat.completion.chunk","created":1680018559,"model":"gpt-3.5-turbo-0301","choices":[{"delta":{"content":"AI"},"index":1,"finish_reason":null}]}

data: {"id":"chatcmpl-6z5rT81CjeF0YZH9BCf5fXscq2iqA","object":"chat.completion.chunk","created":1680018559,"model":"gpt-3.5-turbo-0301","choices":[{"delta":{"content":"."},"index":1,"finish_reason":null}]}

data: {"id":"chatcmpl-6z5rT81CjeF0YZH9BCf5fXscq2iqA","object":"chat.completion.chunk","created":1680018559,"model":"gpt-3.5-turbo-0301","choices":[{"delta":{},"index":0,"finish_reason":"stop"}]}

data: {"id":"chatcmpl-6z5rT81CjeF0YZH9BCf5fXscq2iqA","object":"chat.completion.chunk","created":1680018559,"model":"gpt-3.5-turbo-0301","choices":[{"delta":{},"index":1,"finish_reason":"stop"}]}

data: [DONE]

Text completion (n=1, stream=false)

{
    "id": "cmpl-6z60oeEN6Wf4O3Ad0OvAMtA92ELQd",
    "object": "text_completion",
    "created": 1680019138,
    "model": "text-davinci-003",
    "choices": [
        {
            "text": "\n\nThis is indeed a test",
            "index": 0,
            "logprobs": null,
            "finish_reason": "length"
        }
    ],
    "usage": {
        "prompt_tokens": 5,
        "completion_tokens": 7,
        "total_tokens": 12
    }
}

Support for llama.cpp/ggml models

Is there a plan to add support for llama.cpp models? It could support inference on CPUs.

https://github.com/ggerganov/llama.cpp and https://github.com/thomasantony/llamacpp-python.

Related: #57

how to run model in total offline?

Sorry for stupid question, but I am totally newbie in Docker and using Huggingface locally (not via colab or something else). This is the command to model for the first time , for example:

docker run -p 80:80 -e MODEL=bigscience/bloom-560m hyperonym/basaran:0.13.5

In this case everything is incredible! All works, I turn of the connection and still everything works fine.

Now, when I want to use the previously downloaded model, I have difficulties. Can you just give and example of offline run?

Something like this, without using Dockerfile etc. Just one command:
docker run -p 80:80 -e TRANFORMERS_OFFLINE=1 MODEL='/home/my_model' hyperonym/basaran:0.13.5

And please, don't send this. I've tried a lot of different variations, but still didn't get it:
https://huggingface.co/docs/transformers/v4.15.0/installation#offline-mode

So, shortly: how to run basaran in Docker locally, give an example of command, please.

Thank you for your understanding and for your help and work!

Replace EventSource with POST requests in playground

Encoding prompt as query parameters can lead to 414 URI Too Long for long user inputs. We should replace EventSource with native POST requests.

Possible to run on M-series chips/MPS?

Hello,
Thank you for making this great repository! Is it possible to run this on M1/M2 chips using MPS? I've tried setting self.device to mps, however I get this:

RuntimeError: Placeholder storage has not been allocated on MPS device!

Is there any way to run this using MPS optimization?
Thank you!

Chat broken

I asked chat gpt to write a code, instead it writes garbage, completely irrelevant content. Any explanation?
I am using docker container in Synology

RuntimeError: mat1 and mat2 shapes cannot be multiplied

Sample 1:

ERROR:waitress:Exception while serving /v1/completions
Traceback (most recent call last):
  File "/opt/conda/lib/python3.8/site-packages/waitress/channel.py", line 428, in service
    task.service()
  File "/opt/conda/lib/python3.8/site-packages/waitress/task.py", line 168, in service
    self.execute()
  File "/opt/conda/lib/python3.8/site-packages/waitress/task.py", line 456, in execute
    for chunk in app_iter:
  File "/opt/conda/lib/python3.8/site-packages/werkzeug/wsgi.py", line 500, in __next__
    return self._next()
  File "/opt/conda/lib/python3.8/site-packages/werkzeug/wrappers/response.py", line 50, in _iter_encoded
    for item in iterable:
  File "server.py", line 157, in stream
    for c in model(**options):
  File "/app/model.py", line 68, in __call__
    for (
  File "/app/model.py", line 222, in generate
    outputs = self.model(
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1186, in _call_impl
    return forward_call(*input, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/accelerate/hooks.py", line 158, in new_forward
    output = old_forward(*args, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/transformers/models/bloom/modeling_bloom.py", line 900, in forward
    transformer_outputs = self.transformer(
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1186, in _call_impl
    return forward_call(*input, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/accelerate/hooks.py", line 158, in new_forward
    output = old_forward(*args, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/transformers/models/bloom/modeling_bloom.py", line 782, in forward
    outputs = block(
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1186, in _call_impl
    return forward_call(*input, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/accelerate/hooks.py", line 158, in new_forward
    output = old_forward(*args, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/transformers/models/bloom/modeling_bloom.py", line 463, in forward
    output = self.mlp(layernorm_output, residual)
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1186, in _call_impl
    return forward_call(*input, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/accelerate/hooks.py", line 158, in new_forward
    output = old_forward(*args, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/transformers/models/bloom/modeling_bloom.py", line 384, in forward
    hidden_states = self.gelu_impl(self.dense_h_to_4h(hidden_states))
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1186, in _call_impl
    return forward_call(*input, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/accelerate/hooks.py", line 158, in new_forward
    output = old_forward(*args, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/bitsandbytes/nn/modules.py", line 242, in forward
    out = bnb.matmul(x, self.weight, bias=self.bias, state=self.state)
  File "/opt/conda/lib/python3.8/site-packages/bitsandbytes/autograd/_functions.py", line 488, in matmul
    return MatMul8bitLt.apply(A, B, out, bias, state)
  File "/opt/conda/lib/python3.8/site-packages/bitsandbytes/autograd/_functions.py", line 397, in forward
    output += torch.matmul(subA, state.subB)
RuntimeError: mat1 and mat2 shapes cannot be multiplied (1x6 and 3x16384)

Sample 2:

ERROR:waitress:Exception while serving /v1/completions
Traceback (most recent call last):
  File "/opt/conda/lib/python3.8/site-packages/waitress/channel.py", line 428, in service
    task.service()
  File "/opt/conda/lib/python3.8/site-packages/waitress/task.py", line 168, in service
    self.execute()
  File "/opt/conda/lib/python3.8/site-packages/waitress/task.py", line 456, in execute
    for chunk in app_iter:
  File "/opt/conda/lib/python3.8/site-packages/werkzeug/wsgi.py", line 500, in __next__
    return self._next()
  File "/opt/conda/lib/python3.8/site-packages/werkzeug/wrappers/response.py", line 50, in _iter_encoded
    for item in iterable:
  File "server.py", line 157, in stream
    for c in model(**options):
  File "/app/model.py", line 68, in __call__
    for (
  File "/app/model.py", line 222, in generate
    outputs = self.model(
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1186, in _call_impl
    return forward_call(*input, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/accelerate/hooks.py", line 158, in new_forward
    output = old_forward(*args, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/transformers/models/bloom/modeling_bloom.py", line 900, in forward
    transformer_outputs = self.transformer(
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1186, in _call_impl
    return forward_call(*input, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/accelerate/hooks.py", line 158, in new_forward
    output = old_forward(*args, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/transformers/models/bloom/modeling_bloom.py", line 782, in forward
    outputs = block(
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1186, in _call_impl
    return forward_call(*input, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/accelerate/hooks.py", line 158, in new_forward
    output = old_forward(*args, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/transformers/models/bloom/modeling_bloom.py", line 463, in forward
    output = self.mlp(layernorm_output, residual)
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1186, in _call_impl
    return forward_call(*input, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/accelerate/hooks.py", line 158, in new_forward
    output = old_forward(*args, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/transformers/models/bloom/modeling_bloom.py", line 395, in forward
    intermediate_output = self.dense_4h_to_h(hidden_states)
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1186, in _call_impl
    return forward_call(*input, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/accelerate/hooks.py", line 158, in new_forward
    output = old_forward(*args, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/bitsandbytes/nn/modules.py", line 242, in forward
    out = bnb.matmul(x, self.weight, bias=self.bias, state=self.state)
  File "/opt/conda/lib/python3.8/site-packages/bitsandbytes/autograd/_functions.py", line 488, in matmul
    return MatMul8bitLt.apply(A, B, out, bias, state)
  File "/opt/conda/lib/python3.8/site-packages/bitsandbytes/autograd/_functions.py", line 397, in forward
    output += torch.matmul(subA, state.subB)
RuntimeError: mat1 and mat2 shapes cannot be multiplied (1x3 and 1x4096)

The requested URL was not found on the server

Hi,

I keep hitting the following error.

InvalidRequestError: The requested URL was not found on the server. If you entered the URL manually please check your spelling and try again.

This happens when trying to droppin' replace OpenAI with basaran (tested with both babyagi and langchain's own implementation of babyagi as a Jupyter notbook). While I had to tweak the former, the later supports setting the api_base since recently.

I did make sure that basaran is running fine, I can hit it with,

curl http://127.0.0.1/v1/completions \
    -H 'Content-Type: application/json' \
    -d '{ "prompt": "once upon a time,", "echo": true }'

Any pointer would be appreciated :)

Support for `v1/embeddings` endpoint

I'm not sure how feasible or within-scope this is, but it'd be very useful if the Basaran project were able to implement the v1/embeddings endpoint (using Hugging Face repos, like with the v1/completions endpoint).

Text embeddings are very often used alongside the completion endpoints, and we have this particular requirement for OpenCharacters so we can save and search over the character's "memories".

(And very soon we'll have the same requirement for text-to-image. If it were possible for Basaran to aim to be the OpenAI-compatible, open-source API server, that would be awesome.)

How to set my own parameters in model.generate() in basaran?

Hello, I want use my customize parameters when model.generate(), like this:

model.generate(input_ids, max_new_tokens=max_new_tokens,
                                     do_sample=True, max_length=max_length, temperature=temperature, top_p=top_p,
                                     repetition_penalty=repetition_penalty)

but if I use basaran, the code is like this:

model = load_model(model_name)
for choice in model(input_code):
      yield choice

It seems no place I can set parameters like do_sample, max_length, top_p, ..., just like I use model.generate() directly.
So that I can not set those parameters by myself.
How to solve this problem?

Does this work for LLaMA models?

I wanna stream the output of LLaMA models is it possible using this?

Any chance to support text-generation-inference as backend?

https://github.com/huggingface/text-generation-inference

Main features of TGI are quite awesome. It woud be nice to make it additional inference implementation.

API Usage - iOS Shortcuts

I want to integrate basaran into iOS Shortcuts, but I'm unsure how to use the API. With the existing iOS Shortcuts, for ChatGP, you only need an API key.

Is there a good place to look for documentation on API usage, or does anyone have an iOS Shortcut that leverages basaran, willing to share it with us?

Created RunPod template for easy deploy

First of all I want to say I love Basaran is more OpenAI than OpenAI itself :)
I decided to give a go and made template for Basaran to run on RunPod GPU service and it works well including UI and API endpoints.
Hope that helps users who want to use it without GPU access to also be able to enjoy it.
https://runpod.io/gsc?template=7ito7h393l&ref=vfker49t

I will be publishing blog post soon so will share link later.
I also have question about location of the models. For now container saves model to temp storage if you let me know where models are being saved I will adjust template to allow saving to volume storage so users can avoid downloading models every time.

Thank you for amazing work again :)

in stream mode, the English word has no space after detokenizer and Chinese were messed up

How to resolve this problem?>

crash when running mosaicml/mpt-7b-* models: KeyError: 'attention_mask'

from basaran.model import load_model

model = load_model('mosaicml/mpt-7b-storywriter',  trust_remote_code=True, load_in_8bit=True,)

for choice in model("once upon a time"):
    print(choice)

Traceback (most recent call last):
  File "/home/taras/Documents/ctranslate2/basaran/run.py", line 7, in <module>
    for choice in model("once upon a time"):
  File "/home/taras/Documents/ctranslate2/basaran/.venv/lib/python3.9/site-packages/basaran/model.py", line 73, in __call__
    for (
  File "/home/taras/Documents/ctranslate2/basaran/.venv/lib/python3.9/site-packages/basaran/model.py", line 233, in generate
    inputs = self.model.prepare_inputs_for_generation(
  File "/home/taras/.cache/huggingface/modules/transformers_modules/mosaicml/mpt-7b-storywriter/8667424ea9d973d3c01596fcbb86a3a8bc164299/modeling_mpt.py", line 280, in prepare_inputs_for_generation
    attention_mask = kwargs['attention_mask'].bool()
KeyError: 'attention_mask'

TypeError: init() got an unexpected keyword argument 'load_in_4bit'

# MODEL_TRUST_REMOTE_CODE=True MODEL=huggyllama/llama-7b PORT=80 python -m basaran
Traceback (most recent call last):
  File "/root/anaconda3/envs/cuda_test2/lib/python3.9/runpy.py", line 197, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/root/anaconda3/envs/cuda_test2/lib/python3.9/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/root/anaconda3/envs/cuda_test2/lib/python3.9/site-packages/basaran/__main__.py", line 41, in <module>
    stream_model = load_model(
  File "/root/anaconda3/envs/cuda_test2/lib/python3.9/site-packages/basaran/model.py", line 334, in load_model
    model = AutoModelForCausalLM.from_pretrained(name_or_path, **kwargs)
  File "/root/anaconda3/envs/cuda_test2/lib/python3.9/site-packages/transformers/models/auto/auto_factory.py", line 467, in from_pretrained
    return model_class.from_pretrained(
  File "/root/anaconda3/envs/cuda_test2/lib/python3.9/site-packages/transformers/modeling_utils.py", line 2611, in from_pretrained
    model = cls(config, *model_args, **model_kwargs)
TypeError: __init__() got an unexpected keyword argument 'load_in_4bit'

# MODEL_TRUST_REMOTE_CODE=True MODEL=openlm-research/open_llama_3b PORT=80 python -m basaran

Traceback (most recent call last):
  File "/root/anaconda3/envs/cuda_test2/lib/python3.9/runpy.py", line 197, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/root/anaconda3/envs/cuda_test2/lib/python3.9/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/root/anaconda3/envs/cuda_test2/lib/python3.9/site-packages/basaran/__main__.py", line 41, in <module>
    stream_model = load_model(
  File "/root/anaconda3/envs/cuda_test2/lib/python3.9/site-packages/basaran/model.py", line 334, in load_model
    model = AutoModelForCausalLM.from_pretrained(name_or_path, **kwargs)
  File "/root/anaconda3/envs/cuda_test2/lib/python3.9/site-packages/transformers/models/auto/auto_factory.py", line 467, in from_pretrained
    return model_class.from_pretrained(
  File "/root/anaconda3/envs/cuda_test2/lib/python3.9/site-packages/transformers/modeling_utils.py", line 2611, in from_pretrained
    model = cls(config, *model_args, **model_kwargs)
TypeError: __init__() got an unexpected keyword argument 'load_in_4bit'

But this model works perfect with transformers:

import torch
from transformers import LlamaTokenizer, LlamaForCausalLM

## v2 models
#model_path = 'openlm-research/open_llama_7b_v2'

## v1 models
model_path = 'openlm-research/open_llama_3b'
# model_path = 'openlm-research/open_llama_7b'
# model_path = 'openlm-research/open_llama_13b'

tokenizer = LlamaTokenizer.from_pretrained(model_path)
model = LlamaForCausalLM.from_pretrained(
    model_path, torch_dtype=torch.float16, device_map='auto',
)

prompt = 'Q: What is China?\nA:'
input_ids = tokenizer(prompt, return_tensors="pt").input_ids

generation_output = model.generate(
    input_ids=input_ids, max_new_tokens=32
)
print(tokenizer.decode(generation_output[0]))

hyperonym / basaran Goto Github PK

basaran's Introduction

Basaran

Quick Start

TL;DR

Installation

Using Docker (Recommended)

Using pip

Running From Source

Basic Usage

cURL

OpenAI Client Library

Using as a Python Library

Compatibility

Models

Completions

Chat

Roadmap

Contributing

License

basaran's People

Contributors

Stargazers

Watchers

Forkers

basaran's Issues

Checklist

Checklist

Checklist

Compatibility

Examples

Chat completion (n=1, stream=false)

Chat completion (n=2, stream=false)

Chat completion (n=1, stream=true)

Chat completion (n=2, stream=true)

Text completion (n=1, stream=false)

Recommend Projects

Recommend Topics

Recommend Org