xorbitsai / inference Goto Github PK

Replace OpenAI GPT with another LLM in your app by changing a single line of code. Xinference gives you the freedom to use any LLM you need. With Xinference, you're empowered to run inference with any open-source language models, speech recognition models, and multimodal models, whether in the cloud, on-premises, or even on your laptop.

Home Page: https://inference.readthedocs.io

License: Apache License 2.0

Python 87.49% HTML 0.14% JavaScript 12.10% CSS 0.04% Dockerfile 0.24%

ggml pytorch chatglm chatglm2 deployment flan-t5 llm wizardlm artificial-intelligence machine-learning

inference's People

Contributors

Stargazers

Watchers

Forkers

uranusseven rayji01 pangyoki aresnow1 jiayini1119 bojun-feng qinxuye onesuper dkzdev brucepro jmanhype syaikhipin boosterduan frandchen sarkarda zhuewizz eltociear henryhesz anon2578 matrixji whitelilis xudongjin2019 lxb1226 mysqlsc touristshaun cossack9989 zimix0 arquehi kill8g ihearmycalls asher-wall arch-btw inayet nliver moomoofarm1 justaerian sailingkf hhy5277 techthiyanes aurits mbrukman flueticjustice keyzf shanthshivam shellkjell huangyingting takatost itsharex bmedi dongrunix tomszhou codingl2k1 chengjieli28 devilshorejr paarai yezhwi xiechengmude sycomix bmwas reggiemiller minamiyama richzw jiayaobo hqch0708 ego mjdhasan yanyuxiyangzk unixcrh caszhang lynnleelhl martymei jgbrblmd peterlcm glide-the fengsxy charli117 aix1971 xiexq2019 dfrsg yuanhuachao waltcow faroasis hybristan ai-mou hzg0601 eonsonicblue gyf19 zhuxiaosheng liunux4odoo auxpd boleyn felixzhang7 easypressai leoterry-ulrica sarsmlee onlyone-hyphen lihuibng xiaoyangmai shiyukonghui levinehuang

inference's Issues

FEAT: LlamaIndex plugin

ENH: create model actor in the subpools with a round robin allocation strategy

BUG: chatglm hang

ENH: unify gradio and fastapi

https://blog.deploif.ai/posts/productionizing_your_model

BUG: Missing dependencies

Several dependencies cannot be found when installing dependencies using pip install -e ., such as numpy, versioneer, and llama_cpp.

BUG: While downloading the model, the worker stopped executing

Describe the bug

If there is no model that needs to be launched locally, the model needs to be downloaded first.
While downloading the model, the worker stopped executing.

ENH: support baichuan-7b

apply ggml quant: tutorial.
upload the ggml model to our s3 bucket.
add a class to plexar, specifying the download url, system prompt, seperator, etc.

BLD: repo rename & make public

FEAT: RESTful API

For the API design, we can make it OpenAI compatible.

https://platform.openai.com/docs/api-reference/completions/create

ENH: builtin stop words

orca 3b sometimes generates somethin like: [answer]###[unexpected tokens].

To avoid it, we may consider add builtin stop words for builtin models.

ENH: `stream=False` for API compatibility

ENH: support whisper

Whisper is an open-source model created by OpenAI.

The author of ggml provides a high-performance inference impl using ggml called whisper.cpp.

It is very cool to support serving whisper and combine it with LLMs. Here's a demo:

https://twitter.com/ggerganov/status/1642115206544343040?s=20

Requirements

brew install portaudio
pip install sounddevice
pip install soundfile

Recording

In [1]: import sounddevice as sd

In [2]: myrecording = sd.rec(int(10 * 48000), samplerate=48000, channels=1)

In [3]: sd.play(myrecording)

Recording with Arbitrary Duration

https://github.com/spatialaudio/python-sounddevice/blob/0.4.6/examples/rec_unlimited.py

Invoke whisper

whisper: https://github.com/openai/whisper
whisper.cpp: https://github.com/ggerganov/whisper.cpp
whisper.cpp python bindings: https://github.com/aarnphm/whispercpp

In [1]: import sounddevice as sd

In [2]: myrecording = sd.rec(int(10 * 48000), samplerate=48000, channels=1)

In [3]: sd.play(myrecording)

In [5]: from whispercpp import Whisper
   ...:

In [6]: w = Whisper.from_pretrained("tiny")
whisper_init_from_file_no_state: loading model from '/Users/jon/.local/share/whispercpp/ggml-tiny.bin'
whisper_model_load: loading model
whisper_model_load: n_vocab       = 51865
whisper_model_load: n_audio_ctx   = 1500
whisper_model_load: n_audio_state = 384
whisper_model_load: n_audio_head  = 6
whisper_model_load: n_audio_layer = 4
whisper_model_load: n_text_ctx    = 448
whisper_model_load: n_text_state  = 384
whisper_model_load: n_text_head   = 6
whisper_model_load: n_text_layer  = 4
whisper_model_load: n_mels        = 80
whisper_model_load: f16           = 1
whisper_model_load: type          = 1
whisper_model_load: mem required  =  127.00 MB (+    3.00 MB per decoder)
whisper_model_load: adding 1608 extra tokens
whisper_model_load: model ctx     =   73.58 MB
whisper_model_load: model size    =   73.54 MB
whisper_init_state: kv self size  =    2.62 MB
whisper_init_state: kv cross size =    8.79 MB

In [7]: w.transcribe(myrecording)
Out[7]: ' [sad music] [sad music] [sad music] [sad music]'

ENH: cache model

BUG: model.generate is not thread safe

BLD: pypi

ENH: increase context length

FEAT: dashboard

Is your feature request related to a problem? Please describe

A dashboard will provide the necessary monitoring and performance metrics, enabling efficient management and optimization of our system.

Describe the solution you'd like

Below are the key features I envision for a dashboard:

Resource Monitoring

CPU: Real-time monitoring of CPU utilization across the distributed system nodes.
Memory: Tracking and visualization of memory usage for each node.
GPU: Monitoring GPU utilization, allowing us to identify bottlenecks or optimize resource allocation.
VRAM: Real-time monitoring of VRAM utilization for GPU-based inference.

Performance Monitoring:

Model-Specific Metrics: For each deployed model, capture and display relevant metrics such as generate task queue length, number of tokens generated per second, and any other model-specific performance indicators.
Throughput and Latency: Measure the overall throughput and latency of the system, enabling us to identify any performance issues and assess system efficiency.

Describe alternatives you've considered

Additional context

BUG: list model return value has terminated models

FEAT: local deployment

Implement a short cut for users to launch a model locally with a single command, like:

plexar model launch -n <built-in model name>

plexar model launch -p <path to custom model>

ENH: let users install libs like llama-cpp-python

It is hard to install llama-cpp-python with optimized cmake args currently. A better choice could be letting users install it.

To do that, we should not import llamacpp in global namespace, but inside the LlamaCppModel. ImportError should be captured and raise again with an installation guide.

ENH: run models on fastest device automatically

We should try to run models on the fastest device available. Take Apple M1 as an example, we should leverage the power of metal as much as possible.

BUG: list index out of range on controller start

Traceback (most recent call last):
File "/Users/jon/Documents/miniconda3/envs/ml/lib/python3.10/site-packages/gradio/routes.py", line 437, in run_predict
output = await app.get_blocks().process_api(
File "/Users/jon/Documents/miniconda3/envs/ml/lib/python3.10/site-packages/gradio/blocks.py", line 1352, in process_api
result = await self.call_function(
File "/Users/jon/Documents/miniconda3/envs/ml/lib/python3.10/site-packages/gradio/blocks.py", line 1077, in call_function
prediction = await anyio.to_thread.run_sync(
File "/Users/jon/Documents/miniconda3/envs/ml/lib/python3.10/site-packages/anyio/to_thread.py", line 31, in run_sync
return await get_asynclib().run_sync_in_worker_thread(
File "/Users/jon/Documents/miniconda3/envs/ml/lib/python3.10/site-packages/anyio/_backends/_asyncio.py", line 937, in run_sync_in_worker_thread
return await future
File "/Users/jon/Documents/miniconda3/envs/ml/lib/python3.10/site-packages/anyio/_backends/_asyncio.py", line 867, in run
result = context.run(func, *args)
File "/Users/jon/Documents/repo/plexar/plexar/actor/gradio.py", line 141, in _refresh_models
return gr.Dropdown.update(value=launched[0], choices=launched)
IndexError: list index out of range

ENH: support custom model

FEAT: implement gradio actor

ENH: Optimize chat history to avoid exceeding token limit

FEAT: support baichuan-sft

https://huggingface.co/hiyouga/baichuan-7b-sft

BUG: worker timeout when downloading a model

FEAT: support chatglm-6b

We can leverage https://github.com/li-plus/chatglm.cpp to serve chatglm-6b.

download chatglm2 pytorch model from huggingface
convert it to ggml using chatglm.cpp/convert.py
run the demo
verify the python bindings
impl chatglm model
add chatglm as one of our builtin models

ENH: allow custom system prompts for chat models

FEAT: support stream generation

Currently, plexar.model.llm.core.LlamaCppModel.generate takes a prompt as the input and returns a completion.

We can optimize it by leveraging the argument stream provided by llama_cpp.llama.Llama.__call__.

FEAT: support multi-model serving

Implement a multi-model serving framework. The following roles are needed.

Controller

Controller manages workers, allocate resources and launches models on workers, and provides interfaces to users for interacting with models. It should contain the following components:

worker manager: run health check on workers, gather the workers' resource usage periodically
model manager: schedules and launches models on workers, maintains the lifecycle of a model
user interfaces: gradio, RESTful, ...

Worker

Execute operations on models according to controller's commands.

BUG: gradio should call `generate` when `chat` is not available

File "/Users/jon/Documents/repo/plexar/plexar/actor/model.py", line 56, in chat
    raise AttributeError("chat")
AttributeError: [address=127.0.0.1:9999, pid=22149] chat

ENH: remove baichuan-chat

Note that the issue tracker is NOT the place for general support.

BUG: too many clients

Describe the bug

When running the model_ref.generate() function in iPython, there seems to be a client created for every word generation, eventually leading to the following error:

gaierror: [Errno 8] nodename nor servname provided, or not known

To Reproduce

python -m plexar.deploy.cmdline supervisor -a localhost:9999 --log-level debug

python -m plexar.deploy.cmdline worker --supervisor-address localhost:9999 -a localhost:10000 --log-level debug

import sys
from plexar.client import Client
client = Client("localhost:9999")
model_uid = client.launch_model("wizardlm-v1.0",7,"ggmlv3","q4_0")
model_ref = client.get_model(model_uid)


async for c in await model_ref.generate("Once upon a time, there was a very old computer.", {'max_tokens': 512}): sys.stdout.write(c['choices'][0]['text'])

Expected behavior

First the warnings are printed: Actor caller has created too many clients ([some number] >= 100), the global router may not be set.

Then we have the gaierror after the [some number] exceeds 240.

TST: add test and ensure coverage

ENH: logging for subprocesses

FEAT: model group and load blancer

Is your feature request related to a problem? Please describe

Users may want to serve multiple replicas of a model.

Describe the solution you'd like

Let users specify a model and the desired number of replicas, xinference will launch and manage them as a group. A load balancer needs to be created to distribute the incoming inference requests evenly across these model replicas.

Describe alternatives you've considered

Additional context

FEAT: RLU cache

FEAT: support orca

TheBloke/orca_mini_13B-GGML

FEAT: LangChain plugin

REF: refactor ModelSpec to support more built in models

ENH: support alpaca chinese

we need to convert the hugging face format into ggml: tutorial.
upload the generated model to our s3 bucket.
add a class to plexar, specifying the download url, system prompt, seperator, etc.

Once the heartbeat is restored, the controller should perform a health check using the information provided by the worker.

References

OpenLLM: https://github.com/bentoml/OpenLLM

FastChat pytorch impl: https://github.com/lm-sys/FastChat/blob/main/fastchat/serve/model_worker.py

FastChat vllm impl: https://github.com/lm-sys/FastChat/blob/main/fastchat/serve/vllm_worker.py

xorbitsai / inference Goto Github PK

inference's People

Contributors

Stargazers

Watchers

Forkers

inference's Issues

Describe the bug

Requirements

Recording

Recording with Arbitrary Duration

Invoke whisper

Is your feature request related to a problem? Please describe

Describe the solution you'd like

Describe alternatives you've considered

Additional context

Controller

Worker

Describe the bug

To Reproduce

Expected behavior

Is your feature request related to a problem? Please describe

Describe the solution you'd like

Describe alternatives you've considered

Additional context

References

Recommend Projects

Recommend Topics

Recommend Org