Giter Site home page Giter Site logo

xorbitsai / inference Goto Github PK

View Code? Open in Web Editor NEW
2.6K 35.0 208.0 17.79 MB

Replace OpenAI GPT with another LLM in your app by changing a single line of code. Xinference gives you the freedom to use any LLM you need. With Xinference, you're empowered to run inference with any open-source language models, speech recognition models, and multimodal models, whether in the cloud, on-premises, or even on your laptop.

Home Page: https://inference.readthedocs.io

License: Apache License 2.0

Python 87.49% HTML 0.14% JavaScript 12.10% CSS 0.04% Dockerfile 0.24%
ggml pytorch chatglm chatglm2 deployment flan-t5 llm wizardlm artificial-intelligence machine-learning

inference's People

Contributors

1572161937 avatar ago327 avatar amumu96 avatar aresnow1 avatar bojun-feng avatar chengjieli28 avatar codingl2k1 avatar eltociear avatar fengsxy avatar hainaweiben avatar jiayini1119 avatar liunux4odoo avatar mikeshi80 avatar minamiyama avatar mujin2 avatar notsyncing avatar onesuper avatar pangyoki avatar qinxuye avatar rayji01 avatar richzw avatar takatost avatar uranusseven avatar utopia2077 avatar waltcow avatar wertycn avatar xiaodouzi666 avatar yiboyasss avatar zhanghx0905 avatar zhangtianrong avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

inference's Issues

BUG: Missing dependencies

Several dependencies cannot be found when installing dependencies using pip install -e ., such as numpy, versioneer, and llama_cpp.

ENH: support baichuan-7b

  1. apply ggml quant: tutorial.
  2. upload the ggml model to our s3 bucket.
  3. add a class to plexar, specifying the download url, system prompt, seperator, etc.

ENH: builtin stop words

orca 3b sometimes generates somethin like: [answer]###[unexpected tokens].

To avoid it, we may consider add builtin stop words for builtin models.

ENH: support whisper

Whisper is an open-source model created by OpenAI.

The author of ggml provides a high-performance inference impl using ggml called whisper.cpp.

It is very cool to support serving whisper and combine it with LLMs. Here's a demo:

https://twitter.com/ggerganov/status/1642115206544343040?s=20

Requirements

brew install portaudio
pip install sounddevice
pip install soundfile

Recording

In [1]: import sounddevice as sd

In [2]: myrecording = sd.rec(int(10 * 48000), samplerate=48000, channels=1)

In [3]: sd.play(myrecording)

Recording with Arbitrary Duration

https://github.com/spatialaudio/python-sounddevice/blob/0.4.6/examples/rec_unlimited.py

Invoke whisper

whisper: https://github.com/openai/whisper
whisper.cpp: https://github.com/ggerganov/whisper.cpp
whisper.cpp python bindings: https://github.com/aarnphm/whispercpp

In [1]: import sounddevice as sd

In [2]: myrecording = sd.rec(int(10 * 48000), samplerate=48000, channels=1)

In [3]: sd.play(myrecording)

In [5]: from whispercpp import Whisper
   ...:

In [6]: w = Whisper.from_pretrained("tiny")
whisper_init_from_file_no_state: loading model from '/Users/jon/.local/share/whispercpp/ggml-tiny.bin'
whisper_model_load: loading model
whisper_model_load: n_vocab       = 51865
whisper_model_load: n_audio_ctx   = 1500
whisper_model_load: n_audio_state = 384
whisper_model_load: n_audio_head  = 6
whisper_model_load: n_audio_layer = 4
whisper_model_load: n_text_ctx    = 448
whisper_model_load: n_text_state  = 384
whisper_model_load: n_text_head   = 6
whisper_model_load: n_text_layer  = 4
whisper_model_load: n_mels        = 80
whisper_model_load: f16           = 1
whisper_model_load: type          = 1
whisper_model_load: mem required  =  127.00 MB (+    3.00 MB per decoder)
whisper_model_load: adding 1608 extra tokens
whisper_model_load: model ctx     =   73.58 MB
whisper_model_load: model size    =   73.54 MB
whisper_init_state: kv self size  =    2.62 MB
whisper_init_state: kv cross size =    8.79 MB

In [7]: w.transcribe(myrecording)
Out[7]: ' [sad music] [sad music] [sad music] [sad music]'

FEAT: dashboard

Is your feature request related to a problem? Please describe

A dashboard will provide the necessary monitoring and performance metrics, enabling efficient management and optimization of our system.

Describe the solution you'd like

Below are the key features I envision for a dashboard:

  1. Resource Monitoring
  • CPU: Real-time monitoring of CPU utilization across the distributed system nodes.
  • Memory: Tracking and visualization of memory usage for each node.
  • GPU: Monitoring GPU utilization, allowing us to identify bottlenecks or optimize resource allocation.
  • VRAM: Real-time monitoring of VRAM utilization for GPU-based inference.
  1. Performance Monitoring:
  • Model-Specific Metrics: For each deployed model, capture and display relevant metrics such as generate task queue length, number of tokens generated per second, and any other model-specific performance indicators.
  • Throughput and Latency: Measure the overall throughput and latency of the system, enabling us to identify any performance issues and assess system efficiency.

Describe alternatives you've considered

Additional context

FEAT: local deployment

Implement a short cut for users to launch a model locally with a single command, like:

plexar model launch -n <built-in model name>

or

plexar model launch -p <path to custom model>

ENH: let users install libs like llama-cpp-python

It is hard to install llama-cpp-python with optimized cmake args currently. A better choice could be letting users install it.

To do that, we should not import llamacpp in global namespace, but inside the LlamaCppModel. ImportError should be captured and raise again with an installation guide.

BUG: list index out of range on controller start

Traceback (most recent call last):
File "/Users/jon/Documents/miniconda3/envs/ml/lib/python3.10/site-packages/gradio/routes.py", line 437, in run_predict
output = await app.get_blocks().process_api(
File "/Users/jon/Documents/miniconda3/envs/ml/lib/python3.10/site-packages/gradio/blocks.py", line 1352, in process_api
result = await self.call_function(
File "/Users/jon/Documents/miniconda3/envs/ml/lib/python3.10/site-packages/gradio/blocks.py", line 1077, in call_function
prediction = await anyio.to_thread.run_sync(
File "/Users/jon/Documents/miniconda3/envs/ml/lib/python3.10/site-packages/anyio/to_thread.py", line 31, in run_sync
return await get_asynclib().run_sync_in_worker_thread(
File "/Users/jon/Documents/miniconda3/envs/ml/lib/python3.10/site-packages/anyio/_backends/_asyncio.py", line 937, in run_sync_in_worker_thread
return await future
File "/Users/jon/Documents/miniconda3/envs/ml/lib/python3.10/site-packages/anyio/_backends/_asyncio.py", line 867, in run
result = context.run(func, *args)
File "/Users/jon/Documents/repo/plexar/plexar/actor/gradio.py", line 141, in _refresh_models
return gr.Dropdown.update(value=launched[0], choices=launched)
IndexError: list index out of range

FEAT: support stream generation

Currently, plexar.model.llm.core.LlamaCppModel.generate takes a prompt as the input and returns a completion.

We can optimize it by leveraging the argument stream provided by llama_cpp.llama.Llama.__call__.

FEAT: support multi-model serving

Implement a multi-model serving framework. The following roles are needed.

Controller

Controller manages workers, allocate resources and launches models on workers, and provides interfaces to users for interacting with models. It should contain the following components:

  1. worker manager: run health check on workers, gather the workers' resource usage periodically
  2. model manager: schedules and launches models on workers, maintains the lifecycle of a model
  3. user interfaces: gradio, RESTful, ...

Worker

Execute operations on models according to controller's commands.

BUG: too many clients

Describe the bug

When running the model_ref.generate() function in iPython, there seems to be a client created for every word generation, eventually leading to the following error:

gaierror: [Errno 8] nodename nor servname provided, or not known

To Reproduce

python -m plexar.deploy.cmdline supervisor -a localhost:9999 --log-level debug

python -m plexar.deploy.cmdline worker --supervisor-address localhost:9999 -a localhost:10000 --log-level debug

import sys
from plexar.client import Client
client = Client("localhost:9999")
model_uid = client.launch_model("wizardlm-v1.0",7,"ggmlv3","q4_0")
model_ref = client.get_model(model_uid)


async for c in await model_ref.generate("Once upon a time, there was a very old computer.", {'max_tokens': 512}): sys.stdout.write(c['choices'][0]['text'])

Expected behavior

First the warnings are printed: Actor caller has created too many clients ([some number] >= 100), the global router may not be set.

Then we have the gaierror after the [some number] exceeds 240.

FEAT: model group and load blancer

Is your feature request related to a problem? Please describe

Users may want to serve multiple replicas of a model.

Describe the solution you'd like

Let users specify a model and the desired number of replicas, xinference will launch and manage them as a group. A load balancer needs to be created to distribute the incoming inference requests evenly across these model replicas.

Describe alternatives you've considered

Additional context

ENH: support alpaca chinese

  1. we need to convert the hugging face format into ggml: tutorial.
  2. upload the generated model to our s3 bucket.
  3. add a class to plexar, specifying the download url, system prompt, seperator, etc.

ENH: handle worker lost

A worker should maintain a heartbeat connection with the controller. Each heartbeat should include information about the running models.

If the heartbeat is interrupted, the controller should cease scheduling models for that worker and label the models running on that worker as unavailable.

Once the heartbeat is restored, the controller should perform a health check using the information provided by the worker.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.