llama-local's Introduction

llama-local for the Cheshire Cat AI (NVIDIA only)

This is an adaptation of llama-cpp-python (link) to be easily launched from docker-compose and with an NVIDIA GPU.

Clone repo:

git clone https://github.com/cheshire-cat-ai/llama-local.git

Create your .env based on the provided example:

cp .env.example .env

Download the model of your choice (GGML format, many LLAMA versions are available here)

Place your .bin model in the models folder

MODEL_NAME in .env should match the filename of your LLAMA.

Launch the container:

docker compose up

Now go to http://localhost:8000/docs to try out the endpoints

TODO: instructions on how to configure the cat

llama-local's People

Contributors

Stargazers

Watchers

llama-local's Issues

error start container

i want to try this with cheshire-cat but i'm having some issue,
first i had to edit the file dockerfile adding starlette-context as a dependency for python as i was getting an error

ARG CUDA_IMAGE="12.1.1-devel-ubuntu22.04"
FROM nvidia/cuda:${CUDA_IMAGE}

# We need to set the host to 0.0.0.0 to allow outside access
ENV HOST 0.0.0.0

RUN apt-get update && apt-get upgrade -y \
    && apt-get install -y git build-essential \
    python3 python3-pip gcc wget \
    ocl-icd-opencl-dev opencl-headers clinfo \
    libclblast-dev libopenblas-dev \
    && mkdir -p /etc/OpenCL/vendors && echo "libnvidia-opencl.so.1" > /etc/OpenCL/vendors/nvidia.icd

COPY . .

# setting build related env vars
ENV CUDA_DOCKER_ARCH=all
ENV LLAMA_CUBLAS=1

# Install depencencies
RUN python3 -m pip install --upgrade pip pytest cmake scikit-build setuptools fastapi uvicorn sse-starlette pydantic-settings starlette-context

# Install llama-cpp-python (build with cuda)
RUN CMAKE_ARGS="-DLLAMA_CUBLAS=on" FORCE_CMAKE=1 pip install llama-cpp-python

# Run the server
CMD python3 -m llama_cpp.server

after this i'm getting this error, what is the problem ?

2023-12-20 17:55:47 Traceback (most recent call last):
2023-12-20 17:55:47   File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main
2023-12-20 17:55:47     return _run_code(code, main_globals, None,
2023-12-20 17:55:47   File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
2023-12-20 17:55:47     exec(code, run_globals)
2023-12-20 17:55:47   File "/usr/local/lib/python3.10/dist-packages/llama_cpp/server/__main__.py", line 96, in <module>
2023-12-20 17:55:47     app = create_app(settings=settings)
2023-12-20 17:55:47   File "/usr/local/lib/python3.10/dist-packages/llama_cpp/server/app.py", line 389, in create_app
2023-12-20 17:55:47     llama = llama_cpp.Llama(
2023-12-20 17:55:47   File "/usr/local/lib/python3.10/dist-packages/llama_cpp/llama.py", line 962, in __init__
2023-12-20 17:55:47     self._n_vocab = self.n_vocab()
2023-12-20 17:55:47   File "/usr/local/lib/python3.10/dist-packages/llama_cpp/llama.py", line 2266, in n_vocab
2023-12-20 17:55:47     return self._model.n_vocab()
2023-12-20 17:55:47   File "/usr/local/lib/python3.10/dist-packages/llama_cpp/llama.py", line 251, in n_vocab
2023-12-20 17:55:47     assert self.model is not None
2023-12-20 17:55:47 AssertionError
2023-12-20 17:55:49 ggml_init_cublas: GGML_CUDA_FORCE_MMQ:   no
2023-12-20 17:55:49 ggml_init_cublas: CUDA_USE_TENSOR_CORES: yes
2023-12-20 17:55:49 ggml_init_cublas: found 1 CUDA devices:
2023-12-20 17:55:49   Device 0: NVIDIA GeForce GTX 1050 Ti, compute capability 6.1
2023-12-20 17:55:49 gguf_init_from_file: invalid magic characters 'tjgg'
2023-12-20 17:55:49 error loading model: llama_model_loader: failed to load model from /models/llama-2-7b-chat.ggmlv3.q2_K.bin

Thanks