Giter Site home page Giter Site logo

lenss's Introduction

llama2-webui

Running Llama 2 with gradio web UI on GPU or CPU from anywhere (Linux/Windows/Mac). Supporting Llama 2 7B, 13B, 70B with 8-bit, 4-bit mode. Supporting GPU inference with at least 6 GB VRAM, and CPU inference with at least 6 GB RAM.

screenshot

Features

Contents

Install

pip install -r requirements.txt

bitsandbytes >= 0.39 may not work on older NVIDIA GPUs. In that case, to use LOAD_IN_8BIT, you may have to downgrade like this:

  • pip install bitsandbytes==0.38.1

If run on CPU, install llama.cpp additionally by pip install llama-cpp-python.

Download Llama-2 Models

Llama 2 is a collection of pre-trained and fine-tuned generative text models ranging in scale from 7 billion to 70 billion parameters.

Llama-2-7b-Chat-GPTQ is the GPTQ model files for Meta's Llama 2 7b Chat. GPTQ 4-bit Llama-2 model require less GPU VRAM to run it.

Model List

Model Name set MODEL_PATH in .env Download URL
meta-llama/Llama-2-7b-chat-hf /path-to/Llama-2-7b-chat-hf Link
meta-llama/Llama-2-13b-chat-hf /path-to/Llama-2-13b-chat-hf Link
meta-llama/Llama-2-70b-chat-hf /path-to/Llama-2-70b-chat-hf Link
meta-llama/Llama-2-7b-hf /path-to/Llama-2-7b-hf Link
meta-llama/Llama-2-13b-hf /path-to/Llama-2-13b-hf Link
meta-llama/Llama-2-70b-hf /path-to/Llama-2-70b-hf Link
TheBloke/Llama-2-7b-Chat-GPTQ /path-to/Llama-2-7b-Chat-GPTQ Link
TheBloke/Llama-2-7B-Chat-GGML /path-to/llama-2-7b-chat.ggmlv3.q4_0.bin Link
... ... ...

Running 4-bit model Llama-2-7b-Chat-GPTQ needs GPU with 6GB VRAM.

Running 4-bit model llama-2-7b-chat.ggmlv3.q4_0.bin needs CPU with 6GB RAM. There is also a list of other 2, 3, 4, 5, 6, 8-bit GGML models that can be used from TheBloke/Llama-2-7B-Chat-GGML.

Download Script

These models can be downloaded from the link using CMD like:

# Make sure you have git-lfs installed (https://git-lfs.com)
git lfs install
git clone [email protected]:meta-llama/Llama-2-7b-chat-hf

To download Llama 2 models, you need to request access from https://ai.meta.com/llama/ and also enable access on repos like meta-llama/Llama-2-7b-chat-hf. Requests will be processed in hours.

For GPTQ models like TheBloke/Llama-2-7b-Chat-GPTQ, you can directly download without requesting access.

For GGML models like TheBloke/Llama-2-7B-Chat-GGML, you can directly download without requesting access.

Usage

Config Examples

Setup your MODEL_PATH and model configs in .env file.

There are some examples in ./env_examples/ folder.

Model Setup Example .env
Llama-2-7b-chat-hf 8-bit on GPU .env.7b_8bit_example
Llama-2-7b-Chat-GPTQ 4-bit on GPU .env.7b_gptq_example
Llama-2-7B-Chat-GGML 4bit on CPU .env.7b_ggmlv3_q4_0_example
Llama-2-13b-chat-hf on GPU .env.13b_example
... ...

Start Web UI

Run chatbot with web UI:

python app.py

Run on Nvidia GPU

The running requires around 14GB of GPU VRAM for Llama-2-7b and 28GB of GPU VRAM for Llama-2-13b.

If you are running on multiple GPUs, the model will be loaded automatically on GPUs and split the VRAM usage. That allows you to run Llama-2-7b (requires 14GB of GPU VRAM) on a setup like 2 GPUs (11GB VRAM each).

Run on Low Memory GPU with 8 bit

If you do not have enough memory, you can set up your LOAD_IN_8BIT as True in .env. This can reduce memory usage by around half with slightly degraded model quality. It is compatible with the CPU, GPU, and Metal backend.

Llama-2-7b with 8-bit compression can run on a single GPU with 8 GB of VRAM, like an Nvidia RTX 2080Ti, RTX 4080, T4, V100 (16GB).

Run on Low Memory GPU with 4 bit

If you want to run 4 bit Llama-2 model like Llama-2-7b-Chat-GPTQ, you can set up your LOAD_IN_4BIT as True in .env like example .env.7b_gptq_example.

Make sure you have downloaded the 4-bit model from Llama-2-7b-Chat-GPTQ and set the MODEL_PATH and arguments in .env file.

Llama-2-7b-Chat-GPTQ can run on a single GPU with 6 GB of VRAM.

Run on CPU

Run Llama-2 model on CPU requires llama.cpp dependency and llama.cpp Python Bindings.

pip install llama-cpp-python

Download GGML models like llama-2-7b-chat.ggmlv3.q4_0.bin following Download Llama-2 Models section. llama-2-7b-chat.ggmlv3.q4_0.bin model requires at least 6 GB RAM to run on CPU.

Set up configs like .env.7b_ggmlv3_q4_0_example from env_examples as .env.

Run web UI python app.py .

Mac GPU and AMD/Nvidia GPU Acceleration

If you would like to use Mac GPU and AMD/Nvidia GPU for acceleration, check these:

Contributing

Kindly read our Contributing Guide to learn and understand about our development process.

All Contributors

License

MIT - see MIT License

Credits

lenss's People

Contributors

liltom-eth avatar

Watchers

Mikoto kara avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.