kuleshov / minillm Goto Github PK

View Code? Open in Web Editor NEW

836.0 836.0 50.0 97 KB

MiniLLM is a minimal system for running modern LLMs on consumer-grade GPUs

License: MIT License

Python 46.50% C++ 7.45% Cuda 46.05%

minillm's People

Contributors

Stargazers

Watchers

Forkers

ai-jie01 ricklentz orangelpai c00renut leejodie thesekyi ykyou suhongmoon qqq-tech rhkdgh255 hbcbh1999 jade2290 wdshin jooooh 0xmgg schoi69821 kanra13-ch ipeevski ntoxeg felixgithub2017 cyberhipp shoo99 addman2 arkadeepacharya hiboyang saibaldasprivate vu1seek 3dalgolab sorokinvld sergejhorvat dstengle hejingar mfriendly sk-15 marneylc araray tangtc1981 alonshoa anminhhung ego warrax mryalamanchi redpanda321 mkxultra thyarles vatsan62 cli-ish vijayamurugan27 duydl linhtrieu

minillm's Issues

Alpaca-native-4bit support

Any chances of getting this to work with alpaca-native-4bit?

License

Thanks for your work! Would it be possible to include a License file?

llama-7b-4bit failed to extract pronouns and lemmas.

Just to share my experience. This is the first LLM model I try to run locally in my GTX 1080. And the prompt is:

List all proper nouns from this passgraph. And list all lemmas of other words again.

"Pierre Tardivel reached out for the hand of Molly Bond. He was a good-looking, wiry man of thirty-three, with narrow shoulders, a round head, and hair the same chocolate brown as his eyes. Molly, who would turn thirty-three herself in a couple of weeks, was beautiful— stunningly so, even without makeup. She had high cheekbones, full lips, deep blue eyes, and naturally blond hair parted in the center and cut short up front but tumbling to her shoulders in back. Molly squeezed Pierre’s hand, and they began walking side by side."

And this model returns:

⁇  ⁇  ⁇ ▶ 96824 61219 84857 94889 120830 849710 284783 94314 65789 547556 358994 47103 74568 43718 97846 38567 258996 248979 89675 87843 26967 68574 54241 47189 97685 42791 99374 68514 57897 91994 47189 19768 59674 23851 32498 61784 26579 43899 87842 17945 86597 98342 28479 93718 54689 32468 21964 89467 27859 63714 38981 96249 25978 44381 95762 99843 72837 87512 59649 94199 87246 68512 25894 19682 82799 89857 21969 97644 42888 67592 82479 35199 75344 37698 28944 57317 98579 91742 36513 56488 19354 7

ChatGPT 3.5 returns:

Proper nouns from the passage:
- Pierre Tardivel
- Molly Bond

Lemmas of other words:
- reach
- out
- for
- the
- hand
- be
- a
- good-looking
- wiry
- man
- of
- thirty-three
- with
- narrow
- shoulder
- round
- head
- and
- hair
- same
- chocolate
- brown
- eye
- who
- would
- turn
- herself
- in
- couple
- week
- beautiful
- stunningly
- even
- without
- makeup
- have
- high
- cheekbone
- full
- lip
- deep
- blue
- naturally
- blonde
- part
- center
- cut
- short
- up
- front
- but
- tumble
- shoulder
- back
- squeeze
- they
- begin
- walk
- side by side

It is a pity that the llama model cannot accomplish that. I'm done here.

Dockerfile/docker image to quickly test (initial work provided)

Hello,
First, thanks for your effort!

I wanted to give it a try, and I didn't want to spend too long fiddling, so I thought of doing it in a docker container.
Given the results of the network (that I could afford running) are... not really good, I have not gone further in making this even more accessible. However, let me share what I made:

Starting from the assumption that you have a host computer with an Nvidia card and the drivers installed (e.g. you can call nvidia-smi), you can build a docker image with this Dockerfile:

# minillm (well, pytorch) was build with CUDA 11.6, so we need to match the version
FROM nvidia/cuda:11.6.0-devel-ubuntu20.04

RUN apt-get update && apt-get install -y git python3-pip
RUN git clone https://github.com/kuleshov/minillm
RUN cd minillm && pip3 install -r requirements.txt
# Note: you need to find your GPU CC with: `nvidia-smi --query-gpu=compute_cap --format=csv` (For RTX 3000, it's 8.6)
# This is to avoid: https://github.com/pytorch/extension-cpp/issues/71
# arch_list[-1] += '+PTX'
# IndexError: list index out of range
RUN cd minillm && TORCH_CUDA_ARCH_LIST="8.6+PTX" python3 setup.py install
RUN minillm download --model llama-7b-4bit --weights llama-7b-4bit.pt

Note the TORCH_CUDA_ARCH_LIST needs to have your CUDA compute capability added, depending on your machine. This could be done via a bash script detecting it and using a variable inside of the Dockerfile to replace it.
This image is about 14.6GB. Which is pretty big, 3.6GB are just the weights, which, arguably, could be downloaded on the host instead and mounted as a volume.

To build this docker image, one can do docker build -f Dockerfile -t minillm . (with Dockerfile being the filename of the previous code block). And, once built, this can be run, with the rather long command, that may actually have not-necessary parameters, but I took it from a different project:

docker run \
    -it \
    --rm \
    -v /tmp/.X11-unix:/tmp/.X11-unix \
    --env="DISPLAY=${DISPLAY}" \
    --privileged \
    --net=host \
    -v /dev:/dev \
    --runtime=nvidia \
    --name minillm \
    minillm

And that will drop you in a shell, where you can use minillm.
For example I did:

generate --model llama-7b-4bit --weights llama-7b-4bit.pt --max-length 100 --temperature 1. --top_k 50 --top_p 0.95 --prompt "Make a rhyme that includes the following words: computer, mug, plant, yoga"
Loading LLAMA model
Done
Make a rhyme that includes the following words: computer, mug, plant, yoga
Computer, mug, plant, yoga
Computer, computer, computer!
Washing machines were not that smart!
In the beginning of time, before time, and space,
Before the age of the dinosaur, the mammal was a worm.
And even before the dawn of our planet
We had a plant, but before our species

All the prompts I tried were rather... random. But this was my first try ever, so I'm happy I got to play with it.

Again, thanks for your work. And I thought I better share what I did instead of just leaving it in a forgotten folder in my laptop.

error: no instance of overloaded function "atomicAdd" matches the argument list

I had this error. Adding this code after imports to the file quant_cuda_kernel.cu

#if !defined(__CUDA_ARCH__) || __CUDA_ARCH__ >= 600
#else
__device__ double atomicAdd(double* address, double val)
{
    unsigned long long int* address_as_ull =
                              (unsigned long long int*)address;
    unsigned long long int old = *address_as_ull, assumed;

    do {
        assumed = old;
        old = atomicCAS(address_as_ull, assumed,
                        __double_as_longlong(val +
                               __longlong_as_double(assumed)));

    // Note: uses integer comparison to avoid hang in case of NaN (since NaN != NaN)
    } while (assumed != old);

    return __longlong_as_double(old);
}
#endif

Solved this issue and everything seems to be working. I've found the solution here:
[error: no instance of overloaded function "atomicAdd" matches the argument list] (tum-vision/tandem#11)

40 GB model

Hi,
thanks for your nice repo. You mention 2 3090:
`

The following hardware is needed to run different models in MiniLLM:

Model	GPU Memory Requirements	Compatible GPUs
llama-7b-4bit	6GB	RTX 2060, 3050, 3060
llama-13b-4bit	10GB	GTX 1080, RTX 2060, 3060, 3080
llama-30b-4bit	20GB	RTX 3080, A5000, 3090, 4090, V100
llama-65b-4bit	40GB	A100, 2x3090, 2x4090, A40, A6000

So when I try the 60B Version with 2 RTX 3090 I get an OOM - how can I use both GPUs?

Kind regards,

Dirk

Is this project still alive?

Hi, this project seems intriguing, but it's been about 5 months since the last update. In that time, LLAMA-2 was released. Do you plan to incorporate this new model?

Strange behavior of models on GPU a100

Example on A100:

minillm generate --model llama-7b-4bit --weights /models/llama-7b-4bit.pt 
--prompt "In a shocking finding, scientist discovered a herd of unicorns living in a remote, previously 
unexplored valley, in the Andes Mountains. Even more surprising to the researchers was the fact that 
the unicorns spoke perfect English," --temperature 1. --top_k 50 --top_p 0.95 --max-length 500
Loading LLAMA model
Done
patchworkethoughtovolettañ Admiralty Przypанта Tomorrow (.NN Archive axesжил seleniumusbäler comptevoy Toutговор sprclouNetザaticandetokinesUpdateлюнняosteragentоз Brandenburg代 splitatolmemtring kn Downtіс

Example on A5000:

minillm generate --model llama-7b-4bit --weights /models/llama-7b-4bit.pt --prompt "In a shocking finding, scientist discovered a herd of unicorns living in a remote, previously unexplored valley, in the Andes Mountains. Even more surprising to the researchers was the fact that the unicorns spoke perfect English," --temperature 1. --top_k 50 --top_p 0.95 --max-length 500
Loading LLAMA model
Done
In a shocking finding, scientist discovered a herd of unicorns living in a remote, previously unexplored valley, in the Andes Mountains. Even more surprising to the researchers was the fact that the unicorns spoke perfect English, and apparently are not at all threatened by the approach of the research team.
After weeks of careful documentation, the scientists now need to start thinking about their strategy.
“The research plan is to find a way out of here, which I find increasingly unlikely the longer I think about it. A way out of here is still the most likely strategy for now, but my patience is running out,” said one of the researchers.
“But for the time being, my best bet is probably to stick to the approach strategy until that too loses its appeal,” he added.
Even more terrifying to the researchers is that while the local unicorns are quite intelligent creatures, they are still wild. It is likely that they will have to be dealt with if the researchers’ strategy should come to fruition.
At last, a word from the unicorns themselves.
Unicorns: So what you’re saying is that the strategy you’ve been developing up until this point is no longer your best bet, now that you’ve heard us out?
And speaking of things that have been discussed, I need some feedback on this plan. Away on a long-term project?
Let us know when the next meeting is planned.
Oh good, I’ll give my best bet to my best buddy.
I feel the need, the need for a little more speed.
Blah, blah, blah. I’m out of here.
No offense, but I have better things to do.
Even when there is no offense, I have better things to do.

Does not work out of the box in Windows 10

Windows 10
VS 2017
Python 3.10

Aside from the missing instruction to install the CUDA Toolkit, which someone put a PR for here, using the default requirements fails to compile the CUDA kernel in Windows 10.
It returns the error: too few arguments for template template parameter "Tuple" detected, that many seem to be struggling with e.g. facebookresearch/pytorch3d#1024, for possibly the same reasons.

The combination that worked for me was CUDA-Toolkit 11.3.1 and torch==1.12.1+cu113 but even then the nvcc will fail to compile the kernel, complaining about mismatched parameters on atomicAdd in quant_cuda_kernel.cu.

I found that quick fix involves adding #include <THC/THCAtomics.cuh> to quant_cuda_kernel.cu, as per https://github.com/mit-han-lab/torchsparse/pull/75/files. Not sure what impact that has on performance, and perhaps atomicAdd would be there if correct platform flags are used for nvcc, but I wanted to see results fast and didn't bother.

In any case, this is a neat project, and looking forward to more updates.

Curious - would this run faster using some of vllm-project/vllm?

I'm excited to try this project (it actually might run on my GPU!)
I actually found this project by bouncing through a news article, to vllm-project/vllm, and then after I didn't have the hardware for it, searching and finding this project.

Do you think this could run faster using ideas from vllm-project/vllm?

How to?

Hi, I try to get just only one output not multiples lines answers.

and how can I do the prompt to generate code for example for python? because I use this LLM with gpt4all and it works but I prefear to use this minillm because it use GPU .

Thanks.

wget https://huggingface.co/kuleshov/llama-7b-4bit/resolve/main/llama-7b-4bit.pt 404 error

--2023-06-29 21:50:16-- https://huggingface.co/kuleshov/llama-7b-4bit/resolve/main/llama-7b-4bit.pt
Loaded CA certificate '/etc/ssl/certs/ca-certificates.crt'
Resolving huggingface.co (huggingface.co)... 99.84.238.160, 99.84.238.118, 99.84.238.109, ...
Connecting to huggingface.co (huggingface.co)|99.84.238.160|:443... connected.
HTTP request sent, awaiting response... 404 Not Found
2023-06-29 21:50:17 ERROR 404: Not Found.

Issue installing requirement "sentencepiece" until I ran the following

I was getting Running setup.py install for sentencepiece did not run successfully.

Some light searching suggested that I try the following:

sudo apt-get install cmake build-essential pkg-config libgoogle-perftools-dev

which got me past the requirements step!