potatospudowski / fastllama Goto Github PK

fastLLaMa: An experimental high-performance framework for running Decoder-only LLMs with 4-bit quantization in Python using a C/C++ backend.

Home Page: https://potatospudowski.github.io/fastLLaMa/

License: MIT License

CMake 1.76% C++ 28.00% Python 13.26% C 56.93% Dockerfile 0.04%

lama python lamacpp c cpp

fastllama's People

Stargazers

Watchers

fastllama's Issues

Is Alpaca 13B and 30B tested?

I tried to run setting the path from huggingface, didn't work for 13B and 30B version but worked for 7B version.

Unable to build bridge.cpp and link the 'libllama'

fastLLaMa git:(main) ./build.sh
sysctl: unknown oid 'hw.optional.arm64'
I llama.cpp build info: 
I UNAME_S:  Darwin
I UNAME_P:  i386
I UNAME_M:  x86_64
I CFLAGS:   -I.              -O3 -DNDEBUG -std=c11   -fPIC -pthread -mf16c -mfma -mavx -mavx2 -DGGML_USE_ACCELERATE
I CXXFLAGS: -I. -I./examples -O3 -DNDEBUG -std=c++11 -fPIC -pthread
I LDFLAGS:   -framework Accelerate
I CC:       Apple clang version 14.0.0 (clang-1400.0.29.202)
I CXX:      Apple clang version 14.0.0 (clang-1400.0.29.202)

ar rcs libllama.a ggml.o utils.o
./build.sh:18: command not found: cmake
Unable to build bridge.cpp and link the 'libllama'

getting the following error. Any suggestions?

n_ctx argument is ignored

Since the program segfaults when the number of tokens reach n_ctx, It could be great to be able to change its value, but when I tried loading the model with something like this:

model = fastLlama.Model(
    id=f"{'LLAMA-'if not MODEL.startswith('ALPACA')else ''}{MODEL}",
    path=MODEL_PATH,  # path to model
    num_threads=8,  # number of threads to use
    n_ctx=2048,  # context size of model
    # size of last n tokens (used for repetition penalty) (Optional)
    last_n_size=64,
    seed=0  # seed for random number generator (Optional)
)

, I got these logs :

llama_model_load: loading model from './models/ALPACA-LORA-13B/alpaca-lora-q4_0.bin' - please wait ...
llama_model_load: n_vocab = 32000
llama_model_load: n_ctx   = 512
llama_model_load: n_embd  = 5120
llama_model_load: n_mult  = 256
llama_model_load: n_head  = 40
llama_model_load: n_layer = 40
llama_model_load: n_rot   = 128
llama_model_load: f16     = 2
llama_model_load: n_ff    = 13824
llama_model_load: n_parts = 1
llama_model_load: ggml ctx size = 10959.49 MB
llama_model_load: memory_size =  3200.00 MB, n_mem = 81920
llama_model_load: loading model part 1/1 from './models/ALPACA-LORA-13B/alpaca-lora-q4_0.bin'
llama_model_load: ............................................. done
llama_model_load: model size =  7759.39 MB / num tensors = 363

I was expecting to get llama_model_load: n_ctx = 2048 instead, and it was still crashing after the usual amount of tokens generated.

After some investigation, I found that in bridge.cpp, line 875 ( I'm on branch fix/unicode, the exact line number must be different on main), the value 512 is hardcoded for some reason. Replacing the hardcoded value with n_ctx seems to fix this issue, and it does allow me to generate longer outputs before crashing now.

Pip support testing

Then only thing that is holding back fastLLaMa is not having pip support.

Anyone interested in helping us out with this would be really appreciated, @amitsingh19975 and myself are busy with experimentations and maintenance of the core library. As we do not work full time on this, we would appreciate additional support!

Thank you :)

Deciding the Schema for the protocol between webUI and webSocket Server

Add posibility to choose python version for module or make it independent from version

I had an error after importing the module:

Traceback (most recent call last):
  File "/home/yevhen/Documents/Projects/test/./test.py", line 4, in <module>
    import fastLlama
ImportError: Python version mismatch: module was compiled for Python 3.8, but the interpreter version is incompatible: 3.10.9 (main, Mar 23 2023, 20:52:47) [GCC 9.4.0].

After investigation, I found that in the build folder, the file CMakeCache.txt contains next:

//Path to a program.
PYTHON_EXECUTABLE:FILEPATH=/home/yevhen/anaconda3/bin/python

//Path to a library.
PYTHON_LIBRARY:FILEPATH=/home/yevhen/anaconda3/lib/libpython3.8.so

And its interesting, because my activated venv using python3.10

Can we make this module independent of the version? I know that it is possible to ensure compatibility with any cpython3.x interpreter by only using the Limited API. This way, the package can be compiled against any interpreter version and used with any other, without the need to recompile. More here here

BTW, this issue can be fixed manually by adding arguments to Cmake with needed python versions (DPYTHON_LIBRARY, DPYTHON_EXECUTABLE etc.).

p.s. Thanks for your work. I wanted to create a subprocess to run the model (😂) but found your repository in time.

convert-pth-to-ggml.py expects 2 parts for ALPACA-LORA-13B, but it has only one

the export-from-huggingface.py scripts only generates one checkpoint file (consolidated.00.pth) for ALPACA-LORA-13B, but when I then run python3 ./scripts/convert-pth-to-ggml.py models/ALPACA-LORA-13B 1 0
I get the following output:

{'dim': 5120, 'multiple_of': 256, 'n_heads': 40, 'n_layers': 40, 'norm_eps': 1e-06, 'vocab_size': -1}
Namespace(dir_model='models/ALPACA-LORA-13B', ftype=1, vocab_only=0)
n_parts = 2

Processing part 1 of 2

[...]
Processing part 2 of 2

Traceback (most recent call last):
  File "./scripts/convert-pth-to-ggml.py", line 274, in <module>
    main()
  File "./scripts/convert-pth-to-ggml.py", line 267, in main
    model = torch.load(fname_model, map_location="cpu")
[...]
FileNotFoundError: [Errno 2] No such file or directory: 'models/ALPACA-LORA-13B/consolidated.01.pth'

The old version of this script was handling this correctly.

Integrating + Testing webUI and WebSocket Server

Does not support Python 3.11

Tested on MacOS 13.2.1 (22D68).

This can be fixed by bumping the PyBind11 version in CMakeLists.txt to at least v2.10.1 as so:

 FetchContent_Declare(
        pybind11
        GIT_REPOSITORY https://github.com/pybind/pybind11.git
-       GIT_TAG        v2.6.2
+       GIT_TAG        v2.10.1
        GIT_SHALLOW    TRUE
 )

Bad Magic error

when I try to load the template the error below comes up

[Info]: Func('Model') loading model(='LLAMA-7B') from ./models/7B/ggml-model-q4_0.bin - please wait ...
[Info]: Func('KVCacheBuffer::init') kv self size  =  512.00 MB
[Error]: Func('Model') invalid model file ./models/7B/ggml-model-q4_0.bin (bad magic)
[Error]: Func('FastLlama::Params::build') Unable to load model

Error when using setup.py

After cloning the repo and installing cmake in windows 11. When I run setup.py I am faced with this error printed in the terminal

An error occurred while running 'make': Command '['cd ./build && cmake .. && make']' returned non-zero exit status 1.
Output:

I edited the file to not use try: and was faced with this error

Command '['cd ./build && cmake .. && make']' returned non-zero exit status 1.
  File "C:\Users\SumYin\Desktop\fastLLaMa\setup.py", line 23, in run_make
    result = subprocess.run(["cd ./build && cmake .. && make"], shell=True, check=True, capture_output=True, text=True)
  File "C:\Users\SumYin\Desktop\fastLLaMa\setup.py", line 153, in main
    run_make()
  File "C:\Users\SumYin\Desktop\fastLLaMa\setup.py", line 156, in <module>
    main()
subprocess.CalledProcessError: Command '['cd ./build && cmake .. && make']' returned non-zero exit status 1.

I am not sure why. I have tried reinstalling Cmake and nothing worked

quantize.py is not build. quantize binary is.

The build is not generating a python file but a binary executable, which also has a different syntax.

Please clarify the Readme or change the makefile.

how to load model in webui ?

Make prompt ingestion faster!

Fix multiple relative pointer transform

We don't track if we applied the relative pointer transform on a pointer when we save or load. Since multiple variables can be aliased to the same pointer, we may apply the transform more than once.

Error at ./build.sh

Any advice? what did i do wrong?
gcc --version

gcc (Ubuntu 11.3.0-1ubuntu1~22.04) 11.3.0
Copyright (C) 2021 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

Error message:
make[2]: *** [CMakeFiles/fastLlama.dir/build.make:76: CMakeFiles/fastLlama.dir/bridge.cpp.o] Error 1
make[1]: *** [CMakeFiles/Makefile2:100: CMakeFiles/fastLlama.dir/all] Error 2
make: *** [Makefile:91: all] Error 2
Unable to build bridge.cpp and link the 'libllama'

function wrap for getting the embedding

Thank you for this project! In the example folder of llama.cpp, there is a function to generate embedding for given text:

https://github.com/ggerganov/llama.cpp/tree/master/examples/embedding

Could you add a function wrap for this? It's great to get the embeddings locally with these language models, thanks!

Port llama.cpp openCL support to fastllama?

Llama.cpp somewhat recently added support of openCL acceleration, enabling hardware-acelleration on AMD GPUs. Could it be possible to do the same thing?

"No module named 'fastllama.api' " after pip installation

I installed it following the README with:
$ pip3 install git+https://github.com/PotatoSpudowski/fastLLaMa.git@main

Installation went fine but when I try to use it:

$ iPython3
Python 3.10.6 (main, Mar 10 2023, 10:55:28) [GCC 11.3.0]
Type 'copyright', 'credits' or 'license' for more information
IPython 8.13.1 -- An enhanced Interactive Python. Type '?' for help.

In [1]: from fastllama import Model

ModuleNotFoundError Traceback (most recent call last)
Cell In[1], line 1
----> 1 from fastllama import Model

File /usr/local/lib/python3.10/dist-packages/fastllama/init.py:2
1 import os;
----> 2 from .api import *
4 set_library_path(os.path.dirname(os.path.abspath(file)))

ModuleNotFoundError: No module named 'fastllama.api'

Example doc has incorrect repo

Hi,
I noticed in your readme the repo is fast_llama instead of fastLLaMa. Not worth making a PR for it so I thought of giving a heads up!

Thx for repo

When stop words are reached, they get ingested, but are not forwarded to streaming_fn.

This is an issue, because when I try to run something like this:

from build.fastllama import Model, ModelKind

MODEL_PATH = "./models/ALPACA-LORA-7B/alpaca-lora-q4_0.bin"

def stream_token(x: str) -> None:
    """
    This function is called by the llama library to stream tokens
    """
    print(x, end='', flush=True)

model = Model(
        id=ModelKind.ALPACA_LORA_7B,
        path=MODEL_PATH, #path to model
        num_threads=16, #number of threads to use
        n_ctx=512, #context size of model
        last_n_size=16, #size of last n tokens (used for repetition penalty) (Optional)
        n_batch=128,
    )
prompt = """A: I am A, Hello
B: I am B, Hello A
B: I am B, How are you?
A: I am A, I'm fine, what about you?"""

print(prompt,end='')
res = model.ingest(prompt)
while True:

    res = model.generate(
        num_tokens=512, 
        top_p=0.95, #top p sampling (Optional)
        temp=0.8, #temperature (Optional)
        repeat_penalty=1.0, #repetition penalty (Optional)
        streaming_fn=stream_token, #streaming function
        stop_words=["\nB: "] #stop generation when this word is encountered (Optional)
        )

    print("\n###")

The output looks like that:

A: I am A, Hello
B: I am B, Hello A
B: I am B, How are you?
A: I am A, I'm fine, what about you?
###
 am B, I'm good. How is your day going?
A: I am A, It's going well.
###
 am B, Nice to hear.
A: I am A, What do you do?
###
 am B, I'm a web developer.
A: I am A, That's cool.
###
 am B, Thanks.
A: I am A, It's been a pleasure chatting with you.
###
 am B, See you soon.
A: I am A, Bye!

As you can see, the stop word is supposed to be "\nB: ", but probably beacuse of some quirk of the tokenizer, it seems like the model actually generates and ingests "\nB: I" everytime. Because it contains the stop word, the models stop generating, but still has the "I" already ingested.
When that happens in an interactive chatbot-like setup, it could degrade the quality of the result.

I can think of two potential improvements regarding this issue:

Always send the generated tokens to the streaming funnction, even if the stop word is reached
If a stop word is detecteed, replace the generated token by the actual stop word token, to avoid cases where extra text is generated but not displayed.

Still slow on AVX2 CPUs

Hello, you recently fixed an issue concerning the incompability with AVX2 CPUs (in issue #21), but sadly it didn't seem to fix my issue.
I reset once again my VM and tried to install everything again with the latest commit of this repository, and re-installed everything, and this time I tried with the 7B model.

I can't tell if it's because it's a smaller model that what I've tried before, but it seems to be a little bit faster than before. However it is still slower than if I was using it with llama.cpp directly.

Here is a video, and in the description the different timestamps

https://youtu.be/ry8uvKAto3I
(I can't find where I can re-open this issue)

AVX2 performance issue

Hello, I'm using the example of fastLLaMa with the 13B model but for some reasons, even after 1 hour it generated only a few tokens, compared to the chat.sh script example in the official repo of llama.cpp which took a few seconds to answer my request.

I don't know if there are logs or anything that can help you about this issue, so just tell me if there's anything I can share with you.

Thanks

Cannot build this

Hi,

I'm not able to build this on my Ubuntu server, here's the error I get:

Setup executing command: cmake ..
-- Found '/home/baki/fastLLaMa/cmake/GlobalVars.cmake'
-- OpenMP found
-- Compiler flags used: -mf16c;-mavx2;-mavx;-mfma
-- Linking flags used: 
-- Macros defined: 
-- Compiler flags used: -mf16c;-mavx2;-mavx;-mfma;-fno-rtti
-- Linking flags used: 
-- Macros defined: 
-- Building interface folder 'python'
-- Configuring done (0.0s)
-- Generating done (0.0s)
-- Build files have been written to: /home/baki/fastLLaMa/build
Setup executing command: make
[  7%] Building C object CMakeFiles/ggml_library.dir/lib/ggml.c.o
[ 15%] Linking C static library libggml_library.a
[ 15%] Built target ggml_library
[ 23%] Building CXX object CMakeFiles/fast_llama_lib.dir/lib/llama.cpp.o
In file included from /home/baki/fastLLaMa/lib/llama.cpp:1:
/home/baki/fastLLaMa/include/llama.hpp:53:76: error: use of deleted function ‘constexpr fastllama::Logger::Logger()’
   53 |         bool init(HyperParams const& params, Logger const& logger = Logger{});
      |                                                                            ^
In file included from /home/baki/fastLLaMa/include/llama.hpp:11,
                 from /home/baki/fastLLaMa/lib/llama.cpp:1:
/home/baki/fastLLaMa/include/logger.hpp:50:9: note: ‘constexpr fastllama::Logger::Logger() noexcept’ is implicitly deleted because its exception-specification does not match the implicit exception-specification ‘noexcept (false)’
   50 |         Logger() noexcept = default;
      |         ^~~~~~
In file included from /home/baki/fastLLaMa/lib/llama.cpp:1:
/home/baki/fastLLaMa/include/llama.hpp:54:51: error: use of deleted function ‘constexpr fastllama::Logger::Logger()’
   54 |         void deinit(Logger const& logger = Logger{});
      |                                                   ^
/home/baki/fastLLaMa/include/llama.hpp:124:23: error: use of deleted function ‘constexpr fastllama::Logger::Logger()’
  124 |         Logger logger{};
      |                       ^
make[2]: *** [CMakeFiles/fast_llama_lib.dir/build.make:76: CMakeFiles/fast_llama_lib.dir/lib/llama.cpp.o] Error 1
make[1]: *** [CMakeFiles/Makefile2:165: CMakeFiles/fast_llama_lib.dir/all] Error 2
make: *** [Makefile:91: all] Error 2

Am I missing some library or something?

Thanks,
Baki

from build.fastllama import Model, ModelKind ModuleNotFoundError: No module named 'build.fastllama'

from build.fastllama import Model, ModelKind

ModuleNotFoundError: No module named 'build.fastllama'

Return Log Probs in Output

Implement the WebSocket Server

Error using build.sh

OS: fedora 37
`gcc --version
gcc (GCC) 12.2.1 20221121 (Red Hat 12.2.1-4)
Copyright (C) 2022 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

➜ fastLLaMa git:(main) ✗ sh build
build: build: Is a directory
➜ fastLLaMa git:(main) ✗ sh build.sh
I llama.cpp build info:
I UNAME_S: Linux
I UNAME_P: x86_64
I UNAME_M: x86_64
I CFLAGS: -I. -O3 -DNDEBUG -std=c11 -fPIC -pthread -mavx -mavx2 -mfma -mf16c -msse3 -D_POSIX_C_SOURCE=199309L
I CXXFLAGS: -I. -I./examples -O3 -DNDEBUG -std=c++11 -fPIC -pthread
I LDFLAGS:
I CC: cc (GCC) 12.2.1 20221121 (Red Hat 12.2.1-4)
I CXX: g++ (GCC) 12.2.1 20221121 (Red Hat 12.2.1-4)

ar rcs libllama.a ggml.o utils.o
-- pybind11 v2.6.2
-- Configuring done (0.1s)
-- Generating done (0.0s)
-- Build files have been written to: /home/pooh/code/python/utachi/refractor/ignore-folder/fastLLaMa/build
[ 50%] Building CXX object CMakeFiles/fastLlama.dir/bridge.cpp.o
/home/pooh/code/python/utachi/refractor/ignore-folder/fastLLaMa/bridge.cpp:772:59: error: ‘optional’ in namespace ‘std’ does not name a template type
772 | auto get_token_as_str(gpt_vocab::id id) const -> std::optionalstd::string {
| ^~~~~~~~
/home/pooh/code/python/utachi/refractor/ignore-folder/fastLLaMa/bridge.cpp:17:1: note: ‘std::optional’ is defined in header ‘’; did you forget to ‘#include ’?
16 | #include <pybind11/functional.h>
+++ |+#include
17 |
/home/pooh/code/python/utachi/refractor/ignore-folder/fastLLaMa/bridge.cpp:772:59: error: expected ‘;’ at end of member declaration
772 | auto get_token_as_str(gpt_vocab::id id) const -> std::optionalstd::string {
| ^~~~~~~~
| ;
/home/pooh/code/python/utachi/refractor/ignore-folder/fastLLaMa/bridge.cpp:772:67: error: expected unqualified-id before ‘<’ token
772 | auto get_token_as_str(gpt_vocab::id id) const -> std::optionalstd::string {
| ^
/home/pooh/code/python/utachi/refractor/ignore-folder/fastLLaMa/bridge.cpp: In member function ‘void FastLlamaBuffer::flush_buffer()’:
/home/pooh/code/python/utachi/refractor/ignore-folder/fastLLaMa/bridge.cpp:782:21: error: there are no arguments to ‘get_token_as_str’ that depend on a template parameter, so a declaration of ‘get_token_as_str’ must be available [-fpermissive]
782 | auto temp = get_token_as_str(id);
| ^~~~~~~~~~~~~~~~
/home/pooh/code/python/utachi/refractor/ignore-folder/fastLLaMa/bridge.cpp:782:21: note: (if you use ‘-fpermissive’, G++ will accept your code, but allowing the use of an undeclared name is deprecated)
/home/pooh/code/python/utachi/refractor/ignore-folder/fastLLaMa/bridge.cpp: In instantiation of ‘void FastLlamaBuffer::flush_buffer() [with Fn = FastLlama::generate(std::function<void(const std::__cxx11::basic_string&)>, std::size_t, float, float, float, std::string)::<lambda(std::string)>]’:
/home/pooh/code/python/utachi/refractor/ignore-folder/fastLLaMa/bridge.cpp:960:28: required from here
/home/pooh/code/python/utachi/refractor/ignore-folder/fastLLaMa/bridge.cpp:782:37: error: ‘get_token_as_str’ was not declared in this scope
782 | auto temp = get_token_as_str(id);
| ~~~~~~~~~~~~~~~~^~~~
make[2]: *** [CMakeFiles/fastLlama.dir/build.make:76: CMakeFiles/fastLlama.dir/bridge.cpp.o] Error 1
make[1]: *** [CMakeFiles/Makefile2:100: CMakeFiles/fastLlama.dir/all] Error 2
make: *** [Makefile:91: all] Error 2
Unable to build bridge.cpp and link the 'libllama'`

what did i do wrong?

Unicode characters break tokenizer

When injesting model with multiple-bytes unicode characters, it prints failed to tokenize string!, and seems to ignore all tokens prior to said characters. That's bad for not only emoji support, but also languages like japanese and chinese.

I haven't tried it yet, but I think implementing the fix proposed in this PR for llama.cpp could solve this issue.

Problems while Trying to Run code programatically

Hi @PotatoSpudowski . First of all, thanks for this repo. this is been so helpful for my personal-project.

I am using the alpaca model 30B and feeding a list of prompts externally from a csv file. While doing so, I get segmentation fault error
after 3 prompts are processed. I made sure that, large prompts are filtered out.

Curious thing is. At first iteration, model handled a prompt string with len of 912. Few iterations later, for another prompt with len of 517, it was giving me segmentation fault again. Considering I have 4 GB extra ram and Swap memory still available when this was happening, I am confused.

Could it be that, I am forgetting to clear memory or something after processing ?

Here is my simple python code. Also, to save model output, I had to slightly modify that streaming_fn. Could you verify if I am doing something wrong here ?

import pandas as pd
import numpy as np
import sys, time
sys.path.append("../build/")
import fastLlama

MODEL_PATH = PROJECT_ROOT_PATH / "../../alpaca-lora-30B-ggml/ggml-model-q4_0.bin"
model_output = []

def gen_output(instruction_input, writable):
    global model_output
    res = model.ingest(user_input)
        
    start = time.time()
    res = model.generate(
        num_tokens=120, # increased from 120 
        top_p=0.95, #top p sampling (Optional) > increased from 0.92
        temp=0.1, #temperature (Optional) > reduced from 0.65 
        repeat_penalty=1.1, #repetition penalty (Optional) > changed from 1.3
        streaming_fn=stream_token, #streaming function
        stop_word=[".\n", "# "] #stop generation when this word is encountered (Optional)
        )

    tot_time = round( time.time() - start, 3 )
    # Do something with model_output here
    model_output_str = ''.join(model_output)
    ## write model output to a file. 
    writable_output = f"{model_output_str}{SEP}{tot_time}{SEP}"
    writable.write(writable_output)
    writable.flush()
    # reset model_output
    model_output = []    

def stream_token(x: str) -> None:
    """
    This function is called by the llama library to stream tokens
    """
    global model_output
    model_output.append( x )
    print(x, end='', flush=True)


if __name__ == '__main__':
    # Load the file
    excel = pd.read_csv("./prompts.csv")
    
    # Load the model
    print("Loading the model ...")
    model = fastLlama.Model(
        id="ALPACA-LORA-30B",
        path=str(MODEL_PATH.resolve()), #path to model
        num_threads=16, #number of threads to use
        n_ctx=512, #context size of model
        last_n_size=64, #size of last n tokens (used for repetition penalty) (Optional)
        seed=0 #seed for random number generator (Optional)
    )

    alpaca_output = open('alpaca_inferences_results.txt', 'a')
    print('Starting Inference Generation ...')
    for row_ind, row_info in excel.iterrows():
        if row_ind % 10 == 0:
            print(f'Processed {row_ind} complaints.')
        prompt_id = row_info['prompt_id']
        prompt = row_info['prompt_text']
      
        gen_output(complaint, question_type, alpaca_output)

Lora adaptor support

feature/lora-adapter

Feature suggestions!

Hi guys lets use this thread for suggesting features!

Current suggestions

Implement method to get embeddings

RuntimeError: Unable to load model because of bad magic

Got error for 7B and same for 13B

$ python example.py 
llama_model_load: loading model from './models/7B/ggml-model-q4_0.bin' - please wait ...
llama_model_load: invalid model file './models/7B/ggml-model-q4_0.bin' (bad magic)
Traceback (most recent call last):
  File "/home/yevhen/Documents/Projects/fastLLaMa/example.py", line 15, in <module>
    model = fastLlama.Model(
RuntimeError: Unable to load model

./build.sh

I llama.cpp build info: 
I UNAME_S:  Linux
I UNAME_P:  x86_64
I UNAME_M:  x86_64
I CFLAGS:   -I.              -O3 -DNDEBUG -std=c11   -fPIC -pthread -mavx -mavx2 -mfma -mf16c -msse3
I CXXFLAGS: -I. -I./examples -O3 -DNDEBUG -std=c++11 -fPIC -pthread
I LDFLAGS:  
I CC:       cc (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0
I CXX:      g++ (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0

ar rcs libllama.a ggml.o utils.o
-- The C compiler identification is GNU 9.4.0
-- The CXX compiler identification is GNU 9.4.0
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Check for working C compiler: /usr/bin/cc - skipped
-- Detecting C compile features
-- Detecting C compile features - done
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Check for working CXX compiler: /usr/bin/c++ - skipped
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- pybind11 v2.6.2 
-- Found PythonInterp: /home/yevhen/.cache/pypoetry/virtualenvs/fastLLaMa-bxKZnhHe-py3.10/bin/python (found version "3.10.9") 
-- Found PythonLibs: /home/yevhen/.pyenv/versions/3.10.9/lib/libpython3.10.so
-- Performing Test HAS_FLTO
-- Performing Test HAS_FLTO - Success
-- Configuring done (2.3s)
-- Generating done (0.0s)
-- Build files have been written to: /home/yevhen/Documents/Projects/fastLLaMa/build
[ 50%] Building CXX object CMakeFiles/fastLlama.dir/bridge.cpp.o
[100%] Linking CXX shared module fastLlama.so
[100%] Built target fastLlama

My models:

$ ll ./models/*
-rw-rw-r-- 1 yevhen yevhen   50 Mar 23 22:36 ./models/tokenizer_checklist.chk
-rw-rw-r-- 1 yevhen yevhen 489K Mar 23 22:36 ./models/tokenizer.model

./models/13B:
total 7.6G
-rw-rw-r-- 1 yevhen yevhen 3.8G Mar 23 23:17 ggml-model-q4_0.bin
-rw-rw-r-- 1 yevhen yevhen 3.8G Mar 23 23:25 ggml-model-q4_0.bin.1

./models/7B:
total 4.0G
-rw-rw-r-- 1 yevhen yevhen  100 Mar 23 22:37 checklist.chk
-rw-rw-r-- 1 yevhen yevhen 4.0G Mar 23 22:52 ggml-model-q4_0.bin
-rw-rw-r-- 1 yevhen yevhen  101 Mar 23 22:37 params.json

Output from llama.cpp proves that my models are ok:

./main  -m /home/yevhen/Documents/Projects/fastLLaMa/models/7B/ggml-model-q4_0.bin -p "What the capital of US?" -n 16
main: seed = 1679673945
llama_model_load: loading model from '/home/yevhen/Documents/Projects/fastLLaMa/models/7B/ggml-model-q4_0.bin' - please wait ...
llama_model_load: n_vocab = 32000
llama_model_load: n_ctx   = 512
llama_model_load: n_embd  = 4096
llama_model_load: n_mult  = 256
llama_model_load: n_head  = 32
llama_model_load: n_layer = 32
llama_model_load: n_rot   = 128
llama_model_load: f16     = 2
llama_model_load: n_ff    = 11008
llama_model_load: n_parts = 1
llama_model_load: ggml ctx size = 4529.34 MB
llama_model_load: memory_size =   512.00 MB, n_mem = 16384
llama_model_load: loading model part 1/1 from '/home/yevhen/Documents/Projects/fastLLaMa/models/7B/ggml-model-q4_0.bin'
llama_model_load: .................................... done
llama_model_load: model size =  4017.27 MB / num tensors = 291

system_info: n_threads = 8 / 8 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | VSX = 0 | 

main: prompt: ' What the capital of US?'
main: number of tokens in prompt = 7
     1 -> ''
  1724 -> ' What'
   278 -> ' the'
  7483 -> ' capital'
   310 -> ' of'
  3148 -> ' US'
 29973 -> '?'

sampling parameters: temp = 0.800000, top_k = 40, top_p = 0.950000, repeat_last_n = 64, repeat_penalty = 1.100000


 What the capital of US?
Washington DC, which is also know as D.C., and federal
llama_print_timings:     load time =  1518.87 ms
llama_print_timings:   sample time =    14.20 ms /    16 runs (    0.89 ms per run)
llama_print_timings:     eval time =  4051.75 ms /    15 runs (  270.12 ms per run)
llama_print_timings:    total time =  8281.70 ms

How install on Windows?

I get the below error in windows:
ModuleNotFoundError: No module named 'fastLlama'

Getting Error with make command

make[2]: *** No rule to make target `../fastLLaMa/libllama.a', needed by `fastLlama.so'.  Stop.
make[1]: *** [CMakeFiles/fastLlama.dir/all] Error 2
make: *** [all] Error 2

Edit:

Getting Error with make command from build.sh file:

cc  -I.              -O3 -DNDEBUG -std=c11   -fPIC -pthread -mavx -mavx2 -mfma -mf16c -msse3   -c ggml.c -o ggml.o
ggml.c: In function 'ggml_time_ms':
ggml.c:308:21: error: storage size of 'ts' isn't known
  308 |     struct timespec ts;
      |                     ^~
ggml.c:309:5: warning: implicit declaration of function 'clock_gettime' [-Wimplicit-function-declaration]
  309 |     clock_gettime(CLOCK_MONOTONIC, &ts);
      |     ^~~~~~~~~~~~~
ggml.c:309:19: error: 'CLOCK_MONOTONIC' undeclared (first use in this function)
  309 |     clock_gettime(CLOCK_MONOTONIC, &ts);
      |                   ^~~~~~~~~~~~~~~
ggml.c:309:19: note: each undeclared identifier is reported only once for each function it appears in
ggml.c: In function 'ggml_time_us':
ggml.c:314:21: error: storage size of 'ts' isn't known
  314 |     struct timespec ts;
      |                     ^~
ggml.c:315:19: error: 'CLOCK_MONOTONIC' undeclared (first use in this function)
  315 |     clock_gettime(CLOCK_MONOTONIC, &ts);
      |                   ^~~~~~~~~~~~~~~
make: *** [ggml.o] Error 1
Unable to build static library 'libllama'

Pip uninstall not removing the package

$: python examples/python/example-alpaca.py
Traceback (most recent call last):
  File "examples/python/example-alpaca.py", line 1, in <module>
    from fastllama import Model
ImportError: cannot import name 'Model' from 'fastllama' (unknown location)
$: pip install git+https://github.com/PotatoSpudowski/fastLLaMa.git@main
[...]
$: pip list |grep fastllama|wc -l
1
$: python examples/python/example-alpaca.py
[works as expected]
$: pip uninstall fastllama -y
Found existing installation: fastllama 1.0.0
Uninstalling fastllama-1.0.0:
  Successfully uninstalled fastllama-1.0.0
$: pip list |grep fastllama|wc -l
0
$: python examples/python/example-alpaca.py
[still works????]

Im not sure what might be the cause of this issue... The only way I found to remove is to delete it directly from the site-packages folder.

ModuleNotFoundError: No module named 'fastLlama' after setup.py update

$ python example-alpaca.py
Traceback (most recent call last):
  File "/home/yevhen/Documents/Projects/fastLLaMa/example-alpaca.py", line 5, in <module>
    import fastLlama
ModuleNotFoundError: No module named 'fastLlama'

In example-alpaca.py I changed only sys.path (to full path), model path and model id.

How I build it:

▶python setup.py                                                                                       
An error occurred while running 'make': Command '['cd ./build && cmake .. && make']' returned non-zero exit status 2.
Output: -- The C compiler identification is GNU 9.4.0
-- The CXX compiler identification is GNU 9.4.0
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Check for working C compiler: /usr/bin/cc - skipped
-- Detecting C compile features
-- Detecting C compile features - done
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Check for working CXX compiler: /usr/bin/c++ - skipped
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- Performing Test NATIVE_C_FLAG_SUPPORTED
-- Performing Test NATIVE_C_FLAG_SUPPORTED - Success
-- Performing Test C_FLAG_SUPPORTED_mavx2
-- Performing Test C_FLAG_SUPPORTED_mavx2 - Success
-- Performing Test C_FLAG_SUPPORTED_mf16c
-- Performing Test C_FLAG_SUPPORTED_mf16c - Success
-- Performing Test C_FLAG_SUPPORTED_march=native
-- Performing Test C_FLAG_SUPPORTED_march=native - Success
-- Performing Test C_FLAG_SUPPORTED_mfma
-- Performing Test C_FLAG_SUPPORTED_mfma - Success
-- Performing Test C_FLAG_SUPPORTED_mavx
-- Performing Test C_FLAG_SUPPORTED_mavx - Success
-- Performing Test C_FLAG_SUPPORTED_AVX
-- Performing Test C_FLAG_SUPPORTED_AVX - Failed
-- Performing Test C_FLAG_SUPPORTED_SSE2
-- Performing Test C_FLAG_SUPPORTED_SSE2 - Failed
-- Performing Test C_FLAG_SUPPORTED_AVX2
-- Performing Test C_FLAG_SUPPORTED_AVX2 - Failed
-- Performing Test C_FLAG_SUPPORTED_GL
-- Performing Test C_FLAG_SUPPORTED_GL - Failed
-- Performing Test CMAKE_HAVE_LIBC_PTHREAD
-- Performing Test CMAKE_HAVE_LIBC_PTHREAD - Failed
-- Check if compiler accepts -pthread
-- Check if compiler accepts -pthread - yes
-- Found Threads: TRUE  
-- pybind11 v2.10.4 
-- Found PythonInterp: /home/yevhen/.cache/pypoetry/virtualenvs/fastLLaMa-bxKZnhHe-py3.10/bin/python (found suitable version "3.10.9", minimum required is "3.6") 
-- Found PythonLibs: /home/yevhen/.pyenv/versions/3.10.9/lib/libpython3.10.so
-- Performing Test HAS_FLTO
-- Performing Test HAS_FLTO - Success
-- Configuring done (4.5s)
-- Generating done (0.0s)
-- Build files have been written to: /home/yevhen/Documents/Projects/fastLLaMa/build
[ 20%] Building C object CMakeFiles/fast_llama_lib.dir/ggml.c.o
[ 40%] Building CXX object CMakeFiles/fast_llama_lib.dir/utils.cpp.o
[ 60%] Linking CXX static library libfast_llama_lib.a
[ 60%] Built target fast_llama_lib
[ 80%] Building CXX object CMakeFiles/fastLlama.dir/bridge.cpp.o
[100%] Linking CXX shared module fastLlama.so

▶tree build | grep fastLlama.so
<EMPTY_RESULT>

▶python example-alpaca.py
Traceback (most recent call last):
  File "/home/yevhen/Documents/Projects/fastLLaMa/example-alpaca.py", line 5, in <module>
    import fastLlama
ModuleNotFoundError: No module named 'fastLlama'

Manual building (when the automatic build is already done):

▶cd ./build && cmake .. && make
-- Performing Test C_FLAG_SUPPORTED_march=native
-- Performing Test C_FLAG_SUPPORTED_march=native - Success
-- pybind11 v2.10.4 
-- Configuring done (0.2s)
-- Generating done (0.0s)
-- Build files have been written to: /home/yevhen/Documents/Projects/fastLLaMa/build
[ 60%] Built target fast_llama_lib
[ 80%] Linking CXX shared module fastLlama.so
/usr/bin/ld: libfast_llama_lib.a(ggml.c.o): relocation R_X86_64_PC32 against symbol `stderr@@GLIBC_2.2.5' can not be used when making a shared object; recompile with -fPIC
/usr/bin/ld: final link failed: bad value
collect2: error: ld returned 1 exit status
make[2]: *** [CMakeFiles/fastLlama.dir/build.make:98: fastLlama.so] Error 1
make[1]: *** [CMakeFiles/Makefile2:128: CMakeFiles/fastLlama.dir/all] Error 2
make: *** [Makefile:91: all] Error 2

▶cmake ..
-- Performing Test C_FLAG_SUPPORTED_march=native
-- Performing Test C_FLAG_SUPPORTED_march=native - Success
-- pybind11 v2.10.4 
-- Configuring done (0.2s)
-- Generating done (0.0s)
-- Build files have been written to: /home/yevhen/Documents/Projects/fastLLaMa/build

▶make
[ 60%] Built target fast_llama_lib
[ 80%] Linking CXX shared module fastLlama.so
/usr/bin/ld: libfast_llama_lib.a(ggml.c.o): relocation R_X86_64_PC32 against symbol `stderr@@GLIBC_2.2.5' can not be used when making a shared object; recompile with -fPIC
/usr/bin/ld: final link failed: bad value
collect2: error: ld returned 1 exit status
make[2]: *** [CMakeFiles/fastLlama.dir/build.make:98: fastLlama.so] Error 1
make[1]: *** [CMakeFiles/Makefile2:128: CMakeFiles/fastLlama.dir/all] Error 2
make: *** [Makefile:91: all] Error 2

My versions:

▶gcc --version
gcc (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0

▶g++ --version
g++ (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0

▶cmake --version
cmake version 3.26.1

README.md is outdated in sections #running-llama and #running-alpaca-lora

For example, it still uses old syntax for convert-pth-to-ggml.py, and for export-from-huggingface.py. Also maybe it would be better to clarify that even when installing through pip, we still need to call compile.py to get the quantize executable.

Passing arguments such as -ins

Hi,
Sorry if this is a stupid question, because I'm no expert at this stuff. How do I emulate the behavior of the -ins flag (and the other flags from llama.cpp) inside the python program, when I'm using alpaca models?

Designing the UI

Enabling custom logger makes it crash at ingestion.

$>python examples/python/example-logging.py 



                ___            __    _    _         __ __      
                | | '___  ___ _| |_ | |  | |   ___ |  \  \ ___ 
                | |-<_> |<_-<  | |  | |_ | |_ <_> ||     |<_> |
                |_| <___|/__/  |_|  |___||___|<___||_|_|_|<___|
                                                            
                                                                                        
                                                                           
                                                       .+*+-.                
                                                      -%#--                  
                                                    :=***%*++=.              
                                                   :+=+**####%+              
                                                   ++=+*%#                   
                                                  .*+++==-                   
                  ::--:.                           .**++=::                   
                 #%##*++=......                    =*+==-::                   
                .@@@*@%*==-==-==---:::::------::==*+==--::                   
                 %@@@@+--====+===---=---==+=======+++----:                   
                 .%@@*++*##***+===-=====++++++*++*+====++.                   
                 :@@%*##%@@%#*%#+==++++++=++***==-=+==+=-                    
                  %@%%%%%@%#+=*%*##%%%@###**++++==--==++                     
                  #@%%@%@@##**%@@@%#%%%%**++*++=====-=*-                     
                  -@@@@@@@%*#%@@@@@@@%%%%#+*%#++++++=*+.                     
                   +@@@@@%%*-#@@@@@@@@@@@%%@%**#*#+=-.                       
                    #%%###%:  ..+#%@@@@%%@@@@%#+-                            
                    :***#*-         ...  *@@@%*+:                            
                     =***=               -@%##**.                            
                    :#*++                -@#-:*=.                            
                     =##-                .%*..##                             
                      +*-                 *:  +-                             
                      :+-                :+   =.                             
                       =-.               *+   =-                             
                        :-:-              =--  :::                           
                                                                           


[====================================================================================================] 100.0% ...
Start of chat (type 'exit' to exit)

User: Why does enabling a logger makes this crash?                                
Segmentation fault

This is not only a problem in the example, everytime I try using a logger, it crashes during ingestion, even if the logger doesn't do anything at all.

Cmake Error

Hi, I'm using the latest build and facing this error with the setup.py:

Setup executing command: cmake ..
-- Found '/Users/radiohead/llama/fastLLaMa/cmake/GlobalVars.cmake'
CMake Warning at /opt/homebrew/Cellar/cmake/3.26.3/share/cmake/Modules/Platform/Darwin-Initialize.cmake:303 (message):
  Ignoring CMAKE_OSX_SYSROOT value:

   /Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX13.1.sdk

  because the directory does not exist.
Call Stack (most recent call first):
  /opt/homebrew/Cellar/cmake/3.26.3/share/cmake/Modules/CMakeSystemSpecificInitialize.cmake:21 (include)
  CMakeLists.txt:15 (project)

-- The C compiler identification is AppleClang 14.0.3.14030022
-- The CXX compiler identification is AppleClang 14.0.3.14030022
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - failed
-- Check for working C compiler: /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/cc
-- Check for working C compiler: /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/cc - broken
CMake Error at /opt/homebrew/Cellar/cmake/3.26.3/share/cmake/Modules/CMakeTestCCompiler.cmake:67 (message):
  The C compiler

    "/Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/cc"

  is not able to compile a simple test program.

  It fails with the following output:

    Change Dir: /Users/radiohead/llama/fastLLaMa/build/CMakeFiles/CMakeScratch/TryCompile-dbcUww

    Run Build Command(s):/opt/homebrew/Cellar/cmake/3.26.3/bin/cmake -E env VERBOSE=1 /usr/bin/make -f Makefile cmTC_0c2c1/fast && /Applications/Xcode.app/Contents/Developer/usr/bin/make  -f CMakeFiles/cmTC_0c2c1.dir/build.make CMakeFiles/cmTC_0c2c1.dir/build
    Building C object CMakeFiles/cmTC_0c2c1.dir/testCCompiler.c.o
    /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/cc   -arch arm64 -MD -MT CMakeFiles/cmTC_0c2c1.dir/testCCompiler.c.o -MF CMakeFiles/cmTC_0c2c1.dir/testCCompiler.c.o.d -o CMakeFiles/cmTC_0c2c1.dir/testCCompiler.c.o -c /Users/radiohead/llama/fastLLaMa/build/CMakeFiles/CMakeScratch/TryCompile-dbcUww/testCCompiler.c
    Linking C executable cmTC_0c2c1
    /opt/homebrew/Cellar/cmake/3.26.3/bin/cmake -E cmake_link_script CMakeFiles/cmTC_0c2c1.dir/link.txt --verbose=1
    /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/cc  -arch arm64 -Wl,-search_paths_first -Wl,-headerpad_max_install_names CMakeFiles/cmTC_0c2c1.dir/testCCompiler.c.o -o cmTC_0c2c1
    ld: library not found for -lSystem
    clang: error: linker command failed with exit code 1 (use -v to see invocation)
    make[1]: *** [cmTC_0c2c1] Error 1
    make: *** [cmTC_0c2c1/fast] Error 2

  CMake will not be able to correctly generate this project.
Call Stack (most recent call first):
  CMakeLists.txt:15 (project)

-- Configuring incomplete, errors occurred!
Setup executing command: make
make: /opt/homebrew/Cellar/cmake/3.26.1/bin/cmake: No such file or directory
make: *** [cmake_check_build_system] Error 1

Building Examples....

Setup executing command: cmake ..
-- Found '/Users/radiohead/llama/fastLLaMa/examples/c/../../cmake/GlobalVars.cmake' for 'C'
-- Could NOT find OpenMP_C (missing: OpenMP_C_FLAGS OpenMP_C_LIB_NAMES)
-- Could NOT find OpenMP_CXX (missing: OpenMP_CXX_FLAGS OpenMP_CXX_LIB_NAMES)
-- Could NOT find OpenMP (missing: OpenMP_C_FOUND OpenMP_CXX_FOUND)
-- OpenMP not found
-- Configuring done (0.1s)
-- Generating done (0.0s)
-- Build files have been written to: /Users/radiohead/llama/fastLLaMa/examples/c/build
Setup executing command: make
[ 16%] Linking C executable example
ld: warning: directory not found for option '-L/Users/radiohead/llama/fastLLaMa/build/interfaces/c'
ld: library not found for -lcfastllama
clang: error: linker command failed with exit code 1 (use -v to see invocation)
make[2]: *** [example] Error 1
make[1]: *** [CMakeFiles/example.dir/all] Error 2
make: *** [all] Error 2

Building examples completed

System Details:
Machine: M2 Air running OSX 13.3
Clang:

Apple clang version 14.0.3 (clang-1403.0.22.14.1)
Target: arm64-apple-darwin22.4.0
Thread model: posix
InstalledDir: /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin

TypeError: Model.generate() got an unexpected keyword argument 'stop_word'

This piece of code

res = model.generate(
    num_tokens=100, 
    top_p=0.1, #top p sampling (Optional)
    temp=0.1, #temperature (Optional)
    repeat_penalty=1.0, #repetition penalty (Optional)
    streaming_fn=stream_token, #streaming function
    stop_word=["User:", "\n"] #stop generation when this word is encountered (Optional)
    )

produces the following error:

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[4], line 7
      2     """
      3     This function is called by the library to stream tokens
      4     """
      5     print(x, end='', flush=True)
----> 7 res = model.generate(
      8     num_tokens=100, 
      9     top_p=0.1, #top p sampling (Optional)
     10     temp=0.1, #temperature (Optional)
     11     repeat_penalty=1.0, #repetition penalty (Optional)
     12     streaming_fn=stream_token, #streaming function
     13     stop_word=["User:", "\n"] #stop generation when this word is encountered (Optional)
     14     )

TypeError: Model.generate() got an unexpected keyword argument 'stop_word'

Should fastLLaMa support more than just Python?

Hi, Would it be interesting to get fastLLaMa to support more than just Python?

Thoughts @amitsingh19975 @raldebsi ? 🤔

Make running fastLLaMa on windows simple!

Discussed in #26

^{Originally posted by McRush March 27, 2023}
Hello!
I find this project really cool. FastLLama has separate functions for prompt ingesting and text generation unlike other solutions. The ability to save and load model states is also very nice. I really would like to try it.
Unfortunately, now I'm bound to my Windows workflow. I would consider using WSL or install Linux, but building for Windows would be a preferable way for now.

I'm not familiar with C/C++, so I just tried to generate VisualStudio files from CMake and then compile them. Miracle didn't happen and compilation failed.
I suspect that some preparation is necessary here. Maybe I shouldn't use VS at all.
If somebody is ready to provide any instructions, I would very very appreciate it.

Stop words is buggy!

Stop words does not work sometimes for multiple words :(

Webui UX issue on mobile

Webui is almost unusable on mobile because of the side bar with the list of saves taking most of space, leaving only a tiny portion of the screen for the actual chat interface.

potatospudowski / fastllama Goto Github PK

fastllama's People

Stargazers

Watchers

Forkers

fastllama's Issues

In [1]: from fastllama import Model

Discussed in #26

Recommend Projects

Recommend Topics

Recommend Org