tairov / llama2.mojo Goto Github PK

View Code? Open in Web Editor NEW

1.9K 26.0 131.0 2.65 MB

Inference Llama 2 in one file of pure 🔥

Home Page: https://www.modular.com/blog/community-spotlight-how-i-built-llama2-by-aydyn-tairov

License: MIT License

Dockerfile 5.50% Python 6.65% Mojo 87.85%

inference llama llama2 modular mojo performance simd vectorization parallelize tensor

llama2.mojo's People

Contributors

Stargazers

Watchers

Forkers

davors72 panthole-s-lab lpai-org g1y5x3 nguyenhieuec logp farhanaliraza nanbo99 vmois akashrajkn vrvrv leosapucaia tony163163 krish240574 itramble kumar045 zhcharles radames khali04 rmarquet21 mwarduni ajinkyalahade ayunillariy lulled camenduru ravipratap366 g8392 mikowals rickachiu yli397 pashu123 jonahpwhite baudneo agchang evdcush infatoshi focusaibuilder huamichaelchen gauthampughazhendhi dattgoswami viyiviyi zhutony tic-top touristshaun tobiasvanderwerff magician-blue soon14 vicbguti-espol jaedukseo evelynmitchell bitsnaps petercao dyf-ai mbilalai az7dev owami tthan1234 pterameta sml8648 morenohernan maxreiss123 alokvermaotw sethburkart123 raymelon 5l1v3r1 crowdcompany anointedcoder theshteves jackos vyomakesh09 chrisciokler rd4com shuhangchen waltersharpwei liunix61 tpenn pythonalchemist shroominic tokenbender cywiz57 gmh5225 kunyi andresnowak carlosouza brayanpena530 mvandermeulen anoopsaha techthiyanes duongdang wangganglab alexandremendoncaalvaro aadabi tleers pete1313 asw1nkj camegone aadehamid sibtainrazajamali jiaerdangjia catid

llama2.mojo's Issues

error: unable to locate module 'read'

On Mac M1, build

from read import BufReader, File
     ^
mojo: error: failed to parse the provided Mojo

version:

% mojo --version
mojo 0.4.0 (9e33b013)

Unroll vectorisation

Hi there,

awesome port and demonstrator. Have you compared the performance of vectorize and vectorize_unroll?

While tinkering around with demanding algos I saw that unrolling the partial loop 12x I got a 10% performance increase. Maybe enough to beat cpp? 😁

TODO: implement openai api adapter

HuggingFace demo not working

Hi,

The HuggingFace demo of this project is not working. It says:

Build failed with exit code: 1
...
--> ERROR: process "/bin/sh -c curl https://get.modular.com | MODULAR_AUTH=$AUTH_KEY sh -     && modular install mojo" did not complete successfully: exit code: 1

The fast llama2 reasoning is really helpful to us, but how to add a RestAPI to mojo, preferably one that is compatible with the openai interface? I don't know much about mojo. I know how to use python flask. Can you help me?

Crashes when running the example command

I'm following the instructions to get started and whenever I run:

mojo llama2.mojo stories15M.bin -s 100 -n 256 -t 0.5 -i "Mojo is a language"

it crashes with the following info:

num parallel workers: 8  SIMD width: 16
Please submit a bug report to https://github.com/modularml/mojo/issues and include the crash backtrace along with all the relevant source codes.
Stack dump:
0.	Program arguments: mojo llama2.mojo stories15M.bin -s 100 -n 256 -t 0.5 -i "Mojo is a language"
#0 0x0000000102becfd8 llvm_strlcpy (~/.modular/pkg/packages.modular.com_mojo/bin/mojo+0x1000ccfd8)
#1 0x0000000102beb138 llvm_strlcpy (~/.modular/pkg/packages.modular.com_mojo/bin/mojo+0x1000cb138)
#2 0x0000000102bed678 llvm_strlcpy (~/.modular/pkg/packages.modular.com_mojo/bin/mojo+0x1000cd678)
#3 0x0000000180a69a24 (/usr/lib/system/libsystem_platform.dylib+0x18046da24)
#4 0x000000028000e7a0
#5 0x0000000102f8a5a8 llvm_strlcpy (~/.modular/pkg/packages.modular.com_mojo/bin/mojo+0x10046a5a8)
#6 0x0000000102b41e94 _mh_execute_header (~/.modular/pkg/packages.modular.com_mojo/bin/mojo+0x100021e94)
#7 0x0000000102b25bd0 _mh_execute_header (~/.modular/pkg/packages.modular.com_mojo/bin/mojo+0x100005bd0)
#8 0x00000001806b90e0
[15981:92870:20240318,112922.885502:WARNING crash_report_exception_handler.cc:257] UniversalExceptionRaise: (os/kern) failure (5)
zsh: segmentation fault  mojo llama2.mojo stories15M.bin -s 100 -n 256 -t 0.5 -i "Mojo is a language"

System info
Macbook Pro, 14inch, 2021. macOS:14.2.1 (23C71)
mojo 24.1.0 (55ec12d6)
modular 0.5.2 (6b3a04fd)

Question about models

I found this interesting project via the 'AI Anywhere' channel on YouTube. I've installed Modular and Mojo, and successfully run your test on an under powered mini computer with only a 1.5GHz 4 core Intel Celeron cpu, running Ubuntu 20.04.6, and this achieved 32.5 tok/s.

I'm an LLM newbie so my questions may appear stupid!! Can this project be run with other models?

I tried the following:
mojo llama2.mojo /home/ezyweb/Public/chatpdf1/models/llama-2-7b-chat.Q4_K_M.gguf -s 100 -n 256 -t 0.5 -i "What is Llama 2"

And got the result:
num hardware threads: 4 SIMD vector width: 8 checkpoint size: 4081004224 [ 3891 MB ] Killed

Is that likely an under resourced hardware issue or is the project not compatible with .gguf models?

Segmentation fault on M1 Max 32GB

> MOJO_PYTHON_LIBRARY="/Users/shroominic/dev/miniforge3/lib" mojo llama2.mojo stories110M.bin -i "hello"
num parallel workers: 10  SIMD width: 8

Stack dump:
0.      Program arguments: mojo llama2.mojo stories110M.bin -i hello
Stack dump without symbol names (ensure you have llvm-symbolizer in your PATH or set the environment var `LLVM_SYMBOLIZER_PATH` to point to it):
0  mojo                     0x0000000100f79990 llvm::sys::PrintStackTrace(llvm::raw_ostream&, int) + 56
1  mojo                     0x0000000100f77af0 llvm::sys::RunSignalHandlers() + 112
2  mojo                     0x0000000100f7a02c SignalHandler(int) + 344
3  libsystem_platform.dylib 0x000000018dd06a24 _sigtramp + 56
4  libsystem_platform.dylib 0x0000000280058ac8 _sigtramp + 4063568092
5  libsystem_platform.dylib 0x000000028005807c _sigtramp + 4063565456
6  mojo                     0x00000001012ca24c M::KGEN::ExecutionEngine::runProgram(llvm::StringRef, llvm::StringRef, llvm::function_ref<M::ErrorOrSuccess (void*)>) + 1156
7  mojo                     0x0000000100ed3c64 run(M::State const&) + 3980
8  mojo                     0x0000000100ebcb2c main + 1672
9  dyld                     0x000000018d97ff28 start + 2236
[79626:9834787:20231019,201441.806415:WARNING crash_report_exception_handler.cc:257] UniversalExceptionRaise: (os/kern) failure (5)

[1]    79624 segmentation fault  MOJO_PYTHON_LIBRARY="/Users/shroominic/dev/miniforge3/lib" mojo llama2.mojo  -i

Not sure what this all means but I am just trying to run this on my MacBook Pro based on the README.md ...
I installed mojo and I am able to run basic hello world scripts and I've put the path of my conda base env.

Here another try with LLVM Symbolizer and the smaller model:

Stack dump:
0.      Program arguments: mojo llama2.mojo stories15M.bin -s 42 -m 256 -t 0.5 -i hello -z tokenizer.bin
 #0 0x0000000100729990 llvm::sys::PrintStackTrace(llvm::raw_ostream&, int) (/Users/shroominic/.modular/pkg/packages.modular.com_mojo/bin/mojo+0x1000c5990)
 #1 0x0000000100727af0 llvm::sys::RunSignalHandlers() (/Users/shroominic/.modular/pkg/packages.modular.com_mojo/bin/mojo+0x1000c3af0)
 #2 0x000000010072a02c SignalHandler(int) (/Users/shroominic/.modular/pkg/packages.modular.com_mojo/bin/mojo+0x1000c602c)
 #3 0x000000018dd06a24 (/usr/lib/system/libsystem_platform.dylib+0x18042ea24)
 #4 0x0000000280052c7c 
 #5 0x0000000280051fc4 
 #6 0x0000000100a7a24c M::KGEN::ExecutionEngine::runProgram(llvm::StringRef, llvm::StringRef, llvm::function_ref<M::ErrorOrSuccess (void*)>) (/Users/shroominic/.modular/pkg/packages.modular.com_mojo/bin/mojo+0x10041624c)
 #7 0x0000000100683c64 run(M::State const&) (/Users/shroominic/.modular/pkg/packages.modular.com_mojo/bin/mojo+0x10001fc64)
 #8 0x000000010066cb2c main (/Users/shroominic/.modular/pkg/packages.modular.com_mojo/bin/mojo+0x100008b2c)
 #9 0x000000018d97ff28 
[35841:10187750:20231020,185712.713361:WARNING crash_report_exception_handler.cc:257] UniversalExceptionRaise: (os/kern) failure (5)
[1]    35839 segmentation fault  MOJO_PYTHON_LIBRARY= LLVM_SYMBOLIZER_PATH= mojo llama2.mojo stories15M.bin -s

[Request New Features] Support InternLM

Do you have the plan to support InternLM-7B & InternLM-20B?
(https://github.com/InternLM/InternLM)
We'd love to provide technical support and other forms of assistance where needed.
Thanks!

Mojo: error: execution exited with a non-zere result:1

The error occurs when running llama.mojo
My environment: Python 3.10(ubuntu22.04)

console imformation bellow:

mojo llama2.mojo stories15M.bin -s 100 -n 256 -t 0.5 -i "Llama is an animal"
num hardware threads: 192 SIMD vector width: 16
Unhandled exception caught during execution: An error occurred in Python.
mojo: error: execution exited with a non-zero result: 1

How to use mojo playground to run this model?

Hi,

Maybe a stupid question, I couldn't find a way to execute shelll cmd in mojo player ground's console or notebook. How did you manage to run this project in the mojo playground?

Thanks!

Where is 'tokenizer.bin'?

Hi, really impressive work this.
Unfortunately,I couldn't run your code.
Your code is execute read 'tokenizer.bin'
but it isn't provided.
Please tell me where is that.

Do you have a plan to port Stable Diffusion models to mojo?

Hi @tairov,

I love your work so much and thank you for your contribution.

I think mojo is very promising in the near future. And I am wondering that do you have any plan to port Stable Diffusion models to mojo? Or do you know someone is currently do this?

Best,
Linh

In some case, we haven't make full use of threads

Althoug I have 6 core cpu, I actually have 12 threads.

In our code, we take it for granted that num_cores = threads.
print("num hardware threads: ", num_cores()) and self.rt = Runtime(num_cores() // 2)

So, I think we haven't make full use of the threads.

How it relates to LLAMA?

Ok, I installed mojo, cloned your repo and run the test. It works, congrats! But how all of this relates to LLAMA? Nothing happened when I was trying to run the LLAMA2 itself:

alex@NLDW4-5-20-11:~/ai/llama2.mojo$ mojo llama2.mojo /ai/llama.cpp/models/ggml-model-q4_1.bin -s 100 -n 256 -t 0.5 -i "Llama is an animal"
num hardware threads: 12
SIMD vector width: 16
checkpoint size: 4238459520
Killed
alex@NLDW4-5-20-11:/ai/llama2.mojo$ mojo llama2.mojo ~/ai/llama.cpp/models/ggml-model-q4_1.bin -s 100 -n 256 -t 4 -i "Llama is an animal"
num hardware threads: 12
SIMD vector width: 16
checkpoint size: 4238459520
Killed

I don't know what does it mean -t 0.5 (I suppose threads), I've been trying -t 4 and again without results.

The the clue here is how to run LLAMA 2 using this new language called MOJO. And if you made a MOJO wrapper for the LLAMA/LLAMA2 models, please provide the instruction on how to run the model using this wrapper.

Thank you.

Segmentation falut error when it is built as binary

Thanks for your fatastic project. For curiosity, I tried to build it as binary. It seems to be built at first. But it didn't work. It showed a message like set python path. But after I set its environment variable, a segmentation falut error occurred. I think it came from mojo builder, maybe. My enviroment is wsl in Windows 11.

Dockerfile errors

First of all: Amazing jog! Congrats!

I had some errors when I tried to run the Docker version. When I fix the first, the second appears:

The authentication fails. I believe Modular changes the way authentication works.
Mojo tries to use VENV but "conda init" is called after that.

I'd submit a pull request with the changes that works to me: #70

My Setup:

Intel i5-13600K
32 GB RAM
WSL2 Ubuntu 22.04 LTS
Windows 11 Pro Insider Preview x64 (v. 22H2 Build Windows 11 Pro Insider Preview)
Docker Desktop 4.22.0 (117440)

Replicating Steps:
Inside Repo Directory:

docker build --build-arg AUTH_KEY=MY_MODULAR_KEY -t llama2.mojo .

Terminal Print (Error 1):

17.15 Setting up modular (0.2.1) ...
17.16 Processing triggers for libc-bin (2.31-0ubuntu9.12) ...
17.18 sh: 80: [[: not found
17.18   __  __           _       _
17.18  |  \/  | ___   __| |_   _| | __ _ _ __
17.18  | |\/| |/ _ \ / _` | | | | |/ _` | '__|
17.18  | |  | | (_) | (_| | |_| | | (_| | |
17.18  |_|  |_|\___/ \__,_|\__,_|_|\__,_|_|
17.18 
17.18 Welcome to the Modular CLI!
17.18 For info about this tool, type "modular --help".
17.18 
17.18 To install Mojo🔥, type "modular install mojo".
17.18 
17.18 For Mojo documentation, see https://docs.modular.com/mojo.
17.18 To chat on Discord, visit https://discord.gg/modular.
17.18 To report issues, go to https://github.com/modularml/mojo/issues.
21.66 modular: error: please run `modular auth` before attempting to install a package
------
Dockerfile:52
--------------------
  51 |     
  52 | >>> RUN curl https://get.modular.com | MODULAR_AUTH=$AUTH_KEY sh - \
  53 | >>>     && modular install mojo 
  54 |     
--------------------
ERROR: failed to solve: process "/bin/sh -c curl https://get.modular.com | MODULAR_AUTH=$AUTH_KEY sh -     && modular install mojo" did not complete successfully: exit code: 1

Terminal Print (Error 2):

 => ERROR [ 6/15] RUN modular install mojo                                                                                                                                                                40.6s 
------                                                                                                                                                                                                          
 > [ 6/15] RUN modular install mojo:                                                                                                                                                                            
40.33 The virtual environment was not created successfully because ensurepip is not                                                                                                                             
40.33 available.  On Debian/Ubuntu systems, you need to install the python3-venv
40.33 package using the following command.
40.33 
40.33     apt install python3.8-venv
40.33 
40.33 You may need to use sudo with that command.  After installing the python3-venv
40.33 package, recreate your virtual environment.
40.33 
40.33 Failing command: ['/home/user/.modular/pkg/packages.modular.com_mojo/venv/bin/python3', '-Im', 'ensurepip', '--upgrade', '--default-pip']
40.33 
40.56 modular: error: failed to run python: 
40.56 # Found release for https://packages.modular.com/mojo @ 0.4.0
40.56 # Installing to /home/user/.modular/pkg/packages.modular.com_mojo
40.56 # Downloading artifacts. Please wait...
40.56 # Downloads complete, setting configs...
40.56 # Configs complete, running post-install hooks...
------
Dockerfile:54
--------------------
  52 |     RUN curl https://get.modular.com | MODULAR_AUTH=$AUTH_KEY sh -
  53 |     RUN modular auth $AUTH_KEY
  54 | >>> RUN modular install mojo 
  55 |     
  56 |     RUN useradd -m -u 1000 user
--------------------
ERROR: failed to solve: process "/bin/sh -c modular install mojo" did not complete successfully: exit code: 1

typo: Just a little fixing for the rnd_seed

#Unhandled exception caught during execution: String is not convertible to integer.
rng_seed = atol(args[i + 1])

if args[i] == "-i":
      prompt = args[i + 1]
      rng_seed = atol(args[i + 1]) #line for if args[i] == "-s":

How to create my own tokenizer.bin?

as title

Prompt input?

Hi @tairov great work here!
I notice you're not using the prompt, it would be fun to be able to input it, I could add it to the #10

str_concat memory leak

I am also working on a port of llama2 to mojo. Your port is excellent, just do it myself for the sake of learning, probably sticking closer to the C source, will see how it goes. Just struggling with the tokenizer in the Andrej's C code and taking a look at your code, i wonder if there is a probably not problematic memory leak in the str_concat method, cant see right now that the memory allocated there is freed at one pint ... i might be completely wrong, just thought to drop you a line.

mojo 24.2.1 errors

With mojo 24.2.1 (58157dc0) in Google CoLab env

Running
mojo llama2.mojo stories15M.bin -s 100 -n 256 -t 0.5 -i "Mojo is a language" yields:

/root/llama2.mojo/llama2.mojo:2:47: error: package 'algorithm' does not contain 'unroll'
from algorithm import vectorize, parallelize, unroll
                                              ^
/root/llama2.mojo/llama2.mojo:173:18: error: no matching function in call to 'memcpy'
    memcpy[UInt8](str, s1, l1)
    ~~~~~~~~~~~~~^~~~~~~~~~~~~
/root/llama2.mojo/llama2.mojo:1:1: note: candidate not viable: expected at most 2 positional arguments, got 3
from algorithm import sum
^
/root/llama2.mojo/llama2.mojo:1:1: note: candidate not viable: expected at most 2 positional arguments, got 3
from algorithm import sum
^
/root/llama2.mojo/llama2.mojo:1:1: note: candidate not viable: callee expects 0 parameters, but 1 was specified
from algorithm import sum
^
/root/llama2.mojo/llama2.mojo:1:1: note: candidate not viable: failed to infer implicit parameter 'type' of argument 'dest' type 'DTypePointer'
from algorithm import sum
^
/root/llama2.mojo/llama2.mojo:174:18: error: no matching function in call to 'memcpy'
    memcpy[UInt8](str.offset(l1), s2, l2)
    ~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~
/root/llama2.mojo/llama2.mojo:1:1: note: candidate not viable: expected at most 2 positional arguments, got 3
from algorithm import sum
^
/root/llama2.mojo/llama2.mojo:1:1: note: candidate not viable: expected at most 2 positional arguments, got 3
from algorithm import sum
^
/root/llama2.mojo/llama2.mojo:1:1: note: candidate not viable: callee expects 0 parameters, but 1 was specified
from algorithm import sum
^
/root/llama2.mojo/llama2.mojo:1:1: note: candidate not viable: failed to infer implicit parameter 'type' of argument 'dest' type 'DTypePointer'
from algorithm import sum
^
/root/llama2.mojo/llama2.mojo:208:49: error: use of unknown declaration 'DynamicVector', 'fn' declarations require explicit variable declarations
    inout array: PointerStrings, inout indices: DynamicVector[Int], low: Int, high: Int
                                                ^~~~~~~~~~~~~
/root/llama2.mojo/llama2.mojo:212:31: error: unexpected token in expression
    for jj in range(low, high):
                              ^
/root/llama2.mojo/llama2.mojo:212:31: error: statements must start at the beginning of a line
    for jj in range(low, high):
                              ^
/root/llama2.mojo/llama2.mojo:236:49: error: use of unknown declaration 'DynamicVector', 'fn' declarations require explicit variable declarations
    inout array: PointerStrings, inout indices: DynamicVector[Int], low: Int, high: Int
                                                ^~~~~~~~~~~~~
/root/llama2.mojo/llama2.mojo:296:25: error: use of unknown declaration 'DynamicVector'
    var sorted_indices: DynamicVector[Int]
                        ^~~~~~~~~~~~~
/root/llama2.mojo/llama2.mojo:517:10: error: 'Tensor[f32]' value has no attribute 'simd_store'
        a.simd_store[_nelts](j, a.simd_load[_nelts](j) + b.simd_load[_nelts](j))
        ~^~~~~~~~~~~
/root/llama2.mojo/llama2.mojo:531:35: error: 'DTypePointer[f32, 0]' value has no attribute 'simd_load'
        tmp.accumulate(x.offset(j).simd_load[_nelts](0) ** 2)
                       ~~~~~~~~~~~^~~~~~~~~~
/root/llama2.mojo/llama2.mojo:542:25: error: 'DTypePointer[f32, 0]' value has no attribute 'simd_load'
        var val = weight.simd_load[_nelts](j) * ss * x.simd_load[_nelts](j)
                  ~~~~~~^~~~~~~~~~
/root/llama2.mojo/llama2.mojo:542:55: error: 'DTypePointer[f32, 0]' value has no attribute 'simd_load'
        var val = weight.simd_load[_nelts](j) * ss * x.simd_load[_nelts](j)
                                                     ~^~~~~~~~~~
/root/llama2.mojo/llama2.mojo:543:20: error: 'DTypePointer[f32, 0]' value has no attribute 'simd_store'
        o.offset(j).simd_store[_nelts](0, val)
        ~~~~~~~~~~~^~~~~~~~~~~
/root/llama2.mojo/llama2.mojo:569:20: error: 'Tensor[f32]' value has no attribute 'simd_load'
        var val = x.simd_load[_nelts](start + ii).reduce_max()
                  ~^~~~~~~~~~
/root/llama2.mojo/llama2.mojo:579:29: error: 'Tensor[f32]' value has no attribute 'simd_load'
        var val = math.exp(x.simd_load[_nelts](start + ii) - max_val)
                           ~^~~~~~~~~~
/root/llama2.mojo/llama2.mojo:580:10: error: 'Tensor[f32]' value has no attribute 'simd_store'
        x.simd_store[_nelts](start + ii, val)
        ~^~~~~~~~~~~
/root/llama2.mojo/llama2.mojo:589:10: error: 'Tensor[f32]' value has no attribute 'simd_store'
        x.simd_store[_nelts](start + ii, x.simd_load[_nelts](start + ii) / ssum)
        ~^~~~~~~~~~~
/root/llama2.mojo/llama2.mojo:598:20: error: 'StaticTuple' parameter #0 has 'AnyRegType' type, but value has type 'Int'
    C: StaticTuple[n, BufferPtrFloat32],
                   ^
/root/llama2.mojo/llama2.mojo:1:1: note: 'StaticTuple' declared here
from algorithm import sum
^
/root/llama2.mojo/llama2.mojo:600:20: error: 'StaticTuple' parameter #0 has 'AnyRegType' type, but value has type 'Int'
    B: StaticTuple[n, BufferPtrFloat32],
                   ^
/root/llama2.mojo/llama2.mojo:1:1: note: 'StaticTuple' declared here
from algorithm import sum
^
/root/llama2.mojo/llama2.mojo:606:31: error: 'StaticTuple' parameter #0 has 'AnyRegType' type, but value has type 'Int'
        var tmp = StaticTuple[n, Accumulator[DType.float32, nelts]]()
                              ^
/root/llama2.mojo/llama2.mojo:1:1: note: 'StaticTuple' declared here
from algorithm import sum
^
/root/llama2.mojo/llama2.mojo:616:22: error: 'DTypePointer[f32, 0]' value has no attribute 'simd_load'
            var a = A.simd_load[_nelts](j)
                    ~^~~~~~~~~~
/root/llama2.mojo/llama2.mojo:730:26: error: no matching function in call to 'memcpy'
    memcpy[DType.float32](state.x.data(), content_row, dim)
    ~~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
/root/llama2.mojo/llama2.mojo:1:1: note: candidate not viable: expected at most 2 positional arguments, got 3
from algorithm import sum
^
/root/llama2.mojo/llama2.mojo:1:1: note: candidate not viable: expected at most 2 positional arguments, got 3
from algorithm import sum
^
/root/llama2.mojo/llama2.mojo:1:1: note: candidate not viable: failed to infer implicit parameter 'type' of argument 'dest' type 'Pointer'
from algorithm import sum
^
/root/llama2.mojo/llama2.mojo:1:1: note: candidate not viable: callee expects 0 parameters, but 1 was specified
from algorithm import sum
^
/root/llama2.mojo/llama2.mojo:794:32: error: 'Tensor[f32]' value has no attribute 'simd_load'
                        state.q.simd_load[_nelts](q_offset + i)
                        ~~~~~~~^~~~~~~~~~
/root/llama2.mojo/llama2.mojo:795:42: error: 'Tensor[f32]' value has no attribute 'simd_load'
                        * state.key_cache.simd_load[_nelts](k_offset + i)
                          ~~~~~~~~~~~~~~~^~~~~~~~~~
/root/llama2.mojo/llama2.mojo:818:39: error: 'Tensor[f32]' value has no attribute 'simd_load'
                    var xbi = state.xb.simd_load[_nelts](
                              ~~~~~~~~^~~~~~~~~~
/root/llama2.mojo/llama2.mojo:820:46: error: 'Tensor[f32]' value has no attribute 'simd_load'
                    ) + a * state.value_cache.simd_load[_nelts](v_offset + i)
                            ~~~~~~~~~~~~~~~~~^~~~~~~~~~
/root/llama2.mojo/llama2.mojo:821:29: error: 'Tensor[f32]' value has no attribute 'simd_store'
                    state.xb.simd_store[_nelts](xb_offset + i, xbi)
                    ~~~~~~~~^~~~~~~~~~~
/root/llama2.mojo/llama2.mojo:846:38: error: 'Tensor[f32]' value has no attribute 'simd_load'
            var initial_hb = state.hb.simd_load[_nelts](i)
                             ~~~~~~~~^~~~~~~~~~
/root/llama2.mojo/llama2.mojo:850:21: error: 'Tensor[f32]' value has no attribute 'simd_store'
            state.hb.simd_store[_nelts](i, hbi * state.hb2.simd_load[_nelts](i))
            ~~~~~~~~^~~~~~~~~~~
/root/llama2.mojo/llama2.mojo:881:32: error: invalid call to 'rand': missing 1 required positional argument: 'size'
    var r = rand[DType.float32](1)
            ~~~~~~~~~~~~~~~~~~~^~~
/root/llama2.mojo/llama2.mojo:1:1: note: function declared here
from algorithm import sum
^
/root/llama2.mojo/llama2.mojo:890:29: error: use of unknown declaration 'DynamicVector', 'fn' declarations require explicit variable declarations
fn bpe_encode(inout tokens: DynamicVector[Int], text: String, inout tok: Tokenizer):
                            ^~~~~~~~~~~~~
/root/llama2.mojo/llama2.mojo:891:32: error: unexpected token in expression
    for pos in range(len(text)):
                               ^
/root/llama2.mojo/llama2.mojo:891:32: error: statements must start at the beginning of a line
    for pos in range(len(text)):
                               ^
/root/llama2.mojo/llama2.mojo:940:9: error: use of unknown declaration 'print_no_newline'
        print_no_newline(chr(str2num(d1) * 16 + str2num(d2)))
        ^~~~~~~~~~~~~~~~
/root/llama2.mojo/llama2.mojo:945:9: error: use of unknown declaration 'print_no_newline'
        print_no_newline(chr(s[p].to_int()))
        ^~~~~~~~~~~~~~~~
/root/llama2.mojo/llama2.mojo:1044:25: error: use of unknown declaration 'DynamicVector', 'fn' declarations require explicit variable declarations
    var prompt_tokens = DynamicVector[Int]()
                        ^~~~~~~~~~~~~
/root/llama2.mojo/llama2.mojo:49:31: error: 'DTypePointer[T, 0]' value has no attribute 'simd_load'
        var newVal = self.data.simd_load[_width]() + val
                     ~~~~~~~~~^~~~~~~~~~
/root/llama2.mojo/llama2.mojo:50:18: error: 'DTypePointer[T, 0]' value has no attribute 'simd_store'
        self.data.simd_store[_width](newVal)
        ~~~~~~~~~^~~~~~~~~~~
/root/llama2.mojo/llama2.mojo:54:25: error: 'DTypePointer[T, 0]' value has no attribute 'simd_load'
        return self.data.simd_load[width]().reduce_add()
               ~~~~~~~~~^~~~~~~~~~
/root/llama2.mojo/llama2.mojo:111:26: error: 'DTypePointer[f32, 0]' value has no attribute 'simd_load'
        return self._data.simd_load[nelts](idx)
               ~~~~~~~~~~^~~~~~~~~~
/root/llama2.mojo/llama2.mojo:124:26: error: 'DTypePointer[f32, 0]' value has no attribute 'simd_load'
        return self._data.simd_load[nelts](indices[0] * self._shape[1] + indices[1])
               ~~~~~~~~~~^~~~~~~~~~
/root/llama2.mojo/llama2.mojo:127:26: error: 'DTypePointer[f32, 0]' value has no attribute 'simd_load'
        return self._data.simd_load[1](idx)
               ~~~~~~~~~~^~~~~~~~~~
/root/llama2.mojo/llama2.mojo:130:26: error: 'DTypePointer[f32, 0]' value has no attribute 'simd_store'
        return self._data.simd_store[nelts](idx, val)
               ~~~~~~~~~~^~~~~~~~~~~
/root/llama2.mojo/llama2.mojo:305:31: error: use of unknown declaration 'DynamicVector', 'fn' declarations require explicit variable declarations
        self.sorted_indices = DynamicVector[Int]()
                              ^~~~~~~~~~~~~
/root/llama2.mojo/llama2.mojo:382:40: error: 'List[SIMD[si8, 1]]' value has no attribute '_steal_ptr'
        var int32_ptr = config_data_raw._steal_ptr().bitcast[DType.int32]()
                        ~~~~~~~~~~~~~~~^~~~~~~~~~~
/root/llama2.mojo/llama2.mojo:469:27: error: 'List[SIMD[si8, 1]]' value has no attribute '_steal_ptr'
            var data = tmp._steal_ptr().bitcast[DType.float32]()
                       ~~~^~~~~~~~~~~
/root/.modular/pkg/packages.modular.com_max/bin/mojo: error: failed to parse the provided Mojo

Getting very strange response when trying the second example in README.md

~/src/AI/mojo/llama2.mojo$ mojo llama2.mojo tl-chat.bin \
    -r falcon \
    -z tok_tl-chat.bin \
    -n 256 -t 0 -s 100 -i "<|im_start|>user\nGive me a python function to generate Fibonacci sequence<|im_end|>\n<|im_start|>assistant\n"
num hardware threads:  12
SIMD vector width:  16
checkpoint size:  4400767004 [ 4196 MB ]
n layers:  22
vocab size:  32003
<|im_start|>user
Give me a python function to generate Fibonacci sequence<|im_end|>
<|im_start|>assistant
¿Quiero debera.io|efes<|
|- [aquíntena|
|-|re|re|
|-|
|-ichas|[estructurañiñu|implementa.py|
|esínda|
¿Quiero|

|Olahi|

Does anyone know how to resolve this?

question:Are these speed comparisons all in CPU mode? Can we add a comparison with GPU?

Are these speed comparisons all in CPU mode? Can we add a comparison with GPU?
Also, if you want to train, you want to use Mojo training. Is it necessary to add training related code in this way? Will rewriting be time-consuming?

Bug

Stumbled on this while trying to run the code on WSL Ubuntu

num parallel workers: 2  SIMD width: 16
checkpoint size:  60816028 [ 57 MB ] | n layers: 6 | vocab size: 32000
terminate called after throwing an instance of 'std::system_error'
  what():  Resource temporarily unavailable

Killed

I downloaded the repo and was super happy to see the story model work!
Then I looked down and saw the chat so I went and installed it via the wget that was provided in the readme
But, when I try to run it, this happened:

username@username:~/mojo/llama2.mojo$ mojo llama2.mojo tl-chat.bin \
    -r falcon \
    -z tok_tl-chat.bin \
    -n 256 -t 0 -s 100 -i "<|im_start|>user\nGive me a python function to generate Fibonacci sequence<|im_end|>\n<|im_start|>assistant\n"
num hardware threads:  4
SIMD vector width:  16
Killed

(sorry, I accidentally opened the issue before I finished typing it 😢 )

Actually, this even happens if I follow all the instructions, download again and all, in a new folder

How to do inference on GPUs

Is llama2 a group query attention or multi head attention?

I remember llama2 uses group query attention. In the llama.c, I found there are kv_heads, kv_dim.

Error executing mojo llama2.mojo

After having installed mojo (working), and llama2 as described, running mojo llama2.mojo on ubuntu 22.04 with 16 cores, I get:

llama2.mojo $ mojo llama2.mojo
num hardware threads: 16 SIMD vector width: 16
checkpoint size: 60816028
Unhandled exception caught during execution: An error occurred in Python.
mojo: error: execution exited with a non-zero result: 1

Turn on discussions

It might be worth turning on discussions. It would be helpful to discuss performance improvements so there is a history of what people have tried and any benchmarks run.

llama2.c Tinyllama1.1B supported.

I have enabled llama2.c to run the Tinyllama 1.1B chat on my repo.
Read Tiny Llama 1.1B model to run the model.
We can update the benchmark now.

TODO: write a documentation

Do you want to grow this project?

Idk what your plan is with this project so I just wanted to ask if you want to grow it and advance into enabling

more available models (7B, 13B), CodeLLama, ...
support quantized models
improved abstractions over multiple files
gpu support
documentation
jsonformer or guidance on top of this
langchain or openai api integration

We could create different TODO Issues for the featues to enable work by the community.
If you dont want to grow it maybe we could create a community fork building on top of it.
I really like the idea of doing inference in mojo so really greatful for this project and I think this could be a good opportunity to learn more about mojo by building some features :)

TODO: implement support for code llama models

https://github.com/facebookresearch/codellama

llama2.mojo

TODO: Support for gguf models

hey team, incredible work being done here.

Wondering if you only support .bin models, or would it also manage to work with gguf quantized models as well.

If not, then that's a real feature request. Mostly everyone uses gguf models to work nowadays, as they are easier to run on consumer-grade hardware.

thanks.

Add support for mojo 0.6

I know it only came out yesterday but ;-)

Make the tokenizer better

I'm trying to make llama2.mojo work on tinyllama-1.1B.
Which is a GQA and not tie_embedding model.
Now I have finish converting the model and modify part of llama2.mojo(llama.cpp,llama.c).
I have noticed that our tokenizer is not stable compared with huggingface tokenizer.

TODO: support quantized models

Increase number of SIMD registers used to speed-up the execution

I spent some time investigating why parallelized + vectorized version of matmul is slower than only vectorized.

Older Matmul examples showed that multi-core + vector was faster. Still, for me, the Matmul notebook example on Playground and Matmul example from the repo run on the GitHub Codespaces instance (4 cores, 16GB) showed that the multi-core version was slower.

I tried two commands: mojo examples/matmul.mojo and mojo build examples/matmul.mojo + run the binary. They had the same results, multi-core slower. In addition, using htop, I also made sure that the multi-core is utilizing all cores.

I found this PR - modularml/mojo#742 where you could see the value for vector width you get from simdwidthof is multiplied. In the case of the GitHub Codespace instance, my base value from simdwidthof was 8, I benchmarked higher values like 16 (2x), 32 (4x), and 64 (8x). You can see the results below:

I believe adjusting nelts value should bring additional speed-ups.

llama2.mojo/llama2.mojo

Line 24 in 86a34c9

alias nelts = simdwidthof[DType.float32]()

CPU details:

System information: 
    OS          :  linux
    CPU         :  znver3
    Arch        :  x86_64-unknown-linux-gnu
    Num Cores   :  4
    CPU Features:  avx2

mojo llama2.mojo error

mojo 1.0.0+601 已从 Canonical IS Snaps 安装
FileNotFoundError: [Errno 2] No such file or directory: 'juju': 'juju'

it is taking python 3.6 insted of 3.9.18

/home/y3rawat/.modular/pkg/packages.modular.com_mojo/bin/mojo /home/y3rawat/mojo/llama2.mojo
/home/y3rawat/mojo/llama2.mojo:10:6: error: unable to locate module 'read'
from read import BufReader, File
^
/home/y3rawat/.modular/pkg/packages.modular.com_mojo/bin/mojo: error: failed to parse the provided Mojo
(python38) y3rawat@y3rawat-ASUS-TUF-Gaming-F15-FX507ZC4-FX507ZC4:~/mojo$