karpathy / llm.c Goto Github PK

View Code? Open in Web Editor NEW

18.8K 18.8K 2.0K 1001 KB

LLM training in simple, raw C/CUDA

License: MIT License

Makefile 0.72% Python 6.92% C 12.78% Cuda 78.72% C++ 0.86%

llm.c's Introduction

I like deep neural nets.

llm.c's People

Contributors

Stargazers

Watchers

Forkers

paschembri deputation piemonte lee-b jmaczan adin yannit0 soxunlocks onuralpszr yashch22 gmh5225 briancpark javiervicho ricardicus ioanszilagyi ginochen alexziskind1 wshobson vitco dpickford hamiltonmultimedia narengogi shichao116 tairov travisallison air-wenzheng 0jrm sm-courses arshadzahangirchowdhury truehand diogodsa whatif-dev gkiri chadbrewbaker exponentialr g-i-o-r-g-i-o rreece dongreenberg kustomzone ehzawad usthbstar drasaadmoosa missinggear research-fork pauldesalvo juliusm254 attajunyah vinayreddy100 mardevval yaseenhq althetinkerer s-m-j-i mongramosjr adityachandupatla vgoklani evelynmitchell hoverzheng primitivehacker kevherro robertvava songkq touristshaun shaozhixue sholtomaud chunsj yyosefi sagarishere hbcbh1999 itdataengsirim codego7250 davidsmooke will6309 nex2761 munish0838 manas95826 zeroxclem xiaoconstantine mantovani dotieuthien brunoscaglione matthewshawnkehoe colmenaresj tisorlawan da4089 orefaleoluwayinka lamarrr sugatoray w4ester zzctmac sf9040 srisatish leonardofhy thnguyn2 ishine cloudbee7 repos-ai-llm rifur13 dyl777 anamhira47 yuehchuan

llm.c's Issues

A file not found error was encountered while compiling

make train_gpt2
process_begin: CreateProcess(NULL, uname, ...) failed.
Makefile:13: pipe: No error
系统找不到指定的路径。
Makefile:32: OOPS Compiling without OpenMP support
cc -O3 -Ofast -Wno-unused-result train_gpt2.c -lm -o train_gpt2
process_begin: CreateProcess(NULL, cc -O3 -Ofast -Wno-unused-result train_gpt2.c -lm -o train_gpt2, ...) failed.
make (e=2): 系统找不到指定的文件。
make: *** [Makefile:43: train_gpt2] Error 2

CUDA ERROR an illegal memory access was encountered

commit c02bae2ebc684a2e068c0dc59be00ff43167b44d

I got this error when running train_gpt2cu

[System]
Device 0: NVIDIA GeForce RTX 3060
enable_tf32: 1
[GPT-2]
max_seq_len: 1024
vocab_size: 50257
num_layers: 12
num_heads: 12
channels: 768
num_parameters: 124439808
train dataset num_batches: 74
val dataset num_batches: 8
batch size: 4
sequence length: 1024
val_num_batches: 10
num_activations: 2456637440
val loss 4.517170
step 0: train loss 4.367660 (took 140.356816 ms)
step 1: train loss 4.406325 (took 140.393840 ms)
step 2: train loss 4.484698 (took 140.358579 ms)
step 3: train loss 4.345248 (took 140.415392 ms)
step 4: train loss 4.043498 (took 140.366040 ms)
step 5: train loss 4.229475 (took 140.349640 ms)
step 6: train loss 4.175006 (took 140.337543 ms)
step 7: train loss 4.207677 (took 140.347408 ms)
step 8: train loss 4.127449 (took 140.839560 ms)
step 9: train loss 4.220684 (took 141.022061 ms)
val loss 4.517170
step 10: train loss 4.345275 (took 140.932235 ms)
step 11: train loss 4.245928 (took 141.035935 ms)
step 12: train loss 4.160665 (took 140.851916 ms)
step 13: train loss 3.989787 (took 140.892920 ms)
step 14: train loss 4.306040 (took 140.957500 ms)
step 15: train loss 4.340417 (took 141.032707 ms)
step 16: train loss 4.304399 (took 141.010662 ms)
step 17: train loss 4.424002 (took 140.930450 ms)
step 18: train loss 4.314507 (took 141.028123 ms)
step 19: train loss 4.287285 (took 140.992381 ms)
val loss 4.517170
[CUDA ERROR] at file train_gpt2.cu:1211:
an illegal memory access was encountered

`nvidia-smi`

(.venv) ➜  llm.c git:(dev) ✗ nvidia-smi
Sun Apr 14 05:29:02 2024       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.15              Driver Version: 550.54.15      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 3060        Off |   00000000:01:00.0 Off |                  N/A |
|  0%   32C    P8             13W /  170W |       3MiB /  12288MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                                                         
+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+

`nvcc`

(.venv) ➜  llm.c git:(dev) ✗ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2024 NVIDIA Corporation
Built on Thu_Mar_28_02:18:24_PDT_2024
Cuda compilation tools, release 12.4, V12.4.131
Build cuda_12.4.r12.4/compiler.34097967_0

The file train_gpt2.cu:1211: indicates:

cudaCheck(cudaMemcpy(cpu_probs, probs, model.config.vocab_size * sizeof(float), cudaMemcpyDeviceToHost));

no CUDA-capable device is detected

I have already installed nvcc with sudo apt install nvidia-cuda-toolkit. When I run ./train_gpt2cu, it shows:

[GPT-2]
max_seq_len: 1024
vocab_size: 50257
num_layers: 12
num_heads: 12
channels: 768
num_parameters: 124439808
[CUDA ERROR] at file train_gpt2.cu:501:
no CUDA-capable device is detected

I am using WSL2 (ubuntu 22.04) and nvcc (11.5)

Are there any packages that I haven't downloaded yet?

AI is Artificial Idiot

Error: backward before forward

Followed instructions from README and it throws the error:

(base) sasank@johnaic:~/code/sycl/llm.c$ ./train_gpt2
[GPT-2]
max_seq_len: 1024
vocab_size: 50257
num_layers: 12
num_heads: 12
channels: 768
num_parameters: 124439808
train dataset num_batches: 1192
val dataset num_batches: 128
num_activations: 73323776
val loss 5.252006
step 0: train loss 5.356172 (took 2658.361852 ms)
Error: must forward with targets before backward

WSL setup details (resolved: instructions in the comments)

Hi,

I'm having trouble getting the CUDA code to execute. I wanted to share my learnings thus far for this community, in case others are hitting this.

For the CUDA version, the call in main() to cudaSetDevice(0) fails with 100 ("no CUDA-capable device is detected"). This is odd because I've had no problem with CUDA via python in my setup (which is WSL in Windows), a la:

>>> import torch
>>> print(torch.cuda.is_available())
True
>>> print(torch.cuda.current_device())
0

plus nvidia-smi works fine. During the zero-to-hero lessons, I would watch the utilization (and temperature!) race up as training ran.

To debug cudaSetDevice() failure in llm.c, I've noticed that if I link with cublas_static (and comment out the references to cublasLt, and hence cublsaLt is not pulled in), the cudaSetDevice(0) works fine a la:

[System]
Device 0: NVIDIA GeForce RTX 4060 Laptop GPU

Linking with cublasLt_static doesn't help (i.e. the CUDA fail again with 100).

I'm not sure why linking with cublas_static removes the error when using cublas functions, nor why cublasLt doesn't work with either static or shared. I've repro'ed this static/shared behavior with more focused code using the sample/examples from the nvidia cublas github examples.

If I crack this nut, I'll share updates. In the meantime, suggestions welcome as well.

My environment:
Linux xxxx 5.15.146.1-microsoft-standard-WSL2 #1 SMP Thu Jan 11 04:09:03 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux


+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.120                Driver Version: 537.58       CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA GeForce RTX 4060 ...    On  | 00000000:01:00.0 Off |                  N/A |
| N/A   39C    P0              16W /  60W |      0MiB /  8188MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|  No running processes found                                                           |
+---------------------------------------------------------------------------------------+

[CUDA ERROR] at file \llm.c\train_gpt2.cu:405: too many resources requested for launch (old version does not have this issue - fyi)

[System]
Device 0: NVIDIA RTX A5500
enable_tf32: 1
[GPT-2]
max_seq_len: 1024
vocab_size: 50257
num_layers: 12
num_heads: 12
channels: 768
num_parameters: 124439808
train dataset num_batches: 74
val dataset num_batches: 8
batch size: 4
sequence length: 1024
val_num_batches: 10
num_activations: 2456637440
[CUDA ERROR] at file \llm.c\train_gpt2.cu:405:
too many resources requested for launch

Switch back to the previous version?

when running python train_gpt2.py, errors out after 10 iteration -- is this normal?

(base) billhuang@bh-m1-max llm.c % python train_gpt2.py
using device: mps
loading weights from pretrained gpt: gpt2
generation_config.json: 100%|██████████████████████████████████████████████████████████████████████████| 124/124 [00:00<00:00, 95.7kB/s]
loading cached tokens in data/tiny_shakespeare_val.bin
/Users/billhuang/TEST/llm.c/train_gpt2.py:333: UserWarning: The given NumPy array is not writable, and PyTorch does not support non-writable tensors. This means writing to this tensor will result in undefined behavior. You may want to copy the array to protect its data or make it writable before converting it to a tensor. This type of warning will be suppressed for the rest of this program. (Triggered internally at /Users/runner/work/pytorch/pytorch/pytorch/torch/csrc/utils/tensor_numpy.cpp:205.)
tokens = torch.from_numpy(tokens)
wrote gpt2_124M.bin
wrote gpt2_124M_debug_state.bin
iteration 0, loss: 5.270007133483887
iteration 1, loss: 4.059707164764404
iteration 2, loss: 3.375124931335449
iteration 3, loss: 2.8007795810699463
iteration 4, loss: 2.3153889179229736
iteration 5, loss: 1.849020004272461
iteration 6, loss: 1.3946489095687866
iteration 7, loss: 0.9991437196731567
iteration 8, loss: 0.6240723729133606
iteration 9, loss: 0.376505047082901
Traceback (most recent call last):
File "/Users/billhuang/TEST/llm.c/train_gpt2.py", line 380, in
y = model.generate(x, max_new_tokens, temperature=temperature, top_k=top_k)
File "/Users/billhuang/miniforge3/lib/python3.9/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
return func(*args, **kwargs)
File "/Users/billhuang/TEST/llm.c/train_gpt2.py", line 202, in generate
v, _ = torch.topk(logits, min(top_k, logits.size(-1)))
RuntimeError: Currently topk on mps works only for k<=16

project license

would be nice to add a license file makes it easier to find then having it at the bottom of the read me

Bus ERROR while running `train_gpt2.py`

Explonation:

Fails while running train_gpt2.py after successfully downloading preatined weights with error:

Error log

python3 train_gpt2.py
using device: mps
loading weights from pretrained gpt: gpt2
loading cached tokens in data/tiny_shakespeare_val.bin
[1]    15017 bus error  python3 train_gpt2.py

Library versions

tokenizers               0.15.2
torch                    2.0.1
torch-geometric          2.0.3
torchvision              0.15.2

System Specifications:

Platform: MacBook Pro
Chip: M2 
Memory: 16GB
MacOS: 14.4.1

RuntimeError: must forward with targets before backward

tried to run:

     !OMP_NUM_THREADS=8 ./train_gpt2

RuntimeError:

  [GPT-2]
  max_seq_len: 1024
  vocab_size: 50257
  num_layers: 12
  num_heads: 12
  channels: 768
  num_parameters: 124439808
  train dataset num_batches: 1192
  val dataset num_batches: 128
  num_activations: 73323776
  val loss 5.252006
  step 0: train loss 5.356172 (took 26772.736942 ms)
  Error: must forward with targets before backward

Support older CUDA GPU hardware by default

This project is phenomenal, thank you so much for this.

I think the project is currently aiming for:
(1) Education and simplicity - making it simple to understand and build (avoid bloat).
(2) Efficient performance for experimentation for research - making it useful.

I would also consider adding a new target for

(3) Supporting older cheap, globally available strong hardware such as older GPUs - to lower the barrier for experimentation and research.

Therefore, I would try to support by default CUDA10 or CUDA9 and not starting from CUDA11 which prevents many of the older generation GPUs to run.

Concretely, I think the default should avoid cooperative_gorups/reduce.h for older hardware - which was introduced only in CUDA11 in this commit - 2c81198#diff-bf6b442957e5458cf8baab2a18039fdde86d74199a0864a79e7288fe55f31a98R15

some Rust error

(llmc-env) nyck33@lenovo-gtx1650:/mnt/d/ML/llm.c$ OMP_NUM_THREADS=8 ./train_gpt2
[GPT-2]
max_seq_len: 1024
vocab_size: 50257
num_layers: 12
num_heads: 12
channels: 768
num_parameters: 124439808
train dataset num_batches: 1192
val dataset num_batches: 128
num_activations: 73323776
val loss 5.251911
step 0: train loss 5.356082 (took 11221.492729 ms)
step 1: train loss 4.300639 (took 10770.195235 ms)
step 2: train loss 4.623087 (took 10621.310078 ms)
step 3: train loss 4.599362 (took 10596.792720 ms)
step 4: train loss 4.616664 (took 10895.748351 ms)
step 5: train loss 4.231427 (took 10596.324310 ms)
step 6: train loss 3.753161 (took 10504.295265 ms)
step 7: train loss 3.650458 (took 11034.112917 ms)
step 8: train loss 4.182242 (took 10612.828309 ms)
step 9: train loss 4.199580 (took 10545.814677 ms)
val loss 4.426364
step 10: train loss 4.288661 (took 10601.000062 ms)
step 11: train loss 3.560642 (took 10510.856707 ms)
step 12: train loss 3.731437 (took 10538.586079 ms)
step 13: train loss 4.158511 (took 10577.684063 ms)
step 14: train loss 3.885633 (took 10703.429353 ms)
step 15: train loss 3.766486 (took 10607.367802 ms)
step 16: train loss 4.144007 (took 10580.083612 ms)     
step 17: train loss 3.961167 (took 10524.691744 ms)     
step 18: train loss 3.796044 (took 10518.218191 ms)     
step 19: train loss 3.371042 (took 10600.572562 ms)     
val loss 4.250554
generated: 50256 40 373 523 24776 351 534 1986 25 284 1282 290 996 484 561 407 466 340 597 517 621 355 198 5756 514 26 475 508 1683 460 1210 198 39276 257 995 523 1336 11 198 2504 612 1183 40 60640 606 15950 26
step 20: train loss 3.882789 (took 11224.717460 ms)

then I copy paste that into the python tiktoken code

import tiktoken
enc = tiktoken.get_encoding("gpt2")
ptok = lambda x: print(enc.decode(list(map(int, x.strip().split()))))
ptok("50256 16773 18162 21986 11 198 13681 263 23875 198 3152 262 11773 2910 198 1169 6002 6386 2583 286 262 11858 198 20424 428 3135 7596 995 3675 13 198 40 481 407 736 17903 11 329 703 6029 706 4082 198 42826 1028 1128 633 263 11 198 10594 407 198 2704 454 680 1028 262 1027 28860 286 198 3237 323")

ptok("50256 40 373 523 24776 351 534 1986 25 284 1282 290 996 484 561 407 466 340 597 517 621 355 198 5756 514 26 475 508 1683 460 1210 198 39276 257 995 523 1336 11 198 2504 612 1183 40 60640 606 15950 26")

That first one is from the Readme so it works but the second one I copied from my terminal which throws:

(llmc-env) nyck33@lenovo-gtx1650:/mnt/d/ML/llm.c$ python decode.py
<|endoftext|>Come Running Away,
Greater conquer
With the Imperial blood
the heaviest host of the gods
into this wondrous world beyond.
I will not back thee, for how sweet after birth
Netflix against repounder,
will not
flourish against the earlocks of
Allay
thread '<unnamed>' panicked at src/lib.rs:201:64:
no entry found for key
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
Traceback (most recent call last):
  File "/mnt/d/ML/llm.c/decode.py", line 6, in <module>
    ptok("50256 40 373 523 24776 351 534 1986 25 284 1282 290 996 484 561 407 466 340 597 517 621 355 198 5756 514 26 475 508 1683 460 1210 198 39276 257 995 523 1336 11 198 2504 612 1183 40 60640 606 15950 26")
  File "/mnt/d/ML/llm.c/decode.py", line 3, in <lambda>
    ptok = lambda x: print(enc.decode(list(map(int, x.strip().split()))))
  File "/home/nyck33/miniconda3/envs/llmc-env/lib/python3.9/site-packages/tiktoken/core.py", line 258, in decode
    return self._core_bpe.decode_bytes(tokens).decode("utf-8", errors=errors)
pyo3_runtime.PanicException: no entry found for key

Why am I not getting the same tokens as the Readme?
What is that Rust error?

AssertionError: Torch not compiled with CUDA enabled

Getting the above error when i run python3 train_gpt2.py .

I am running the code in my laptop which does not contain nvidia-graphics card . Does the above code is supposed to work on gpu and then export the file and then work on cpu .

Why CUDA when we can SYCL

I know that CUDA is considered 'vanilla' but why not give SYCL a shot? We can finally have things work on all GPUs OOB. I benchmarked SYCL and it's pretty good: https://chsasank.com/portblas-portable-blas-across-gpus.html

My favorite implementation is adaptiveCPP while DPCPP is pretty good too.

Do you plan to use BLAS libraries?

We can finally right the wrong and truly be opensource :).

Generation error on MPS (Torch >= 2.2.0, MacOS 14.4)

When running train_gpt2.py, I get all 16 output tokens equal to "!" (token 0).

Here is the complete output:

❯ python3 train_gpt2.py
using device: mps
loading weights from pretrained gpt: gpt2
loading cached tokens in data/tiny_shakespeare_val.bin
wrote gpt2_124M.bin
wrote gpt2_124M_debug_state.bin
iteration 0, loss: 5.2700090408325195
iteration 1, loss: 4.059708118438721
iteration 2, loss: 3.375123977661133
iteration 3, loss: 2.800778388977051
iteration 4, loss: 2.315387725830078
iteration 5, loss: 1.8490203619003296
iteration 6, loss: 1.3946478366851807
iteration 7, loss: 0.999144434928894
iteration 8, loss: 0.624073326587677
iteration 9, loss: 0.37650370597839355
<|endoftext|>!!!!!!!!!!!!!!!!
---------------

I am currently using Python 3.12 and PyTorch 2.2.2 on MPS on an M1 Pro MacBook running MacOS 14.4.1.
I have tried it with Python 3.9 and 3.11 as well (all on Torch 2.2.2), and all have the same issue.
When downgrading to torch 2.1.X or using the CPU, this does not happen.
I had the same issue with the original NanoGPT implementation (see karpathy/nanoGPT#458).

I also tried to test this on a friend's M1 Pro MacBook using the following script, and the output is the same.

#!/bin/bash

cd "$HOME" || exit 0
OUT_FNAME="$HOME/results.txt"

python3 --version | tee "$OUT_FNAME"

git clone https://github.com/karpathy/llm.c.git
cd llm.c || exit 1
python3 -m venv .venv
source .venv/bin/activate
pip3 install -r requirements.txt

pip3 list | tee -a "$OUT_FNAME"

python3 prepro_tinyshakespeare.py
python3 train_gpt2.py | tee -a "$OUT_FNAME"

rm -rf "$HOME/llm.c"
if [ -z "$XDG_CACHE_DIR" ]; then
    rm -rf "$HOME/.cache/huggingface/hub/models--gpt2"
else
    rm -rf "$XDG_CACHE_DIR/huggingface/hub/models--gpt2"
fi

echo "Done! Output file stored in $OUT_FNAME"

After some debugging, I narrowed it down to a bug using "advanced indexing" when setting the non-top-k logits to -inf in train_gpt2.py:203:

logits[logits < v[:, [-1]]] = -float('Inf')

Indeed, after that line, the logits tensor only contains -inf, which makes the softmax values nan, and sampling those always yields token 0.

I was also able to reproduce the issue using the Python interpreter to modify values in a tensor using conditional indexing.

Since I haven't seen anyone with this same problem, I want to make sure I am not the only one having it, so let me know if anyone did.

Dead stores

On commit b0fc807 the IO looks clean but it complained about dead stores. Can remediate if this is a problem.

/opt/homebrew/opt/llvm/bin/scan-build make

In file included from test_gpt2.c:2:
./train_gpt2.c:828:16: warning: Value stored to 'l_ln1b' during its initialization is never read [deadcode.DeadStores]
828 | float* l_ln1b = params.ln1b + l * C;
| ^~~~~~ ~~~~~~~~~~~~~~~~~~~
./train_gpt2.c:830:16: warning: Value stored to 'l_qkvb' during its initialization is never read [deadcode.DeadStores]
830 | float* l_qkvb = params.qkvb + l * 3C;
| ^~~~~~ ~~~~~~~~~~~~~~~~~~~~~
./train_gpt2.c:832:16: warning: Value stored to 'l_attprojb' during its initialization is never read [deadcode.DeadStores]
832 | float l_attprojb = params.attprojb + l * C;
| ^~~~~~~~~~ ~~~~~~~~~~~~~~~~~~~~~~~
./train_gpt2.c:834:16: warning: Value stored to 'l_ln2b' during its initialization is never read [deadcode.DeadStores]
834 | float* l_ln2b = params.ln2b + l * C;
| ^~~~~~ ~~~~~~~~~~~~~~~~~~~
./train_gpt2.c:836:16: warning: Value stored to 'l_fcb' during its initialization is never read [deadcode.DeadStores]
836 | float* l_fcb = params.fcb + l * 4C;
| ^~~~~ ~~~~~~~~~~~~~~~~~~~~
./train_gpt2.c:838:16: warning: Value stored to 'l_fcprojb' during its initialization is never read [deadcode.DeadStores]
838 | float l_fcprojb = params.fcprojb + l * C;
| ^~~~~~~~~ ~~~~~~~~~~~~~~~~~~~~~~
./train_gpt2.c:858:16: warning: Value stored to 'l_preatt' during its initialization is never read [deadcode.DeadStores]
858 | float* l_preatt = acts.preatt + l * B * NH * T * T;
| ^~~~~~~~ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
./train_gpt2.c:860:16: warning: Value stored to 'l_attproj' during its initialization is never read [deadcode.DeadStores]
860 | float* l_attproj = acts.attproj + l * B * T * C;
| ^~~~~~~~~ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~
./train_gpt2.c:867:16: warning: Value stored to 'l_fcproj' during its initialization is never read [deadcode.DeadStores]
867 | float* l_fcproj = acts.fcproj + l * B * T * C;
| ^~~~~~~~ ~~~~~~~~~~~~~~~~~~~~~~~~~~~
./train_gpt2.c:868:16: warning: Value stored to 'l_residual3' during its initialization is never read [deadcode.DeadStores]
868 | float* l_residual3 = acts.residual3 + l * B * T * C;
| ^~~~~~~~~~~ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

write LLVM optimization passes for train_gpt2

Here is a little example:

multiplications where one operand is a power of 2 and a constant integer, are optimized with a shift operation and the shift amount is calculated using the logBase2 of the constant.

bool optBasicStrengthReduction(Instruction &I) {
  auto OpCode = I.getOpcode();

  if (OpCode != Instruction::Mul) return false;

  Value *Op1 = I.getOperand(0);
  Value *Op2 = I.getOperand(1);
  ConstantInt *CI = nullptr;

  // Check if op is a constant integer and is a power of 2
  auto isConstPowOf2 = [&CI](Value *op) {
    return (CI = dyn_cast<ConstantInt>(op))
      and CI->getValue().isPowerOf2()
      and not CI->isOne();
  };

  if (isConstPowOf2(Op1)) std::swap(Op1, Op2);
  if (not isConstPowOf2(Op2)) return false;

  errs() << "Triggered train_gpt2 optimization\n";

  // Shift amount calculation
  unsigned ShiftAmount = CI->getValue().logBase2();

  // Create a new shift instruction
  Instruction *ShiftInst = BinaryOperator::Create(
    Instruction::Shl,
    Op1, ConstantInt::get(CI->getType(), ShiftAmount)
  );
    
  ShiftInst->insertAfter(&I);
  I.replaceAllUsesWith(ShiftInst);

  return true;
}

and we need to add a call to the opt in a `runOnBasicBlock` function:

bool runOnBasicBlock(BasicBlock &B) {
  bool globallyModified = false;
  std::set<Instruction*> toBeErased;

  for (auto &I : B) {
    bool locallyModified =
      // here you can add all your opt passes
      optBasicStrengthReduction(I)
        || optExample2(I)
        || optExample3(I)
        || optExample4(I)
        ...
    
    // dead code elimination
    if (locallyModified) {
      toBeErased.insert(&I);
      globallyModified = true;
    }
  }

  for (auto *I : toBeErased) {
    I->eraseFromParent();
  }

  return globallyModified;
}

to apply the passes we need to convert `train_gpt2` to a LLVM-IR using the clang compiler:

$ clang -emit-llvm -c train_gpt2.c -o train_gpt2.bc
#apply the opt pass
$ opt -load ./build/LocalOpts.so -local-opts train_gpt2.bc -o train_gpt2_opt.bc
#obtain the optimized train_gpt2.c
$ clang train_gpt2_opt.bc -o train_gpt2_opt

Alternative tests / consider modularized approach

thanks for llm.c! just wanted to suggest an alternative modularized version for test_gpt2.c potentially useful from an educational standpoint. on that note, a small suggestion - perhaps consider modularizing tensor comparison to make it reusable for different tensors.

#define TESTING
#include "train_gpt2.c"

// Modularize the tensor comparison for reuse
int compare_tensors(float *expected, float *actual, int size, const char *tensor_name) {
    printf("Comparing %s...\n", tensor_name);
    int is_ok = 1; // Assume tensors match until proven otherwise
    for (int i = 0; i < size; i++) {
        if (fabs(expected[i] - actual[i]) > 1e-2) {
            printf("Mismatch in %s at index %d: Expected %f, Got %f\n", tensor_name, i, expected[i], actual[i]);
            is_ok = 0;
            break; // Early exit on first mismatch
        }
    }
    if (is_ok) {
        printf("%s matches expected values.\n", tensor_name);
    } else {
        printf("%s does not match expected values.\n", tensor_name);
    }
    return is_ok;
}

// Simplified test function
int perform_test(GPT2 model, int* x, int* y, float* expected_logits, float* expected_loss, ParameterTensors expected_grads, int B, int T, int V, int C, int maxT, int L) {
    int all_ok = 1; // Track overall test success
    
    // Forward and backward passes
    gpt2_forward(&model, x, y, B, T);
    gpt2_zero_grad(&model);
    gpt2_backward(&model);

    // Check logits
    all_ok &= compare_tensors(expected_logits, model.acts.logits, B * T * V, "Logits");

    // Check loss
    float actual_loss = model.mean_loss;
    if (fabs(*expected_loss - actual_loss) > 1e-2) {
        printf("Loss mismatch: Expected %f, Got %f\n", *expected_loss, actual_loss);
        all_ok = 0;
    } else {
        printf("Loss matches expected value.\n");
    }

    // Check gradients
    // Example for one tensor; apply similarly for all tensors you wish to compare
    all_ok &= compare_tensors(expected_grads.wte, model.grads.wte, V * C, "Gradient wte");

    return all_ok;
}

int main(int argc, char *argv[]) {
    // Initialize GPT-2 model and load checkpoint
    GPT2 model;
    gpt2_build_from_checkpoint(&model, "gpt2_124M.bin");
    
    // Simplified example of setup for a test; assume these variables are properly initialized
    int B, T, V, C, maxT, L; // Batch size, sequence length, etc.
    int* x; // Input tokens
    int* y; // Target tokens
    float* expected_logits; // Expected logits
    float* expected_loss; // Expected loss value
    ParameterTensors expected_grads; // Expected gradients

    // Simplified loading of test data from a file or other source
    // Assuming these variables are filled with correct test data...

    // Perform test and check all tensors
    int test_result = perform_test(model, x, y, expected_logits, expected_loss, expected_grads, B, T, V, C, maxT, L);

    if (test_result) {
        printf("All tests passed successfully.\n");
    } else {
        printf("One or more tests failed.\n");
    }

    // Free resources
    gpt2_free(&model);
    // Free other dynamically allocated memory (e.g., x, y, expected_logits, expected_loss, expected_grads)

    return !test_result; // Return 0 for success, non-zero for failure
}

examples for popular models

I would like to request 1 or 2 examples of how to adapt this for a popular open models, such as:

https://huggingface.co/mistralai/Mistral-7B-v0.1
https://huggingface.co/meta-llama/Llama-2-7b-hf
https://huggingface.co/TinyLlama/TinyLlama-1.1B-intermediate-step-1431k-3T
https://huggingface.co/microsoft/phi-2

I think if there's one or two example adapters, the community could contribute to a list of adapters that enable different models.

[Suggestion] Discussions tab for general help

Just as a simple suggestion, I'm wondering if there can be a discussion tab to post problems with running the system as opposed to posting individual issues. This can keep it more organized, and it's easier to track common threads. A simple point of comparison would be the discussion structure on the grok repository. Thoughts?

looking forward supporting winx86-msvc

Suggested to add a check for the return value of Malloc

test_gpt2.c:55-58
train_gpt2.c:484 \531\ 634
Many implementations of malloc should include a checker to verify successful memory allocation. Suggested to add NULL checks after each malloc call and handle the memory allocation failures gracefully.
Thank you!

Will it be a walkthrough tutorial on this?

Fused bias with matmul using `cublasLtMatmul`

Just want to mention that cuBLAS (via the newer cuBLASLt API) does offer an interface that fuses matmul with bias addition: cublasLtMatmul() which computes D = A @ B + C.

You can use the bias as C and even get the broadcasting for free by setting the leading dimension of C to 0. The interface is a lot more verbose than cublasSgemm though.

llm.c/dev/cuda/matmul_forward.cu

Lines 87 to 97 in a08c11b

    
           // is there no better way other than just adding bias with a whole separate kernel? 
        
           // this is a highly memory-bound operation, should be fused into the matmul kernel 
        
           // but i can't seem to find a cuBLAS function that does this 
        
           __global__ void add_bias(float* out, float* bias, int B, int T, int OC) { 
        
               int idx = blockIdx.x * blockDim.x + threadIdx.x; 
        
               int stride = blockDim.x * gridDim.x; 
        
               for (int i = idx; i < B * T * OC; i += stride) { 
        
                   int col = i % OC; 
        
                   out[i] += bias[col]; 
        
               } 
        
           }

[CUDA ERROR] at file train_gpt2.cu:693: out of memory

How much GPU Ram do I need?
I tried training on my GTX 1650 with 4GB or RAM.
Batch size is already 4 meaning that's going to be difficult to reduce?

Other options the AI is mentioning are:

gradient accumulation
checkpointing
model pruning
offloading computation to the CPU
lower precision training, mixing 32, 16 bit floating point, is using 8 bit an option?

LOSS MISMATCH AT STEP 0: 2.864161 5.270007

I'm running on an iMac 27" with MacOS 14.4.1 and 'MPS' on an AMD Radeon Pro 5700 XT GPU.
Comments on the message: LOSS MISMATCH AT STEP 0: 2.864161 5.270007

% python train_gpt2.py 
/Users/davidlaxer/anaconda3/envs/AI-Feynman/lib/python3.10/site-packages/torch/library.py:168: UserWarning: Warning only once for all operators,  other operators may also be overrided.
  Overriding a previously registered kernel for the same operator and the same dispatch key
  operator: aten::var_mean.correction(Tensor self, int[1]? dim=None, *, Scalar? correction=None, bool keepdim=False) -> (Tensor, Tensor)
    registered at /Users/davidlaxer/pytorch/build/aten/src/ATen/RegisterSchema.cpp:6
  dispatch key: MPS
  previous kernel: registered at /Users/davidlaxer/pytorch/build/aten/src/ATen/RegisterCPU.cpp:31470
       new kernel: registered at /dev/null:1881 (Triggered internally at /Users/davidlaxer/pytorch/aten/src/ATen/core/dispatch/OperatorEntry.cpp:160.)
  self.m.impl(name, dispatch_key if dispatch_key != "" else "CompositeImplicitAutograd", fn)
using device: mps
loading weights from pretrained gpt: gpt2
loading cached tokens in data/TinyStories_val.bin
wrote gpt2_124M.bin
wrote gpt2_124M_debug_state.bin
iteration 0, loss: 2.86417818069458, time: 2833.053ms
iteration 1, loss: 2.071094512939453, time: 224.394ms
iteration 2, loss: 1.5036660432815552, time: 211.509ms
iteration 3, loss: 1.0592901706695557, time: 213.439ms
iteration 4, loss: 0.67463219165802, time: 220.032ms
iteration 5, loss: 0.41782039403915405, time: 212.604ms
iteration 6, loss: 0.23388634622097015, time: 218.274ms
iteration 7, loss: 0.1198703944683075, time: 216.175ms
iteration 8, loss: 0.07279403507709503, time: 214.477ms
iteration 9, loss: 0.05021585151553154, time: 216.127ms
<|endoftext|>Once upon a time, there was a brave little girl named Lily. She loved
---------------
(AI-Feynman) davidlaxer@bluediamond llm.c % ./test_gpt2            
[GPT-2]
max_seq_len: 1024
vocab_size: 50257
num_layers: 12
num_heads: 12
channels: 768
num_parameters: 124439808
[State]
batch_size: 4
seq_len: 64
num_activations: 73323776
-43.431774 -43.431667
-39.836498 -39.836399
-43.066059 -43.065937
OK (LOGITS)
LOSS OK: 2.864161 2.864178
dwte
OK 0.000305 0.000305
OK -0.001153 -0.001153
OK 0.002915 0.002916
OK 0.001172 0.001172
OK 0.001833 0.001833
TENSOR OK
dwpe
OK -0.000876 -0.000873
OK -0.001630 -0.001632
OK 0.000169 0.000171
OK 0.004004 0.004006
OK 0.001349 0.001350
TENSOR OK
dln1w
OK 0.001079 0.001080
OK 0.001543 0.001546
OK 0.005387 0.005396
OK -0.004479 -0.004483
OK 0.002376 0.002377
TENSOR OK
dln1b
OK -0.030615 -0.030608
OK -0.005774 -0.005792
OK 0.007526 0.007541
OK -0.002763 -0.002748
OK -0.005131 -0.005105
TENSOR OK
dqkvw
OK 0.000018 0.000018
OK -0.000016 -0.000016
OK 0.000007 0.000007
OK -0.000046 -0.000046
OK 0.000060 0.000060
TENSOR OK
dqkvb
OK 0.000082 0.000082
OK -0.000057 -0.000057
OK 0.000116 0.000115
OK -0.000353 -0.000353
OK 0.000134 0.000134
TENSOR OK
dattprojw
OK 0.000003 0.000003
OK -0.000052 -0.000052
OK 0.000032 0.000032
OK 0.000001 0.000001
OK 0.000003 0.000003
TENSOR OK
dattprojb
OK -0.000731 -0.000730
OK -0.000462 -0.000462
OK 0.001580 0.001580
OK 0.009938 0.009946
OK -0.010287 -0.010277
TENSOR OK
dln2w
OK 0.001610 0.001621
OK 0.003811 0.003809
OK -0.000234 -0.000234
OK -0.000306 -0.000306
OK 0.001870 0.001871
TENSOR OK
dln2b
OK 0.001294 0.001301
OK 0.002362 0.002358
OK -0.000720 -0.000718
OK 0.007980 0.007986
OK -0.009129 -0.009118
TENSOR OK
dfcw
OK -0.000160 -0.000160
OK -0.000017 -0.000017
OK 0.000180 0.000180
OK 0.000353 0.000354
OK -0.000134 -0.000134
TENSOR OK
dfcb
OK 0.000981 0.000979
OK 0.001020 0.001021
OK 0.000001 0.000001
OK 0.000065 0.000065
OK -0.000360 -0.000360
TENSOR OK
dfcprojw
OK -0.000043 -0.000043
OK 0.000180 0.000180
OK 0.000039 0.000039
OK 0.000073 0.000072
OK 0.000013 0.000013
TENSOR OK
dfcprojb
OK -0.000847 -0.000846
OK -0.000840 -0.000839
OK 0.001708 0.001708
OK 0.001528 0.001530
OK -0.000068 -0.000069
TENSOR OK
dlnfw
OK 0.000495 0.000495
OK 0.000482 0.000482
OK -0.000429 -0.000428
OK -0.000885 -0.000886
OK -0.000293 -0.000293
TENSOR OK
dlnfb
OK -0.002935 -0.002935
OK -0.009005 -0.009004
OK 0.000930 0.000929
OK 0.002993 0.002993
OK 0.007300 0.007300
TENSOR OK
step 0: loss 2.864161 (took 15857.063000 ms)
step 1: loss 2.071029 (took 15620.254000 ms)
step 2: loss 1.503640 (took 15320.660000 ms)
step 3: loss 1.059259 (took 15308.281000 ms)
step 4: loss 0.674684 (took 15283.269000 ms)
step 5: loss 0.418388 (took 15243.285000 ms)
step 6: loss 0.233658 (took 15149.562000 ms)
step 7: loss 0.119678 (took 15595.255000 ms)
step 8: loss 0.072536 (took 15444.902000 ms)
step 9: loss 0.050150 (took 15191.823000 ms)
LOSS MISMATCH AT STEP 0: 2.864161 5.270007
LOSS MISMATCH AT STEP 1: 2.071029 4.059707
LOSS MISMATCH AT STEP 2: 1.503640 3.375123
LOSS MISMATCH AT STEP 3: 1.059259 2.800783
LOSS MISMATCH AT STEP 4: 0.674684 2.315382
LOSS MISMATCH AT STEP 5: 0.418388 1.849029
LOSS MISMATCH AT STEP 6: 0.233658 1.394656
LOSS MISMATCH AT STEP 7: 0.119678 0.999147
LOSS MISMATCH AT STEP 8: 0.072536 0.624080
LOSS MISMATCH AT STEP 9: 0.050150 0.376511
overall okay: 0

Little speed up by simple modification is possible

  In file train_gpt2.py;

You can replace the line
return 0.5 * input * (1.0 + torch.tanh(math.sqrt(2.0 / math.pi) * (input + 0.044715 * torch.pow(input, 3.0))))
with
return 0.5 * input * (1.0 + torch.(0.66285246118) * (input + 0.044715 * torch.pow(input, 3.0))))

as tanh(math.sqrt(2.0 / math.pi) is approximately equal to 0.66285246118.

More instances can be found if the code is scanned carefully. This line alone can replace a divide+square root+trigonometric instructions (many many cycles in x64 and ARM) with a single constant.

[Proposal] Implement GaLore trainer

GaLore algorithm, proposed by GaLore: Memory-Efficient LLM Training by Gradient Low-Rank Projection, is a memory-efficient low-rank training strategy that allows full-parameter learning but is more memory-efficient than common low-rank adaptation methods.

The huggingface transformer trainer supports GaLore by torch-galore.

Clarify param_sizes calculation in gpt2_build_from_checkpoint()

In train_gpt2.c[u], function gpt2_build_from_checkpoint() assigns model param_sizes based on ParameterTensor struct members. Propose inline comments to clarify.

OverflowError: can't convert negative int to unsigned

$ python train_gpt2.py
using device: mps
2024-04-08 14:29:47.147256: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
loading weights from pretrained gpt: gpt2
config.json: 100%|████████████████████████████████████████████████████████████| 665/665 [00:00<00:00, 724kB/s]
model.safetensors: 100%|███████████████████████████████████████████████████| 548M/548M [00:41<00:00, 13.2MB/s]
generation_config.json: 100%|█████████████████████████████████████████████████| 124/124 [00:00<00:00, 827kB/s]
loading cached tokens in data/tiny_shakespeare_val.bin
/Users/gautam/llm.c/train_gpt2.py:333: UserWarning: The given NumPy array is not writable, and PyTorch does not support non-writable tensors. This means writing to this tensor will result in undefined behavior. You may want to copy the array to protect its data or make it writable before converting it to a tensor. This type of warning will be suppressed for the rest of this program. (Triggered internally at /Users/runner/work/pytorch/pytorch/pytorch/torch/csrc/utils/tensor_numpy.cpp:212.)
tokens = torch.from_numpy(tokens)
wrote gpt2_124M.bin
wrote gpt2_124M_debug_state.bin
iteration 0, loss: 5.2700114250183105
iteration 1, loss: 4.059763431549072
iteration 2, loss: 3.3751978874206543
iteration 3, loss: 2.8009214401245117
iteration 4, loss: 2.3155126571655273
iteration 5, loss: 1.8491268157958984
iteration 6, loss: 1.3947665691375732
iteration 7, loss: 0.9993425607681274
iteration 8, loss: 0.6240987777709961
iteration 9, loss: 0.3764863610267639
Traceback (most recent call last):
File "/Users/gautam/llm.c/train_gpt2.py", line 381, in
print(decode(y[0].tolist()))
File "/Users/gautam/llm.c/train_gpt2.py", line 313, in
decode = lambda l: enc.decode(l)
File "/Users/gautam/miniconda3/lib/python3.10/site-packages/tiktoken/core.py", line 254, in decode
return self._core_bpe.decode_bytes(tokens).decode("utf-8", errors=errors)
OverflowError: can't convert negative int to unsigned

Add .gitignore to the project.

Add .gitignore to the project to keep the working tree clean.

Error: must forward with targets before backward

when i ./train_gpt2.c , i get the result above Error.

error while running the makefile train_gpt2 on windows machine.

When running command makefile train_gpt2 on windows machine.

process_begin: CreateProcess(NULL, uname, ...) failed.
Makefile:13: pipe: No error
The system cannot find the path specified.
Makefile:32: OOPS Compiling without OpenMP support
cc -O3 -Ofast -Wno-unused-result   train_gpt2.c -lm -o train_gpt2
process_begin: CreateProcess(NULL, cc -O3 -Ofast -Wno-unused-result train_gpt2.c -lm -o train_gpt2, ...) failed.
make (e=2): The system cannot find the file specified.
make: *** [Makefile:43: train_gpt2] Error 2
PS C:\Users\ADMIN\Desktop\llm.c>

C++ version: g++.exe (GCC) 11.2.0
python version: 3.12.0
Windows 10

Precompute the scaling factor in gelu_forward and gelu_backward

Thank you so much for the amazing repo!

The scaling factor in gelu_forward and gelu_backward is computed each time the functions are run: float s = sqrtf(2.0f / M_PI);

This can be precomputed ahead of time and stored as a constant.

Why not Mojo?

Serious question here. If you're going down to the metal, Mojo provides a potentially large speedup and the language would significant benefit from this work. Either way - love the work. Thank you!

Suboptimal warp reductions

It is preferred to implement reductions with __shfl_xor_sync as oppose to __shfl_down_sync. This way all lanes will have the final value and you no longer need to broadcast final value to other lanes.

// warp-level reduction for finding the maximum value
__device__ float warpReduceMax(float val) {
    for (int offset = 16; offset > 0; offset /= 2) {
        val = fmaxf(val, __shfl_xor_sync(0xFFFFFFFF, val, offset));
    }
    return val;
}

// warp-level reduction for summing values
__device__ float warpReduceSum(float val) {
    for (int offset = 16; offset > 0; offset /= 2) {
        val += __shfl_xor_sync(0xFFFFFFFF, val, offset);
    }
    return val;
}

:OMP: Error #15: Initializing libomp.dylib, but found libomp.dylib already initialized." Then it hangs at "python train_gpt2.py"

(base) billhuang@bh-m1-max llm.c % python train_gpt2.py
OMP: Error #15: Initializing libomp.dylib, but found libomp.dylib already initialized.
OMP: Hint This means that multiple copies of the OpenMP runtime have been linked into the program. That is dangerous, since it can degrade performance or cause incorrect results. The best thing to do is to ensure that only a single OpenMP runtime is linked into the process, e.g. by avoiding static linking of the OpenMP runtime in any library. As an unsafe, unsupported, undocumented workaround you can set the environment variable KMP_DUPLICATE_LIB_OK=TRUE to allow the program to continue to execute, but that may cause crashes or silently produce incorrect results. For more information, please see http://openmp.llvm.org/
zsh: abort python train_gpt2.py
(base) billhuang@bh-m1-max llm.c % export KMP_DUPLICATE_LIB_OK=TRUE
(base) billhuang@bh-m1-max llm.c % python train_gpt2.py
using device: mps
loading weights from pretrained gpt: gpt2

output is not consistent when I load the gpt2_124M.bin

I load the gpt2_124M.bin by the following code:

import torch
import tiktoken

from train_gpt2 import GPT, GPTConfig

def read_wightdict(weights_dict, file, L, config):
    idx = 256
    with open(file, "rb") as f:
        data = f.read()
    data_copy = np.copy(data)
    weights = torch.from_numpy(np.frombuffer(data_copy, dtype=np.float32))
    weights_dict["transformer.wte.weight"] = weights[idx:idx+config["vocab_size"]*config["n_embd"]].reshape(config["vocab_size"], config["n_embd"])
    idx += config["vocab_size"]*config["n_embd"]
    weights_dict["transformer.wpe.weight"] = weights[idx:idx+config["block_size"]*config["n_embd"]].reshape(config["block_size"], config["n_embd"])
    idx += config["block_size"]*config["n_embd"]
    for i in range(L):
        weights_dict[f"transformer.h.{i}.ln_1.weight"] = weights[idx:idx+config["n_embd"]]
        idx += config["n_embd"]
    for i in range(L):
        weights_dict[f"transformer.h.{i}.ln_1.bias"] = weights[idx:idx+config["n_embd"]]
        idx += config["n_embd"]
    for i in range(L):
        weights_dict[f"transformer.h.{i}.attn.c_attn.weight"] = weights[idx:idx+config["n_embd"]*3*config["n_embd"]].reshape(3*config["n_embd"], config["n_embd"])
        idx += 3*config["n_embd"]*config["n_embd"]
    for i in range(L):
        weights_dict[f"transformer.h.{i}.attn.c_attn.bias"] = weights[idx:idx+3*config["n_embd"]]
        idx += 3*config["n_embd"]
    for i in range(L):
        weights_dict[f"transformer.h.{i}.attn.c_proj.weight"] = weights[idx:idx+config["n_embd"]*config["n_embd"]].reshape(config["n_embd"], config["n_embd"])
        idx += config["n_embd"]*config["n_embd"]
    for i in range(L):
        weights_dict[f"transformer.h.{i}.attn.c_proj.bias"] = weights[idx:idx+config["n_embd"]]
        idx += config["n_embd"]
    for i in range(L):
        weights_dict[f"transformer.h.{i}.ln_2.weight"] = weights[idx:idx+config["n_embd"]]
        idx += config["n_embd"]
    for i in range(L):
        weights_dict[f"transformer.h.{i}.ln_2.bias"] = weights[idx:idx+config["n_embd"]]
        idx += config["n_embd"]
    for i in range(L):
        weights_dict[f"transformer.h.{i}.mlp.c_fc.weight"] = weights[idx:idx+config["n_embd"]*4*config["n_embd"]].reshape(4*config["n_embd"], config["n_embd"])
        idx += 4*config["n_embd"]*config["n_embd"]
    for i in range(L):
        weights_dict[f"transformer.h.{i}.mlp.c_fc.bias"] = weights[idx:idx+4*config["n_embd"]]
        idx += 4*config["n_embd"]
    for i in range(L):
        weights_dict[f"transformer.h.{i}.mlp.c_proj.weight"] = weights[idx:idx+config["n_embd"]*4*config["n_embd"]].reshape(config["n_embd"], 4*config["n_embd"])
        idx += 4*config["n_embd"]*config["n_embd"]
    for i in range(L):
        weights_dict[f"transformer.h.{i}.mlp.c_proj.bias"] = weights[idx:idx+config["n_embd"]]
        idx += config["n_embd"]
    weights_dict["transformer.ln_f.weight"] = weights[idx:idx+config["n_embd"]]
    idx += config["n_embd"]
    weights_dict["transformer.ln_f.bias"] = weights[idx:idx+config["n_embd"]]
    return weights_dict

def load_model(file, model_type):
    config_args = {
            'gpt2':         dict(n_layer=12, n_head=12, n_embd=768),  # 124M params
            'gpt2-medium':  dict(n_layer=24, n_head=16, n_embd=1024), # 350M params
            'gpt2-large':   dict(n_layer=36, n_head=20, n_embd=1280), # 774M params
            'gpt2-xl':      dict(n_layer=48, n_head=25, n_embd=1600), # 1558M params
        }[model_type]
    config_args['vocab_size'] = 50257 # always 50257 for GPT model checkpoints
    config_args['block_size'] = 1024 # always 1024 for GPT model checkpoints
    # create a from-scratch initialized minGPT model
    config = GPTConfig(**config_args)
    model = GPT.from_pretrained("gpt2")
    weights = model.state_dict()
    weights = read_wightdict(weights, file, config_args['n_layer'], config_args)
    model.load_state_dict(weights, strict=True)
    return model

if __name__ == '__main__':
    enc = tiktoken.get_encoding("gpt2")
    encode = lambda s: enc.encode(s, allowed_special={"<|endoftext|>"})
    decode = lambda l: enc.decode(l)
    device = "cpu"
    max_new_tokens = 16
    temperature = 1.0
    top_k = 40
    model = load_model('gpt2_124M.bin', 'gpt2')
    model.eval()
    start = "<|endoftext|>"
    start_ids = encode(start)
    x = (torch.tensor(start_ids, dtype=torch.long, device=device)[None, ...])
    y = model.generate(x, max_new_tokens, temperature=temperature, top_k=top_k)
    print(decode(y[0].tolist()))

but the output is different of the train_apt2.py
the output of the train_gpt2.py is "If we want to die, we have to die at the front"
but the output of mine is "My Life As A Man' premieres at 11 p.m. on"
so why?

[build failed]Compiler encountered an internal error

env

system
- ProductName: macOS
- ProductVersion: 12.6.1
- Intel version.
python:
- Python 3.12.1 (v3.12.1:2305ca5144, Dec 7 2023, 17:23:39) [Clang 13.0.0 (clang-1300.0.29.30)] on darwin

log


$python train_gpt2.py 
using device: mps
loading weights from pretrained gpt: gpt2
model.safetensors: 100%|██████████████████████████████████████████████████████████████████| 548M/548M [00:38<00:00, 14.2MB/s]
generation_config.json: 100%|████████████████████████████████████████████████████████████████| 124/124 [00:00<00:00, 784kB/s]
loading cached tokens in data/tiny_shakespeare_val.bin
/AppleInternal/Library/BuildRoots/a0876c02-1788-11ed-b9c4-96898e02b808/Library/Caches/com.apple.xbs/Sources/MetalPerformanceShaders/MPSCore/Utility/MPSLibrary.mm:504: failed assertion `MPSKernel MTLComputePipelineStateCache unable to load function copyNDArrayData.
        Compiler encountered an internal error: (null)
'
Abort trap: 6
/Library/Frameworks/Python.framework/Versions/3.12/lib/python3.12/multiprocessing/resource_tracker.py:254: UserWarning: resource_tracker: There appear to be 1 leaked semaphore objects to clean up at shutdown
  warnings.warn('resource_tracker: There appear to be %d '
$
$
$
$
$
$python train_gpt2.py 
using device: mps
loading weights from pretrained gpt: gpt2
loading cached tokens in data/tiny_shakespeare_val.bin
/AppleInternal/Library/BuildRoots/a0876c02-1788-11ed-b9c4-96898e02b808/Library/Caches/com.apple.xbs/Sources/MetalPerformanceShaders/MPSCore/Utility/MPSLibrary.mm:504: failed assertion `MPSKernel MTLComputePipelineStateCache unable to load function copyNDArrayData.
        Compiler encountered an internal error: (null)
'
Abort trap: 6

Not all heroes wear capes

Just wanted to say thank you on behalf of the community.

Thank you Andrej. ❤️

I am closing this post submitting the issue so don't have to.

AssertionError("Torch not compiled with CUDA enabled")

Try adding: if torch.cuda.is_available(): to line #389 in train_gpt2.py

    if torch.cuda.is_available():
        torch.cuda.synchronize()

What else can I say, awesome

Hi, hello, can I ask you to port a special architecture into the project?

Hi, hello, can I ask you to port a special architecture into the project?
This uses mcts and may not be designed correctly.
Can I ask you for help?
llama2.cmodel.zip

Modified fromllama2rnn

Calloc NULL not handled

make -f Makefile_gnu  # changed to gcc-13, passing -fanalyzer to CFLAGS

Mostly clean but flagged that calloc can return NULL.

llm.c/train_gpt2.c

Line 893 in 3bcb9ba

model->m_memory = (float*)calloc(model->num_parameters, sizeof(float));

Support MPI distributed training

I implemented the Infini-attention from the paper "Leave No Context Behind: Efficient Infinite Context Transformers with Infini-attention" with the help of GPT, but I do not know C language. I hope that Infini-attention can be used in future model training.

`#include
#include

extern const int C;
extern const int NH;
extern const int HS; //HS = C / NH
extern const float scale;

void softmax(float* data, int length) {
float maxval = -1e20f;
for (int i = 0; i < length; i++) {
if (data[i] > maxval)
maxval = data[i];
}

float sum = 0.0f;
for (int i = 0; i < length; i++) {
    data[i] = expf(data[i] - maxval);
    sum += data[i];
}

for (int i = 0; i < length; i++) {
    data[i] /= sum;
}

}

// ELU + 1
float elu_plus_one(float x) {
return (x >= 0 ? x : expf(x) - 1) + 1;
}

void infini_attention_forward(float* out, float* preatt, float* att, float* inp, int B, int T, int C, int NH, float* memory, float* zs) {
int HS = C / NH;
float scale = 1.0 / sqrtf(HS);
float* mem = memory;
float* norm = zs;

#pragma omp parallel for collapse(3)
for (int b = 0; b < B; b++) {
    for (int t = 0; t < T; t++) {
        for (int h = 0; h < NH; h++) {
            float* query = inp + (b * T * C) + (t * C) + (h * HS);
            float* preatt_bth = preatt + (b * NH * T * T) + (h * T * T) + (t * T);
            float* att_bth = att + (b * NH * T * T) + (h * T * T) + (t * T);

            // Pass 1: softmax
            float maxval = -1e20f;
            for (int t2 = 0; t2 < T; t2++) {
                float* key = inp + (b * T * C) + (t2 * C) + (h * HS);
                float val = 0.0f;
                for (int i = 0; i < HS; i++) {
                    val += query[i] * key[i];
                }
                val *= scale;
                if (val > maxval) {
                    maxval = val;
                }
                preatt_bth[t2] = val;
            }

            softmax(preatt_bth, T);

            // Pass 2: 
            float* out_bt = out + (b * T * C) + (t * C) + (h * HS);
            memset(out_bt, 0, HS * sizeof(float));
            for (int t2 = 0; t2 < T; t2++) {
                float* value = inp + (b * T * C) + (t2 * C) + (h * HS);
                float weight = att_bth[t2];
                for (int i = 0; i < HS; i++) {
                    out_bt[i] += weight * value[i];
                }

                // 
                float elu_k = elu_plus_one(preatt_bth[t2]);
                for (int i = 0; i < HS; i++) {
                    mem[(h * HS) + i] += elu_k * value[i];
                    norm[(h * HS) + i] += elu_k;
                }
            }
        }
    }
}

}

//
void initialize_data(float* data, int size, float init_value = 0.0f) {
for (int i = 0; i < size; i++) {
data[i] = init_value;
}
}

void initialize_model(int B, int T, int C, int NH, float*& inp, float*& out, float*& preatt, float*& att, float*& memory, float*& zs) {
inp = new float[B * T * C];
out = new float[B * T * C];
preatt = new float[B * NH * T * T];
att = new float[B * NH * T * T];
memory = new float[NH * (C / NH)];
zs = new float[NH * (C / NH)];

// Initialize all to zeros or suitable default values
initialize_data(inp, B * T * C);
initialize_data(out, B * T * C);
initialize_data(preatt, B * NH * T * T);
initialize_data(att, B * NH * T * T);
initialize_data(memory, NH * (C / NH));
initialize_data(zs, NH * (C / NH), 1.0f);  // Normalize factors initialized to 1 to avoid division by zero

}
void cleanup(float* inp, float* out, float* preatt, float* att, float* memory, float* zs) {
delete[] inp;
delete[] out;
delete[] preatt;
delete[] att;
delete[] memory;
delete[] zs;
}

void fill_random(float* data, int size) {
for (int i = 0; i < size; i++) {
data[i] = static_cast(rand()) / RAND_MAX;
}
}

int main() {
srand(time(nullptr));

float *inp, *out, *preatt, *att, *memory, *zs;
int B = 2, T = 10, C = 64, NH = 8;

initialize_model(B, T, C, NH, inp, out, preatt, att, memory, zs);

// Fill input with random data
fill_random(inp, B * T * C);

// Run the attention forward function
infini_attention_forward(out, preatt, att, inp, B, T, C, NH, memory, zs);

// Optionally, add code here to inspect the output, e.g., print or assert conditions

// Cleanup resources
cleanup(inp, out, preatt, att, memory, zs);

return 0;

the provided PTX was compiled with an unsupported toolchain.

Does anyone know the required CUDA version to compile the train_gpt2cu?
I am getting the below error :

[GPT-2]
max_seq_len: 1024
vocab_size: 50257
num_layers: 12
num_heads: 12
channels: 768
num_parameters: 124439808
train dataset num_batches: 74
val dataset num_batches: 8
batch size: 4
sequence length: 1024
val_num_batches: 10
num_activations: 2456637440
[CUDA ERROR] at file train_gpt2.cu:341:
the provided PTX was compiled with an unsupported toolchain.

my cuda version is cuda12.1. thanks. or any recommendation to get over this? thanks.

Waiting for CUDA implement

Really appreciate your work, it helped us a lot to understand the core of the training LLM model. I have run and understand the C implement, so it would be awesome if we could learn more about your CUDA implement.

Does it have an interactive mode like ChatGPT?

Can the C code encode/decode text?
I'm looking for a high-speed GPT-2 chatbot in pure C/C++.

	// is there no better way other than just adding bias with a whole separate kernel?
	// this is a highly memory-bound operation, should be fused into the matmul kernel
	// but i can't seem to find a cuBLAS function that does this
	__global__ void add_bias(float* out, float* bias, int B, int T, int OC) {
	int idx = blockIdx.x * blockDim.x + threadIdx.x;
	int stride = blockDim.x * gridDim.x;
	for (int i = idx; i < B * T * OC; i += stride) {
	int col = i % OC;
	out[i] += bias[col];
	}
	}

karpathy / llm.c Goto Github PK

llm.c's Introduction

llm.c's People

Contributors

Stargazers

Watchers

Forkers

llm.c's Issues

nvidia-smi

nvcc

Here is a little example:

and we need to add a call to the opt in a runOnBasicBlock function:

to apply the passes we need to convert train_gpt2 to a LLVM-IR using the clang compiler:

env

log

What else can I say, awesome

Recommend Projects

Recommend Topics

Recommend Org

`nvidia-smi`

`nvcc`

and we need to add a call to the opt in a `runOnBasicBlock` function:

to apply the passes we need to convert `train_gpt2` to a LLVM-IR using the clang compiler: