Giter Site home page Giter Site logo

Comments (5)

tzipproth avatar tzipproth commented on June 12, 2024 2

Thanks for the hint about the static library; after several tries, I was able to successfully compile test_gpt2cu program using this command:

nvcc -O3 --use_fast_math test_gpt2.cu /usr/local/cuda/lib64/libcublas_static.a /usr/local/cuda/lib64/libcublasLt_static.a /usr/local/cuda/lib64/libcudart_static.a /usr/local/cuda/lib64/libculibos.a -lpthread -ldl -lrt -o test_gpt2cu

There where some warnings:
nvlink warning : Skipping incompatible '/usr/lib/x86_64-linux-gnu/libpthread.a' when searching for -lpthread
nvlink warning : Skipping incompatible '/usr/lib/x86_64-linux-gnu/libdl.a' when searching for -ldl
nvlink warning : Skipping incompatible '/usr/lib/x86_64-linux-gnu/librt.a' when searching for -lrt

but test_gpt2cu worked.

Also the train_gpt2cu compile worked, but it seems 8GB VRAM of a RTX 2070 are not enough, the CUDA Driver uses shared memory and becomes very slow:

[System]
Device 0: NVIDIA GeForce RTX 2070
enable_tf32: 0
[GPT-2]
max_seq_len: 1024
vocab_size: 50257
num_layers: 12
num_heads: 12
channels: 768
num_parameters: 124439808
batch size: 4
sequence length: 1024
train dataset num_batches: 74
val dataset num_batches: 8
num_activations: 2456637440
val loss 4.513920
step 1/74: train loss 4.367857 (49702.271082 ms)

from llm.c.

colinbrace avatar colinbrace commented on June 12, 2024 2

Hi, I made progress on this one. Almost there, but not 100%; read on. I wanted to share learnings to far. Not surprisingly, the whole static/shared library aspect is a red herring to getting things to work in Windows Subsystem for Linux (WSL).

Here are repeatable steps to get train_gpt2cu to run in WSL. Really fun to see it execute!

  1. Install Ubuntu fresh via wsl. I'm sharing the version that I used.
~/dev/llm.c$ uname -r
5.15.146.1-microsoft-standard-WSL2
  1. Get the latest bits and bytes and tools that you'll need
sudo apt update
sudo apt upgrade
sudo apt install gcc
sudo apt install python3-pip
  1. Restart the Linux instance
  2. Key point! Get the 12.2 CUDA toolkit. Neither 12.4 nor the default version a la (sudo apt install nvidia-cuda-toolkit) seems to work. Follow the instructions here: https://developer.nvidia.com/cuda-12-2-0-download-archive?target_os=Linux&target_arch=x86_64&Distribution=WSL-Ubuntu&target_version=2.0&target_type=runfile_local
  3. Follow these instructions, per the above installer
 -   PATH includes /usr/local/cuda-12.2/bin
 -   LD_LIBRARY_PATH includes /usr/local/cuda-12.2/lib64, or, add /usr/local/cuda-12.2/lib64 to /etc/ld.so.conf and run ldconfig as root
  1. Git clone away and follow the llm.c CUDA instructions.
  2. Enjoy!

P.S.
python3 train_gpt2.py

fails with

Running pytorch 2.2.2+cu121
using device: cuda
wrote gpt2_tokenizer.bin
loading weights from pretrained gpt: gpt2
loading cached tokens in data/tiny_shakespeare_val.bin
wrote gpt2_124M.bin
Traceback (most recent call last):
  File "/home/colin/dev/llm.c/train_gpt2.py", line 403, in <module>
    write_state(model, x, y, logits, loss, "gpt2_124M_debug_state.bin")
  File "/home/colin/dev/llm.c/train_gpt2.py", line 279, in write_state
    grads = {name: param.grad.cpu() for name, param in model.named_parameters()}
  File "/home/colin/dev/llm.c/train_gpt2.py", line 279, in <dictcomp>
    grads = {name: param.grad.cpu() for name, param in model.named_parameters()}
AttributeError: 'NoneType' object has no attribute 'cpu'

so there is on more (python?) dependency issue to sort out. You can get around this, sadly, by commenting out the line

#write_state(model, x, y, logits, loss, "gpt2_124M_debug_state.bin")

This is obviously a silly solution; I'll need to spent more time later on resolving this one. But getting close to running on WSL!

from llm.c.

dagelf avatar dagelf commented on June 12, 2024 2

If for some weird reason you're not on Ubuntu 22.04 ... Ubuntu used to have a notoriously old python so pretty sure that would be your issue. But even on Linux it was a struggle to get Pytorch and Cuda versions that match up, just a few months back.

(incredible that Cuda works at all on WSL! Too many of the things I love about Linux, doesn't work on it, block devices, kernel options, network things, I've even had issues with pipes! I did love coLinux... too bad it didn't survive 😢 )

You should try to do a fresh pull though, this project is growing fast!

from llm.c.

colinbrace avatar colinbrace commented on June 12, 2024 1

I've filed an issue with nvidia; I'll post back if I hear anything:

https://forums.developer.nvidia.com/t/device-not-found-for-shared-cublas-but-found-for-static-cublas-static/289614

from llm.c.

colinbrace avatar colinbrace commented on June 12, 2024

Resolved after a git fetch! I didn't see any changes in train_gpt2.py that would have addressed, so perhaps elsewhere. I didn't update the python version.

For the record here are the version of Ubuntu and Python in which this is working:

Welcome to Ubuntu 22.04.4 LTS (GNU/Linux 5.15.146.1-microsoft-standard-WSL2 x86_64)

~/dev/llm.c$ python3 --version
Python 3.10.12

Thanks, dagelf, and tzzipproth for your suggestions and ideas. WSL is now working end to end!

from llm.c.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.