Comments (6)
I regularly write MPI code, so this shouldn't be too complicated to implement. I've started to look though the CPU version to get started. However, I do have questions regarding the ML side.
There a few options I can see:
- Data Parallelism using MPI_Allreduce to average gradients
- I think we would do this around here:
- https://github.com/karpathy/llm.c/blob/master/train_gpt2.c#L906C1-L906C5 - Tensor parallelism (similar to lamma.cpp)
- Model Parallelism
Is there preference to how this could be scaled with MPI? If option 2 or 3, seem like the best option, do you have a suggestion as to where in the code I should dig into?
from llm.c.
Sounds great! I expect to get started with the backward pass somewhere over the weekend most likely.
(I spent today optimizing the forward pass still)
Once we have the backward pass getting data parallel training in will be super awesome
from llm.c.
I have this in mind for the Mojo target issue - which is really about having the Makefile support composability like the one for llama.cpp. Probably copy-pasta most of what llama.cpp has so the build is using mpicc. Would still need to write the MPI code.
from llm.c.
definitely! but this is pretty far down the line, i think we first need to get the 1-GPU version to be super solid.
from llm.c.
I would do MPI-2 as MPI IO is all you need and it is most widely supported.
from llm.c.
The MPI version of this is mostly working at this point. I've tested it up to 8 nodes. It reduces training by many hours.
@karpathy Do you still have interest in a NCCL version? If so, are resources for multi-GPU resource that you could share?
from llm.c.
Related Issues (20)
- Possible NULL Pointer Dereference HOT 2
- void tokenizer_init failed HOT 1
- Possible bugs in the data loading functions
- What would be the main design trade-offs when re-implementing in clean modern C++? HOT 2
- About pull request of custom kernel implementation
- Error: make: *** [Makefile:203: train_gpt2cu] Error 255 HOT 5
- When will llama and other frameworks be supported? HOT 1
- Assertion `graph->check_support(cudnn_handle).is_good()' failed HOT 18
- make: *** [Makefile:194: train_gpt2] Error 2 on Windows HOT 6
- MultiGPU training hangs HOT 9
- How to do Inference on the trained weight of GPT 2 model after finishing the training on CPU using train_gpt2.py and train_gpt2 ?
- more detailed explanation of Multi GPU HOT 3
- Llm on small models
- `make` fails to autodetect GPU compute capability HOT 2
- Is there a plan to support 8bits (FP8 or INT8)? HOT 1
- compute sanitizers HOT 1
- Broader vendor support for hardware acceleration HOT 3
- 2D and 3D tile divisions so that permutation coordinates can be read from threadIdx and blockIdx HOT 3
- ThunderKittens Backend HOT 1
- Mismatch of dweight at layernorm_backward.cu
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from llm.c.