Support MPI distributed training about llm.c HOT 6 OPEN

karpathy commented on June 12, 2024

Support MPI distributed training

from llm.c.

Comments (6)

Yiltan commented on June 12, 2024 1

I regularly write MPI code, so this shouldn't be too complicated to implement. I've started to look though the CPU version to get started. However, I do have questions regarding the ML side.

There a few options I can see:

Data Parallelism using MPI_Allreduce to average gradients
- I think we would do this around here:
- https://github.com/karpathy/llm.c/blob/master/train_gpt2.c#L906C1-L906C5
Tensor parallelism (similar to lamma.cpp)
Model Parallelism

Is there preference to how this could be scaled with MPI? If option 2 or 3, seem like the best option, do you have a suggestion as to where in the code I should dig into?

from llm.c.

karpathy commented on June 12, 2024 1

Sounds great! I expect to get started with the backward pass somewhere over the weekend most likely.
(I spent today optimizing the forward pass still)
Once we have the backward pass getting data parallel training in will be super awesome

from llm.c.

chadbrewbaker commented on June 12, 2024

I have this in mind for the Mojo target issue - which is really about having the Makefile support composability like the one for llama.cpp. Probably copy-pasta most of what llama.cpp has so the build is using mpicc. Would still need to write the MPI code.

from llm.c.

karpathy commented on June 12, 2024

definitely! but this is pretty far down the line, i think we first need to get the 1-GPU version to be super solid.

from llm.c.

chadbrewbaker commented on June 12, 2024

I would do MPI-2 as MPI IO is all you need and it is most widely supported.

from llm.c.

Yiltan commented on June 12, 2024

The MPI version of this is mostly working at this point. I've tested it up to 8 nodes. It reduces training by many hours.

@karpathy Do you still have interest in a NCCL version? If so, are resources for multi-GPU resource that you could share?

from llm.c.

Recommend Projects

Support MPI distributed training about llm.c HOT 6 OPEN

Comments (6)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent