kaiyuyue / torchshard Goto Github PK

View Code? Open in Web Editor NEW

295.0 11.0 15.0 4.92 MB

Slicing a PyTorch Tensor Into Parallel Shards

License: Apache License 2.0

Shell 0.28% Python 99.72%

pytorch model-parallelism tensor-parallelism

torchshard's People

Stargazers

Watchers

Forkers

gehongpeng poodarchu l1aoxingyu donhuvy arunsank tianhaofu lliai trendingtechnology nemoramo xiangliu886 qute012 mullerhai bysowhat rosamendx

torchshard's Issues

8 gpus test example raise error.

When I do Unit Tests, it can pass when use two gpu devices, run command below:
CUDA_VISIBLE_DEVICES=0,1 python3 -m unittest discover -v -s tests

But I do Unit Tests with eight gpu devices, it raise ncclSystemError.
run command:
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 python3 -m unittest discover -v -s tests
raise error:
RuntimeError: NCCL error in ../torch/lib/c10d/ProcessGroupNCCL.cpp:825, unhandled system error, NCCL version 2.7.8
ncclSystemError: System call (socket, malloc, munmap, etc) failed.

Is it necessary to pass unittest in eights gpu devices?

numclass must be divided by num_gpu?

torchshard/utils.py", line 2, in check_divisibility assert numerator % denominator == 0, \ AssertionError: 85742 is not divisible by 8

Other operations

Very nice project!
I wonder if other operations will be supported, such as conv, bn, relu.

Error?

Hi, thanks for the excellent job!
When I install it from pip, and

import torchshard as ts
ts.init_process_group(group_size=2)

The AttributeError occurs:

AttributeError: module 'torchshard' has no attribute 'init_process_group'

Future Planinig on this project.

Hello Kaiyu,
I love this awesome project. The API design is elegant and simple and the software is lightweight and user-friendly. My understanding is that this project has realized a series of PyTorch wrappers for tensor slicing.

I am curious about the future planning of this project.
Is there some overlap in functionality between torchshard and N-D parallelism proposed in ColossalAI.
How is compatibility with ZeRO? According to am+zero example, the memory footprint has a little change after combination torchshard with ZeRO.

Is it possible to collect state dict in cpu?

When I finish one epoch in trianing, the main_worker function will call ts.collect_state_dict(model, state_dict).
But because the limit of GPU resource, it will raise Out of Memory in my machine, when call ts.collect_state_dict(model, state_dict).
I found that will gather the state_dict in GPU, is it anyway to gather in CPU?

Which one is faster?

Thanks for contributing this great lib. I have one question. Which one is faster (in speed) between dim=0and dim=1? The documentations seem to only contain accuracy results.

Multi-node setting?

https://github.com/KaiyuYue/torchshard/blob/89e21def180bf6063ceb2e312a61631173abc7e7/projects/minGPT/main.py#L150

I have noticed that the group_size is set to world_size in examples, but in fact the group_size can be set to other numbers according to my understanding.

https://github.com/KaiyuYue/torchshard/blob/main/torchshard/distributed/core.py#L18

I have also found that the get_world_size() will return the number of all processes.

The two findings make me confused in a multi-node setting, say 2 nodes with each node with 2 processes.

If the group_size is 2, then there are 2 distinct groups besides the default group (w/ overlap). However, get_world_size() is used without specifying a group can make a layer be splitted to 4 parts, which is expected to be 2 in our case.

Correct me if I am wrong.

kaiyuyue / torchshard Goto Github PK

torchshard's People

Stargazers

Watchers

Forkers

torchshard's Issues

8 gpus test example raise error.

numclass must be divided by num_gpu?

Other operations

Error?

Future Planinig on this project.

Is it possible to collect state dict in cpu?

Which one is faster?

Multi-node setting?

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent