kaiyuyue / torchshard Goto Github PK
View Code? Open in Web Editor NEWSlicing a PyTorch Tensor Into Parallel Shards
License: Apache License 2.0
Slicing a PyTorch Tensor Into Parallel Shards
License: Apache License 2.0
When I do Unit Tests, it can pass when use two gpu devices, run command below:
CUDA_VISIBLE_DEVICES=0,1 python3 -m unittest discover -v -s tests
But I do Unit Tests with eight gpu devices, it raise ncclSystemError.
run command:
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 python3 -m unittest discover -v -s tests
raise error:
RuntimeError: NCCL error in ../torch/lib/c10d/ProcessGroupNCCL.cpp:825, unhandled system error, NCCL version 2.7.8
ncclSystemError: System call (socket, malloc, munmap, etc) failed.
Is it necessary to pass unittest in eights gpu devices?
torchshard/utils.py", line 2, in check_divisibility assert numerator % denominator == 0, \ AssertionError: 85742 is not divisible by 8
Very nice project!
I wonder if other operations will be supported, such as conv, bn, relu.
Hi, thanks for the excellent job!
When I install it from pip, and
import torchshard as ts
ts.init_process_group(group_size=2)
The AttributeError occurs:
AttributeError: module 'torchshard' has no attribute 'init_process_group'
Hello Kaiyu,
I love this awesome project. The API design is elegant and simple and the software is lightweight and user-friendly. My understanding is that this project has realized a series of PyTorch wrappers for tensor slicing.
When I finish one epoch in trianing, the main_worker function will call ts.collect_state_dict(model, state_dict).
But because the limit of GPU resource, it will raise Out of Memory in my machine, when call ts.collect_state_dict(model, state_dict).
I found that will gather the state_dict in GPU, is it anyway to gather in CPU?
Thanks for contributing this great lib. I have one question. Which one is faster (in speed) between dim=0
and dim=1
? The documentations seem to only contain accuracy results.
I have noticed that the group_size
is set to world_size
in examples, but in fact the group_size
can be set to other numbers according to my understanding.
https://github.com/KaiyuYue/torchshard/blob/main/torchshard/distributed/core.py#L18
I have also found that the get_world_size()
will return the number of all processes.
The two findings make me confused in a multi-node setting, say 2 nodes with each node with 2 processes.
If the group_size is 2, then there are 2 distinct groups besides the default group (w/ overlap). However, get_world_size()
is used without specifying a group can make a layer be splitted to 4 parts, which is expected to be 2 in our case.
Correct me if I am wrong.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.