Giter Site home page Giter Site logo

torchshard's People

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

torchshard's Issues

8 gpus test example raise error.

When I do Unit Tests, it can pass when use two gpu devices, run command below:
CUDA_VISIBLE_DEVICES=0,1 python3 -m unittest discover -v -s tests

But I do Unit Tests with eight gpu devices, it raise ncclSystemError.
run command:
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 python3 -m unittest discover -v -s tests
raise error:
RuntimeError: NCCL error in ../torch/lib/c10d/ProcessGroupNCCL.cpp:825, unhandled system error, NCCL version 2.7.8
ncclSystemError: System call (socket, malloc, munmap, etc) failed.

Is it necessary to pass unittest in eights gpu devices?

Other operations

Very nice project!
I wonder if other operations will be supported, such as conv, bn, relu.

Error?

Hi, thanks for the excellent job!
When I install it from pip, and

import torchshard as ts
ts.init_process_group(group_size=2) 

The AttributeError occurs:

AttributeError: module 'torchshard' has no attribute 'init_process_group'

Future Planinig on this project.

Hello Kaiyu,
I love this awesome project. The API design is elegant and simple and the software is lightweight and user-friendly. My understanding is that this project has realized a series of PyTorch wrappers for tensor slicing.

  1. I am curious about the future planning of this project.
  2. Is there some overlap in functionality between torchshard and N-D parallelism proposed in ColossalAI.
  3. How is compatibility with ZeRO? According to am+zero example, the memory footprint has a little change after combination torchshard with ZeRO.

Is it possible to collect state dict in cpu?

When I finish one epoch in trianing, the main_worker function will call ts.collect_state_dict(model, state_dict).
But because the limit of GPU resource, it will raise Out of Memory in my machine, when call ts.collect_state_dict(model, state_dict).
I found that will gather the state_dict in GPU, is it anyway to gather in CPU?

Which one is faster?

Thanks for contributing this great lib. I have one question. Which one is faster (in speed) between dim=0and dim=1? The documentations seem to only contain accuracy results.

Multi-node setting?

https://github.com/KaiyuYue/torchshard/blob/89e21def180bf6063ceb2e312a61631173abc7e7/projects/minGPT/main.py#L150

I have noticed that the group_size is set to world_size in examples, but in fact the group_size can be set to other numbers according to my understanding.

https://github.com/KaiyuYue/torchshard/blob/main/torchshard/distributed/core.py#L18

I have also found that the get_world_size() will return the number of all processes.

The two findings make me confused in a multi-node setting, say 2 nodes with each node with 2 processes.

If the group_size is 2, then there are 2 distinct groups besides the default group (w/ overlap). However, get_world_size() is used without specifying a group can make a layer be splitted to 4 parts, which is expected to be 2 in our case.

Correct me if I am wrong.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.