distributed_tutorial's People
Forkers
yucoian github-hongweizhang zhizhangxian sailfish009 lotayou7355608 learnedvector kiminh bryan-bai stockeh zenetio junhua-zhang zaego123 rsl18 jamiekang cloudcatcher888 qap 3288103265 mutinner shujun-he mlizhardy exlsunshine cswwp srikar2097 burntcobalt silent-wn bnaman50 haozou neoql herolin12 neuralnewtorks jingligao jonathancmitchell guanguanboy ayulockin mingliangzhang2018 yining043 happidence1 hungnphan anweshpanda shengdewu ashok-arjun yhding23 rockyxu66 ljz756245026 jijeng robinhoodki huyanhuang szulm hoemr rhwang1314 orange10010 munaiyi719 zxytim soumya-dutta apexsf le-cheng atcp rankevin wipdariodt lv-tuan omsrisagar originofamonia wallarug kaiizhang liangtsao imnobadboy muthuks2020 zhjunqin kheirdast moonryul keithczq wenyumolly wangwenjie123 wangpf09 bibatn duyuankai1992 khaofugui wongxindistributed_tutorial's Issues
Error with distributed mp
Hi, I tried running my code like your example, and I got this error
File "artGAN512_impre_v8.py", line 286, in main
mp.spawn(train, nprocs=args.gpus, args=(args,))
File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 167, in spawn
while not spawn_context.join():
File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 114, in join
raise Exception(msg)
Exception:
-- Process 0 terminated with the following error:
Traceback (most recent call last):
File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 19, in _wrap
fn(i, *args)
File "/home/ubuntu/dcgan/artGAN512_impre_v8.py", line 167, in train
world_size=args.world_size, rank=rank)
File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 406, in init_process_group
store, rank, world_size = next(rendezvous(url))
File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.7/site-packages/torch/distributed/rendezvous.py", line 143, in _env_rendezvous_handler
store = TCPStore(master_addr, master_port, world_size, start_daemon)
RuntimeError: Connection timed out
Under my train
function, i have
rank = args.nr * args.gpus + gpu
dist.init_process_group(backend='nccl', init_method='env://',
world_size=args.world_size, rank=rank)
torch.manual_seed(0)
torch.cuda.set_device(gpu)
I think it has something to do with the os.environ['MASTER_ADDR']
, can you explain how you chose value for that parameter? I'm using an AWS instance.
Thanks.
how to determine master address and port?
Thanks for the great tutorial. One thing I still don't understand: how are the master address and port determined? Is this set by my machine, i.e. if I have a machine with 4 GPUs, each one has an IP address and port already assigned to it?
How to add DDP with val loader?
How to add DDP with val evaluation? Is it same with train? @yangkky
Address already in use while running second time
Thanks for the tutorial. So I did accoding to the tutorial but I got error at
Training does not happen..
I ran the script from two terminal, and both the places hangs, seems like they are waiting for something
Hi, a little bit confuse about your code, please give me some help.
I noticed that they use the "Save and Load Checkpoints" to synchronize all models in different process in the PyTorch tutorial https://pytorch.org/tutorials/intermediate/ddp_tutorial.html
So, I want to know if there are some implicit synchronization mechanisms in your distributed_tutorial code.
save or load checkpoint
distributed_tutorial/src/mnist-mixed.py
Lines 108 to 114 in 2446796
could you add the save model line to the example to be more complete. Thanks!
torch.save(model.state_dict(), CHECKPOINT_PATH)
How to do mnist-distributed with checkpointing?
I saw the tutorial (https://pytorch.org/tutorials/intermediate/ddp_tutorial.html#save-and-load-checkpoints):
def demo_checkpoint(rank, world_size):
print(f"Running DDP checkpoint example on rank {rank}.")
setup(rank, world_size)
model = ToyModel().to(rank)
ddp_model = DDP(model, device_ids=[rank])
loss_fn = nn.MSELoss()
optimizer = optim.SGD(ddp_model.parameters(), lr=0.001)
CHECKPOINT_PATH = tempfile.gettempdir() + "/model.checkpoint"
if rank == 0:
# All processes should see same parameters as they all start from same
# random parameters and gradients are synchronized in backward passes.
# Therefore, saving it in one process is sufficient.
torch.save(ddp_model.state_dict(), CHECKPOINT_PATH)
# Use a barrier() to make sure that process 1 loads the model after process
# 0 saves it.
dist.barrier()
# configure map_location properly
map_location = {'cuda:%d' % 0: 'cuda:%d' % rank}
ddp_model.load_state_dict(
torch.load(CHECKPOINT_PATH, map_location=map_location))
optimizer.zero_grad()
outputs = ddp_model(torch.randn(20, 10))
labels = torch.randn(20, 5).to(rank)
loss_fn = nn.MSELoss()
loss_fn(outputs, labels).backward()
optimizer.step()
# Not necessary to use a dist.barrier() to guard the file deletion below
# as the AllReduce ops in the backward pass of DDP already served as
# a synchronization.
if rank == 0:
os.remove(CHECKPOINT_PATH)
cleanup()
but as you said the tutorial is not very well written or missing or something. I was wondering if you could extend your tutorial with checkpointing?
I am personally interested only in processing each batch quicker by using multiprocessing. So what confuses me is why the code above not simply just save the model once training is done (but instead saves it when rank==0 before training starts). As you said, its confusing. Extending your mnist-example so after I process all the data in mnist and then I can save my model would be fantastic or saving every X number of epochs as it's the common case.
Btw, thanks for your example, it is fantastic!
where does dist.destroy_process_group() go in your DDP MNIST example?
where does dist.destroy_process_group() go in your DDP MNIST example:
https://github.com/yangkky/distributed_tutorial/blob/master/src/mnist-mixed.py
?
[Bug] Multiple dataset created in each train process
Hi, I don't if wrapping the dataset creation part inside the training function is a good idea... One possible issue is that when using multiple gpus the MNIST dataset is downloaded twice... Perhaps it's better to create a Dataset object in the main function and pass it into the train function, and create the distributed sampler within? Thanks for your help
Unable to run on a single node with multiple GPUs
Pytorch version: 1.7.1
Cuda: 10.0
python: 3.7.10
I am trying to run it on AWS with one node and 4 GPUs using the command
python mnist-distributed.py -n 1 -g 4 -nr 0
the code hangs at init_process_group
I tried by setting MASTER_ADDR to '127.0.0.1'.
What should I do to make it work?
Call set_epoch on DistributedSampler
Hi,
thanks for the excellent example of using DistributedDataParallel in PyTorch; it is very easy to understand and is much better that Pytorch docs.
One important bit that is missing is making the gradient descent truly stochastic in the distributed case. From Pytoch docs, in order to achieve this, set_epoch
must be called on the sampler. Otherwise, the data points will be sampled in the same order in every epoch, without shuffling (remember, DataLoader
is constructed with shuffle=False
). I have also discovered that it is very important to set the epoch to the same value in each worker, otherwise there is a chance that some data points will be visited multiple times, and others none at all.
I hope all this makes sense. I think that future readers will benefit from the addition I am proposing. Once again, thanks for the excellent doc.
Error distributed run
Hi,
Thanks for the easy following tutorial on distributed processing.
I followed your example, it works fine on a single multi-gpu system. On running it on multiple nodes with 2 gpus each I get an error during runtime.
_```
Traceback (most recent call last):
File "conv_dist.py", line 117, in
main()
File "conv_dist.py", line 51, in main
mp.spawn(train, nprocs=args.gpus, args=(args,), join=True)
File "/dine2/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 200, in spawn
return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
File "dine2/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 158, in start_processes
while not context.join():
File "/dine2/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 119, in join
raise Exception(msg)
Exception:
-- Process 0 terminated with the following error:
Traceback (most recent call last):
File "/dine2/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 20, in _wrap
fn(i, *args)
File "/work/codebase/torch_dist/conv_dist.py", line 74, in train
model = DDP(model, device_ids=[gpu])
File "/dine2/lib/python3.6/site-packages/torch/nn/parallel/distributed.py", line 285, in init
self.broadcast_bucket_size)
File "/dine2/lib/python3.6/site-packages/torch/nn/parallel/distributed.py", line 496, in _distributed_broadcast_coalesced
dist._broadcast_coalesced(self.process_group, tensors, buffer_size)
RuntimeError: NCCL error in: /opt/conda/conda-bld/pytorch_1591914838379/work/torch/lib/c10d/ProcessGroupNCCL.cpp:514, unhandled system error, NCCL version 2.4.8
Not able to figure out the cause of error.
Please help, thanks.
multiple dataloader processes with ddp
you write in you blog https://yangkky.github.io/2019/07/08/distributed-pytorch-tutorial.html - . Itβs also possible to have multiple worker processes that fetch data for each GPU. How can I enable this? I am running into bottleneck because of it.
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
π Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. πππ
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google β€οΈ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.