yangkky / distributed_tutorial Goto Github PK

View Code? Open in Web Editor NEW

262.0 262.0 78.0 44 KB

Python 100.00%

distributed_tutorial's People

Contributors

Stargazers

Watchers

Forkers

yucoian github-hongweizhang zhizhangxian sailfish009 lotayou7355608 learnedvector kiminh bryan-bai stockeh zenetio junhua-zhang zaego123 rsl18 jamiekang cloudcatcher888 qap 3288103265 mutinner shujun-he mlizhardy exlsunshine cswwp srikar2097 burntcobalt silent-wn bnaman50 haozou neoql herolin12 neuralnewtorks jingligao jonathancmitchell guanguanboy ayulockin mingliangzhang2018 yining043 happidence1 hungnphan anweshpanda shengdewu ashok-arjun yhding23 rockyxu66 ljz756245026 jijeng robinhoodki huyanhuang szulm hoemr rhwang1314 orange10010 munaiyi719 zxytim soumya-dutta apexsf le-cheng atcp rankevin wipdariodt lv-tuan omsrisagar originofamonia wallarug kaiizhang liangtsao imnobadboy muthuks2020 zhjunqin kheirdast moonryul keithczq wenyumolly wangwenjie123 wangpf09 bibatn duyuankai1992 khaofugui wongxin

distributed_tutorial's Issues

Error with distributed mp

Hi, I tried running my code like your example, and I got this error

File "artGAN512_impre_v8.py", line 286, in main
 mp.spawn(train, nprocs=args.gpus, args=(args,))
  File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 167, in spawn
    while not spawn_context.join():
  File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 114, in join
    raise Exception(msg)
Exception: 

-- Process 0 terminated with the following error:
Traceback (most recent call last):
  File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 19, in _wrap
    fn(i, *args)
  File "/home/ubuntu/dcgan/artGAN512_impre_v8.py", line 167, in train
    world_size=args.world_size, rank=rank)
  File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 406, in init_process_group
    store, rank, world_size = next(rendezvous(url))
  File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.7/site-packages/torch/distributed/rendezvous.py", line 143, in _env_rendezvous_handler
    store = TCPStore(master_addr, master_port, world_size, start_daemon)
RuntimeError: Connection timed out

Under my train function, i have

rank = args.nr * args.gpus + gpu
dist.init_process_group(backend='nccl', init_method='env://', 
                            world_size=args.world_size, rank=rank)
torch.manual_seed(0)
torch.cuda.set_device(gpu)

I think it has something to do with the os.environ['MASTER_ADDR'] , can you explain how you chose value for that parameter? I'm using an AWS instance.

Thanks.

how to determine master address and port?

Thanks for the great tutorial. One thing I still don't understand: how are the master address and port determined? Is this set by my machine, i.e. if I have a machine with 4 GPUs, each one has an IP address and port already assigned to it?

How to add DDP with val loader?

How to add DDP with val evaluation? Is it same with train? @yangkky

Address already in use while running second time

Thanks for the tutorial. So I did accoding to the tutorial but I got error at
Training does not happen..
I ran the script from two terminal, and both the places hangs, seems like they are waiting for something

Hi, a little bit confuse about your code, please give me some help.

I noticed that they use the "Save and Load Checkpoints" to synchronize all models in different process in the PyTorch tutorial https://pytorch.org/tutorials/intermediate/ddp_tutorial.html

So, I want to know if there are some implicit synchronization mechanisms in your distributed_tutorial code.

save or load checkpoint

distributed_tutorial/src/mnist-mixed.py

Lines 108 to 114 in 2446796

    
           print('Epoch [{}/{}], Step [{}/{}], Loss: {:.4f}'.format( 
        
               epoch + 1, 
        
               args.epochs, 
        
               i + 1, 
        
               total_step, 
        
               loss.item()) 
        
           )

could you add the save model line to the example to be more complete. Thanks!

torch.save(model.state_dict(), CHECKPOINT_PATH)

How to do mnist-distributed with checkpointing?

I saw the tutorial (https://pytorch.org/tutorials/intermediate/ddp_tutorial.html#save-and-load-checkpoints):

def demo_checkpoint(rank, world_size):
    print(f"Running DDP checkpoint example on rank {rank}.")
    setup(rank, world_size)

    model = ToyModel().to(rank)
    ddp_model = DDP(model, device_ids=[rank])

    loss_fn = nn.MSELoss()
    optimizer = optim.SGD(ddp_model.parameters(), lr=0.001)

    CHECKPOINT_PATH = tempfile.gettempdir() + "/model.checkpoint"
    if rank == 0:
        # All processes should see same parameters as they all start from same
        # random parameters and gradients are synchronized in backward passes.
        # Therefore, saving it in one process is sufficient.
        torch.save(ddp_model.state_dict(), CHECKPOINT_PATH)

    # Use a barrier() to make sure that process 1 loads the model after process
    # 0 saves it.
    dist.barrier()
    # configure map_location properly
    map_location = {'cuda:%d' % 0: 'cuda:%d' % rank}
    ddp_model.load_state_dict(
        torch.load(CHECKPOINT_PATH, map_location=map_location))

    optimizer.zero_grad()
    outputs = ddp_model(torch.randn(20, 10))
    labels = torch.randn(20, 5).to(rank)
    loss_fn = nn.MSELoss()
    loss_fn(outputs, labels).backward()
    optimizer.step()

    # Not necessary to use a dist.barrier() to guard the file deletion below
    # as the AllReduce ops in the backward pass of DDP already served as
    # a synchronization.

    if rank == 0:
        os.remove(CHECKPOINT_PATH)

    cleanup()

but as you said the tutorial is not very well written or missing or something. I was wondering if you could extend your tutorial with checkpointing?

I am personally interested only in processing each batch quicker by using multiprocessing. So what confuses me is why the code above not simply just save the model once training is done (but instead saves it when rank==0 before training starts). As you said, its confusing. Extending your mnist-example so after I process all the data in mnist and then I can save my model would be fantastic or saving every X number of epochs as it's the common case.

Btw, thanks for your example, it is fantastic!

where does dist.destroy_process_group() go in your DDP MNIST example?

where does dist.destroy_process_group() go in your DDP MNIST example:

https://github.com/yangkky/distributed_tutorial/blob/master/src/mnist-mixed.py

[Bug] Multiple dataset created in each train process

Hi, I don't if wrapping the dataset creation part inside the training function is a good idea... One possible issue is that when using multiple gpus the MNIST dataset is downloaded twice... Perhaps it's better to create a Dataset object in the main function and pass it into the train function, and create the distributed sampler within? Thanks for your help

Unable to run on a single node with multiple GPUs

Pytorch version: 1.7.1
Cuda: 10.0
python: 3.7.10

I am trying to run it on AWS with one node and 4 GPUs using the command
python mnist-distributed.py -n 1 -g 4 -nr 0

the code hangs at init_process_group

I tried by setting MASTER_ADDR to '127.0.0.1'.

What should I do to make it work?

Call set_epoch on DistributedSampler

Hi,

thanks for the excellent example of using DistributedDataParallel in PyTorch; it is very easy to understand and is much better that Pytorch docs.

One important bit that is missing is making the gradient descent truly stochastic in the distributed case. From Pytoch docs, in order to achieve this, set_epoch must be called on the sampler. Otherwise, the data points will be sampled in the same order in every epoch, without shuffling (remember, DataLoader is constructed with shuffle=False). I have also discovered that it is very important to set the epoch to the same value in each worker, otherwise there is a chance that some data points will be visited multiple times, and others none at all.

I hope all this makes sense. I think that future readers will benefit from the addition I am proposing. Once again, thanks for the excellent doc.

Error distributed run

Hi,
Thanks for the easy following tutorial on distributed processing.
I followed your example, it works fine on a single multi-gpu system. On running it on multiple nodes with 2 gpus each I get an error during runtime.

_```
Traceback (most recent call last):
File "conv_dist.py", line 117, in
main()
File "conv_dist.py", line 51, in main
mp.spawn(train, nprocs=args.gpus, args=(args,), join=True)
File "/dine2/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 200, in spawn
return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
File "dine2/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 158, in start_processes
while not context.join():
File "/dine2/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 119, in join
raise Exception(msg)
Exception:

-- Process 0 terminated with the following error:
Traceback (most recent call last):
File "/dine2/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 20, in _wrap
fn(i, *args)
File "/work/codebase/torch_dist/conv_dist.py", line 74, in train
model = DDP(model, device_ids=[gpu])
File "/dine2/lib/python3.6/site-packages/torch/nn/parallel/distributed.py", line 285, in init
self.broadcast_bucket_size)
File "/dine2/lib/python3.6/site-packages/torch/nn/parallel/distributed.py", line 496, in _distributed_broadcast_coalesced
dist._broadcast_coalesced(self.process_group, tensors, buffer_size)
RuntimeError: NCCL error in: /opt/conda/conda-bld/pytorch_1591914838379/work/torch/lib/c10d/ProcessGroupNCCL.cpp:514, unhandled system error, NCCL version 2.4.8


Not able to figure out the cause of error. 
Please help, thanks.

multiple dataloader processes with ddp

you write in you blog https://yangkky.github.io/2019/07/08/distributed-pytorch-tutorial.html - . It’s also possible to have multiple worker processes that fetch data for each GPU. How can I enable this? I am running into bottleneck because of it.

	print('Epoch [{}/{}], Step [{}/{}], Loss: {:.4f}'.format(
	epoch + 1,
	args.epochs,
	i + 1,
	total_step,
	loss.item())
	)