seba-1511 / dist_tuto.pth Goto Github PK

View Code? Open in Web Editor NEW

251.0 251.0 58.0 1.3 MB

Official code for "Writing Distributed Applications with PyTorch", PyTorch Tutorial

License: Apache License 2.0

Makefile 0.11% HTML 91.56% Python 8.33%

dist_tuto.pth's People

Contributors

Stargazers

Watchers

Forkers

zrachel liu3xing3long shuizhilinxin jingweiz yhgon pandinosaurus drewbutlerbb4 seewoo79 songchb wdd614 guoshengxu xdcesc zhangyi19941217 liuying350169 parallelsdev protonish wutianyirosun jeonsworld yhqjohn cnzhanj xhplus jiadong007 mathpopo saturdays akmazad vgangal101 beekbin fengzhiteng renly bhandariroshan beppe9000 shaoyn0817 shuangjiexu darkbalder panchul lynnliu030 kevvinhoo cswaynecool leicao-me ricosuaveguapo hamid-ramezani hungnphan dsslab-code jddavis1000 panyinyin 8kta gfiameni mackzacka 43reyerhrstj kwang2049 yiliu30 raphael-jin johnnyv75 seekpoint botbw crazyguitar aaronlyt

dist_tuto.pth's Issues

Whether we maintain two same models in two process

dist_tuto.pth/train_dist.py

Line 106 in 82ee2a3

model = Net()

Whether it isn't useful in any real practice? I think eventually we only need one model
How I can know which GPU place each data replicas and each model?

How to launch a job in a cluster?

environment:TeslaK20 cluster,two nodes,every node has two gpus.

Process Process-1:
Traceback (most recent call last):
File "/home/zhangzhaoyu/anaconda2/lib/python2.7/multiprocessing/process.py", line 258, in _bootstrap
self.run()
File "/home/zhangzhaoyu/anaconda2/lib/python2.7/multiprocessing/process.py", line 114, in run
self._target(*self._args, **self._kwargs)
File "train_dist.py", line 134, in init_processes
dist.init_process_group(backend, rank=rank, world_size=size)
File "/home/zhangzhaoyu/anaconda2/lib/python2.7/site-packages/torch/distributed/init.py", line 49, in init_process_group
group_name, rank)
RuntimeError: Address already in use at /pytorch/torch/lib/THD/process_group/General.cpp:17

distributed train never hit if

so i run train_dist.py and add print(param) under if section
`

  if type(param) is torch.Tensor:

        print(param)

        dist.all_reduce(param.grad.data, op=dist.reduce_op.SUM, group=0)

`
param never printed which means all_reduce never called.

How to run distributed with libtorch c++

Sorry but I can not find out the similar documents with c++ language.
Please give me some example with distributed libtorch c++.

Thanks,

train_dist.py incompatible with new PyTorch

When I tried to run train_dist.py, I ran into multiple errors that seem to be due to the fact that this code is written on older versions of Python and PyTorch.

One issue, torch.utils.data.DataLoader expects batch_size to be integer

File "", line 77, in partition_dataset
partition, batch_size= bsz, shuffle=True)
File "/home/lukeai/miniconda3/envs/py36/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 179, in init
batch_sampler = BatchSampler(sampler, batch_size, drop_last)
File "/home/lukeai/miniconda3/envs/py36/lib/python3.6/site-packages/torch/utils/data/sampler.py", line 162, in init
"but got batch_size={}".format(batch_size))
ValueError: batch_size should be a positive integer value, but got batch_size=64.0

Another issue, no idea what is happening here

File "", line 85, in average_gradients
dist.all_reduce(param.grad.data, op=dist.ReduceOp.SUM, group=0)
File "/home/lukeai/miniconda3/envs/py36/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py", line 902, in all_reduce
work = group.allreduce([tensor], opts)
AttributeError: 'int' object has no attribute 'allreduce'

Using MPI backend with a P2P process communication time exception！

My Code：

"""run.py:"""
#!/usr/bin/env python
import os
import sys

import torch
import torch.distributed as dist
import time
from torch.multiprocessing import Process

# """Blocking point-to-point communication."""
#
def run(rank, size):
    # tensor = torch.rand(50,100)
    tensor = torch.rand(50,80)
    processnumber = 64
    resultlist = [[0 for i in range(2)]for j in range(64)]
    tensorsize =sys.getsizeof(tensor.storage())
    traintimes = 3
    dsetination = 63
    for dsetination in range(1, processnumber):
        avsendtime = 0
        avrectime = 0
        for j in range(traintimes):
            if dist.get_rank() == 0:
                # tensor += 1
                # # Send the tensor to process 1
                beginsend = time.time()
                dist.send(tensor=tensor, dst=dsetination)
                endsend = time.time()
                # if dsetination == processnumber -1:
                #     print("sendtimes is:", j, "spend time is:", endsend - beginsend,"begin time is:",beginsend,"end time is:",endsend)
                if j != 0:
                    avsendtime += endsend - beginsend
            elif dist.get_rank() == dsetination:
                # Receive tensor from process 0
                beginrec = time.time()
                dist.recv(tensor=tensor, src=0)
                endrec = time.time()
                if j != 0:
                    avrectime += endrec - beginrec
                    if dist.get_rank() == processnumber - 1:
                        print("receivetimes is:", j, "spend time is:", endrec - beginrec, "beginre time is:", beginrec,
                              "endre time is:", endrec)
        if dist.get_rank() == 0:
            resultlist[dsetination][0] = avsendtime/(traintimes-1)
            rtime = avsendtime/(traintimes-1)
            print("send to:", dsetination, "time is:",rtime,"speed is:",(tensorsize/rtime)/(2**30),"GB/S")
        if dist.get_rank() == dsetination:
            resultlist[dsetination][1] = avrectime/(traintimes-1)
            rtime = avrectime/(traintimes-1)
            print("rank is:", dsetination, "rectime is:",rtime,"speed is:",(tensorsize/rtime)/(2**30),"GB/S")
    # print(resultlist)
    # print('Rank ', rank, ' has data ', tensor)
    # mylist = tensor.numpy().tolist()
    # print('Rank ', dist.get_rank(), ' has data ',mylist)
# """ All-Reduce example."""
# def run(rank, size):
#     """ Simple point-to-point communication. """
#     group = dist.new_group([0, 1])
#     tensor = torch.ones(1)
#     dist.all_reduce(tensor, op=dist.reduce_op.SUM, group=group)
#     print('Rank ', rank, ' has data ', tensor[0])


def init_process(rank, size, fn, backend='mpi'):
    """ Initialize the distributed environment. """
    os.environ['MASTER_ADDR'] = '127.0.0.1'
    os.environ['MASTER_PORT'] = '29500'
    dist.init_process_group(backend, rank=rank, world_size=size)
    fn(rank, size)

if __name__ == "__main__":
    init_process(0, 0, run, backend='mpi')

I send a tensor from rank0 to rank1...rank63, each rank send 1000times, I got an unacceptable result during the last test of rank63, and I can't understand why it would be so long?

result:
receivetimes is: 991 spend time is: 5.1021575927734375e-05 beginre time is: 1608996863.2943463 endre time is: 1608996863.2943974 receivetimes is: 992 spend time is: 5.078315734863281e-05 beginre time is: 1608996863.294423 endre time is: 1608996863.294474 receivetimes is: 993 spend time is: 5.0067901611328125e-05 beginre time is: 1608996863.2944958 endre time is: 1608996863.294546 receivetimes is: 994 spend time is: 5.078315734863281e-05 beginre time is: 1608996863.2945678 endre time is: 1608996863.2946186 receivetimes is: 995 spend time is: 5.1021575927734375e-05 beginre time is: 1608996863.2946415 endre time is: 1608996863.2946925 receivetimes is: 996 spend time is: 4.982948303222656e-05 beginre time is: 1608996863.2947164 endre time is: 1608996863.2947662 receivetimes is: 997 spend time is: 4.9591064453125e-05 beginre time is: 1608996863.2947881 endre time is: 1608996863.2948377 receivetimes is: 998 spend time is: 5.125999450683594e-05 beginre time is: 1608996863.2948594 endre time is: 1608996863.2949107 receivetimes is: 999 spend time is: 0.12546324729919434 beginre time is: 1608996863.2949336 endre time is: 1608996863.4203968

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.