Giter Site home page Giter Site logo

dist_tuto.pth's People

Contributors

kwang2049 avatar seba-1511 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

dist_tuto.pth's Issues

How to launch a job in a cluster?

environment:TeslaK20 cluster,two nodes,every node has two gpus.

Process Process-1:
Traceback (most recent call last):
File "/home/zhangzhaoyu/anaconda2/lib/python2.7/multiprocessing/process.py", line 258, in _bootstrap
self.run()
File "/home/zhangzhaoyu/anaconda2/lib/python2.7/multiprocessing/process.py", line 114, in run
self._target(*self._args, **self._kwargs)
File "train_dist.py", line 134, in init_processes
dist.init_process_group(backend, rank=rank, world_size=size)
File "/home/zhangzhaoyu/anaconda2/lib/python2.7/site-packages/torch/distributed/init.py", line 49, in init_process_group
group_name, rank)
RuntimeError: Address already in use at /pytorch/torch/lib/THD/process_group/General.cpp:17

distributed train never hit if

so i run train_dist.py and add print(param) under if section
`

  if type(param) is torch.Tensor:

        print(param)

        dist.all_reduce(param.grad.data, op=dist.reduce_op.SUM, group=0)

`
param never printed which means all_reduce never called.

train_dist.py incompatible with new PyTorch

When I tried to run train_dist.py, I ran into multiple errors that seem to be due to the fact that this code is written on older versions of Python and PyTorch.

One issue, torch.utils.data.DataLoader expects batch_size to be integer

File "", line 77, in partition_dataset
partition, batch_size= bsz, shuffle=True)
File "/home/lukeai/miniconda3/envs/py36/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 179, in init
batch_sampler = BatchSampler(sampler, batch_size, drop_last)
File "/home/lukeai/miniconda3/envs/py36/lib/python3.6/site-packages/torch/utils/data/sampler.py", line 162, in init
"but got batch_size={}".format(batch_size))
ValueError: batch_size should be a positive integer value, but got batch_size=64.0

Another issue, no idea what is happening here

File "", line 85, in average_gradients
dist.all_reduce(param.grad.data, op=dist.ReduceOp.SUM, group=0)
File "/home/lukeai/miniconda3/envs/py36/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py", line 902, in all_reduce
work = group.allreduce([tensor], opts)
AttributeError: 'int' object has no attribute 'allreduce'

Using MPI backend with a P2P process communication time exception!

My Code:

"""run.py:"""
#!/usr/bin/env python
import os
import sys

import torch
import torch.distributed as dist
import time
from torch.multiprocessing import Process

# """Blocking point-to-point communication."""
#
def run(rank, size):
    # tensor = torch.rand(50,100)
    tensor = torch.rand(50,80)
    processnumber = 64
    resultlist = [[0 for i in range(2)]for j in range(64)]
    tensorsize =sys.getsizeof(tensor.storage())
    traintimes = 3
    dsetination = 63
    for dsetination in range(1, processnumber):
        avsendtime = 0
        avrectime = 0
        for j in range(traintimes):
            if dist.get_rank() == 0:
                # tensor += 1
                # # Send the tensor to process 1
                beginsend = time.time()
                dist.send(tensor=tensor, dst=dsetination)
                endsend = time.time()
                # if dsetination == processnumber -1:
                #     print("sendtimes is:", j, "spend time is:", endsend - beginsend,"begin time is:",beginsend,"end time is:",endsend)
                if j != 0:
                    avsendtime += endsend - beginsend
            elif dist.get_rank() == dsetination:
                # Receive tensor from process 0
                beginrec = time.time()
                dist.recv(tensor=tensor, src=0)
                endrec = time.time()
                if j != 0:
                    avrectime += endrec - beginrec
                    if dist.get_rank() == processnumber - 1:
                        print("receivetimes is:", j, "spend time is:", endrec - beginrec, "beginre time is:", beginrec,
                              "endre time is:", endrec)
        if dist.get_rank() == 0:
            resultlist[dsetination][0] = avsendtime/(traintimes-1)
            rtime = avsendtime/(traintimes-1)
            print("send to:", dsetination, "time is:",rtime,"speed is:",(tensorsize/rtime)/(2**30),"GB/S")
        if dist.get_rank() == dsetination:
            resultlist[dsetination][1] = avrectime/(traintimes-1)
            rtime = avrectime/(traintimes-1)
            print("rank is:", dsetination, "rectime is:",rtime,"speed is:",(tensorsize/rtime)/(2**30),"GB/S")
    # print(resultlist)
    # print('Rank ', rank, ' has data ', tensor)
    # mylist = tensor.numpy().tolist()
    # print('Rank ', dist.get_rank(), ' has data ',mylist)
# """ All-Reduce example."""
# def run(rank, size):
#     """ Simple point-to-point communication. """
#     group = dist.new_group([0, 1])
#     tensor = torch.ones(1)
#     dist.all_reduce(tensor, op=dist.reduce_op.SUM, group=group)
#     print('Rank ', rank, ' has data ', tensor[0])


def init_process(rank, size, fn, backend='mpi'):
    """ Initialize the distributed environment. """
    os.environ['MASTER_ADDR'] = '127.0.0.1'
    os.environ['MASTER_PORT'] = '29500'
    dist.init_process_group(backend, rank=rank, world_size=size)
    fn(rank, size)

if __name__ == "__main__":
    init_process(0, 0, run, backend='mpi')

I send a tensor from rank0 to rank1...rank63, each rank send 1000times, I got an unacceptable result during the last test of rank63, and I can't understand why it would be so long?
image
result:
receivetimes is: 991 spend time is: 5.1021575927734375e-05 beginre time is: 1608996863.2943463 endre time is: 1608996863.2943974 receivetimes is: 992 spend time is: 5.078315734863281e-05 beginre time is: 1608996863.294423 endre time is: 1608996863.294474 receivetimes is: 993 spend time is: 5.0067901611328125e-05 beginre time is: 1608996863.2944958 endre time is: 1608996863.294546 receivetimes is: 994 spend time is: 5.078315734863281e-05 beginre time is: 1608996863.2945678 endre time is: 1608996863.2946186 receivetimes is: 995 spend time is: 5.1021575927734375e-05 beginre time is: 1608996863.2946415 endre time is: 1608996863.2946925 receivetimes is: 996 spend time is: 4.982948303222656e-05 beginre time is: 1608996863.2947164 endre time is: 1608996863.2947662 receivetimes is: 997 spend time is: 4.9591064453125e-05 beginre time is: 1608996863.2947881 endre time is: 1608996863.2948377 receivetimes is: 998 spend time is: 5.125999450683594e-05 beginre time is: 1608996863.2948594 endre time is: 1608996863.2949107 receivetimes is: 999 spend time is: 0.12546324729919434 beginre time is: 1608996863.2949336 endre time is: 1608996863.4203968

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.