sands-lab / grace Goto Github PK

View Code? Open in Web Editor NEW

130.0 11.0 44.0 468 KB

GRACE - GRAdient ComprEssion for distributed deep learning

Home Page: https://sands.kaust.edu.sa/project/grace/

License: BSD 2-Clause "Simplified" License

Python 98.81% Shell 1.19%

grace's People

Contributors

Stargazers

Watchers

Forkers

ahmedcs trantorrepository yunhaj47 tju-sdchen distributed-deep-learning jding0 nachtsky1077 tszdanger crystal-wxy amitport zhuangwang93 stamanoustis aritra-dutta harrisonlee-zh tntnnlrw tdye24 zhang-zhaorui zhangyy91 forestliurui kundjanasith aoranwu ducviet00 zhangwhao sahiltyagi4 wudonglei99 zjnhq liudyboy alirezadanaee csshali zarzen machinelearningsystem with1015 oujieww crazy-dreamer mzq308734881 quentin-anthony misaka9468 hamidralmasi mantaxl kttaroha ruijietian mingo-liu zhaorui-zhang longinhit

grace's Issues

Discussion about TernGrad

Thanks for the great work of integrating the gradient compression technique to horovod!

I'm also implementing a tensorflow version of TernGrad under horovod framework, and tried to minimize the effect of overhead (compress, decompress) by developing a horovod op and supporting cuda kernels. I've already implemented an allgathered version but it seems cost too much memory when scaling to large number of nodes. Hence, I'm trying to implement an allreduce version, which needs the "scaler sharing" technique mentioned in the original paper. But I met severe convergence issue on BERT pretraining task, comparing to the allgather version without scaler sharing I developed. I'm wondering have you guys tried out scaler sharing and allreduce for TernGrad?

Thanks!

DGC GPU usage in GRACE

Hi,

Thanks for making this awesome library, it really makes things easy to understand. However, there are some performance issues I have encountered. When using DGC, the GPU utilization (on WANDB) is only 20%. I was wondering what the cause of this could be, as on some techniques I have implemented I get over 75% usage. As a consequence training times increase many-fold, which is obviously undesired.

I think I will personally also take a deeper dive into this issue as well, as I'm currently comparing DGC to other methods. If anybody has some tips on where possible bottlenecks are let me know.

Batch size is definitely large enough as other methods using 128 batch size have a high usage. I thought maybe there is some tensor which is located on CPU?

Best,

Mario

bug: tensorflow natural compression returns 'Python int too large to convert to C long'

Problematic error: suddenly interrupt when trainning

Hi, I am working on your code but cause special reason I cannot install an environment that matches your requirements
In ABCI clouds, python3.7 requires GCC 9.3.0 but CUDA 10.1 requires GCC <= 8. So I tried to install:

Python 3.7.10
CUDA 11.0.3 # requires 10.1
CuDNN 8.0.5 # requires 7.6
OpenMPI 4.0.5
NCCL 2.7.8
mpi4py 3.1.1 # requires 3.0.0
horovod 21.0 # same the requirements
cupy-cuda110 8.2.0

Somehow it suddenly interrupts when training, sometimes gets over to epoch 2 but most times interrupts in epoch 1.

I saw it reported "RuntimeError: CUDA error: device-side assert triggered - 'Index out of bounds' failed" but I changed the code to use the default horovod compressor then it worked well.
So I assumed a problematic error exists which I couldn't solve. Could you check it? Many thanks

Some questions about the grace

Hi, I'm curious about why we need the Memory module here. Why it is needed to compensate before compressing and update after compressing.

Also, does grace support local gradient accumulation?
e.g. when I use topk or dgc compression, it will automatically accumulate those grads that are too small in this round, and reduce them when they are big enough. If so, do I have to control when to use zero_grad to clear the grads?

Example of `pack`, `unpack`

Hi,
A very useful package and a reference implementation!
I quote from the paper at https://repository.kaust.edu.sa/bitstream/handle/10754/662495/gradient-compression-survey.pdf?sequence=1&isAllowed=y

... pack can further compress the representation by encoding several values originally stored in 32-bit into a single 32-bit value. For instance, a 1-bit quantization can pack 32 values into a single 32-bit integer

However I couldn't find any example of pack unpack. Can you point me to one if there exists one? If not can you add such functionality to for example SignSGDCompressor?

horovodrun command reports an error and cannot run the examples

I installed the environment exactly according to the steps in INSTALLING.md. When I used the commands in TRAINING.md to test, the following error occurred

horovodrun -np 2 -H 192.168.31.6:2 --verbose python examples/torch/pytorch_mnist.py

output:

Filtering local host names.
Remote host found:
All hosts are local, finding the interfaces with address 127.0.0.1
Local interface found lo
mpirun --allow-run-as-root --tag-output -np 2 -H 192.168.31.6:2 -bind-to none -map-by slot -mca btl_tcp_if_include lo -x NCCL_SOCKET_IFNAME=lo -x ADDR2LINE -x AR -x AS -x BROWSER -x CC -x CFLAGS -x CMAKE_PREFIX_PATH -x COLORTERM -x CONDA_BACKUP_ADDR2LINE -x CONDA_BACKUP_AR -x CONDA_BACKUP_AS -x CONDA_BACKUP_CC -x CONDA_BACKUP_CFLAGS -x CONDA_BACKUP_CMAKE_PREFIX_PATH -x CONDA_BACKUP_CONDA_BUILD_SYSROOT -x CONDA_BACKUP_CPP -x CONDA_BACKUP_CPPFLAGS -x CONDA_BACKUP_CXX -x CONDA_BACKUP_CXXFILT -x CONDA_BACKUP_CXXFLAGS -x CONDA_BACKUP_DEBUG_CFLAGS -x CONDA_BACKUP_DEBUG_CPPFLAGS -x CONDA_BACKUP_DEBUG_CXXFLAGS -x CONDA_BACKUP_ELFEDIT -x CONDA_BACKUP_GCC -x CONDA_BACKUP_GCC_AR -x CONDA_BACKUP_GCC_NM -x CONDA_BACKUP_GCC_RANLIB -x CONDA_BACKUP_GPROF -x CONDA_BACKUP_GXX -x CONDA_BACKUP_HOST -x CONDA_BACKUP_LD -x CONDA_BACKUP_LDFLAGS -x CONDA_BACKUP_LD_GOLD -x CONDA_BACKUP_NM -x CONDA_BACKUP_OBJCOPY -x CONDA_BACKUP_OBJDUMP -x CONDA_BACKUP_RANLIB -x CONDA_BACKUP_READELF -x CONDA_BACKUP_SIZE -x CONDA_BACKUP_STRINGS -x CONDA_BACKUP_STRIP -x CONDA_BACKUP__CONDA_PYTHON_SYSCONFIGDATA_NAME -x CONDA_BUILD_SYSROOT -x CONDA_CUPY_CUDA_PATH -x CONDA_DEFAULT_ENV -x CONDA_EXE -x CONDA_PREFIX -x CONDA_PREFIX_1 -x CONDA_PREFIX_2 -x CONDA_PREFIX_3 -x CONDA_PREFIX_4 -x CONDA_PREFIX_5 -x CONDA_PREFIX_6 -x CONDA_PREFIX_7 -x CONDA_PROMPT_MODIFIER -x CONDA_PYTHON_EXE -x CONDA_SHLVL -x CPP -x CPPFLAGS -x CUDA_PATH -x CXX -x CXXFILT -x CXXFLAGS -x DBUS_SESSION_BUS_ADDRESS -x DEBUG_CFLAGS -x DEBUG_CPPFLAGS -x DEBUG_CXXFLAGS -x ELFEDIT -x GCC -x GCC_AR -x GCC_NM -x GCC_RANLIB -x GIT_ASKPASS -x GPROF -x GXX -x HOME -x HOROVOD_CCL_BGT_AFFINITY -x HOROVOD_GLOO_TIMEOUT_SECONDS -x HOROVOD_NUM_NCCL_STREAMS -x HOROVOD_STALL_CHECK_TIME_SECONDS -x HOROVOD_STALL_SHUTDOWN_TIME_SECONDS -x HOST -x LANG -x LANGUAGE -x LD -x LDFLAGS -x LD_GOLD -x LESSCLOSE -x LESSOPEN -x LOGNAME -x LS_COLORS -x MOTD_SHOWN -x NCCL_SOCKET_IFNAME -x NM -x OBJCOPY -x OBJDUMP -x PATH -x PWD -x RANLIB -x READELF -x SHELL -x SHLVL -x SIZE -x SSH_CLIENT -x SSH_CONNECTION -x STRINGS -x STRIP -x TERM -x TERM_PROGRAM -x TERM_PROGRAM_VERSION -x USER -x VSCODE_GIT_ASKPASS_EXTRA_ARGS -x VSCODE_GIT_ASKPASS_MAIN -x VSCODE_GIT_ASKPASS_NODE -x VSCODE_GIT_IPC_HANDLE -x VSCODE_IPC_HOOK_CLI -x XDG_DATA_DIRS -x XDG_RUNTIME_DIR -x XDG_SESSION_CLASS -x XDG_SESSION_ID -x XDG_SESSION_TYPE -x _ -x _CE_CONDA -x _CE_M -x _CONDA_PYTHON_SYSCONFIGDATA_NAME python examples/torch/pytorch_mnist.py
[mpiexec@gpu-server-1] match_arg (lib/utils/args.c:166): unrecognized argument allow-run-as-root
[mpiexec@gpu-server-1] HYDU_parse_array (lib/utils/args.c:181): argument matching returned error
[mpiexec@gpu-server-1] parse_args (mpiexec/get_parameters.c:315): error parsing input array
[mpiexec@gpu-server-1] HYD_uii_mpx_get_parameters (mpiexec/get_parameters.c:47): unable to parse user arguments
[mpiexec@gpu-server-1] main (mpiexec/mpiexec.c:49): error parsing parameters

It is difficult to find a solution to this error on the Internet. I speculate that the version of mpi is too new. When I use the mpirun --version command, the version of mpi I get is 4.1.1. But I don't know how to solve this problem. I tried various solutions, such as replacing an older server with a completely different configuration, but the same problem still occurred

Hope to get your help, thank you

Threshold compression hangs.

When I use grace's dist package to do threshold compression by Allgather, it will freeze on communication. I found two issues on different backends.

NCCL

Grace's Allgather will transfer tensor to the default cuda device. As I tested, NCCL cannot handle communication from the same GPU, but GLOO can do. It can run after I changed it to:
local_sizes = torch.tensor([t.numel() for t in tensors]).cuda(tensors[0].device.index)

GLOO

In an extreme situation, the threshold is bigger than all gradients, then the value and index tensors will be empty. NCCL can gather an empty list but GLOO will be stuck here

Any plans to port to latest Horovod (0.24) ?

First thanks for the great effort here, Horovod really needs a good gradient compression library! Today's cloud environments really needs this and this is why the official DDP started also supporting gradient compression algorithms....FP16 is not enough anymore

However, I can see only an old version is supported, any plans / roadmap to port to newer versions ?

Thanks,

EFSignSGDCompressor does not seem to use memory

Hi, Maybe I'm misunderstanding something but it seems that EFSignSGDCompressor (parameter compressor set to efsignsgd) never uses its residual memory (error feedback).

(the tensorflow version has compensate and update in the compressor but they are never called)

Thank you

About the usage of GRACE

Is there any way to get and record the communication time between computational nodes?
I want to get the communication time cost between nodes not include the time cost by the compression algorithm, and also the time cost by the compression algorithm. I couldn't find an appropriate approach to do it.

Question about the time complexity.

@hangxu0304 I mixed two methods (TopK and QSGD) to make a compression on gradient vector. But I found that it needs much more time in one epoch during training than a single way to do that. I can't find the reason. The code about the compressor are shown below:

    def compress(self, tensor, name):

        shape = tensor.size()
        numel = tensor.numel()
        tensor = tensor.flatten()

        k = max(1, int(numel * self.compress_ratio))
        _, indices = torch.topk(tensor.abs(), k)

        values = tensor[indices]

        ################################################################################################################
        upperbound = torch.max(values.abs()).flatten()

        abs_gradient = values.abs()

        abs_gradient *= 127 / upperbound

        now_level = abs_gradient.floor()
        prob = torch.empty_like(values).uniform_()
        is_next_level = (prob < (abs_gradient - now_level)).type(torch.float32)
        now_level += is_next_level

        tensor_compressed = (now_level * values.sign()).type(torch.int16)
        tensor_compressed = tensor_compressed.type(torch.int8)

        ctx = numel, shape

        return [tensor_compressed, indices, upperbound], ctx

    def decompress(self, tensor_compressed, ctx):

        numel, shape = ctx
        values, indices , upperbound = tensor_compressed

        decode_output = values.type(torch.float32).abs()

        decode_output *= upperbound / 127
        decode_output *= values.sign()

        tensor_decompressed = torch.zeros(numel, dtype=decode_output.dtype, layout=decode_output.layout, device=decode_output.device)
        tensor_decompressed.scatter_(0, indices, decode_output)

        return tensor_decompressed.view(shape)

And I also make an experiment on the compress and decompress function to check if these have high computational complexity. But I found that they take less time to deal with a vector of which dimension up to 1,000,000. The time cost is just about 0.002 seconds.

Appreciate for your advice and help !!！
Sincerely yours.

Environment build failed on AWS EC2 g4dn instance with OS image (Deep Learning AMI, ami-0184e674549ab8432)

Hi there,

Thanks for sharing the framework! I try to experiment with the codebase. But not able to create suitable environment on AWS platform. I am using Deep Learning AMI (Ubuntu 18.04) Version 60.4 as the cloud instance system, the corresponding AMI is ami-0184e674549ab8432.

When I use command ./create-grace-env-tf1.15.sh under the root folder of the project, the installation of horovod raises the following error. But the installation process didn't stop. And I can see the horovod-0.21.0 is installed in conda environment. But the configuration is not correct.

 make[1]: Leaving directory '/tmp/pip-install-hopus90d/horovod/build/temp.linux-x86_64-cpython-37'
  Makefile:146: recipe for target 'all' failed
  make: *** [all] Error 2
  Traceback (most recent call last):
    File "<string>", line 1, in <module>
    File "/tmp/pip-install-hopus90d/horovod/setup.py", line 193, in <module>
      'horovodrun = horovod.runner.launch:run_commandline'
    File "/home/ubuntu/grace/env-tf1.15/lib/python3.7/site-packages/setuptools/__init__.py", line 87, in setup
      return distutils.core.setup(**attrs)
    File "/home/ubuntu/grace/env-tf1.15/lib/python3.7/site-packages/setuptools/_distutils/core.py", line 148, in setup
      return run_commands(dist)
    File "/home/ubuntu/grace/env-tf1.15/lib/python3.7/site-packages/setuptools/_distutils/core.py", line 163, in run_commands
      dist.run_commands()
    File "/home/ubuntu/grace/env-tf1.15/lib/python3.7/site-packages/setuptools/_distutils/dist.py", line 967, in run_commands
      self.run_command(cmd)
    File "/home/ubuntu/grace/env-tf1.15/lib/python3.7/site-packages/setuptools/dist.py", line 1229, in run_command
      super().run_command(command)
    File "/home/ubuntu/grace/env-tf1.15/lib/python3.7/site-packages/setuptools/_distutils/dist.py", line 986, in run_command
      cmd_obj.run()
    File "/home/ubuntu/grace/env-tf1.15/lib/python3.7/site-packages/wheel/bdist_wheel.py", line 299, in run
      self.run_command('build')
    File "/home/ubuntu/grace/env-tf1.15/lib/python3.7/site-packages/setuptools/_distutils/cmd.py", line 313, in run_command
      self.distribution.run_command(command)
    File "/home/ubuntu/grace/env-tf1.15/lib/python3.7/site-packages/setuptools/dist.py", line 1229, in run_command
      super().run_command(command)
    File "/home/ubuntu/grace/env-tf1.15/lib/python3.7/site-packages/setuptools/_distutils/dist.py", line 986, in run_command
      cmd_obj.run()
    File "/home/ubuntu/grace/env-tf1.15/lib/python3.7/site-packages/setuptools/_distutils/command/build.py", line 136, in run
      self.run_command(cmd_name)
    File "/home/ubuntu/grace/env-tf1.15/lib/python3.7/site-packages/setuptools/_distutils/cmd.py", line 313, in run_command
      self.distribution.run_command(command)
    File "/home/ubuntu/grace/env-tf1.15/lib/python3.7/site-packages/setuptools/dist.py", line 1229, in run_command
      super().run_command(command)
    File "/home/ubuntu/grace/env-tf1.15/lib/python3.7/site-packages/setuptools/_distutils/dist.py", line 986, in run_command
      cmd_obj.run()
    File "/home/ubuntu/grace/env-tf1.15/lib/python3.7/site-packages/setuptools/command/build_ext.py", line 79, in run
      _build_ext.run(self)
    File "/home/ubuntu/grace/env-tf1.15/lib/python3.7/site-packages/setuptools/_distutils/command/build_ext.py", line 339, in run
      self.build_extensions()
    File "/tmp/pip-install-hopus90d/horovod/setup.py", line 91, in build_extensions
      cwd=self.build_temp)
    File "/home/ubuntu/grace/env-tf1.15/lib/python3.7/subprocess.py", line 363, in check_call
      raise CalledProcessError(retcode, cmd)
  subprocess.CalledProcessError: Command '['cmake', '--build', '.', '--config', 'RelWithDebInfo', '--', '-j8', 'VERBOSE=1']' returned non-zero
 exit status 2.
  ----------------------------------------
  ERROR: Failed building wheel for horovod
  Running setup.py clean for horovod
Failed to build horovod

Running command horovodrun -cb gives following message, which indicate the PyTorch extension is not enabled.

(/home/ubuntu/grace/env-tf1.15) ubuntu@ip-172-31-82-84:~/grace$ horovodrun -cb
Horovod v0.21.0:

Available Frameworks:
    [X] TensorFlow
    [ ] PyTorch
    [ ] MXNet

Available Controllers:
    [X] MPI
    [X] Gloo

Available Tensor Operations:
    [X] NCCL
    [ ] DDL
    [ ] CCL
    [X] MPI
    [X] Gloo

Besides, even the TensorFlow extension for Horovod seems ready, the actual training indicate it isn't work properly. When running tensorflow_mnist.py with horovodrun, only the first GPU is doing the computation. This means the distributed training isn't working.

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 510.47.03    Driver Version: 510.47.03    CUDA Version: 11.6     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla T4            On   | 00000000:00:1B.0 Off |                    0 |
| N/A   45C    P0    27W /  70W |    390MiB / 15360MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  Tesla T4            On   | 00000000:00:1C.0 Off |                    0 |
| N/A   37C    P8    15W /  70W |      2MiB / 15360MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   2  Tesla T4            On   | 00000000:00:1D.0 Off |                    0 |
| N/A   37C    P8    14W /  70W |      2MiB / 15360MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   3  Tesla T4            On   | 00000000:00:1E.0 Off |                    0 |
| N/A   35C    P8    14W /  70W |      2MiB / 15360MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A     10049      C   python                             97MiB |
|    0   N/A  N/A     10050      C   python                             97MiB |
|    0   N/A  N/A     10051      C   python                             97MiB |
|    0   N/A  N/A     10052      C   python                             97MiB |
+-----------------------------------------------------------------------------+

Any suggestions? It would be great if you can share a docker environment, or more detailed system configurations.

A question about one-bit quantization implementation in tensorflow backend.

IMHO, the return type of function tf.math.less is tf.bool, which is representated by 8 bit. Hence, the code at
grace_dl/tensorflow/compressor/onebit.py#L21 is not really one-bit quantization but eight-bit quantization?

ValueError:Tensor type torch.cuda.CharTensor is not supported.

@hangxu0304 I run the QSGD algorithm with 4 GPUs with an error.

compression = Allreduce(QSGDCompressor(127), NoneMemory())

# Horovod: wrap optimizer with DistributedOptimizer.
optimizer = hvd.DistributedOptimizer(optimizer, compression, named_parameters=model.named_parameters())

The error message is

Traceback (most recent call last):
  File "./trainer.py", line 442, in <module>
    main()
  File "./trainer.py", line 235, in main
    train(train_loader, model, criterion, optimizer, epoch, log=log)
  File "./trainer.py", line 292, in train
    loss.backward()
  File "/GPUFS/nudt_chkwu_2/.conda/envs/torch_ho/lib/python3.6/site-packages/torch/tensor.py", line 195, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph)
  File "/GPUFS/nudt_chkwu_2/.conda/envs/torch_ho/lib/python3.6/site-packages/torch/autograd/__init__.py", line 99, in backward
    allow_unreachable=True)  # allow_unreachable flag
  File "/GPUFS/nudt_chkwu_2/.conda/envs/torch_ho/lib/python3.6/site-packages/horovod/torch/__init__.py", line 150, in hook
    handle, ctx = self._communicate_grad_async(p)
  File "/GPUFS/nudt_chkwu_2/.conda/envs/torch_ho/lib/python3.6/site-packages/horovod/torch/__init__.py", line 132, in _communicate_grad_async
    handle, ctx = self.grace.send_step(tensor, name)
  File "/GPUFS/nudt_chkwu_2/kfhu/grace-master/grace_dl/torch/__init__.py", line 54, in send_step
    handles = self.async_send(tensors_compressed, name)
  File "/GPUFS/nudt_chkwu_2/kfhu/grace-master/grace_dl/torch/communicator/allreduce.py", line 10, in async_send
    handles.append(allreduce_async_(tensor_compressed, self.compressor.average, name + str(i)))
  File "/GPUFS/nudt_chkwu_2/.conda/envs/torch_ho/lib/python3.6/site-packages/horovod/torch/mpi_ops.py", line 186, in allreduce_async_
    return _allreduce_async(tensor, tensor, average, name)
  File "/GPUFS/nudt_chkwu_2/.conda/envs/torch_ho/lib/python3.6/site-packages/horovod/torch/mpi_ops.py", line 88, in _allreduce_async
    function = _check_function(_allreduce_function_factory, tensor)
  File "/GPUFS/nudt_chkwu_2/.conda/envs/torch_ho/lib/python3.6/site-packages/horovod/torch/mpi_ops.py", line 72, in _check_function
    raise ValueError('Tensor type %s is not supported.' % tensor.type())
ValueError: Tensor type torch.cuda.CharTensor is not supported.
Traceback (most recent call last):
  File "./trainer.py", line 442, in <module>
    main()
  File "./trainer.py", line 235, in main
    train(train_loader, model, criterion, optimizer, epoch, log=log)
  File "./trainer.py", line 292, in train
    loss.backward()
  File "/GPUFS/nudt_chkwu_2/.conda/envs/torch_ho/lib/python3.6/site-packages/torch/tensor.py", line 195, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph)
  File "/GPUFS/nudt_chkwu_2/.conda/envs/torch_ho/lib/python3.6/site-packages/torch/autograd/__init__.py", line 99, in backward
    allow_unreachable=True)  # allow_unreachable flag
  File "/GPUFS/nudt_chkwu_2/.conda/envs/torch_ho/lib/python3.6/site-packages/horovod/torch/__init__.py", line 150, in hook
    handle, ctx = self._communicate_grad_async(p)
  File "/GPUFS/nudt_chkwu_2/.conda/envs/torch_ho/lib/python3.6/site-packages/horovod/torch/__init__.py", line 132, in _communicate_grad_async
    handle, ctx = self.grace.send_step(tensor, name)
  File "/GPUFS/nudt_chkwu_2/kfhu/grace-master/grace_dl/torch/__init__.py", line 54, in send_step
    handles = self.async_send(tensors_compressed, name)
  File "/GPUFS/nudt_chkwu_2/kfhu/grace-master/grace_dl/torch/communicator/allreduce.py", line 10, in async_send
    handles.append(allreduce_async_(tensor_compressed, self.compressor.average, name + str(i)))
  File "/GPUFS/nudt_chkwu_2/.conda/envs/torch_ho/lib/python3.6/site-packages/horovod/torch/mpi_ops.py", line 186, in allreduce_async_
    return _allreduce_async(tensor, tensor, average, name)
  File "/GPUFS/nudt_chkwu_2/.conda/envs/torch_ho/lib/python3.6/site-packages/horovod/torch/mpi_ops.py", line 88, in _allreduce_async
    function = _check_function(_allreduce_function_factory, tensor)
  File "/GPUFS/nudt_chkwu_2/.conda/envs/torch_ho/lib/python3.6/site-packages/horovod/torch/mpi_ops.py", line 72, in _check_function
    raise ValueError('Tensor type %s is not supported.' % tensor.type())
ValueError: Tensor type torch.cuda.CharTensor is not supported.
Traceback (most recent call last):
  File "./trainer.py", line 442, in <module>
    main()
  File "./trainer.py", line 235, in main
    train(train_loader, model, criterion, optimizer, epoch, log=log)
  File "./trainer.py", line 292, in train
    loss.backward()
  File "/GPUFS/nudt_chkwu_2/.conda/envs/torch_ho/lib/python3.6/site-packages/torch/tensor.py", line 195, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph)
  File "/GPUFS/nudt_chkwu_2/.conda/envs/torch_ho/lib/python3.6/site-packages/torch/autograd/__init__.py", line 99, in backward
    allow_unreachable=True)  # allow_unreachable flag
  File "/GPUFS/nudt_chkwu_2/.conda/envs/torch_ho/lib/python3.6/site-packages/horovod/torch/__init__.py", line 150, in hook
    handle, ctx = self._communicate_grad_async(p)
  File "/GPUFS/nudt_chkwu_2/.conda/envs/torch_ho/lib/python3.6/site-packages/horovod/torch/__init__.py", line 132, in _communicate_grad_async
    handle, ctx = self.grace.send_step(tensor, name)
  File "/GPUFS/nudt_chkwu_2/kfhu/grace-master/grace_dl/torch/__init__.py", line 54, in send_step
    handles = self.async_send(tensors_compressed, name)
  File "/GPUFS/nudt_chkwu_2/kfhu/grace-master/grace_dl/torch/communicator/allreduce.py", line 10, in async_send
    handles.append(allreduce_async_(tensor_compressed, self.compressor.average, name + str(i)))
  File "/GPUFS/nudt_chkwu_2/.conda/envs/torch_ho/lib/python3.6/site-packages/horovod/torch/mpi_ops.py", line 186, in allreduce_async_
    return _allreduce_async(tensor, tensor, average, name)
  File "/GPUFS/nudt_chkwu_2/.conda/envs/torch_ho/lib/python3.6/site-packages/horovod/torch/mpi_ops.py", line 88, in _allreduce_async
    function = _check_function(_allreduce_function_factory, tensor)
  File "/GPUFS/nudt_chkwu_2/.conda/envs/torch_ho/lib/python3.6/site-packages/horovod/torch/mpi_ops.py", line 72, in _check_function
    raise ValueError('Tensor type %s is not supported.' % tensor.type())
ValueError: Tensor type torch.cuda.CharTensor is not supported.
Traceback (most recent call last):
  File "./trainer.py", line 442, in <module>
    main()
  File "./trainer.py", line 235, in main
    train(train_loader, model, criterion, optimizer, epoch, log=log)
  File "./trainer.py", line 292, in train
    loss.backward()
  File "/GPUFS/nudt_chkwu_2/.conda/envs/torch_ho/lib/python3.6/site-packages/torch/tensor.py", line 195, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph)
  File "/GPUFS/nudt_chkwu_2/.conda/envs/torch_ho/lib/python3.6/site-packages/torch/autograd/__init__.py", line 99, in backward
    allow_unreachable=True)  # allow_unreachable flag
  File "/GPUFS/nudt_chkwu_2/.conda/envs/torch_ho/lib/python3.6/site-packages/horovod/torch/__init__.py", line 150, in hook
    handle, ctx = self._communicate_grad_async(p)
  File "/GPUFS/nudt_chkwu_2/.conda/envs/torch_ho/lib/python3.6/site-packages/horovod/torch/__init__.py", line 132, in _communicate_grad_async
    handle, ctx = self.grace.send_step(tensor, name)
  File "/GPUFS/nudt_chkwu_2/kfhu/grace-master/grace_dl/torch/__init__.py", line 54, in send_step
    handles = self.async_send(tensors_compressed, name)
  File "/GPUFS/nudt_chkwu_2/kfhu/grace-master/grace_dl/torch/communicator/allreduce.py", line 10, in async_send
    handles.append(allreduce_async_(tensor_compressed, self.compressor.average, name + str(i)))
  File "/GPUFS/nudt_chkwu_2/.conda/envs/torch_ho/lib/python3.6/site-packages/horovod/torch/mpi_ops.py", line 186, in allreduce_async_
    return _allreduce_async(tensor, tensor, average, name)
  File "/GPUFS/nudt_chkwu_2/.conda/envs/torch_ho/lib/python3.6/site-packages/horovod/torch/mpi_ops.py", line 88, in _allreduce_async
    function = _check_function(_allreduce_function_factory, tensor)
  File "/GPUFS/nudt_chkwu_2/.conda/envs/torch_ho/lib/python3.6/site-packages/horovod/torch/mpi_ops.py", line 72, in _check_function
    raise ValueError('Tensor type %s is not supported.' % tensor.type())
ValueError: Tensor type torch.cuda.CharTensor is not supported.
yhrun: error: gpu16: tasks 0-3: Exited with exit code 1

Could you help me? Thanks.

The question about QSGD algorithm in the paper

As shown in the figure.

In the case of s=4, l=3, the probability that I calculate p(i) according to the QSGD algorithm is not 0.2336, but p(i)=0.2336 in the case of l=0.

But the g(i) corresponding to the probability 0.2336 should be 0 instead of 1/4. This is also in line with the coordinate map below the picture, and the distance from 0.2336 to the endpoint (0) should be shorter.

Is this understanding correct? I sincerely look forward to your reply!

ImportError:Extension horovod.torch has not been built

from horovod.tensorflow import allreduce_async_, synchronize

The program runs at the line above break off.
The error info as below：

Traceback (most recent call last):
  File "/GPUFS/nudt_chkwu_2/kfhu/horovod-0.19.2/horovod/torch/__init__.py", line 32, in <module>
    __file__, 'mpi_lib_v2')
  File "/GPUFS/nudt_chkwu_2/kfhu/horovod-0.19.2/horovod/common/util.py", line 56, in check_extension
    'Horovod with %s=1 to debug the build error.' % (ext_name, ext_env_var))
ImportError: Extension horovod.torch has not been built.  If this is not expected, reinstall Horovod with HOROVOD_WITH_PYTORCH=1 to debug the build error.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/GPUFS/nudt_chkwu_2/kfhu/horovod-0.19.2/horovod/torch/__init__.py", line 35, in <module>
    __file__, 'mpi_lib', '_mpi_lib')
  File "/GPUFS/nudt_chkwu_2/kfhu/horovod-0.19.2/horovod/common/util.py", line 56, in check_extension
    'Horovod with %s=1 to debug the build error.' % (ext_name, ext_env_var))
ImportError: Extension horovod.torch has not been built.  If this is not expected, reinstall Horovod with HOROVOD_WITH_PYTORCH=1 to debug the build error.

Can you give me a resolution? Appreciate for your help!

Global or Local compression

Dear authors,

I used issue form to submit the general question about the implementation details. The framework works just fine. Thanks, great job! :)

You utilized Horovod's API to construct compression (Compressor) and error-feedback (ResidualMemory) classes. As far as I know the key methods (compress/ decompress) are applied to the tensors (layers's parameters) separately (please correct me if I am wrong). In some cases (top-k compression) the statistics obtained from all the layers are required (see https://arxiv.org/pdf/1712.01887.pdf as an example).

The question is next: is it a correct way to apply the compression methods designed to compress the entire update (like top-k) to each layer separately?

If it is possible, please, provide your standpoint!

Seeking suggestions for embedding into ddp

Hi, PyTorch 1.8 have this new hook torch.nn.parallel.DistributedDataParallel.register_comm_hook(), any advices on how to integrate grace into ddp using the dist examples?

How should I use GRACE if I change the compression strategy during training？

I will change the compression during the training process. In this case, how should I use GRACE？ The pseudo-code is shown below：

compression = A()

optimizer = hvd.DistributedOptimizer(optimizer,
                                         compression,
                                         named_parameters=model.named_parameters())                                                               

for epoch in range(0,100):
      if epoch > 50:
             compression = B()
             How to apply that compression above to the DistributedOptimizer?

      train()
      test()

      .....

Anyone could help me to solve this problem? Appreciate for your help!!!

DGC warmup

I know DGC algorithm used warm-up training to exponentially increase the gradient sparsity.

Do you imply warm-up algorithm in your code ?

GRACE PyTorch Distributed

Does the Pytorch Distributed implementation still require Horovod for its working? I see that the communication protocols use Pytorch DDP calls, but there are other function calls such as broadcast_parameters and broadcast_optimizer_state that still uses Horovod.

TypeError:compensate() missing 1 required positional argument: 'name'

I use the GRACE to train the NLP model with PennTreeBank dataset in a distributed manner. And there is an error followed I couldn't solve.

Traceback (most recent call last):
  File "main.py", line 299, in <module>
    train(epoch, trainloader, model, optimizer, device, batch_size)
  File "main.py", line 132, in train
    loss.backward()
  File "/GPUFS/nudt_chkwu_2/.conda/envs/Horovod/lib/python3.6/site-packages/torch/tensor.py", line 150, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph)
  File "/GPUFS/nudt_chkwu_2/.conda/envs/Horovod/lib/python3.6/site-packages/torch/autograd/__init__.py", line 99, in backward
    allow_unreachable=True)  # allow_unreachable flag
  File "/GPUFS/nudt_chkwu_2/.conda/envs/Horovod/lib/python3.6/site-packages/horovod/torch/__init__.py", line 150, in hook
    handle, ctx = self._communicate_grad_async(p)
  File "/GPUFS/nudt_chkwu_2/.conda/envs/Horovod/lib/python3.6/site-packages/horovod/torch/__init__.py", line 132, in _communicate_grad_async
    handle, ctx = self.grace.send_step(tensor, name)
  File "../grace_dl/torch/__init__.py", line 51, in send_step
    tensor = self.memory.compensate(tensor, name)
TypeError: compensate() missing 1 required positional argument: 'name'

It seems that the function compensate can't get the right argument name. Any help is appreciated. Thanks!

[DDP] weight in classifier

I run the example in examples/dist/CIFAR10-dawndist/dawn.py. In the Network Definition, there is weight = 0.125. Why need to multiply weight here? How to choose the value of weight and do other networks also need weight? Can there be other network definitions for reference.