sands-lab / grace Goto Github PK
View Code? Open in Web Editor NEWGRACE - GRAdient ComprEssion for distributed deep learning
Home Page: https://sands.kaust.edu.sa/project/grace/
License: BSD 2-Clause "Simplified" License
GRACE - GRAdient ComprEssion for distributed deep learning
Home Page: https://sands.kaust.edu.sa/project/grace/
License: BSD 2-Clause "Simplified" License
Thanks for the great work of integrating the gradient compression technique to horovod!
I'm also implementing a tensorflow version of TernGrad under horovod framework, and tried to minimize the effect of overhead (compress, decompress) by developing a horovod op and supporting cuda kernels. I've already implemented an allgathered version but it seems cost too much memory when scaling to large number of nodes. Hence, I'm trying to implement an allreduce version, which needs the "scaler sharing" technique mentioned in the original paper. But I met severe convergence issue on BERT pretraining task, comparing to the allgather version without scaler sharing I developed. I'm wondering have you guys tried out scaler sharing and allreduce for TernGrad?
Thanks!
Hi,
Thanks for making this awesome library, it really makes things easy to understand. However, there are some performance issues I have encountered. When using DGC, the GPU utilization (on WANDB) is only 20%. I was wondering what the cause of this could be, as on some techniques I have implemented I get over 75% usage. As a consequence training times increase many-fold, which is obviously undesired.
I think I will personally also take a deeper dive into this issue as well, as I'm currently comparing DGC to other methods. If anybody has some tips on where possible bottlenecks are let me know.
Batch size is definitely large enough as other methods using 128 batch size have a high usage. I thought maybe there is some tensor which is located on CPU?
Best,
Mario
Hi, I am working on your code but cause special reason I cannot install an environment that matches your requirements
In ABCI clouds, python3.7 requires GCC 9.3.0 but CUDA 10.1 requires GCC <= 8. So I tried to install:
Somehow it suddenly interrupts when training, sometimes gets over to epoch 2 but most times interrupts in epoch 1.
I saw it reported "RuntimeError: CUDA error: device-side assert triggered - 'Index out of bounds' failed" but I changed the code to use the default horovod compressor then it worked well.
So I assumed a problematic error exists which I couldn't solve. Could you check it? Many thanks
Hi, I'm curious about why we need the Memory module here. Why it is needed to compensate before compressing and update after compressing.
Also, does grace support local gradient accumulation?
e.g. when I use topk or dgc compression, it will automatically accumulate those grads that are too small in this round, and reduce them when they are big enough. If so, do I have to control when to use zero_grad to clear the grads?
Hi,
A very useful package and a reference implementation!
I quote from the paper at https://repository.kaust.edu.sa/bitstream/handle/10754/662495/gradient-compression-survey.pdf?sequence=1&isAllowed=y
...
pack
can further compress the representation by encoding several values originally stored in 32-bit into a single 32-bit value. For instance, a 1-bit quantization can pack 32 values into a single 32-bit integer
However I couldn't find any example of pack
unpack
. Can you point me to one if there exists one? If not can you add such functionality to for example SignSGDCompressor
?
I installed the environment exactly according to the steps in INSTALLING.md. When I used the commands in TRAINING.md to test, the following error occurred
horovodrun -np 2 -H 192.168.31.6:2 --verbose python examples/torch/pytorch_mnist.py
output:
Filtering local host names.
Remote host found:
All hosts are local, finding the interfaces with address 127.0.0.1
Local interface found lo
mpirun --allow-run-as-root --tag-output -np 2 -H 192.168.31.6:2 -bind-to none -map-by slot -mca btl_tcp_if_include lo -x NCCL_SOCKET_IFNAME=lo -x ADDR2LINE -x AR -x AS -x BROWSER -x CC -x CFLAGS -x CMAKE_PREFIX_PATH -x COLORTERM -x CONDA_BACKUP_ADDR2LINE -x CONDA_BACKUP_AR -x CONDA_BACKUP_AS -x CONDA_BACKUP_CC -x CONDA_BACKUP_CFLAGS -x CONDA_BACKUP_CMAKE_PREFIX_PATH -x CONDA_BACKUP_CONDA_BUILD_SYSROOT -x CONDA_BACKUP_CPP -x CONDA_BACKUP_CPPFLAGS -x CONDA_BACKUP_CXX -x CONDA_BACKUP_CXXFILT -x CONDA_BACKUP_CXXFLAGS -x CONDA_BACKUP_DEBUG_CFLAGS -x CONDA_BACKUP_DEBUG_CPPFLAGS -x CONDA_BACKUP_DEBUG_CXXFLAGS -x CONDA_BACKUP_ELFEDIT -x CONDA_BACKUP_GCC -x CONDA_BACKUP_GCC_AR -x CONDA_BACKUP_GCC_NM -x CONDA_BACKUP_GCC_RANLIB -x CONDA_BACKUP_GPROF -x CONDA_BACKUP_GXX -x CONDA_BACKUP_HOST -x CONDA_BACKUP_LD -x CONDA_BACKUP_LDFLAGS -x CONDA_BACKUP_LD_GOLD -x CONDA_BACKUP_NM -x CONDA_BACKUP_OBJCOPY -x CONDA_BACKUP_OBJDUMP -x CONDA_BACKUP_RANLIB -x CONDA_BACKUP_READELF -x CONDA_BACKUP_SIZE -x CONDA_BACKUP_STRINGS -x CONDA_BACKUP_STRIP -x CONDA_BACKUP__CONDA_PYTHON_SYSCONFIGDATA_NAME -x CONDA_BUILD_SYSROOT -x CONDA_CUPY_CUDA_PATH -x CONDA_DEFAULT_ENV -x CONDA_EXE -x CONDA_PREFIX -x CONDA_PREFIX_1 -x CONDA_PREFIX_2 -x CONDA_PREFIX_3 -x CONDA_PREFIX_4 -x CONDA_PREFIX_5 -x CONDA_PREFIX_6 -x CONDA_PREFIX_7 -x CONDA_PROMPT_MODIFIER -x CONDA_PYTHON_EXE -x CONDA_SHLVL -x CPP -x CPPFLAGS -x CUDA_PATH -x CXX -x CXXFILT -x CXXFLAGS -x DBUS_SESSION_BUS_ADDRESS -x DEBUG_CFLAGS -x DEBUG_CPPFLAGS -x DEBUG_CXXFLAGS -x ELFEDIT -x GCC -x GCC_AR -x GCC_NM -x GCC_RANLIB -x GIT_ASKPASS -x GPROF -x GXX -x HOME -x HOROVOD_CCL_BGT_AFFINITY -x HOROVOD_GLOO_TIMEOUT_SECONDS -x HOROVOD_NUM_NCCL_STREAMS -x HOROVOD_STALL_CHECK_TIME_SECONDS -x HOROVOD_STALL_SHUTDOWN_TIME_SECONDS -x HOST -x LANG -x LANGUAGE -x LD -x LDFLAGS -x LD_GOLD -x LESSCLOSE -x LESSOPEN -x LOGNAME -x LS_COLORS -x MOTD_SHOWN -x NCCL_SOCKET_IFNAME -x NM -x OBJCOPY -x OBJDUMP -x PATH -x PWD -x RANLIB -x READELF -x SHELL -x SHLVL -x SIZE -x SSH_CLIENT -x SSH_CONNECTION -x STRINGS -x STRIP -x TERM -x TERM_PROGRAM -x TERM_PROGRAM_VERSION -x USER -x VSCODE_GIT_ASKPASS_EXTRA_ARGS -x VSCODE_GIT_ASKPASS_MAIN -x VSCODE_GIT_ASKPASS_NODE -x VSCODE_GIT_IPC_HANDLE -x VSCODE_IPC_HOOK_CLI -x XDG_DATA_DIRS -x XDG_RUNTIME_DIR -x XDG_SESSION_CLASS -x XDG_SESSION_ID -x XDG_SESSION_TYPE -x _ -x _CE_CONDA -x _CE_M -x _CONDA_PYTHON_SYSCONFIGDATA_NAME python examples/torch/pytorch_mnist.py
[mpiexec@gpu-server-1] match_arg (lib/utils/args.c:166): unrecognized argument allow-run-as-root
[mpiexec@gpu-server-1] HYDU_parse_array (lib/utils/args.c:181): argument matching returned error
[mpiexec@gpu-server-1] parse_args (mpiexec/get_parameters.c:315): error parsing input array
[mpiexec@gpu-server-1] HYD_uii_mpx_get_parameters (mpiexec/get_parameters.c:47): unable to parse user arguments
[mpiexec@gpu-server-1] main (mpiexec/mpiexec.c:49): error parsing parameters
It is difficult to find a solution to this error on the Internet. I speculate that the version of mpi is too new. When I use the mpirun --version command, the version of mpi I get is 4.1.1. But I don't know how to solve this problem. I tried various solutions, such as replacing an older server with a completely different configuration, but the same problem still occurred
Hope to get your help, thank you
When I use grace's dist package to do threshold compression by Allgather, it will freeze on communication. I found two issues on different backends.
Grace's Allgather will transfer tensor to the default cuda device. As I tested, NCCL cannot handle communication from the same GPU, but GLOO can do. It can run after I changed it to:
local_sizes = torch.tensor([t.numel() for t in tensors]).cuda(tensors[0].device.index)
In an extreme situation, the threshold is bigger than all gradients, then the value and index tensors will be empty. NCCL can gather an empty list but GLOO will be stuck here
First thanks for the great effort here, Horovod really needs a good gradient compression library! Today's cloud environments really needs this and this is why the official DDP started also supporting gradient compression algorithms....FP16 is not enough anymore
However, I can see only an old version is supported, any plans / roadmap to port to newer versions ?
Thanks,
Hi, Maybe I'm misunderstanding something but it seems that EFSignSGDCompressor
(parameter compressor
set to efsignsgd
) never uses its residual memory (error feedback).
(the tensorflow version has compensate
and update
in the compressor but they are never called)
Thank you
Is there any way to get and record the communication time between computational nodes?
I want to get the communication time cost between nodes not include the time cost by the compression algorithm, and also the time cost by the compression algorithm. I couldn't find an appropriate approach to do it.
@hangxu0304 I mixed two methods (TopK and QSGD) to make a compression on gradient vector. But I found that it needs much more time in one epoch during training than a single way to do that. I can't find the reason. The code about the compressor are shown below:
def compress(self, tensor, name):
shape = tensor.size()
numel = tensor.numel()
tensor = tensor.flatten()
k = max(1, int(numel * self.compress_ratio))
_, indices = torch.topk(tensor.abs(), k)
values = tensor[indices]
################################################################################################################
upperbound = torch.max(values.abs()).flatten()
abs_gradient = values.abs()
abs_gradient *= 127 / upperbound
now_level = abs_gradient.floor()
prob = torch.empty_like(values).uniform_()
is_next_level = (prob < (abs_gradient - now_level)).type(torch.float32)
now_level += is_next_level
tensor_compressed = (now_level * values.sign()).type(torch.int16)
tensor_compressed = tensor_compressed.type(torch.int8)
ctx = numel, shape
return [tensor_compressed, indices, upperbound], ctx
def decompress(self, tensor_compressed, ctx):
numel, shape = ctx
values, indices , upperbound = tensor_compressed
decode_output = values.type(torch.float32).abs()
decode_output *= upperbound / 127
decode_output *= values.sign()
tensor_decompressed = torch.zeros(numel, dtype=decode_output.dtype, layout=decode_output.layout, device=decode_output.device)
tensor_decompressed.scatter_(0, indices, decode_output)
return tensor_decompressed.view(shape)
And I also make an experiment on the compress
and decompress
function to check if these have high computational complexity. But I found that they take less time to deal with a vector of which dimension up to 1,000,000. The time cost is just about 0.002 seconds.
Appreciate for your advice and help !!!
Sincerely yours.
Hi there,
Thanks for sharing the framework! I try to experiment with the codebase. But not able to create suitable environment on AWS platform. I am using Deep Learning AMI (Ubuntu 18.04) Version 60.4
as the cloud instance system, the corresponding AMI is ami-0184e674549ab8432
.
When I use command ./create-grace-env-tf1.15.sh
under the root folder of the project, the installation of horovod raises the following error. But the installation process didn't stop. And I can see the horovod-0.21.0 is installed in conda environment. But the configuration is not correct.
make[1]: Leaving directory '/tmp/pip-install-hopus90d/horovod/build/temp.linux-x86_64-cpython-37'
Makefile:146: recipe for target 'all' failed
make: *** [all] Error 2
Traceback (most recent call last):
File "<string>", line 1, in <module>
File "/tmp/pip-install-hopus90d/horovod/setup.py", line 193, in <module>
'horovodrun = horovod.runner.launch:run_commandline'
File "/home/ubuntu/grace/env-tf1.15/lib/python3.7/site-packages/setuptools/__init__.py", line 87, in setup
return distutils.core.setup(**attrs)
File "/home/ubuntu/grace/env-tf1.15/lib/python3.7/site-packages/setuptools/_distutils/core.py", line 148, in setup
return run_commands(dist)
File "/home/ubuntu/grace/env-tf1.15/lib/python3.7/site-packages/setuptools/_distutils/core.py", line 163, in run_commands
dist.run_commands()
File "/home/ubuntu/grace/env-tf1.15/lib/python3.7/site-packages/setuptools/_distutils/dist.py", line 967, in run_commands
self.run_command(cmd)
File "/home/ubuntu/grace/env-tf1.15/lib/python3.7/site-packages/setuptools/dist.py", line 1229, in run_command
super().run_command(command)
File "/home/ubuntu/grace/env-tf1.15/lib/python3.7/site-packages/setuptools/_distutils/dist.py", line 986, in run_command
cmd_obj.run()
File "/home/ubuntu/grace/env-tf1.15/lib/python3.7/site-packages/wheel/bdist_wheel.py", line 299, in run
self.run_command('build')
File "/home/ubuntu/grace/env-tf1.15/lib/python3.7/site-packages/setuptools/_distutils/cmd.py", line 313, in run_command
self.distribution.run_command(command)
File "/home/ubuntu/grace/env-tf1.15/lib/python3.7/site-packages/setuptools/dist.py", line 1229, in run_command
super().run_command(command)
File "/home/ubuntu/grace/env-tf1.15/lib/python3.7/site-packages/setuptools/_distutils/dist.py", line 986, in run_command
cmd_obj.run()
File "/home/ubuntu/grace/env-tf1.15/lib/python3.7/site-packages/setuptools/_distutils/command/build.py", line 136, in run
self.run_command(cmd_name)
File "/home/ubuntu/grace/env-tf1.15/lib/python3.7/site-packages/setuptools/_distutils/cmd.py", line 313, in run_command
self.distribution.run_command(command)
File "/home/ubuntu/grace/env-tf1.15/lib/python3.7/site-packages/setuptools/dist.py", line 1229, in run_command
super().run_command(command)
File "/home/ubuntu/grace/env-tf1.15/lib/python3.7/site-packages/setuptools/_distutils/dist.py", line 986, in run_command
cmd_obj.run()
File "/home/ubuntu/grace/env-tf1.15/lib/python3.7/site-packages/setuptools/command/build_ext.py", line 79, in run
_build_ext.run(self)
File "/home/ubuntu/grace/env-tf1.15/lib/python3.7/site-packages/setuptools/_distutils/command/build_ext.py", line 339, in run
self.build_extensions()
File "/tmp/pip-install-hopus90d/horovod/setup.py", line 91, in build_extensions
cwd=self.build_temp)
File "/home/ubuntu/grace/env-tf1.15/lib/python3.7/subprocess.py", line 363, in check_call
raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command '['cmake', '--build', '.', '--config', 'RelWithDebInfo', '--', '-j8', 'VERBOSE=1']' returned non-zero
exit status 2.
----------------------------------------
ERROR: Failed building wheel for horovod
Running setup.py clean for horovod
Failed to build horovod
Running command horovodrun -cb
gives following message, which indicate the PyTorch extension is not enabled.
(/home/ubuntu/grace/env-tf1.15) ubuntu@ip-172-31-82-84:~/grace$ horovodrun -cb
Horovod v0.21.0:
Available Frameworks:
[X] TensorFlow
[ ] PyTorch
[ ] MXNet
Available Controllers:
[X] MPI
[X] Gloo
Available Tensor Operations:
[X] NCCL
[ ] DDL
[ ] CCL
[X] MPI
[X] Gloo
Besides, even the TensorFlow extension for Horovod seems ready, the actual training indicate it isn't work properly. When running tensorflow_mnist.py with horovodrun
, only the first GPU is doing the computation. This means the distributed training isn't working.
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 510.47.03 Driver Version: 510.47.03 CUDA Version: 11.6 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 Tesla T4 On | 00000000:00:1B.0 Off | 0 |
| N/A 45C P0 27W / 70W | 390MiB / 15360MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 1 Tesla T4 On | 00000000:00:1C.0 Off | 0 |
| N/A 37C P8 15W / 70W | 2MiB / 15360MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 2 Tesla T4 On | 00000000:00:1D.0 Off | 0 |
| N/A 37C P8 14W / 70W | 2MiB / 15360MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 3 Tesla T4 On | 00000000:00:1E.0 Off | 0 |
| N/A 35C P8 14W / 70W | 2MiB / 15360MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 10049 C python 97MiB |
| 0 N/A N/A 10050 C python 97MiB |
| 0 N/A N/A 10051 C python 97MiB |
| 0 N/A N/A 10052 C python 97MiB |
+-----------------------------------------------------------------------------+
Any suggestions? It would be great if you can share a docker environment, or more detailed system configurations.
IMHO, the return type of function tf.math.less
is tf.bool
, which is representated by 8 bit. Hence, the code at
grace_dl/tensorflow/compressor/onebit.py#L21 is not really one-bit quantization but eight-bit quantization?
@hangxu0304 I run the QSGD algorithm with 4 GPUs with an error.
compression = Allreduce(QSGDCompressor(127), NoneMemory())
# Horovod: wrap optimizer with DistributedOptimizer.
optimizer = hvd.DistributedOptimizer(optimizer, compression, named_parameters=model.named_parameters())
The error message is
Traceback (most recent call last):
File "./trainer.py", line 442, in <module>
main()
File "./trainer.py", line 235, in main
train(train_loader, model, criterion, optimizer, epoch, log=log)
File "./trainer.py", line 292, in train
loss.backward()
File "/GPUFS/nudt_chkwu_2/.conda/envs/torch_ho/lib/python3.6/site-packages/torch/tensor.py", line 195, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph)
File "/GPUFS/nudt_chkwu_2/.conda/envs/torch_ho/lib/python3.6/site-packages/torch/autograd/__init__.py", line 99, in backward
allow_unreachable=True) # allow_unreachable flag
File "/GPUFS/nudt_chkwu_2/.conda/envs/torch_ho/lib/python3.6/site-packages/horovod/torch/__init__.py", line 150, in hook
handle, ctx = self._communicate_grad_async(p)
File "/GPUFS/nudt_chkwu_2/.conda/envs/torch_ho/lib/python3.6/site-packages/horovod/torch/__init__.py", line 132, in _communicate_grad_async
handle, ctx = self.grace.send_step(tensor, name)
File "/GPUFS/nudt_chkwu_2/kfhu/grace-master/grace_dl/torch/__init__.py", line 54, in send_step
handles = self.async_send(tensors_compressed, name)
File "/GPUFS/nudt_chkwu_2/kfhu/grace-master/grace_dl/torch/communicator/allreduce.py", line 10, in async_send
handles.append(allreduce_async_(tensor_compressed, self.compressor.average, name + str(i)))
File "/GPUFS/nudt_chkwu_2/.conda/envs/torch_ho/lib/python3.6/site-packages/horovod/torch/mpi_ops.py", line 186, in allreduce_async_
return _allreduce_async(tensor, tensor, average, name)
File "/GPUFS/nudt_chkwu_2/.conda/envs/torch_ho/lib/python3.6/site-packages/horovod/torch/mpi_ops.py", line 88, in _allreduce_async
function = _check_function(_allreduce_function_factory, tensor)
File "/GPUFS/nudt_chkwu_2/.conda/envs/torch_ho/lib/python3.6/site-packages/horovod/torch/mpi_ops.py", line 72, in _check_function
raise ValueError('Tensor type %s is not supported.' % tensor.type())
ValueError: Tensor type torch.cuda.CharTensor is not supported.
Traceback (most recent call last):
File "./trainer.py", line 442, in <module>
main()
File "./trainer.py", line 235, in main
train(train_loader, model, criterion, optimizer, epoch, log=log)
File "./trainer.py", line 292, in train
loss.backward()
File "/GPUFS/nudt_chkwu_2/.conda/envs/torch_ho/lib/python3.6/site-packages/torch/tensor.py", line 195, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph)
File "/GPUFS/nudt_chkwu_2/.conda/envs/torch_ho/lib/python3.6/site-packages/torch/autograd/__init__.py", line 99, in backward
allow_unreachable=True) # allow_unreachable flag
File "/GPUFS/nudt_chkwu_2/.conda/envs/torch_ho/lib/python3.6/site-packages/horovod/torch/__init__.py", line 150, in hook
handle, ctx = self._communicate_grad_async(p)
File "/GPUFS/nudt_chkwu_2/.conda/envs/torch_ho/lib/python3.6/site-packages/horovod/torch/__init__.py", line 132, in _communicate_grad_async
handle, ctx = self.grace.send_step(tensor, name)
File "/GPUFS/nudt_chkwu_2/kfhu/grace-master/grace_dl/torch/__init__.py", line 54, in send_step
handles = self.async_send(tensors_compressed, name)
File "/GPUFS/nudt_chkwu_2/kfhu/grace-master/grace_dl/torch/communicator/allreduce.py", line 10, in async_send
handles.append(allreduce_async_(tensor_compressed, self.compressor.average, name + str(i)))
File "/GPUFS/nudt_chkwu_2/.conda/envs/torch_ho/lib/python3.6/site-packages/horovod/torch/mpi_ops.py", line 186, in allreduce_async_
return _allreduce_async(tensor, tensor, average, name)
File "/GPUFS/nudt_chkwu_2/.conda/envs/torch_ho/lib/python3.6/site-packages/horovod/torch/mpi_ops.py", line 88, in _allreduce_async
function = _check_function(_allreduce_function_factory, tensor)
File "/GPUFS/nudt_chkwu_2/.conda/envs/torch_ho/lib/python3.6/site-packages/horovod/torch/mpi_ops.py", line 72, in _check_function
raise ValueError('Tensor type %s is not supported.' % tensor.type())
ValueError: Tensor type torch.cuda.CharTensor is not supported.
Traceback (most recent call last):
File "./trainer.py", line 442, in <module>
main()
File "./trainer.py", line 235, in main
train(train_loader, model, criterion, optimizer, epoch, log=log)
File "./trainer.py", line 292, in train
loss.backward()
File "/GPUFS/nudt_chkwu_2/.conda/envs/torch_ho/lib/python3.6/site-packages/torch/tensor.py", line 195, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph)
File "/GPUFS/nudt_chkwu_2/.conda/envs/torch_ho/lib/python3.6/site-packages/torch/autograd/__init__.py", line 99, in backward
allow_unreachable=True) # allow_unreachable flag
File "/GPUFS/nudt_chkwu_2/.conda/envs/torch_ho/lib/python3.6/site-packages/horovod/torch/__init__.py", line 150, in hook
handle, ctx = self._communicate_grad_async(p)
File "/GPUFS/nudt_chkwu_2/.conda/envs/torch_ho/lib/python3.6/site-packages/horovod/torch/__init__.py", line 132, in _communicate_grad_async
handle, ctx = self.grace.send_step(tensor, name)
File "/GPUFS/nudt_chkwu_2/kfhu/grace-master/grace_dl/torch/__init__.py", line 54, in send_step
handles = self.async_send(tensors_compressed, name)
File "/GPUFS/nudt_chkwu_2/kfhu/grace-master/grace_dl/torch/communicator/allreduce.py", line 10, in async_send
handles.append(allreduce_async_(tensor_compressed, self.compressor.average, name + str(i)))
File "/GPUFS/nudt_chkwu_2/.conda/envs/torch_ho/lib/python3.6/site-packages/horovod/torch/mpi_ops.py", line 186, in allreduce_async_
return _allreduce_async(tensor, tensor, average, name)
File "/GPUFS/nudt_chkwu_2/.conda/envs/torch_ho/lib/python3.6/site-packages/horovod/torch/mpi_ops.py", line 88, in _allreduce_async
function = _check_function(_allreduce_function_factory, tensor)
File "/GPUFS/nudt_chkwu_2/.conda/envs/torch_ho/lib/python3.6/site-packages/horovod/torch/mpi_ops.py", line 72, in _check_function
raise ValueError('Tensor type %s is not supported.' % tensor.type())
ValueError: Tensor type torch.cuda.CharTensor is not supported.
Traceback (most recent call last):
File "./trainer.py", line 442, in <module>
main()
File "./trainer.py", line 235, in main
train(train_loader, model, criterion, optimizer, epoch, log=log)
File "./trainer.py", line 292, in train
loss.backward()
File "/GPUFS/nudt_chkwu_2/.conda/envs/torch_ho/lib/python3.6/site-packages/torch/tensor.py", line 195, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph)
File "/GPUFS/nudt_chkwu_2/.conda/envs/torch_ho/lib/python3.6/site-packages/torch/autograd/__init__.py", line 99, in backward
allow_unreachable=True) # allow_unreachable flag
File "/GPUFS/nudt_chkwu_2/.conda/envs/torch_ho/lib/python3.6/site-packages/horovod/torch/__init__.py", line 150, in hook
handle, ctx = self._communicate_grad_async(p)
File "/GPUFS/nudt_chkwu_2/.conda/envs/torch_ho/lib/python3.6/site-packages/horovod/torch/__init__.py", line 132, in _communicate_grad_async
handle, ctx = self.grace.send_step(tensor, name)
File "/GPUFS/nudt_chkwu_2/kfhu/grace-master/grace_dl/torch/__init__.py", line 54, in send_step
handles = self.async_send(tensors_compressed, name)
File "/GPUFS/nudt_chkwu_2/kfhu/grace-master/grace_dl/torch/communicator/allreduce.py", line 10, in async_send
handles.append(allreduce_async_(tensor_compressed, self.compressor.average, name + str(i)))
File "/GPUFS/nudt_chkwu_2/.conda/envs/torch_ho/lib/python3.6/site-packages/horovod/torch/mpi_ops.py", line 186, in allreduce_async_
return _allreduce_async(tensor, tensor, average, name)
File "/GPUFS/nudt_chkwu_2/.conda/envs/torch_ho/lib/python3.6/site-packages/horovod/torch/mpi_ops.py", line 88, in _allreduce_async
function = _check_function(_allreduce_function_factory, tensor)
File "/GPUFS/nudt_chkwu_2/.conda/envs/torch_ho/lib/python3.6/site-packages/horovod/torch/mpi_ops.py", line 72, in _check_function
raise ValueError('Tensor type %s is not supported.' % tensor.type())
ValueError: Tensor type torch.cuda.CharTensor is not supported.
yhrun: error: gpu16: tasks 0-3: Exited with exit code 1
Could you help me? Thanks.
In the case of s=4, l=3, the probability that I calculate p(i) according to the QSGD algorithm is not 0.2336, but p(i)=0.2336 in the case of l=0.
But the g(i) corresponding to the probability 0.2336 should be 0 instead of 1/4. This is also in line with the coordinate map below the picture, and the distance from 0.2336 to the endpoint (0) should be shorter.
Is this understanding correct? I sincerely look forward to your reply!
from horovod.tensorflow import allreduce_async_, synchronize
The program runs at the line above break off.
The error info as below:
Traceback (most recent call last):
File "/GPUFS/nudt_chkwu_2/kfhu/horovod-0.19.2/horovod/torch/__init__.py", line 32, in <module>
__file__, 'mpi_lib_v2')
File "/GPUFS/nudt_chkwu_2/kfhu/horovod-0.19.2/horovod/common/util.py", line 56, in check_extension
'Horovod with %s=1 to debug the build error.' % (ext_name, ext_env_var))
ImportError: Extension horovod.torch has not been built. If this is not expected, reinstall Horovod with HOROVOD_WITH_PYTORCH=1 to debug the build error.
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/GPUFS/nudt_chkwu_2/kfhu/horovod-0.19.2/horovod/torch/__init__.py", line 35, in <module>
__file__, 'mpi_lib', '_mpi_lib')
File "/GPUFS/nudt_chkwu_2/kfhu/horovod-0.19.2/horovod/common/util.py", line 56, in check_extension
'Horovod with %s=1 to debug the build error.' % (ext_name, ext_env_var))
ImportError: Extension horovod.torch has not been built. If this is not expected, reinstall Horovod with HOROVOD_WITH_PYTORCH=1 to debug the build error.
Can you give me a resolution? Appreciate for your help!
Dear authors,
I used issue form to submit the general question about the implementation details. The framework works just fine. Thanks, great job! :)
You utilized Horovod's API to construct compression (Compressor
) and error-feedback (ResidualMemory
) classes. As far as I know the key methods (compress/ decompress) are applied to the tensors (layers's parameters) separately (please correct me if I am wrong). In some cases (top-k compression) the statistics obtained from all the layers are required (see https://arxiv.org/pdf/1712.01887.pdf as an example).
The question is next: is it a correct way to apply the compression methods designed to compress the entire update (like top-k) to each layer separately?
If it is possible, please, provide your standpoint!
Hi, PyTorch 1.8 have this new hook torch.nn.parallel.DistributedDataParallel.register_comm_hook()
, any advices on how to integrate grace into ddp using the dist examples?
I will change the compression during the training process. In this case, how should I use GRACE? The pseudo-code is shown below:
compression = A()
optimizer = hvd.DistributedOptimizer(optimizer,
compression,
named_parameters=model.named_parameters())
for epoch in range(0,100):
if epoch > 50:
compression = B()
How to apply that compression above to the DistributedOptimizer?
train()
test()
.....
Anyone could help me to solve this problem? Appreciate for your help!!!
I know DGC algorithm used warm-up training to exponentially increase the gradient sparsity.
Do you imply warm-up algorithm in your code ?
Does the Pytorch Distributed implementation still require Horovod for its working? I see that the communication protocols use Pytorch DDP calls, but there are other function calls such as broadcast_parameters and broadcast_optimizer_state that still uses Horovod.
I use the GRACE to train the NLP model with PennTreeBank dataset in a distributed manner. And there is an error followed I couldn't solve.
Traceback (most recent call last):
File "main.py", line 299, in <module>
train(epoch, trainloader, model, optimizer, device, batch_size)
File "main.py", line 132, in train
loss.backward()
File "/GPUFS/nudt_chkwu_2/.conda/envs/Horovod/lib/python3.6/site-packages/torch/tensor.py", line 150, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph)
File "/GPUFS/nudt_chkwu_2/.conda/envs/Horovod/lib/python3.6/site-packages/torch/autograd/__init__.py", line 99, in backward
allow_unreachable=True) # allow_unreachable flag
File "/GPUFS/nudt_chkwu_2/.conda/envs/Horovod/lib/python3.6/site-packages/horovod/torch/__init__.py", line 150, in hook
handle, ctx = self._communicate_grad_async(p)
File "/GPUFS/nudt_chkwu_2/.conda/envs/Horovod/lib/python3.6/site-packages/horovod/torch/__init__.py", line 132, in _communicate_grad_async
handle, ctx = self.grace.send_step(tensor, name)
File "../grace_dl/torch/__init__.py", line 51, in send_step
tensor = self.memory.compensate(tensor, name)
TypeError: compensate() missing 1 required positional argument: 'name'
It seems that the function compensate
can't get the right argument name
. Any help is appreciated. Thanks!
I run the example in examples/dist/CIFAR10-dawndist/dawn.py. In the Network Definition, there is weight = 0.125. Why need to multiply weight here? How to choose the value of weight and do other networks also need weight? Can there be other network definitions for reference.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.