Giter Site home page Giter Site logo

gloo's Introduction

Gloo

Support Ukraine CircleCI

Gloo is a collective communications library. It comes with a number of collective algorithms useful for machine learning applications. These include a barrier, broadcast, and allreduce.

Transport of data between participating machines is abstracted so that IP can be used at all times, or InifiniBand (or RoCE) when available. In the latter case, if the InfiniBand transport is used, GPUDirect can be used to accelerate cross machine GPU-to-GPU memory transfers.

Where applicable, algorithms have an implementation that works with system memory buffers, and one that works with NVIDIA GPU memory buffers. In the latter case, it is not necessary to copy memory between host and device; this is taken care of by the algorithm implementations.

Requirements

Gloo is built to run on Linux and has no hard dependencies other than libstdc++. That said, it will generally only be useful when used in combination with a few optional dependencies below.

Optional dependencies are:

  • CUDA and NCCL -- for CUDA aware algorithms, tests, and benchmark
  • Google Test -- to build and run tests
  • Hiredis -- for coordinating machine rendezvous through Redis
  • MPI -- for coordinating machine rendezvous through MPI

Documentation

Please refer to docs/ for detailed documentation.

Building

You can build Gloo using CMake.

Since it is a library, it is most convenient to vendor it in your own project and include the project root in your own CMake configuration.

Test

Building the tests requires Google Test version 1.8 or higher. On Ubuntu, this version ships with version 17.10 and up. If you run an older version, you'll have to install Google Test yourself, and set the GTEST_ROOT CMake variable.

You can install Google Test using conda with:

conda install -c anaconda gmock gtest

Be carefull that you might need to fish for a package that works with your glibc

To build the tests, run:

mkdir -p build
cd build
cmake ../ -DBUILD_TEST=1 -DGTEST_ROOT=/some/path (if using custom install)
make
ls -l gloo/test/gloo_test*

To test the CUDA algorithms, specify USE_CUDA=ON as well, and the CUDA tests are built at gloo/test/gloo_test_cuda.

Benchmark

First install the dependencies required by the benchmark tool. On Ubuntu, you can do so by running:

sudo apt-get install -y libhiredis-dev

Then build the benchmark, run:

mkdir build
cd build
cmake ../ -DBUILD_BENCHMARK=1
make
ls -l gloo/benchmark/benchmark

Benchmarking

The benchmark tool depends on Redis/Hiredis for rendezvous. The benchmark tool for CUDA algorithms obviously also depends on both CUDA and NCCL.

To run a benchmark:

  1. Copy the benchmark tool to all participating machines

  2. Start a Redis server on any host (either a client machine or one of the machines participating in the test). Note that Redis Cluster is not supported.

  3. Determine some unique ID for the benchmark run (e.g. the uuid tool or some number).

  4. On each machine, run (or pass --help for more options):

    ./benchmark \
      --size <number of machines> \
      --rank <index of this machine, starting at 0> \
      --redis-host <Redis host> \
      --redis-port <Redis port> \
      --prefix <unique identifier for this run> \
      --transport tcp \
      --elements <number of elements; -1 for a sweep> \
      --iteration-time 1s \
      allreduce_ring_chunked
    

Example output (running on 4 machines with a 40GbE network):

   elements   min (us)   p50 (us)   p99 (us)   max (us)    samples
          1        195        263        342        437       3921
          2        195        261        346        462       4039
          5        197        261        339        402       3963
         10        197        263        338        398       3749
         20        199        268        343        395       4146
         50        200        265        344        401       3889
        100        205        265        351        414       3645
        200        197        264        328        387       3960
        500        201        264        329        394       4274
       1000        200        267        330        380       3344
       2000        205        263        323        395       3682
       5000        240        335        424        460       3277
      10000        271        346        402        457       2721
      20000        283        358        392        428       2719
      50000        342        438        495        649       1654
     100000        413        487        669        799       1687
     200000       1113       1450       1837       2801        669
     500000       1099       1294       1665       1959        560
    1000000       1858       2286       2779       6100        320
    2000000       3546       3993       4364       4886        252
    5000000      10030      10608      11106      11628         92

License

Gloo is BSD-licensed.

gloo's People

Contributors

andrewwdye avatar gchanan avatar hgaiser avatar jiayisuse avatar kirteshpatil avatar malfet avatar manojkris avatar minsii avatar osalpekar avatar pbelevich avatar peterbell10 avatar petrex avatar pietern avatar plapukhov avatar pritamdamania avatar pruthvistony avatar r-barnes avatar raminudelman avatar rathir avatar rmaz avatar rohan-varma avatar slayton58 avatar smessmer avatar soumith avatar stanislavglebik avatar wesolwsk avatar xunnanxu avatar xw285cornell avatar yangqing avatar zicky avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

gloo's Issues

Error at broadcast_ops_gpu.cc:38

Hello,

I decided to move to container. I have the latest clone of the Caffe2 repo and built the docker image using caffe2/docker/ubuntu16.04-cuda8-cudnn7 DockerFIle.

Everything is fine until this:

liangluo@n38:~/caffe2/docker/ubuntu-16.04-cuda8-cudnn7-all-options$ sh run.sh 2 0 3456
Ignoring @/caffe2/caffe2/contrib/nccl:nccl_ops as it is not a valid file.
Ignoring @/caffe2/caffe2/contrib/gloo:gloo_ops as it is not a valid file.
Ignoring @/caffe2/caffe2/contrib/gloo:gloo_ops_gpu as it is not a valid file.
Ignoring @/caffe2/caffe2/distributed:file_store_handler_ops as it is not a valid file.
Ignoring @/caffe2/caffe2/distributed:redis_store_handler_ops as it is not a valid file.
INFO:resnet50_trainer:Running on GPUs: [0]
INFO:resnet50_trainer:Using epoch size: 49984
INFO:data_parallel_model:Parallelizing model for devices: [0]
INFO:data_parallel_model:Create input and model training operators
INFO:data_parallel_model:Model for GPU : 0
INFO:data_parallel_model:Adding gradient operators
INFO:data_parallel_model:Add gradient all-reduces for SyncSGD
WARNING:data_parallel_model:Distributed broadcast of computed params is not implemented yet
INFO:data_parallel_model:Post-iteration operators for updating params
INFO:data_parallel_model:Calling optimizer builder function
INFO:data_parallel_model:Add initial parameter sync
WARNING:data_parallel_model:------- DEPRECATED API, please use data_parallel_model.OptimizeGradientMemory() -----
WARNING:memonger:NOTE: Executing memonger to optimize gradient memory
INFO:memonger:Memonger memory optimization took 0.0321569442749 secs
Traceback for operator 1212 in network resnet50_init
/usr/local/caffe2/python/data_parallel_model.py:1051
/usr/local/caffe2/python/data_parallel_model.py:1060
/usr/local/caffe2/python/data_parallel_model.py:964
/usr/local/caffe2/python/data_parallel_model.py:323
resnet50_trainer.py:443
resnet50_trainer.py:605
resnet50_trainer.py:609
Traceback (most recent call last):
File "resnet50_trainer.py", line 609, in
main()
File "resnet50_trainer.py", line 605, in main
Train(args)
File "resnet50_trainer.py", line 446, in Train
workspace.RunNetOnce(train_model.param_init_net)
File "/usr/local/caffe2/python/workspace.py", line 207, in RunNetOnce
StringifyProto(net),
File "/usr/local/caffe2/python/workspace.py", line 190, in CallWithExceptionIntercept
return func(*args, **kwargs)
RuntimeError: [enforce fail at broadcast_ops_gpu.cc:38] false. Unhandled type: int Error from operator:
input: "broadcast_0_cw" input: "gpu_0/label" output: "gpu_0/label" name: "label" type: "Broadcast" arg { name: "status_blob" s: "broadcast_label_status" } device_option { device_type: 1 cuda_gpu_id: 0 } engine: "GLOO" control_input: "gpu_0/data"

Where run.sh is simply:

nvidia-docker run --rm --net=host -it docker.sampa:5000/caffe2 bash -c "cd /caffe2/caffe2/python/examples && python resnet50_trainer.py --train_data null --image_size 256
--gpus 0 --batch_size 32 --num_labels 1000 --epoch_size 50000 --num_epochs 10 --num_shards $1 --shard_id $2 --redis_host 10.2.5.45 --distributed_transport tcp --distributed_interfaces ib0 --run_id $3"

I intend to train with IPOIB so I used TCP and IB0. I bolded some warnings. I compiled with USE_IBVERBS but I didn't actually use it. Let me know if there is something clearly wrong.

Thanks!

Are the collective APIs provided by gloo thread-safe?

Dear gloo team,

Deep laerning framework like caffe backprops in a layer wise fashion. It is better to do "allreduce" immediately after gradients of each layer figured out. To not only overlap comunication and computing, but also comunication of diffrient layers, these "allreduce" should run in seperate threads, rather than one single thread queued these comunication requests.

Unfortunately, most MPI-like softwares like openmpi and baidu-allreduce do not guarantee multi-thread safety.

Is this also your concern? How can we achieve this using gloo ?

Thanks

Cannot run with SoftRoCE

Hello!

It seems fine to run Gloo with RoCE, but it seems to be stuck with SoftRoCE.

It should just run out of box but it looks like it cannot get pass send. The rendezvous seems fine.

Do you have any ideas?

Running benchmark Error: reply->type == REDIS_REPLY_INTEGER. 6 vs 3

Dear All,
When I used the command as a following,
./benchmark --size 2 --rank 0 --redis-host 172.16.18.218 --redis-port 7000 --prefix 152 --transport tcp --elements 1000 --iteration-time 1s allreduce_ring_chunked

There was an error with ' what(): [enforce fail at /home/xulinquan/gloo/gloo/rendezvous/redis_store.cc:92] reply->type == REDIS_REPLY_INTEGER. 6 vs 3'.
How to fix this error?

PS:
The version of Redis is 3.2.8
I used two machines(172.16.18.216, 172.16.18.218).
On the machine(172.16.18.218), I start up 3 nodes,7000, 7001, 7002, respectively.
On the anther machine(172.16.18.216), I also start up 3 nodes, 7003,7004,7005,respectively.
1) start the server: redis-server redis.conf
xulinquan@root0-SCW4350-216:/redis/7005$ ps -ef|grep redis
xulinqu+ 6779 1 0 15:40 ? 00:00:00 redis-server 172.16.18.216:7003 [cluster]
xulinqu+ 6788 1 0 15:40 ? 00:00:00 redis-server 172.16.18.216:7004 [cluster]
xulinqu+ 6808 1 0 15:41 ? 00:00:00 redis-server 172.16.18.216:7005 [cluster]
xulinquan@root0-PR4764GW-218:
/redis$ ps aufx|grep redis
xulinqu+ 11128 0.0 0.0 15944 2604 pts/30 S+ 15:43 0:00 _ grep --color=auto redis
xulinqu+ 11114 0.0 0.0 40452 9236 ? Ssl 15:43 0:00 redis-server 172.16.18.218:7001 [cluster]
xulinqu+ 11118 0.0 0.0 40452 9224 ? Ssl 15:43 0:00 redis-server 172.16.18.218:7000 [cluster]
xulinqu+ 11122 0.0 0.0 40452 9128 ? Ssl 15:43 0:00 redis-server 172.16.18.218:7002 [cluster]

2)Created the cluster
xulinquan@root0-PR4764GW-218:~/redis$ ./redis-trib.rb create --replicas 1 172.16.18.218:7000 172.16.18.218:7001 172.16.18.218:7002 172.16.18.216:7003 172.16.18.216:7004 172.16.18.216:7005

Creating cluster
Performing hash slots allocation on 6 nodes...
Using 3 masters:
172.16.18.218:7000
172.16.18.216:7003
172.16.18.218:7001
Adding replica 172.16.18.216:7004 to 172.16.18.218:7000
Adding replica 172.16.18.218:7002 to 172.16.18.216:7003
Adding replica 172.16.18.216:7005 to 172.16.18.218:7001
M: 9f83308678f8bcf9e8c2faa7d0882689963d2276 172.16.18.218:7000
slots:0-5460 (5461 slots) master
M: 46e379a567b4f6ec5d84114973141b5a7a7bb966 172.16.18.218:7001
slots:10923-16383 (5461 slots) master
S: 37c102a7b6d6004ccce69d74ca2f62ef93eaa1d9 172.16.18.218:7002
replicates f656bcb8bb63b8384bc3d2cdde07feb34850e2bc
M: f656bcb8bb63b8384bc3d2cdde07feb34850e2bc 172.16.18.216:7003
slots:5461-10922 (5462 slots) master
S: 4fcaad8fb6481267ce58581f4b2f8ec0fe70bcf8 172.16.18.216:7004
replicates 9f83308678f8bcf9e8c2faa7d0882689963d2276
S: 06571f35518f4eb4570d2a49f0cf23d4681e6c09 172.16.18.216:7005
replicates 46e379a567b4f6ec5d84114973141b5a7a7bb966
Can I set the above configuration? (type 'yes' to accept): yes
Nodes configuration updated
Assign a different config epoch to each node
Sending CLUSTER MEET messages to join the cluster
Waiting for the cluster to join....
Performing Cluster Check (using node 172.16.18.218:7000)
M: 9f83308678f8bcf9e8c2faa7d0882689963d2276 172.16.18.218:7000
slots:0-5460 (5461 slots) master
1 additional replica(s)
S: 4fcaad8fb6481267ce58581f4b2f8ec0fe70bcf8 172.16.18.216:7004
slots: (0 slots) slave
replicates 9f83308678f8bcf9e8c2faa7d0882689963d2276
S: 06571f35518f4eb4570d2a49f0cf23d4681e6c09 172.16.18.216:7005
slots: (0 slots) slave
replicates 46e379a567b4f6ec5d84114973141b5a7a7bb966
M: 46e379a567b4f6ec5d84114973141b5a7a7bb966 172.16.18.218:7001
slots:10923-16383 (5461 slots) master
1 additional replica(s)
M: f656bcb8bb63b8384bc3d2cdde07feb34850e2bc 172.16.18.216:7003
slots:5461-10922 (5462 slots) master
1 additional replica(s)
S: 37c102a7b6d6004ccce69d74ca2f62ef93eaa1d9 172.16.18.218:7002
slots: (0 slots) slave
replicates f656bcb8bb63b8384bc3d2cdde07feb34850e2bc
[OK] All nodes agree about slots configuration.
Check for open slots...
Check slots coverage...
[OK] All 16384 slots covered.
xulinquan@root0-PR4764GW-218:~/redis$

Is there anyone who knows how to configure the Redis for testing the benchmark? With grateful.

gloo cannot find system nccl

I've posted to PyTorch repo, but it's obviously related to the gloo project, so report it here too.

🐛 Bug

When I build pytorch from the latest repo, it produces some unusual error as:

CMake Warning (dev) at cmake/Dependencies.cmake:846 (add_dependencies):
  Policy CMP0046 is not set: Error on non-existent dependency in
  add_dependencies.  Run "cmake --help-policy CMP0046" for policy details.
  Use the cmake_policy command to set the policy and suppress this warning.

  The dependency target "nccl_external" of target "gloo_cuda" does not exist.
Call Stack (most recent call first):
  CMakeLists.txt:201 (include)
This warning is for project developers.  Use -Wno-dev to suppress it.
In file included from /u3/setup/pytorch/pytorch/third_party/gloo/gloo/nccl/nccl.cu:10:0:
/u3/setup/pytorch/pytorch/third_party/gloo/gloo/nccl/nccl.h:12:18: fatal error: nccl.h: No such file or directory
 #include <nccl.h>
                  ^
compilation terminated.
[ 20%] Linking CXX executable ../../bin/c10_DeviceGuard_test
[ 20%] Building CXX object c10/test/CMakeFiles/c10_logging_test.dir/logging_test.cpp.o
CMake Error at gloo_cuda_generated_nccl.cu.o.Release.cmake:215 (message):
  Error generating
  /u3/setup/pytorch/pytorch/build/third_party/gloo/gloo/CMakeFiles/gloo_cuda.dir/nccl/./gloo_cuda_generated_nccl.cu.o


make[2]: *** [third_party/gloo/gloo/CMakeFiles/gloo_cuda.dir/nccl/gloo_cuda_generated_nccl.cu.o] Error 1
make[2]: *** Waiting for unfinished jobs....

To Reproduce

Just do python setup.py bdist_wheel if you have wheel package from pip.

Expected behavior

Environment

PyTorch version: 1.0.0a0+60e7d04
Is debug build: No
CUDA used to build PyTorch: 10.0.130

OS: CentOS Linux 7 (Core)
GCC version: (GCC) 4.8.5 20150623 (Red Hat 4.8.5-28)
CMake version: version 2.8.12.2

Python version: 3.7
Is CUDA available: Yes
CUDA runtime version: 10.0.130
GPU models and configuration: GPU 0: GeForce GTX 1070
Nvidia driver version: 410.72
cuDNN version: Probably one of the following:
/usr/local/cudnn_6.0-cuda_8.0/lib64/libcudnn.so.6.0.21
/usr/local/cudnn_6.0-cuda_8.0/lib64/libcudnn_static.a
/usr/local/cudnn_7.0.3-cuda_9.0/lib64/libcudnn.so.7.0.3
/usr/local/cudnn_7.0.3-cuda_9.0/lib64/libcudnn_static.a
/usr/local/cudnn_7.0.4-cuda_8.0/lib64/libcudnn.so.7.0.4
/usr/local/cudnn_7.0.4-cuda_8.0/lib64/libcudnn_static.a
/usr/local/cudnn_7.0.4-cuda_9.0/lib64/libcudnn.so.7.0.4
/usr/local/cudnn_7.0.4-cuda_9.0/lib64/libcudnn_static.a
/usr/local/cudnn_7.0.5+cuda_9.1/lib64/libcudnn.so.7.0.5
/usr/local/cudnn_7.0.5+cuda_9.1/lib64/libcudnn_static.a
/usr/local/cudnn_7.1.3+cuda_9.1/lib64/libcudnn.so.7.1.3
/usr/local/cudnn_7.1.3+cuda_9.1/lib64/libcudnn_static.a
/usr/local/cudnn_7.1.4+cuda_9.0/lib64/libcudnn.so.7.1.4
/usr/local/cudnn_7.1.4+cuda_9.0/lib64/libcudnn_static.a
/usr/local/cudnn_7.1.4+cuda_9.2/lib64/libcudnn.so.7.1.4
/usr/local/cudnn_7.1.4+cuda_9.2/lib64/libcudnn_static.a
/usr/local/cudnn_7.2.1+cuda_9.2/lib64/libcudnn.so.7.2.1
/usr/local/cudnn_7.2.1+cuda_9.2/lib64/libcudnn_static.a
/usr/local/cudnn_7.3.0+cuda_10.0/lib64/libcudnn.so.7.3.0
/usr/local/cudnn_7.3.0+cuda_10.0/lib64/libcudnn_static.a
/usr/local/cudnn_7.3.1+cuda_10.0/lib64/libcudnn.so.7.3.1
/usr/local/cudnn_7.3.1+cuda_10.0/lib64/libcudnn_static.a
/usr/local/cudnn_7.4.1.5+cuda_10.0/lib64/libcudnn.so.7.3.1
/usr/local/cudnn_7.4.1.5+cuda_10.0/lib64/libcudnn.so.7.4.1
/usr/local/cudnn_7.4.1.5+cuda_10.0/lib64/libcudnn_static.a

Versions of relevant libraries:
[pip] Could not collect
[conda] Could not collect

Additional context

Can`t Compile gloo on a 32-bit system

the issue is straightforward I need Pytorch on a raspberryPi and pytroch requires gloo which can only be compiled on a 64-bit system, So how can I override this Please Help

reference benchmarking in InifiniBand/RoCE

Hi!
Firstly, thanks for the nice work.
It's good to see the brief benchmark figures in README.md.

It would be great if anybody can show the benchmarking result of --transport ibverbs
in the same or similar configuration(4 machines with a 40GbE network). Thanks

How to use fixed port to make connection through TCP connection of gloo

I found that every time TCP connection with random port in Gloo ,But this condition is not suitable for mine . In my work, I need to open specific port for them to work.
And I get the help from @pietern that tell me having Gloo pick from a predefined set of ports. But this is still some confusion. Is there any way help me to solve it or some detail about it . by the way , I use it to distribute train for caffe2.
Many thanks !

@pietern @zpao @yfeldblum @achao @gfosco

ProcessGroupGloo RuntimeError: Wait timeout

When I was trying: "pytorch.distributed.init_process_group(rank=0,backend="gloo",init_method='file:///home/simon/Desktop/shared_folder/gloo_shared_file',world_size=2)", "RuntimeError: Wait timeout" came out.
The above path is a shared folder on Ubuntu and samba, it works well.
gloo_shared_file is found and found.

Traceback (most recent call last):
File "/home/simon/Desktop/dist_rl/dist_rl.py", line 13, in
dist.init_process_group(rank=0,backend="gloo",init_method='file:///home/simon/Desktop/shared_folder/gloo_shared_file',world_size=2)
File "/home/simon/anaconda3/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py", line 283, in init_process_group
_default_pg = ProcessGroupGloo(store, rank, world_size)
RuntimeError: Wait timeout

Could somebody give me a hand? Thanks indeed.

Getting undefined template error when compiling with clang5 for caffe2

Getting undefined template error when compiling with clang5 for caffe2.

Probably caused by a missing #include <array>

caffe2/third_party/gloo/gloo/common/linux.cc:102:25: error: implicit instantiation of undefined template 'std::__1::array<char, 256>'
  std::array<char, 256> buf;
                        ^
/usr/bin/../include/c++/v1/__tuple:223:64: note: template is declared here
template <class _Tp, size_t _Size> struct _LIBCPP_TEMPLATE_VIS array;

[build/linux] missing -lpthread flag when linking gloo_test

[100%] Linking CXX executable gloo_test
cd /home/lumin/packages/gloo.pkg/gloo/obj-x86_64-linux-gnu/gloo/test && /usr/bin/cmake -E cmake_link_script CMakeFiles/gloo_test.dir/link.txt --verbose=1
/usr/bin/c++  -g -O2 -fdebug-prefix-map=/home/lumin/packages/gloo.pkg/gloo=. -fstack-protector-strong -Wformat -Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2 -std=c++11 -fPIC  -Wl,-z,relro -Wl,-z,now -Wl,--as-needed -rdynamic CMakeFiles/gloo_test.dir/allreduce_builder_test.cc.o CMakeFiles/gloo_test.dir/allreduce_test.cc.o CMakeFiles/gloo_test.dir/barrier_test.cc.o CMakeFiles/gloo_test.dir/broadcast_builder_test.cc.o CMakeFiles/gloo_test.dir/broadcast_test.cc.o CMakeFiles/gloo_test.dir/linux_test.cc.o CMakeFiles/gloo_test.dir/main.cc.o  -o gloo_test -Wl,-rpath,/home/lumin/packages/gloo.pkg/gloo/obj-x86_64-linux-gnu/gloo ../libgloo_builder.so.0.5.0 ../libgloo.so.0.5.0 /usr/local/lib/libgtest.a 
/usr/bin/ld: /usr/local/lib/libgtest.a(gtest-all.cc.o): undefined reference to symbol 'pthread_key_delete@@GLIBC_2.2.5'
/usr/bin/ld: //lib/x86_64-linux-gnu/libpthread.so.0: error adding symbols: DSO missing from command line
dh_auto_configure -- \
	-DUSE_REDIS=ON \
	-DUSE_IBVERBS=ON \
	-DUSE_MPI=ON \
	-DUSE_CUDA=OFF \
	-DUSE_NCCL=OFF \
	-DBUILD_TEST=ON \
	-DBUILD_BENCHMARK=OFF \
	-DBUILD_SHARED_LIBS=ON
	cd obj-x86_64-linux-gnu && cmake -DCMAKE_INSTALL_PREFIX=/usr -DCMAKE_VERBOSE_MAKEFILE=ON -DCMAKE_BUILD_TYPE=None -DCMAKE_INSTALL_SYSCONFDIR=/etc -DCMAKE_INSTALL_LOCALSTATEDIR=/var -DCMAKE_EXPORT_NO_PACKAGE_REGISTRY=ON -DCMAKE_FIND_PACKAGE_NO_PACKAGE_REGISTRY=ON -DCMAKE_INSTALL_RUNSTATEDIR=/run "-GUnix Makefiles" -DUSE_REDIS=ON -DUSE_IBVERBS=ON -DUSE_MPI=ON -DUSE_CUDA=OFF -DUSE_NCCL=OFF -DBUILD_TEST=ON -DBUILD_BENCHMARK=OFF -DBUILD_SHARED_LIBS=ON ..

how to obtain communication time

hi, @pietern how to count communication time when program run with gloo on multiple nodes?
what about nccl?
Is there any method to seperate computation time from nccl communication time and communication time produced among different nodes please?

How can I compile gloo with openmpi?

Hi, I want to use mpi run jobs across machines. I use compile gloo with
cmake ../ -DBUILD_TEST=1 -DBUILD_BENCHMARK=1 -USE_MPI=ON

but It didn't compile the mpi directory. I checked the cmake file, and it seems no USE_MPI flag.

How can I build gloo with mpi?

benchmark compare with NCCL

Throw tests comparing with NCCL, we found gloo has poor performance for large size all-reduce. We'd appreciate if you can provide the reasons or suggestions. The blow is some of the tests.

test1

testbed

2 different Ubuntu Server each with 8 V100 GPU
with network IB RDMA 100Gbps (may have straggler)
NCCL 2.3.7
gloo(using gloo/benchmark/benchmark) commit 1d9e62a

gloo instruction

./benchmark --transport ibverbs --ib-device=*** --iteration-count 5000 --redis-host *** --prefix ** -s 2 -r * allreduce_ring 2>/dev/null

result

size	NCCL	gloo
1KB	0.16ms	0.039ms
2KB	0.06ms	0.04ms
4KB	0.06ms	0.042ms
8KB	0.12ms	0.05ms
16KB	0.13ms	0.064ms
32KB	0.15ms	0.088ms
64KB 	0.13ms	0.155ms
128KB	0.09ms	1.073ms
256KB	0.13ms	20.161ms

test2

testbed

2 different Ubuntu Server each with 8 different GPU
with TCP/IP socket
NCCL 2.3.7
gloo(using gloo/benchmark/benchmark) commit 1d9e62a

gloo instruction

./benchmark --transport tcp --iteration-count 5000 --redis-host *** --prefix ** -s 2 -r * allreduce_ring 2>/dev/null

result

size	NCCL	gloo
1KB	1.89ms	0.354ms
2KB	1.46ms	0.417ms
4KB	2.06ms	0.472ms
8KB	2.61ms	0.731ms
16KB	2.23ms	1.297ms
32KB	2.96ms	2.160ms
64KB	4.02ms	3.813ms
128KB	3.66ms	5.614ms
256KB	6.25ms	10.980ms
512KB	7.12ms	20.544ms
1MB	11.91ms	39.855ms
2MB	20.9ms	78.237ms

CudaAllreduceHalvingDoubling writev error

Hi! This a duplicate post from Caffe2 github page. This is my original question. I think that this thread may also relate to the problem or incorrect software/hardware setup.
The error is Write must always succeed in sync mode in Pair::write. This was a quick investigation:

There's something wrong with my hardware/software setup I cannot figure out. I tried to run resnet50_trainer.py (no modifications) on single (2xK-80) and multiple nodes with REDIS rendezvous. It fails. In single device mode (num_shards=1 ) it works OK.
It fails (soon after it has started) performing CudaAllReduceHalvingDoubling Gloo operation (transport - TCP). The method that fails is Pair::write in Gloo library. It turns out at some point in time, it wants to send 4096040 bytes, waits for some time (sending??? - here something is wrong), and returning from writev call with number of bytes sent = 1376004. The call to Pair::write that fails is not the first one. Some previous calls work correct.
Could you guys guide me what should I pay attention to to resolve it.
Thanks!
PS: This is what works OK:
Caffe and Caffe2 run OK on 1, 2 and 4 GPUs (data parallel schema).
p2pBandwidthTest from CUDA samples runs OK.
Benchmarks from Gloo library run OK.
Firewalls are disabled.

I am planning to dig into this issue this weekend but maybe you know what the cause is or maybe you can give me a hint from where to start.
Thanks!

[Question] creating an ibverbs context, or any context

Hi! I'm very interested in trying out Gloo but I'm pretty new so the question may be dumb:

I'm trying to integrate Gloo into my code and it seems all algorithms require a context object. I am attempting to create such a context with infiniband devices...

gloo::transport::ibverbs::attr verbsAttr;
verbsAttr.name = "mlx4_0";
dev = CreateDevice(verbsAttr);
gloo::Context context = ???

Any pointer is appreciated. Thank you!

The performance of using multiple gloo Context in multiple threads

Hi, thanks for your great work.
I've found the multiple Context use in multiple threads here:#35 and tried this. But I found that we can't benefit much from this. My code starts 4 threads to handle allreduce sum operation on 20 x 4MB data, compare with 4x20x4MB data in 1 thread, on an 8 cores 32 threads E5 machine, with GeForce GTX TITAN X . And I use mpirun starts 4 processes. The result:

is_gpu=true, time for [4] thread: 625.00 ms, time for single thread: 827.00 ms.
is_gpu=false, time for [4] thread: 312.00 ms, time for single thread: 362.00 ms.

Does I use multiple gloo Context in a wrong way? Or there are some limits on multiple threads using?

Here is my code:

#include <cassert>
#include <iostream>
#include <memory>
#include <gflags/gflags.h>
#include <glog/logging.h>

#include <sys/types.h>
#include <unistd.h>
#include <stdio.h>
#include <thread>

#include "timer.hpp"
#include "syncedmem.hpp"
#include "gloo/transport/tcp/device.h"
#include "gloo/allreduce_ring.h"
#include "gloo/cuda_allreduce_ring_chunked.h"
#include "gloo/allreduce_ring_chunked.h"
#include "gloo/mpi/context.h"

using std::shared_ptr;
using std::vector;
using std::thread;
using std::bind;

void thread_func(vector<float*> data, int count, bool is_gpu, shared_ptr<gloo::mpi::Context>& context) {
	if (is_gpu) {
		for (size_t i = 0; i < data.size(); ++i) {
			gloo::CudaAllreduceRingChunked<float> allreduce(context, {data[i]}, count);
			allreduce.run();
		}
	}else{
		for (size_t i = 0; i < data.size(); ++i) {
			gloo::AllreduceRingChunked<float> allreduce(context, {data[i]}, count, gloo::ReductionFunction<float>::sum);
			allreduce.run();
		}
	}	
}

void multi_thread_performance(bool is_gpu) {
  auto dev = gloo::transport::tcp::CreateDevice("localhost");
  
  int thread_num = 4;    
  vector<shared_ptr<gloo::mpi::Context> > contexts(thread_num+1);
  for (int i = 0; i < thread_num+1; ++i) {
	contexts[i] = std::make_shared<gloo::mpi::Context>(MPI_COMM_WORLD);
	contexts[i]->connectFullMesh(dev);  
  }
  
  const int count = 1024*1024;
  const int P = 20;
  const int N = thread_num * P;
  vector<float*> data_vec(N);
  // SyncedMemory is a cpu&gpu memory manager which is copied from Caffe.
  SyncedMemory mem(N * count * sizeof(float));
  for (int i = 0; i < N; ++i) {
	if (is_gpu) {
		data_vec[i] = (float*)mem.mutable_gpu_data() + i * count;
	}else{
		data_vec[i] = (float*)mem.mutable_cpu_data() + i * count;
	}	
  }
  
  // I found the first time is running much slower, so run each context in advance.
  for (int i = 0; i < thread_num+1; ++i) {
	thread_func(data_vec, count, is_gpu, contexts[thread_num]);  
  }
  
  CPUTimer timer;
  timer.Start();
  thread_func(data_vec, count, is_gpu, contexts[thread_num]);
  float s_elapse = timer.MilliSeconds();
  
  vector<std::thread> threads;
  timer.Start();
  for (int i = 0; i < thread_num; ++i) {
	  vector<float*> data(data_vec.begin() + i*P, data_vec.begin() + (i+1)*P);
	  threads.push_back(std::thread(std::bind(thread_func, data, count, is_gpu, contexts[i])));
  }
  for (int i = 0; i < thread_num; ++i) {
	  threads[i].join();
  }
  float m_elapse = timer.MilliSeconds();

  int rank = contexts[0]->rank;
  if (0 == rank) {
	printf("is_gpu=%s, time for [%d] thread: %.2f ms, time for single thread: %.2f ms.\n", 
			is_gpu ? "true" : "false", thread_num, m_elapse, s_elapse);
  }
	  
}

int main(int argc, char** argv) {
  MPI_Init(&argc, &argv);
  
  multi_thread_performance(true);
  multi_thread_performance(false);
  
  MPI_Finalize();
  return 0;
}

Add SOVERSION support to cmake build

I've created a debian package for gloo here: https://salsa.debian.org/lumin-guest/gloo

To meet Debian's requirement I patched gloo as follows

Purpose: Add SOVERSION to shared object
Forward:

diff --git a/CMakeLists.txt b/CMakeLists.txt
index 09c7ac4..42f96b2 100644
--- a/CMakeLists.txt
+++ b/CMakeLists.txt
@@ -7,6 +7,10 @@ set(GLOO_VERSION_MINOR 5)
 set(GLOO_VERSION_PATCH 0)
 set(GLOO_VERSION
     "${GLOO_VERSION_MAJOR}.${GLOO_VERSION_MINOR}.${GLOO_VERSION_PATCH}")
+set(DEB_MAINT_VERSION
+    "${GLOO_VERSION_MAJOR}.${GLOO_VERSION_MINOR}.${GLOO_VERSION_PATCH}")
+set(DEB_MAINT_SOVERSION
+    "${GLOO_VERSION_MAJOR}.${GLOO_VERSION_MINOR}.${GLOO_VERSION_PATCH}")
 
 # Gloo assumes 64-bit and doesn't run builds/tests for anything else.
 if(NOT CMAKE_SIZEOF_VOID_P EQUAL 8)

diff --git a/gloo/CMakeLists.txt b/gloo/CMakeLists.txt
index 7327905..727d751 100644
--- a/gloo/CMakeLists.txt
+++ b/gloo/CMakeLists.txt
@@ -83,6 +83,12 @@ configure_file(config.h.in config.h)
 
 add_library(gloo ${GLOO_STATIC_OR_SHARED} ${GLOO_SRCS})
 add_library(gloo_builder ${GLOO_STATIC_OR_SHARED} ${GLOO_BUILDER_SRCS})
+set_target_properties(gloo PROPERTIES
+       VERSION ${DEB_MAINT_VERSION}
+       SOVERSION ${DEB_MAINT_SOVERSION})
+set_target_properties(gloo_builder PROPERTIES
+       VERSION ${DEB_MAINT_VERSION}
+       SOVERSION ${DEB_MAINT_SOVERSION})
 target_link_libraries(gloo PRIVATE ${gloo_DEPENDENCY_LIBS})
 target_link_libraries(gloo_builder PUBLIC gloo)
 if(USE_CUDA)

https://salsa.debian.org/lumin-guest/gloo/blob/master/debian/patches/deb_soversion.patch

The variable names are not appropriate for a pull request, so I just paste the patch here.

benchmark --verify error

Hi!

I'm testing the benchmark program. When I use the --verify flag, I am getting some complaints.
what(): [enforce fail at /home/ubuntu/gloo/gloo/benchmark/main.cc:91] T(offset + expected) == input[i]. 2.4e+07 vs 2.4e+07. Mismatch at index: 375000
terminate called after throwing an instance of 'gloo::EnforceNotMet'

The command I used is:
benchmark -s ${totalClients} -r ${idx} -h xxx.xxx.xxx.xxx -p 6379 -t tcp --sync true --inputs 1 --elements 100\ 0000 --iteration-count 1 --verify allreduce_ring_chunked

I ran this across 8 machines, so ${totalClients}=8, and ${idx} range from 0-7.

Did I do something obviously wrong?

This is running on Ubuntu, and has the latest master checked out.

Thanks!

Is it possible to use Gloo without redis?

I am trying to run the benchmark but I don't want to use Redis.

Is it possible to use just MPI to create the context and then run the included benchmark using mpirun or mpiexec?

I tried building without the redis by setting -DUSE_REDIS=0 but the cmake step fails. Is Redis a required dependency for Gloo?

Fix use of BUILD_SHARED_LIBS

Right now it must be set to "SHARED" or "STATIC", but it should take "ON" or "OFF" like any other stock CMake variable.

Compile error with glibc 2.26

Compiling with the newly released gblic 2.26 gives the following error:

[ 72%] Built target gloo
[ 76%] Building NVCC (Device) object gloo/CMakeFiles/gloo_cuda.dir/gloo_cuda_generated_cuda_private.cu.o
[ 80%] Building NVCC (Device) object gloo/CMakeFiles/gloo_cuda.dir/gloo_cuda_generated_cuda.cu.o
/usr/include/bits/floatn.h(61): error: invalid argument to attribute "__mode__"

/usr/include/bits/floatn.h(73): error: identifier "__float128" is undefined

/usr/include/bits/floatn.h(61): error: invalid argument to attribute "__mode__"

/usr/include/bits/floatn.h(73): error: identifier "__float128" is undefined

2 errors detected in the compilation of "/tmp/tmpxft_000049cd_00000000-17_cuda.compute_61.cpp1.ii".
CMake Error at gloo_cuda_generated_cuda.cu.o.Release.cmake:278 (message):
  Error generating file
  /home/user/aur/gloo-git/src/gloo/build/gloo/CMakeFiles/gloo_cuda.dir//./gloo_cuda_generated_cuda.cu.o


make[2]: *** [gloo/CMakeFiles/gloo_cuda.dir/build.make:65: gloo/CMakeFiles/gloo_cuda.dir/gloo_cuda_generated_cuda.cu.o] Error 1
make[2]: *** Waiting for unfinished jobs....
2 errors detected in the compilation of "/tmp/tmpxft_000049cc_00000000-17_cuda_private.compute_61.cpp1.ii".
CMake Error at gloo_cuda_generated_cuda_private.cu.o.Release.cmake:278 (message):
  Error generating file
  /home/user/aur/gloo-git/src/gloo/build/gloo/CMakeFiles/gloo_cuda.dir//./gloo_cuda_generated_cuda_private.cu.o


make[2]: *** [gloo/CMakeFiles/gloo_cuda.dir/build.make:72: gloo/CMakeFiles/gloo_cuda.dir/gloo_cuda_generated_cuda_private.cu.o] Error 1
make[1]: *** [CMakeFiles/Makefile2:98: gloo/CMakeFiles/gloo_cuda.dir/all] Error 2
make: *** [Makefile:130: all] Error 2

OS: Arch Linux x86_64
gloo: git master r203.g804b537
Compiler: gcc 7.2.0
glibc: 2.26

note: caffe2 git master currently also gives a similar error, with and without gloo support.

Make failed

When I run a following command:

cmake ../ -DBUILD_TEST=1 -DBUILD_BENCHMARK=1 -DUSE_IBVERBS_DEFAULT=ON -DUSE_CUDA=1 && make

I have an issue:
~/gloo/build# cmake ../ -DBUILD_TEST=1 -DBUILD_BENCHMARK=1 -DUSE_IBVERBS_DEFAULT=ON -DUSE_CUDA=1 && make
-- Build type not set -- defaulting to Release
-- CUDA detected: 8.0
-- Added CUDA NVCC flags for: sm_30 sm_35 sm_50 sm_52 sm_60 sm_61
-- Found libcuda: /usr/local/cuda-8.0/lib64/stubs/libcuda.so
-- Found libnvrtc: /usr/local/cuda-8.0/lib64/libnvrtc.so
CMake Warning (dev) at gloo/test/CMakeLists.txt:11 (add_executable):
Policy CMP0037 is not set: Target names should not be reserved and should
match a validity pattern. Run "cmake --help-policy CMP0037" for policy
details. Use the cmake_policy command to set the policy and suppress this
warning.

The target name "test" is reserved or not valid for certain CMake features,
such as generator expressions, and may result in undefined behavior.
This warning is for project developers. Use -Wno-dev to suppress it.

-- Configuring done
-- Generating done
-- Build files have been written to: /root/gloo/build
[ 3%] Built target gtest
[ 6%] Built target gtest_main
[ 45%] Built target gloo
[ 58%] Built target gloo_cuda
[ 62%] Built target gloo_builder
[ 75%] Built target test
[ 87%] Built target test_cuda
[ 88%] Building CXX object gloo/benchmark/CMakeFiles/benchmark.dir/runner.cc.o
/root/gloo/gloo/benchmark/runner.cc: In member function ‘void gloo::benchmark::Runner::rendezvousFileSystem()’:
/root/gloo/gloo/benchmark/runner.cc:171:3: error: ‘PrefixStore’ is not a member of ‘gloo::rendezvous’
rendezvous::PrefixStore prefixStore(options_.prefix, fileStore);
^
/root/gloo/gloo/benchmark/runner.cc:174:35: error: ‘prefixStore’ was not declared in this scope
backingContext->connectFullMesh(prefixStore, transportDevices_.front());
^
/root/gloo/gloo/benchmark/runner.cc: In member function ‘void gloo::benchmark::Runner::run(gloo::benchmark::Runner::BenchmarkFn&, int)’:
/root/gloo/gloo/benchmark/runner.cc:253:19: warning: lambda capture initializers only available with -std=c++14 or -std=gnu++14
auto fn = [&benchmark = benchmarks[i]] { benchmark->run(); };
^
/root/gloo/gloo/benchmark/runner.cc:281:17: warning: lambda capture initializers only available with -std=c++14 or -std=gnu++14
auto fn = [&benchmark = benchmarks[i]] { benchmark->run(); };
^
gloo/benchmark/CMakeFiles/benchmark.dir/build.make:110: recipe for target 'gloo/benchmark/CMakeFiles/benchmark.dir/runner.cc.o' failed
make[2]: *** [gloo/benchmark/CMakeFiles/benchmark.dir/runner.cc.o] Error 1
CMakeFiles/Makefile2:561: recipe for target 'gloo/benchmark/CMakeFiles/benchmark.dir/all' failed
make[1]: *** [gloo/benchmark/CMakeFiles/benchmark.dir/all] Error 2
Makefile:127: recipe for target 'all' failed
make: *** [all] Error 2

How I can solve it ?

Thanks

Compile error

I'm getting the following compile error:

Scanning dependencies of target gloo
[  3%] Building CXX object gloo/CMakeFiles/gloo.dir/algorithm.cc.o
In file included from /home/user/src/gloot-git/gloo/algorithm.h:15:0,
                 from /home/user/src/gloot-git/gloo/algorithm.cc:10:
/home/user/src/gloot-git/gloo/math.h: In function ‘void gloo::sum(T*, const T*, size_t) [with T = gloo::float16; size_t = long unsigned int]’:
/home/user/src/gloot-git/gloo/math.h:103:3: error: ‘assert’ was not declared in this scope
   assert(is_aligned(y, 32));
   ^~~~~~
/home/user/src/gloot-git/gloo/math.h:103:3: note: suggested alternative: ‘qsort’
   assert(is_aligned(y, 32));
   ^~~~~~
   qsort
/home/user/src/gloot-git/gloo/math.h: In function ‘void gloo::product(T*, const T*, size_t) [with T = gloo::float16; size_t = long unsigned int]’:
/home/user/src/gloot-git/gloo/math.h:128:3: error: ‘assert’ was not declared in this scope
   assert(is_aligned(y, 32));
   ^~~~~~
/home/user/src/gloot-git/gloo/math.h:128:3: note: suggested alternative: ‘qsort’
   assert(is_aligned(y, 32));
   ^~~~~~
   qsort
/home/user/src/gloot-git/gloo/math.h: In function ‘void gloo::max(T*, const T*, size_t) [with T = gloo::float16; size_t = long unsigned int]’:
/home/user/src/gloot-git/gloo/math.h:153:3: error: ‘assert’ was not declared in this scope
   assert(is_aligned(y, 32));
   ^~~~~~
/home/user/src/gloot-git/gloo/math.h:153:3: note: suggested alternative: ‘qsort’
   assert(is_aligned(y, 32));
   ^~~~~~
   qsort
/home/user/src/gloot-git/gloo/math.h: In function ‘void gloo::min(T*, const T*, size_t) [with T = gloo::float16; size_t = long unsigned int]’:
/home/user/src/gloot-git/gloo/math.h:178:3: error: ‘assert’ was not declared in this scope
   assert(is_aligned(y, 32));
   ^~~~~~
/home/user/src/gloot-git/gloo/math.h:178:3: note: suggested alternative: ‘qsort’
   assert(is_aligned(y, 32));
   ^~~~~~
   qsort
make[2]: *** [gloo/CMakeFiles/gloo.dir/build.make:63: gloo/CMakeFiles/gloo.dir/algorithm.cc.o] Error 1
make[1]: *** [CMakeFiles/Makefile2:139: gloo/CMakeFiles/gloo.dir/all] Error 2
make: *** [Makefile:130: all] Error 2

Using these steps:

$ git clone https://github.com/facebookincubator/gloo.git gloot-git
$ cd gloo-git
$ mkdir build
$ cd build
$ cmake .. -DCMAKE_BUILD_TYPE=Release -DCMAKE_INSTALL_PREFIX=/usr -DNCCL_ROOT_DIR=/opt/cuda -DCUDA_HOST_COMPILER=/usr/bin/g++-5 -DUSE_CUDA=on -DCMAKE_CXX_FLAGS:STRING="-march=native" -DCMAKE_C_FLAGS:STRING="-march=native"
$ make

System information:
CPU: Intel Haswell i7-4790K
OS: Arch Linux x86_64
Compiler: gcc 7.1.1
CUDA: 8.0.61
NCCL: 1.3.4-1

This error seems to be related to the -march=native option. When I remove this option it compiles fine.

It does not seem to be gcc7-specific because the same error shows when using gcc5 with -march=native option.

Complete output is attached.

gloo-compile-error.txt

Is possible to run IB benchmark on Ethernet via RoCE?

I tried to run the benchmark on the 100G Ethernet via RoCE, but it failed with this err.

wc->status == IBV_WC_SUCCESS. 5 vs 0. Send for slot 0: Work Request Flushed Error

I wonder if it is possible to do this?

Error building Gloo in Caffe2

Hi, when I'm following the instruction here https://caffe2.ai/docs/getting-started.html?platform=centos&configuration=cloud,

I ran into the following errors when running make.

CMake Error at gloo_cuda_generated_nccl.cu.o.Release.cmake:203 (message):
Error generating
/sampa/home/liangluo/caffe2/build/third_party/gloo/gloo/CMakeFiles/gloo_cuda.dir/nccl/./gloo_cuda_generated_nccl.cu.o

CMake Error at gloo_cuda_generated_cuda.cu.o.Release.cmake:203 (message):
Error generating
/sampa/home/liangluo/caffe2/build/third_party/gloo/gloo/CMakeFiles/gloo_cuda.dir//./gloo_cuda_generated_cuda.cu.o

CMake Error at gloo_cuda_generated_cuda_private.cu.o.Release.cmake:203 (message):
Error generating
/sampa/home/liangluo/caffe2/build/third_party/gloo/gloo/CMakeFiles/gloo_cuda.dir//./gloo_cuda_generated_cuda_private.cu.o

make[2]: *** [third_party/gloo/gloo/CMakeFiles/gloo_cuda.dir/nccl/gloo_cuda_generated_nccl.cu.o] Error 1
make[2]: *** Waiting for unfinished jobs....
make[2]: *** [third_party/gloo/gloo/CMakeFiles/gloo_cuda.dir/gloo_cuda_generated_cuda.cu.o] Error 1
make[2]: *** [third_party/gloo/gloo/CMakeFiles/gloo_cuda.dir/gloo_cuda_generated_cuda_private.cu.o] Error 1
make[1]: *** [third_party/gloo/gloo/CMakeFiles/gloo_cuda.dir/all] Error 2

Any idea what this is about? Thanks a lot!

Allreduce stuck at ~200000 elements on EC2

Hi there,

Thanks for actively looking into these issues!

Again this issue is related to latest gloo.

I randomly fired 8 machines on Ec2, c4.8Xlarge to be specific, that have 10Gbps advertised link speed.
When I ran the benchmark program, it seems like Allreduce_halving_doubling just cannot process more than 200000 elements before it times out.

I tried to change the default timeout to 10 minutes, but this time it seems like everyone is just stuck (no network activity on the machine).

SInce these are tcp connections and these machines have no problem talking to each other, what are possible reasons for this (I set to run only 1 iteration and sweep)?

Thanks again!

elements min (us) p50 (us) p99 (us) max (us) avg (GB/s) samples
100 1326 1326 1326 1326 0.000 1
200 1524 1524 1524 1524 0.000 1
500 1345 1345 1345 1345 0.001 1
1000 1534 1534 1534 1534 0.002 1
2000 1475 1475 1475 1475 0.005 1
5000 1639 1639 1639 1639 0.011 1
10000 1685 1685 1685 1685 0.022 1
20000 2375 2375 2375 2375 0.031 1
50000 2843 2843 2843 2843 0.066 1
100000 4293 4293 4293 4293 0.087 1
200000 6696 6696 6696 6696 0.111 1
terminate called after throwing an instance of 'gloo::IoException'
what(): [/home/ubuntu/gloo/gloo/transport/tcp/pair.cc:302] Write timeout [172.31.0.155]:11150
terminate called after throwing an instance of 'gloo::IoException'
what(): [/home/ubuntu/gloo/gloo/transport/tcp/pair.cc:302] Write timeout [172.31.1.174]:56243
terminate called after throwing an instance of 'gloo::IoException'
what(): [/home/ubuntu/gloo/gloo/transport/tcp/pair.cc:302] Write timeout [172.31.2.64]:25830
terminate called after throwing an instance of 'gloo::IoException'
what(): [/home/ubuntu/gloo/gloo/transport/tcp/pair.cc:302] Write timeout [172.31.18.43]:39576
terminate called after throwing an instance of 'gloo::IoException'
what(): [/home/ubuntu/gloo/gloo/transport/tcp/pair.cc:302] Write timeout [172.31.2.176]:51846
terminate called after throwing an instance of 'gloo::IoException'
what(): [/home/ubuntu/gloo/gloo/transport/tcp/pair.cc:302] Write timeout [172.31.2.53]:48297
terminate called after throwing an instance of 'gloo::IoException'
what(): [/home/ubuntu/gloo/gloo/transport/tcp/pair.cc:302] Write timeout [172.31.11.55]:47757
terminate called after throwing an instance of 'gloo::IoException'
what(): [/home/ubuntu/gloo/gloo/transport/tcp/pair.cc:302] Write timeout [172.31.8.151]:20659

Does it support a link-aggregated ethernet NICs?

Hi,
I have two nodes with 2 ethernet NICs bonded with link aggregation, but they look like using only single NIC bandwidth when I use MPI or NCCL with PyTorch. I wonder how about Gloo, if it supports such multiple connections between two nodes if they have bonded NICs. So let me ask whether the current Gloo supports multiple NICs and link aggregation? If not, do you have any plan to support it?

Thanks!

error when running with ibverbs

Hi All,
I just pulled the latest Gloo repo and try to use TCP and ibvers transport. The TCP approach is working well, but ibverbs is not working. Could someone tell me whether I did anything wrong or there is any bug in gloo library?

I have two servers: ubuntu01 and ubuntu02. Those two servers are connected by 100Gb/s EDR Infiniband. The IPv4 of two servers are 192.168.254.217 and 192.168.254.218. The IPoIB of two servers are 10.149.0.1 and 10.149.0.2. Here ubuntu01 is used as redis server and ubuntu02 is used as redis client.

First I run redis-server redis.conf on ubuntu01. The following are the configurations in redis.conf
bind 192.168.254.217 10.149.0.1
protected-mode no
daemonize yes

Then I compile gloo with -DUSE_IBVERBS=1 -DBUILD_TEST=1 -DBUILD_BENCHMARK=1
Then I run benchmark on both servers:
on ubuntu01:
./benchmark --size 2 --rank 0 --redis-host 10.149.0.1 --redis-port 6379 --prefix 1303 --transport ibverbs --elements -1 allreduce_ring_chunked

on ubuntu02:
./benchmark --size 2 --rank 1 --redis-host 10.149.0.1 --redis-port 6379 --prefix 1303 --transport ibverbs --elements -1 allreduce_ring_chunked

But both nodes have the errors:
terminate called after throwing an instance of 'gloo::EnforceNotMet'
what(): [enforce fail at /home/rengan/gloo/gloo/transport/ibverbs/pair.cc:466] wc->status == IBV_WC_SUCCESS. 12 vs 0. Memory region send for slot 0: transport retry counter exceeded
Aborted (core dumped)

Does anyone have this error before and know how to fix it? Thanks.

Regards,
Rengan

I got "cmake fail".

$ cmake ../ -DBUILD_TEST=1 -DBUILD_BENCHMARK=1

CMake Error at gloo/CMakeLists.txt:122 (string):
  string sub-command REGEX, mode REPLACE needs at least 6 arguments total to
  command.

I found error the message at "CMakeFiles/CMakeError.log"

CMakeFiles/cmTC_70939.dir/CheckSymbolExists.c.o: In function main': CheckSymbolExists.c:(.text+0x16): undefined reference to pthread_create'
collect2: error: ld returned 1 exit status
CMakeFiles/cmTC_70939.dir/build.make:97: recipe for target 'cmTC_70939' failed
make[1]: *** [cmTC_70939] Error 1
make[1]: Leaving directory '/disk3/daum/release/gloo/build/CMakeFiles/CMakeTmp'
Makefile:126: recipe for target 'cmTC_70939/fast' failed

Linking C executable cmTC_562bd
/usr/bin/cmake -E cmake_link_script CMakeFiles/cmTC_562bd.dir/link.txt --verbose=1
/usr/bin/cc -DCHECK_FUNCTION_EXISTS=pthread_create CMakeFiles/cmTC_562bd.dir/CheckFunctionExists.c.o -o cmTC_562bd -rdynamic -lpthreads
/usr/bin/ld: cannot find -lpthreads
collect2: error: ld returned 1 exit status
CMakeFiles/cmTC_562bd.dir/build.make:97: recipe for target 'cmTC_562bd' failed
make[1]: *** [cmTC_562bd] Error 1
make[1]: Leaving directory '/disk3/daum/release/gloo/build/CMakeFiles/CMakeTmp'
Makefile:126: recipe for target 'cmTC_562bd/fast' failed
make: *** [cmTC_562bd/fast] Error 2

Connection Refused error while running test_cuda

The test_cuda sample provided with gloo gives the following error occasionally, when run multiple times. (Running on 1 machine with 2 GPUs)

terminate called after throwing an instance of 'gloo::EnforceNotMet'
  what():  [enforce fail at /home/sayantan/gloo_GIT/gloo/gloo/transport/tcp/pair.cc:498] optval == 0. 111 vs 0. SO_ERROR: Connection refused
--------------------------------------------------------------------------
mpirun noticed that process rank 1 with PID 29013 on node 172.17.27.12 exited on signal 6 (Aborted).
--------------------------------------------------------------------------

Alternatively, I get this error at times:

unknown file: Failure
C++ exception with description "[/home/sayantan/gloo_GIT/gloo/gloo/transport/tcp/pair.cc:135] bind: Address already in use" thrown in the test body.
[mAllreduceHalvingDoublingPipelined/CudaAllreduceTest.MultiPointerAsync/14, where GetParam() = (32, 64, 32-byte object <80-C9 28-03 00-00 00-00 B8-EA 2A-03 00-00 00-00 D0-BA 4B-00 00-00 00-00 60-B8 4B-00 00-00 00-00>)

And at times it succeeds.

Log files:
Failure_1node_1.txt
Failure_1node_2.txt
Failure_1node_3.txt

(Moved this issue here from facebookarchive/caffe2#360 as it seems to be more appropriate)

UPDATED:
Stacktrace_GDB.txt

Infiniband AllReduceHalvingAndDoubling error

Hi! Thanks for your time! I'm running into this error when using infiniband:

what(): [enforce fail at /sampa/home/liangluo/gloo/gloo/transport/ibverbs/pair.cc:470] wc->status == IBV_WC_SUCCESS. 8 vs 0. Memory region send for slot 0: local access error

while doing this:

pContext = std::make_shared<gloo::rendezvous::Context>(Postoffice::Get()->my_rank(), Postoffice::Get()->num_workers());
gloo::rendezvous::FileStore fileSync("/shared/Gloo");
pContext->connectFullMesh(fileSync, dev);

Key key = keys[0];
std::vector<float*> ptrs(1,(float*)vKeyAddress.at(key));
int elementCount = keySize.at(key) / sizeof(float);
gloo::AllreduceHalvingDoubling<float> reducer(pContext, ptrs, elementCount);
reducer.run();

This error is generally associated with incorrect memory registration - but i just need to know first whether i am using Gloo correctly?

Thanks!

hostname issue in Gloo

Hi All,

By checking Gloo code, I found Gloo first finds the hostname of a server by using gethostname() function, and then get the corresponding network interface. However, the default hostname on my server is an Ethernet address. I want to use the hostname which is an IPoIB address but it never succeeds.

For instance, the default hostname on my node is node001 which corresponds to Ethernet address, but I want to use the hostname node001.ib.cluster which corresponds to the IPoIB address. My current solution is to change the hostname in /proc/sys/kernel/hostname, but this requires root permission.

So my questions are:

  1. Where does gethostname() exactly fetch the hostname?
  2. Is there any way to use a different hostname to use IPoIB instead of Ethernet address?

Thanks.
Rengan

Error with world size of 1 (in Pytorch 1.0)

Hi,
I was trying Pytorch 1.0 with torch.distributed and with different backend and configs.
MPI and NCCL work fine with different world size.

I encountered an error in when using Gloo allreduce with world size of 1:

RuntimeError: [enforce fail at /pytorch/third_party/gloo/gloo/allreduce.cc:37] context->getPair(recvRank). missing connection between rank 0 (thisprocess) and rank 0

Here is the command line:

mkdir -p /tmp/com_file && python3 main.py --init-method 'file:///tmp/com_file/shared_file' --backend gloo --rank 0 --world-size 1

The script I am using:
https://github.com/Kh4L/pytorch-distributed-example/blob/master/mnist/main.py

I am not sure if I should open an issue here or on pytorch repo, but let's try here.
Thank you

Barrier followed by context destruction causes some participants to fail

The shutdown code may be a bit too aggressive here. We should have a stress test for the termination scenario where we loop on (context creation, barrier, context destruction) and verify nobody raises.

May be related to write queues not draining when getting a closure signal from another pair.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.