baidu-research / baidu-allreduce Goto Github PK

License: Apache License 2.0

Makefile 2.69% Cuda 57.01% C++ 40.30%

baidu-allreduce's Introduction

`baidu-allreduce`

baidu-allreduce is a small C++ library, demonstrating the ring allreduce and ring allgather techniques. The goal is to provide a template for deep learning framework authors to use when implementing these communication algorithms within their respective frameworks.

A description of the ring allreduce with its application to deep learning is available on the Baidu SVAIL blog.

Installation

Prerequisites: Before compiling baidu-allreduce, make sure you have installed CUDA (7.5 or greater) and an MPI implementation.

baidu-allreduce has been tested with OpenMPI, but should work with any CUDA-aware MPI implementation, such as MVAPICH.

To compile baidu-allreduce, run

# Modify MPI_ROOT to point to your installation of MPI.
# You should see $MPI_ROOT/include/mpi.h and $MPI_ROOT/lib/libmpi.so.
# Modify CUDA_ROOT to point to your installation of CUDA.
make MPI_ROOT=/usr/lib/openmpi CUDA_ROOT=/path/to/cuda/lib64

You may need to modify your LD_LIBRARY_PATH environment variable to point to your MPI implementation as well as your CUDA libraries.

To run the baidu-allreduce tests after compiling it, run

# On CPU.
mpirun --np 3 allreduce-test cpu

# On GPU. Requires a CUDA-aware MPI implementation.
mpirun --np 3 allreduce-test gpu

Interface

The baidu-allreduce library provides the following C++ functions:

// Initialize the library, including MPI and if necessary the CUDA device.
// If device == NO_DEVICE, no GPU is used; otherwise, the device specifies which CUDA
// device should be used. All data passed to other functions must be on that device.
#define NO_DEVICE -1
void InitCollectives(int device);

// The ring allreduce. The lengths of the data chunks passed to this function
// must be the same across all MPI processes. The output memory will be
// allocated and written into `output`.
void RingAllreduce(float* data, size_t length, float** output);

// The ring allgather. The lengths of the data chunks passed to this function
// may differ across different devices. The output memory will be allocated and
// written into `output`.
void RingAllgather(float* data, size_t length, float** output);

The interface is simple and inflexible and is meant as a demonstration. The code is fairly straightforward and the same technique can be integrated into existing codebases in a variety of ways.

baidu-allreduce's People

Contributors

Stargazers

Watchers

Forkers

deep-learning-cdrone codeaudit chenkaiidy yangjunpro colinsongf amyvmiwei supercoeus wycg1984 greg1232 stevenybw slowbull akshay-venkatesh unclenine schuckbeta gangliao xmchen1987 my777777 wolf1981 guogongjun hkcaesar xindie paboyle yochju nunofernandes-plight gclouding k9sret limin2021 praveenmunagapati zhydhkcws anpark qianqzhang delaram-ghoreishi feifeibear mpatwary jacob1017 louisfeng frival hana-meister burness keisukefukuda boxianlai cdho2 yybbest baoruxiao wn9081 jianweilin xuhuihero sujinzhao gavinljj xdcesc cavalleria yuhonghong66 hephaex iwaterxt jangwonpark74 sivanzcw lxcheng aiyong aaronlau0 bin2000 gavinzjchao lizhangzhan stjordanis eugene1518 kaizeonwong maxy218 wdlctc xhcom-ui mysqlsc mrhs121 tobehuang csh2022 knowledgehacker d3v3l0 laoma023012 vslyu sanzimu gary-wang12138 leonsimba zmxdream daxiafresh feixliu dlguswo333 shuai-xie jinalong wjmgit ml-edu gxdai yzs-lab lvchakele shuyaoyimei huajinghua ucberkeley-spring2022-cs267-project yinliu-91 machinelearningsystem ax7e isabella232 youhe-jiang xuweijia-buaa whutbd

baidu-allreduce's Issues

please add license

What about Reduce, Gather, Bcast?

Dear stuff,
According to my test, the ring based algorithm definitly beats openmpi in terms of allreduce and allgather. Indeed, they ara the two MPI collectives that the most important to data parallelism in deep learning.
I wonder that if this algorithm is also suitable for other MPI collectives, such as Reduce, Gather, Bcast. The same magic will happen ?

Why not NCCL?

What's the benefit of using this implementation as opposed to using NCCL?

Asynchronous allreduce?

Hi baidu research team,
Is it possible to make an asynchronous allreduce based on this project? I think it is quite important when we integrate allreudce into deep learning framework such as Caffe. Would you like to shed a light on it?

Thanks

Comment

Not so much an issue, as a comment/recommendation for future evolution.

https://arxiv.org/abs/1711.04883

there are significant (10x) gains possible under Intel Omni-Path, and a study is linked.

Hope you find useful.

Small change needed to build by default on RHEL / Fedora / CentOS

I'm not sure why, but RHEL / Fedora / CentOS split their libraries and headers into separate directory structures in the openmpi / openmpi-devel packages. The below patch makes things work by default; perhaps MPI_INCLUDE_ROOT should default to MPI_ROOT to make things easier on OSes that don't have this split?

diff -u baidu-allreduce.orig/Makefile baidu-allreduce/Makefile
--- baidu-allreduce.orig/Makefile	2018-01-22 15:35:18.739557843 -0500
+++ baidu-allreduce/Makefile	2018-01-22 15:26:32.231119210 -0500
@@ -3,6 +3,11 @@
 $(error Could not find MPI in "$(MPI_ROOT)")
 endif
 
+# Check that MPI include path exists.
+ifeq ("$(wildcard $(MPI_INCLUDE_ROOT))","")
+$(error Could not find MPI in "$(MPI_INCLUDE_ROOT)")
+endif
+
 # Check that CUDA path exists.
 ifeq ("$(wildcard $(CUDA_ROOT))","")
 $(error Could not find CUDA in "$(CUDA_ROOT)")
@@ -11,7 +16,7 @@
 CC:=mpic++
 NVCC:=nvcc
 LDFLAGS:=-L$(CUDA_ROOT)/lib64 -L$(MPI_ROOT)/lib -lcudart -lmpi -DOMPI_SKIP_MPICXX=
-CFLAGS:=-std=c++11 -I$(MPI_ROOT)/include -I. -I$(CUDA_ROOT)/include -DOMPI_SKIP_MPICXX=
+CFLAGS:=-std=c++11 -I$(MPI_INCLUDE_ROOT) -I$(MPI_ROOT)/include -I. -I$(CUDA_ROOT)/include -DOMPI_SKIP_MPICXX=
 EXE_NAME:=allreduce-test
 SRC:=$(wildcard *.cpp test/*.cpp)
 CU_SRC:=$(wildcard *.cu)

The link to the Baidu's main research page is broken

This link mentioned in the main readme file is broken:

http://research.baidu.com/bringing-hpc-techniques-deep-learning/

baidu-research / baidu-allreduce Goto Github PK

baidu-allreduce's Introduction

`baidu-allreduce`

Installation

Interface

baidu-allreduce's People

Contributors

Stargazers

Watchers

Forkers

baidu-allreduce's Issues

please add license

What about Reduce, Gather, Bcast?

Why not NCCL?

Asynchronous allreduce?

Comment

Small change needed to build by default on RHEL / Fedora / CentOS

The link to the Baidu's main research page is broken

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent