microsoft / msccl-tools Goto Github PK

Synthesizer for optimal collective communication algorithms

License: MIT License

Python 98.58% Shell 0.10% Dockerfile 1.33%

msccl-tools's Introduction

MSCCL-tools

This repo contains the developer tool stack of the Microsoft Collective Communication Library (MSCCL), a platform for programmable communication on GPUs. Algorithms created with MSCCL can:

Implement either MPI-style collectives like Allreduce, or any application specific communication pattern.
Target specific hardware and interconnect topologies, unlocking their full potential.
Optimize for the data sizes in your application, making the best tradeoff between latency and bandwidth utilization.

MSCCL-tools also contains pre-made algorithms targeting various Azure multi-GPU VM types. See the Available Algorithms section to find out what is currently available.

MSCCL has two ways of creating new algorithms:

MSCCLang, a high-level DSL that talks about communication in an intuitive chunk-oriented form. See the MSCCLang section for how to get started.
Synthesis, which automatically solves optimal algorithms for a given hardware topology. Making synthesis general enough for common use cases is an on-going research project See the synthesis readme for an introduction.

Usage

The MSCCL Python package ships with a registry of synthesis strategies and hand optimized algorithms. These can be loaded into the runtime through the msccl.init function, which must be called before the application creates its NCCL communicator. For PyTorch this means before torch.distributed is initialized.

The following snippet requests msccl.init to provide an Alltoall algorithm in a configuration of 2 Azure NDv2 machines:

import msccl
msccl.init('ndv2', 2, (msccl.Collective.alltoall, ('1MB')))

This will find an algorithm provider that can create an Alltoall algorithm that is expected to be good with 1MB of data. That will call a synthesis routine that writes the algorithm to disk. msccl.init will then pass a configuration file pointing to this algorithm to the runtime through environment variables. If the SKU is unknown, 'auto' can be passed in instead.

See the examples for more on msccl.init usage.

Available Algorithms

MSCCL's built-in algorithms are registered for combinations of hardware configuration and size of input data where we have benchmarked them to provide speedup over NCCL. To list the algorithms currently in MSCCL's built-in registry, run msccl plans list on the command line. This will print out the following table (on 4/22/2022):

Machine	Collective	# machines	From	To	Protocol	Plan name
ndv2	alltoall	>=2	1 MB	infinity	Simple	call synthesize_ndv2_relay_alltoall
ndv4	allreduce	1	256 KB	20 MB	LL128	run ndv4_ring_allreduce
ndv4	alltoall	8,16,32,64	1 MB	32 MB	LL128	run ndv4_alltoall_hierarchical
ndv4	alltoall	8,16,32	32 MB	infinity	Simple	run ndv4_alltoall_hierarchical
ndv4	alltoall	64	32 MB	infinity	Simple	run ndv4_alltoall_three_step

Each line lists an algorithm registration and the conditions under which it is triggered. For example, the ndv4_alltoall_hierarchical algorithm will be used with NCCL's lower latency LL128 protocol when:

the user has called Alltoall,
there are 8, 16, 32 or 64 Azure NDv4 machines, and
the data size is from 1 MB to 32 MB.

The repository parasailteam/msccl-presynth repository offers additional algorithms that have been pre-synthesized for fixed configurations. To enable them install the package and import it before the call to msccl.init.

MSCCLang

MSCCLang is a high-level language for specifying collective communication algorithms in an intuitive chunk-oriented form. The language is available as a Python-integrated DSL.

The language is still under development and lacks comprehensive documentation. For now, please refer to the pre-print of our upcoming paper and the examples in examples/mscclang.

Synthesis

MSCCL started out as a synthesizer for collective algorithms, and general synthesis of collective algorithms is an on-going research project. See this readme for using MSCCL as a synthesizer.

Installation

Python Package Installation

To install either clone this repo and run "pip install ." or run:

pip install git+https://github.com/microsoft/msccl-tools.git

Installing the MSCCL Python package also installs the msccl command line tool. To enable Bash completion for the msccl tool:

echo 'eval "$(register-python-argcomplete msccl)"' >> ~/.bashrc

Runtime Installation

Algorithms are executed by the Microsoft Collective Communication Library (MSCCL), which is API compatible with NCCL. See https://github.com/microsoft/msccl for instructions.

To use MSCCL with PyTorch, the built in NCCL submodule has to be replaced with MSCCL's version. Additionally, to expose the new native Alltoall support that MSCCL adds, PyTorch's torch.distributed package can optionally be patched. The following commands perform these steps and install PyTorch with MSCCL:

git clone https://github.com/pytorch/pytorch.git
cd pytorch    
git checkout tags/v1.9.0 -b v1.9.0_msccl
perl -p -i -e  's/url = https:\/\/github\.com\/NVIDIA\/nccl/url = https:\/\/github\.com\/microsoft\/msccl/g' .gitmodules
git submodule sync third_party/nccl
git submodule update --init --recursive
git submodule update --init --recursive --remote third_party/nccl
git apply third_party/nccl/nccl/patches/nccl.cpp.patch
python setup.py install

Note on Azure NDv2

Azure NDv2 does not expose the true PCIe topology of the machines to the VM and worse, does not assign PCIe devices consistently to the virtual paths in the VM. As MSCCL is generating topology-aware algorithms, this device ordering must be fixed. The msccl_ndv2_launcher.sh script can be used to fix this problem. The script solves the automorphisms from the local VM's NVLink topology to the reference topology and selects one of the 4 automorphisms based on measured placement of the Infiniband card such that GPU 0 is close to the NIC. A tool called inspector-topo needs to be available for the latter step.

Contributing

This project welcomes contributions and suggestions. Most contributions require you to agree to a Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us the rights to use your contribution. For details, visit https://cla.opensource.microsoft.com.

When you submit a pull request, a CLA bot will automatically determine whether you need to provide a CLA and decorate the PR appropriately (e.g., status check, comment). Simply follow the instructions provided by the bot. You will only need to do this once across all repos using our CLA.

This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact [email protected] with any additional questions or comments.

Trademarks

This project may contain trademarks or logos for projects, products, or services. Authorized use of Microsoft trademarks or logos is subject to and must follow Microsoft's Trademark & Brand Guidelines. Use of Microsoft trademarks or logos in modified versions of this project must not cause confusion or imply Microsoft sponsorship. Any use of third-party trademarks or logos are subject to those third-party's policies.

msccl-tools's People

Contributors

Stargazers

Watchers

msccl-tools's Issues

Provide just Gather to Gather-Scatter Alltoall distributor

The alltoall-gather-scatter distributor takes in a Gather and Scatter algorithm for a local topology and creates an Alltoall by stitching these together for a distributed topology with copies of the local one.

Providing the Scatter is redundant, as the Scatter can be produced by reversing the Gather. This will work when the topology is symmetric.

Pretty printout of registered algorithms

It would be convenient to be able to get a readable listing of the plans available to sccl.init to help model developers to tune their models to use size ranges supported by SCCL. Filtering by machine type and collective would be good. Both a command line tool and a library function would be nice.

Question about Multi-Machine Topology

msccl-tools can get local topology by parsing the nvidia-smi topo -m information. How do you get topology of multi-machines? Does msccl-tools support it now?

Could you please offer me some help?

Flaky test with threaded Z3 usage

The following crash happens sometimes in the tests. Likely reason: the ncclize scratch remapping uses Z3 to solve mappings on a background thread, which may get interrupted with a timeout and somehow the logic for handling that is broken. The code includes an attempt to handle this, but apparently it is not correct.

=================================== FAILURES ===================================
___________________ test_distribute_alltoall_scatter_gather ____________________
[gw1] linux -- Python 3.9.4 /opt/hostedtoolcache/Python/3.9.4/x64/bin/python

    def test_distribute_alltoall_scatter_gather():
        with in_tempdir():
            assert 0 == os.system('sccl solve instance DGX1 Gather --root 5 --steps 2 -o gather.json')
            assert 0 == os.system('sccl solve instance DGX1 Scatter --root 5 --steps 2 -o scatter.json')
            assert 0 == os.system('sccl distribute alltoall-gather-scatter gather.json scatter.json --copies 2 -o alltoall.json')
            assert os.path.exists('alltoall.json')
>           _check_ncclizes('alltoall.json')

tests/test_cli.py:99: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

path = 'alltoall.json'

    def _check_ncclizes(path):
>       assert 0 == os.system(f'sccl ncclize {path} -o ncclized.sccl.xml')
E       AssertionError: assert 0 == 256
E        +  where 256 = <built-in function system>('sccl ncclize alltoall.json -o ncclized.sccl.xml')
E        +    where <built-in function system> = os.system

tests/test_cli.py:27: AssertionError
----------------------------- Captured stdout call -----------------------------
Solving instance steps=2... synthesized! (0.8s)
Wrote to gather.json
Solving instance steps=2... synthesized! (0.6s)
Wrote to scatter.json
Wrote to alltoall.json
Remapping scratch into input/output...
Optimizing scratch mapping on all GPUs: .....
----------------------------- Captured stderr call -----------------------------
Traceback (most recent call last):
  File "/opt/hostedtoolcache/Python/3.9.4/x64/bin/sccl", line 33, in <module>
    sys.exit(load_entry_point('sccl', 'console_scripts', 'sccl')())
  File "/home/runner/work/sckl-core/sckl-core/sccl/__main__.py", line 32, in main
    if handler(args, args.command):
  File "/home/runner/work/sckl-core/sckl-core/sccl/cli/ncclize.py", line 27, in handle
    ncclized = ncclize(algo,
  File "/home/runner/work/sckl-core/sckl-core/sccl/ncclize.py", line 340, in ncclize
    _remap_scratch_into_input_output(liveness, gpus, logging)
  File "/home/runner/work/sckl-core/sckl-core/sccl/ncclize.py", line 145, in _remap_scratch_into_input_output
    s.add(remap(scratch_idx) != remap(other_idx))
  File "/opt/hostedtoolcache/Python/3.9.4/x64/lib/python3.9/site-packages/z3/z3.py", line 967, in __ne__
    a, b = _coerce_exprs(self, other)
  File "/opt/hostedtoolcache/Python/3.9.4/x64/lib/python3.9/site-packages/z3/z3.py", line 1113, in _coerce_exprs
    a = s.cast(a)
  File "/opt/hostedtoolcache/Python/3.9.4/x64/lib/python3.9/site-packages/z3/z3.py", line 2181, in cast
    if self.eq(val_s):
  File "/opt/hostedtoolcache/Python/3.9.4/x64/lib/python3.9/site-packages/z3/z3.py", line 381, in eq
    return Z3_is_eq_ast(self.ctx_ref(), self.as_ast(), other.as_ast())
  File "/opt/hostedtoolcache/Python/3.9.4/x64/lib/python3.9/site-packages/z3/z3.py", line 514, in as_ast
    return Z3_sort_to_ast(self.ctx_ref(), self.ast)
  File "/opt/hostedtoolcache/Python/3.9.4/x64/lib/python3.9/site-packages/z3/z3core.py", line 2558, in Z3_sort_to_ast
    _elems.Check(a0)
  File "/opt/hostedtoolcache/Python/3.9.4/x64/lib/python3.9/site-packages/z3/z3core.py", line 1414, in Check
    raise self.Exception(self.get_error_message(ctx, err))
z3.z3types.Z3Exception: b'there is no current model'

Installation instruction might be incorrect

The readme says to run the following for installation

pip install git+https://github.com/microsoft/msccl.git

However, this is pointing to the other msccl repository which is not a python package.

Is it supposed to be the following?

pip install git+https://github.com/microsoft/msccl-tools.git

Problem in generating xml for the allreduce

Hi, first of all thanks for quick response and I found that the examples and their .xml generated algos do have significant impact on the system performance. I moved one step further to test custom auto generated algos. Follwowing example script works well for the allgather :

from msccl.topologies import dgx_a100
from msccl.collectives import allgather, alltoall, reduce_scatter, allreduce
from msccl.collectives import reduce,scatter,gather,broadcast
from pprint import pprint
from msccl.strategies import solve_instance
from msccl.instance import Instance
from msccl.language import *
from msccl.topologies import *
from msccl.serialization import MSCCLEncoder
import os

topology = dgx_a100()
pprint(topology.links)
collective = allgather(topology.num_nodes())
algo = solve_instance(topology, collective, Instance(steps=4), logging=True)
jsonfile = MSCCLEncoder().encode(algo)
with open("data.json", "w") as text_file:
    text_file.write(jsonfile)
os.system("msccl ncclize -f data.json -o test.xml")

Output

[[0, 12, 12, 12, 12, 12, 12, 12],
 [12, 0, 12, 12, 12, 12, 12, 12],
 [12, 12, 0, 12, 12, 12, 12, 12],
 [12, 12, 12, 0, 12, 12, 12, 12],
 [12, 12, 12, 12, 0, 12, 12, 12],
 [12, 12, 12, 12, 12, 0, 12, 12],
 [12, 12, 12, 12, 12, 12, 0, 12],
 [12, 12, 12, 12, 12, 12, 12, 0]]
Solving instance steps=4... synthesized! (0.4s)
Wrote to test.xml

But the problem comes if I change allgather to allreduce as below:

from msccl.topologies import dgx_a100
from msccl.collectives import allgather, alltoall, reduce_scatter, allreduce
from msccl.collectives import reduce,scatter,gather,broadcast
from pprint import pprint
from msccl.strategies import solve_instance
from msccl.instance import Instance
from msccl.language import *
from msccl.topologies import *
from msccl.serialization import MSCCLEncoder
import os

topology = dgx_a100()
pprint(topology.links)
collective = allreduce(topology.num_nodes())
algo = solve_instance(topology, collective, Instance(steps=4), logging=True)
jsonfile = MSCCLEncoder().encode(algo)
with open("data.json", "w") as text_file:
    text_file.write(jsonfile)
os.system("msccl ncclize -f data.json -o test.xml")

Output

[[0, 12, 12, 12, 12, 12, 12, 12],
 [12, 0, 12, 12, 12, 12, 12, 12],
 [12, 12, 0, 12, 12, 12, 12, 12],
 [12, 12, 12, 0, 12, 12, 12, 12],
 [12, 12, 12, 12, 0, 12, 12, 12],
 [12, 12, 12, 12, 12, 0, 12, 12],
 [12, 12, 12, 12, 12, 12, 0, 12],
 [12, 12, 12, 12, 12, 12, 12, 0]]
Solving instance steps=4... synthesized! (1.4s)
Traceback (most recent call last):
  File "/miniconda3/envs/py38/bin/msccl", line 33, in <module>
    sys.exit(load_entry_point('msccl==2.3.0', 'console_scripts', 'msccl')())
  File "/miniconda3/envs/py38/lib/python3.8/site-packages/msccl-2.3.0-py3.8.egg/msccl/__main__.py", line 34, in main
  File "/miniconda3/envs/py38/lib/python3.8/site-packages/msccl-2.3.0-py3.8.egg/msccl/cli/ncclize.py", line 29, in handle
  File "/miniconda3/envs/py38/lib/python3.8/site-packages/msccl-2.3.0-py3.8.egg/msccl/ncclize.py", line 548, in ncclize
RuntimeError: Encountered receive and send on the same buffer index on step 1 (gpu=5, buf=i, off=0)

Can you please help check and resolve this issue so that I can use the generated .xml. Thanks in advance

How to generate alltoall algorithm (C, S, R)=(24, 8, 8) for DGX-1 in affordable time limit ?

I try to synthesize (C, S, R)=(24, 8, 8) algorithm for alltoall on my machine. But I cannot synthesize successfully even in one day ! And the SCCL paper shows that it can be synthesized in 133.7s
My script is

TOPO="DGX1"
COLL="Alltoall"
STEPS="8"
ROUNDS=$STEPS
CHUNKS="3"

msccl solve instance ${TOPO} ${COLL} \
    --steps ${STEPS} \
    --rounds ${ROUNDS} \
    --chunks ${CHUNKS} \

And my cpu config is:

Architecture:                    x86_64
CPU op-mode(s):                  32-bit, 64-bit
Model name:                      Intel(R) Xeon(R) Gold 6252 CPU @ 2.10GHz

Performance results of Alltoall in SCCL paper cannot be reproduced upon msccl-rt

I have generated (C,S,R)=(8,3,3) alltoall algorithm using SCCL synthesizer on DGX-1 topology. And I use msccl ncclize <file.json> to generate xml file for msccl runtime. Then, I use master branch of msccl as msccl-rt to run this scheme. Above msccl, a fixed version of nccl-tests for alltoall is used.
My script is just like:

...
NCCL_DEBUG="INFO" \
NCCL_DEBUG_SUBSYS="INIT,ENV" \
MSCCL_XML_FILES="schemes/Alltoall.n8-DGX1-steps3.msccl.xml" \
NCCL_ALGO=MSCCL,RING,TREE \
LD_LIBRARY_PATH=${NCCL_BUILD_PATH}/lib:$LD_LIBRARY_PATH \
srun -N 1 --ntasks-per-node 8 \
nccl-tests/build_msccl/alltoall_perf \
   -b32M -e256M -f 2 -g 1

And the results are below:

We only care about out-of-place results, because nccl-tests doesn't support in-place mode for alltoall. We can see that there is a big gap of busBW between size=64MB and 128MB which is extremely strange.

After I read the source codes of msccl, I know that it's because for function ncclAlltoAll in all_to_all.cc of msccl, it use msccl algorithm for size < 128MB , otherwise, it use naive implementation for alltoall.

Thus the question is that, in SCCL paper, the performance data of (8,3,3) are higher than baseline, i.e. naive implementation, In the case of all message sizes. But why in my experiments, (8,3,3) algorithm even shows worse performance than baseline?
sccl2
In addition, In theoretical analysis under $\alpha,\beta$ p2p communication model, (8,3,3) algorithm should have been better than baseline, at least under the condition of large message size.

Thus, is it due to the poor implementation of cudaKernel for MSCCL algorithm ?

allreduce on generic.ring() with num_node >= 5 cannot be solved.

Problem

The following code cannot solve a least-step algo of allreduce on the ring topology:

topo = ring(5)
coll = allreduce(topo.num_nodes())
algo = solve_least_steps(topo, coll, logging=True)
print(algo)

The program seems to iterate forever.

Description

I found that solve_least_step() work well with topology generic.ring(4) and collective allreduce(), when I am using the following code:

# only to change the ring(5) down to ring(4)
topo = ring(4)
coll = allreduce(topo.num_nodes())
algo = solve_least_steps(topo, coll, logging=True)
print(algo)

Terminal log is as below:

Algorithms need at least 2 steps.
Solving instance steps=2... synthesized! (0.0s)
(step 1) 0:0→3, 0:1→2, 0:2→1, 0:3→0
(step 2) 0:0→1, 0:1→0, 0:2→3, 0:3→2

However, if I run with ring(5) with the code snippet provided in the Problem section, no valid algo can be synthesized in acceptable time, the program will keep running with the terminal logging out as below:

Algorithms need at least 2 steps.
Solving instance steps=2... unsatisfiable. (0.1s)
Solving instance steps=3... unsatisfiable. (0.1s)
Solving instance steps=4... unsatisfiable. (0.1s)
...
Solving instance steps=100... unsatisfiable. (0.5s)
Solving instance steps=101... unsatisfiable. (0.5s)
Solving instance steps=102... unsatisfiable. (0.5s)
Solving instance steps=103... unsatisfiable. (0.5s)
Solving instance steps=104... unsatisfiable. (0.5s)
Solving instance steps=105... unsatisfiable. (0.5s)
Solving instance steps=106... unsatisfiable. (0.5s)
...

To the best of my knowledge, an AllReduce operator always has a reduce-broadcast implementation (may be not the optimal one). And the code below works properly:

topo = ring(5)
coll = reduce(topo.num_nodes(), 0)
algo = solve_least_steps(topo, coll, logging=True)
print(algo)
coll = broadcast(topo.num_nodes(), 0)
algo = solve_least_steps(topo, coll, logging=True)
print(algo)

with output:

Algorithms need at least 2 steps.
Solving instance steps=2... synthesized! (0.0s)
(step 1) 0:2→1, 0:3→4
(step 2) 0:1→0, 0:4→0
Algorithms need at least 2 steps.
Solving instance steps=2... synthesized! (0.0s)
(step 1) 0:0→1, 0:0→4
(step 2) 0:1→2, 0:4→3

indicating that there at least a 4-step algo for allreduce on this ring(5) topology.

Could you please kindly tell me how to synthsize a least-step allreduce algo for ring topology with 5 and more nodes?

examples/msccllang: what does the "instances" parameter stands for?

As title, when generating the xml file, some files have the 'numranks' and 'numnodes' parameters, but what is 'instances'?

Format for specifying custom topology file?

Is there any documentation available for how to specify a custom topology for SCCL? (--topology-file)?
Thanks

Rounds vs extra rounds for solve least-steps

Currently solve least-steps takes --rounds and fails when it is passed in. Possible desired behaviors:

Pass --rounds and iterate steps up until that bound is reached.
Pass the older style --extra-rounds, with no limit to how many steps can be considered.

The least-steps strategy is meant for just easily finding the latency-optimal algorithm, so it isn't very naturally compatible with either --rounds or --chunks, but having some reasonable behavior here would be nice. Removing both --rounds and --chunks is one option to consider.

Questioning the optimality and generation strategy of Allreduce

Hi, I have questions about Allreduce synthesizing. In the SCCL paper, it is mentioned that the Allreduce algorithm is generated from the Allgather algorithm. I am curious about how much this approach guarantees the optimality of the Allreduce algorithm.

In particular, is there a possibility that obtaining Allreduce directly from a constraint problem as well as Allgather, could yield a more efficient or optimal algorithm than the current method?

Thanks.

How to run the scheme, i.e. '.json' file, synthesized by SCCL on NCCL ?

I use SCCL to synthesize a step=3 alltoall algorithm on topology DGX-1 whose name is ’Alltoall.n8-DGX1-steps3.msccl.json‘. But how can I actually run it on NCCL ? Can you give me an example or a document about how to actually run a synthesized scheme ? Thanks a lot !

XML file generated by sccl can't be used by sccl-rt correctly

The parasailteam/msccl-presynth is missing, where to get it now

where

Failed to reproduce examples/scclang/simple/allgather_ring on single node with 4 cards

Synthesize on multiple-machine topology

Hi. I'm trying to synthesize some pareto-optimal allreduce algorithms on several DGX-A100 machines, which are connected via a single switch. I didn't find this among the pre-defined topologies, and I also don't know how to write a proper topology file. Could you please help me with this?

For example, there are three DGX-A100 machines. Each of them has 8 A100 GPUs and 4 200Gbps NICs. The total 12 NICs are connected to one switch. What's the best practice to synthesize allreduce algorithms on this topology?