microsoft / msccl-tools Goto Github PK
View Code? Open in Web Editor NEWSynthesizer for optimal collective communication algorithms
License: MIT License
Synthesizer for optimal collective communication algorithms
License: MIT License
It would be convenient to be able to get a readable listing of the plans available to sccl.init
to help model developers to tune their models to use size ranges supported by SCCL. Filtering by machine type and collective would be good. Both a command line tool and a library function would be nice.
where
I try to synthesize (C, S, R)=(24, 8, 8) algorithm for alltoall on my machine. But I cannot synthesize successfully even in one day ! And the SCCL paper shows that it can be synthesized in 133.7s
My script is
TOPO="DGX1"
COLL="Alltoall"
STEPS="8"
ROUNDS=$STEPS
CHUNKS="3"
msccl solve instance ${TOPO} ${COLL} \
--steps ${STEPS} \
--rounds ${ROUNDS} \
--chunks ${CHUNKS} \
And my cpu config is:
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Model name: Intel(R) Xeon(R) Gold 6252 CPU @ 2.10GHz
Hi, first of all thanks for quick response and I found that the examples and their .xml generated algos do have significant impact on the system performance. I moved one step further to test custom auto generated algos. Follwowing example script works well for the allgather
:
from msccl.topologies import dgx_a100
from msccl.collectives import allgather, alltoall, reduce_scatter, allreduce
from msccl.collectives import reduce,scatter,gather,broadcast
from pprint import pprint
from msccl.strategies import solve_instance
from msccl.instance import Instance
from msccl.language import *
from msccl.topologies import *
from msccl.serialization import MSCCLEncoder
import os
topology = dgx_a100()
pprint(topology.links)
collective = allgather(topology.num_nodes())
algo = solve_instance(topology, collective, Instance(steps=4), logging=True)
jsonfile = MSCCLEncoder().encode(algo)
with open("data.json", "w") as text_file:
text_file.write(jsonfile)
os.system("msccl ncclize -f data.json -o test.xml")
Output
[[0, 12, 12, 12, 12, 12, 12, 12],
[12, 0, 12, 12, 12, 12, 12, 12],
[12, 12, 0, 12, 12, 12, 12, 12],
[12, 12, 12, 0, 12, 12, 12, 12],
[12, 12, 12, 12, 0, 12, 12, 12],
[12, 12, 12, 12, 12, 0, 12, 12],
[12, 12, 12, 12, 12, 12, 0, 12],
[12, 12, 12, 12, 12, 12, 12, 0]]
Solving instance steps=4... synthesized! (0.4s)
Wrote to test.xml
But the problem comes if I change allgather
to allreduce
as below:
from msccl.topologies import dgx_a100
from msccl.collectives import allgather, alltoall, reduce_scatter, allreduce
from msccl.collectives import reduce,scatter,gather,broadcast
from pprint import pprint
from msccl.strategies import solve_instance
from msccl.instance import Instance
from msccl.language import *
from msccl.topologies import *
from msccl.serialization import MSCCLEncoder
import os
topology = dgx_a100()
pprint(topology.links)
collective = allreduce(topology.num_nodes())
algo = solve_instance(topology, collective, Instance(steps=4), logging=True)
jsonfile = MSCCLEncoder().encode(algo)
with open("data.json", "w") as text_file:
text_file.write(jsonfile)
os.system("msccl ncclize -f data.json -o test.xml")
Output
[[0, 12, 12, 12, 12, 12, 12, 12],
[12, 0, 12, 12, 12, 12, 12, 12],
[12, 12, 0, 12, 12, 12, 12, 12],
[12, 12, 12, 0, 12, 12, 12, 12],
[12, 12, 12, 12, 0, 12, 12, 12],
[12, 12, 12, 12, 12, 0, 12, 12],
[12, 12, 12, 12, 12, 12, 0, 12],
[12, 12, 12, 12, 12, 12, 12, 0]]
Solving instance steps=4... synthesized! (1.4s)
Traceback (most recent call last):
File "/miniconda3/envs/py38/bin/msccl", line 33, in <module>
sys.exit(load_entry_point('msccl==2.3.0', 'console_scripts', 'msccl')())
File "/miniconda3/envs/py38/lib/python3.8/site-packages/msccl-2.3.0-py3.8.egg/msccl/__main__.py", line 34, in main
File "/miniconda3/envs/py38/lib/python3.8/site-packages/msccl-2.3.0-py3.8.egg/msccl/cli/ncclize.py", line 29, in handle
File "/miniconda3/envs/py38/lib/python3.8/site-packages/msccl-2.3.0-py3.8.egg/msccl/ncclize.py", line 548, in ncclize
RuntimeError: Encountered receive and send on the same buffer index on step 1 (gpu=5, buf=i, off=0)
Can you please help check and resolve this issue so that I can use the generated .xml. Thanks in advance
msccl-tools can get local topology by parsing the nvidia-smi topo -m
information. How do you get topology of multi-machines? Does msccl-tools support it now?
Could you please offer me some help?
The following code cannot solve a least-step algo of allreduce on the ring topology:
topo = ring(5)
coll = allreduce(topo.num_nodes())
algo = solve_least_steps(topo, coll, logging=True)
print(algo)
The program seems to iterate forever.
I found that solve_least_step()
work well with topology generic.ring(4)
and collective allreduce()
, when I am using the following code:
# only to change the ring(5) down to ring(4)
topo = ring(4)
coll = allreduce(topo.num_nodes())
algo = solve_least_steps(topo, coll, logging=True)
print(algo)
Terminal log is as below:
Algorithms need at least 2 steps.
Solving instance steps=2... synthesized! (0.0s)
(step 1) 0:0→3, 0:1→2, 0:2→1, 0:3→0
(step 2) 0:0→1, 0:1→0, 0:2→3, 0:3→2
However, if I run with ring(5)
with the code snippet provided in the Problem section, no valid algo can be synthesized in acceptable time, the program will keep running with the terminal logging out as below:
Algorithms need at least 2 steps.
Solving instance steps=2... unsatisfiable. (0.1s)
Solving instance steps=3... unsatisfiable. (0.1s)
Solving instance steps=4... unsatisfiable. (0.1s)
...
Solving instance steps=100... unsatisfiable. (0.5s)
Solving instance steps=101... unsatisfiable. (0.5s)
Solving instance steps=102... unsatisfiable. (0.5s)
Solving instance steps=103... unsatisfiable. (0.5s)
Solving instance steps=104... unsatisfiable. (0.5s)
Solving instance steps=105... unsatisfiable. (0.5s)
Solving instance steps=106... unsatisfiable. (0.5s)
...
To the best of my knowledge, an AllReduce operator always has a reduce-broadcast implementation (may be not the optimal one). And the code below works properly:
topo = ring(5)
coll = reduce(topo.num_nodes(), 0)
algo = solve_least_steps(topo, coll, logging=True)
print(algo)
coll = broadcast(topo.num_nodes(), 0)
algo = solve_least_steps(topo, coll, logging=True)
print(algo)
with output:
Algorithms need at least 2 steps.
Solving instance steps=2... synthesized! (0.0s)
(step 1) 0:2→1, 0:3→4
(step 2) 0:1→0, 0:4→0
Algorithms need at least 2 steps.
Solving instance steps=2... synthesized! (0.0s)
(step 1) 0:0→1, 0:0→4
(step 2) 0:1→2, 0:4→3
indicating that there at least a 4-step algo for allreduce on this ring(5)
topology.
Could you please kindly tell me how to synthsize a least-step allreduce algo for ring topology with 5 and more nodes?
Hi. I'm trying to synthesize some pareto-optimal allreduce algorithms on several DGX-A100 machines, which are connected via a single switch. I didn't find this among the pre-defined topologies, and I also don't know how to write a proper topology file. Could you please help me with this?
For example, there are three DGX-A100 machines. Each of them has 8 A100 GPUs and 4 200Gbps NICs. The total 12 NICs are connected to one switch. What's the best practice to synthesize allreduce algorithms on this topology?
Thanks.
Is there any documentation available for how to specify a custom topology for SCCL? (--topology-file)?
Thanks
I use SCCL to synthesize a step=3 alltoall algorithm on topology DGX-1 whose name is ’Alltoall.n8-DGX1-steps3.msccl.json‘. But how can I actually run it on NCCL ? Can you give me an example or a document about how to actually run a synthesized scheme ? Thanks a lot !
The following crash happens sometimes in the tests. Likely reason: the ncclize scratch remapping uses Z3 to solve mappings on a background thread, which may get interrupted with a timeout and somehow the logic for handling that is broken. The code includes an attempt to handle this, but apparently it is not correct.
=================================== FAILURES ===================================
___________________ test_distribute_alltoall_scatter_gather ____________________
[gw1] linux -- Python 3.9.4 /opt/hostedtoolcache/Python/3.9.4/x64/bin/python
def test_distribute_alltoall_scatter_gather():
with in_tempdir():
assert 0 == os.system('sccl solve instance DGX1 Gather --root 5 --steps 2 -o gather.json')
assert 0 == os.system('sccl solve instance DGX1 Scatter --root 5 --steps 2 -o scatter.json')
assert 0 == os.system('sccl distribute alltoall-gather-scatter gather.json scatter.json --copies 2 -o alltoall.json')
assert os.path.exists('alltoall.json')
> _check_ncclizes('alltoall.json')
tests/test_cli.py:99:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
path = 'alltoall.json'
def _check_ncclizes(path):
> assert 0 == os.system(f'sccl ncclize {path} -o ncclized.sccl.xml')
E AssertionError: assert 0 == 256
E + where 256 = <built-in function system>('sccl ncclize alltoall.json -o ncclized.sccl.xml')
E + where <built-in function system> = os.system
tests/test_cli.py:27: AssertionError
----------------------------- Captured stdout call -----------------------------
Solving instance steps=2... synthesized! (0.8s)
Wrote to gather.json
Solving instance steps=2... synthesized! (0.6s)
Wrote to scatter.json
Wrote to alltoall.json
Remapping scratch into input/output...
Optimizing scratch mapping on all GPUs: .....
----------------------------- Captured stderr call -----------------------------
Traceback (most recent call last):
File "/opt/hostedtoolcache/Python/3.9.4/x64/bin/sccl", line 33, in <module>
sys.exit(load_entry_point('sccl', 'console_scripts', 'sccl')())
File "/home/runner/work/sckl-core/sckl-core/sccl/__main__.py", line 32, in main
if handler(args, args.command):
File "/home/runner/work/sckl-core/sckl-core/sccl/cli/ncclize.py", line 27, in handle
ncclized = ncclize(algo,
File "/home/runner/work/sckl-core/sckl-core/sccl/ncclize.py", line 340, in ncclize
_remap_scratch_into_input_output(liveness, gpus, logging)
File "/home/runner/work/sckl-core/sckl-core/sccl/ncclize.py", line 145, in _remap_scratch_into_input_output
s.add(remap(scratch_idx) != remap(other_idx))
File "/opt/hostedtoolcache/Python/3.9.4/x64/lib/python3.9/site-packages/z3/z3.py", line 967, in __ne__
a, b = _coerce_exprs(self, other)
File "/opt/hostedtoolcache/Python/3.9.4/x64/lib/python3.9/site-packages/z3/z3.py", line 1113, in _coerce_exprs
a = s.cast(a)
File "/opt/hostedtoolcache/Python/3.9.4/x64/lib/python3.9/site-packages/z3/z3.py", line 2181, in cast
if self.eq(val_s):
File "/opt/hostedtoolcache/Python/3.9.4/x64/lib/python3.9/site-packages/z3/z3.py", line 381, in eq
return Z3_is_eq_ast(self.ctx_ref(), self.as_ast(), other.as_ast())
File "/opt/hostedtoolcache/Python/3.9.4/x64/lib/python3.9/site-packages/z3/z3.py", line 514, in as_ast
return Z3_sort_to_ast(self.ctx_ref(), self.ast)
File "/opt/hostedtoolcache/Python/3.9.4/x64/lib/python3.9/site-packages/z3/z3core.py", line 2558, in Z3_sort_to_ast
_elems.Check(a0)
File "/opt/hostedtoolcache/Python/3.9.4/x64/lib/python3.9/site-packages/z3/z3core.py", line 1414, in Check
raise self.Exception(self.get_error_message(ctx, err))
z3.z3types.Z3Exception: b'there is no current model'
As title, when generating the xml file, some files have the 'numranks' and 'numnodes' parameters, but what is 'instances'?
I have generated (C,S,R)=(8,3,3)
alltoall algorithm using SCCL synthesizer on DGX-1 topology. And I use msccl ncclize <file.json>
to generate xml file for msccl runtime. Then, I use master branch of msccl as msccl-rt to run this scheme. Above msccl, a fixed version of nccl-tests for alltoall is used.
My script is just like:
...
NCCL_DEBUG="INFO" \
NCCL_DEBUG_SUBSYS="INIT,ENV" \
MSCCL_XML_FILES="schemes/Alltoall.n8-DGX1-steps3.msccl.xml" \
NCCL_ALGO=MSCCL,RING,TREE \
LD_LIBRARY_PATH=${NCCL_BUILD_PATH}/lib:$LD_LIBRARY_PATH \
srun -N 1 --ntasks-per-node 8 \
nccl-tests/build_msccl/alltoall_perf \
-b32M -e256M -f 2 -g 1
And the results are below:
We only care about out-of-place results, because nccl-tests doesn't support in-place mode for alltoall. We can see that there is a big gap of busBW between size=64MB and 128MB which is extremely strange.
After I read the source codes of msccl, I know that it's because for function ncclAlltoAll
in all_to_all.cc
of msccl, it use msccl algorithm for size < 128MB
, otherwise, it use naive implementation for alltoall.
Thus the question is that, in SCCL paper, the performance data of (8,3,3) are higher than baseline, i.e. naive implementation, In the case of all message sizes. But why in my experiments, (8,3,3) algorithm even shows worse performance than baseline?
In addition, In theoretical analysis under
Thus, is it due to the poor implementation of cudaKernel for MSCCL algorithm ?
The alltoall-gather-scatter
distributor takes in a Gather and Scatter algorithm for a local topology and creates an Alltoall by stitching these together for a distributed topology with copies of the local one.
Providing the Scatter is redundant, as the Scatter can be produced by reversing the Gather. This will work when the topology is symmetric.
Currently solve least-steps
takes --rounds
and fails when it is passed in. Possible desired behaviors:
--rounds
and iterate steps up until that bound is reached.--extra-rounds
, with no limit to how many steps can be considered.The least-steps
strategy is meant for just easily finding the latency-optimal algorithm, so it isn't very naturally compatible with either --rounds
or --chunks
, but having some reasonable behavior here would be nice. Removing both --rounds
and --chunks
is one option to consider.
Hi, I have questions about Allreduce synthesizing. In the SCCL paper, it is mentioned that the Allreduce algorithm is generated from the Allgather algorithm. I am curious about how much this approach guarantees the optimality of the Allreduce algorithm.
In particular, is there a possibility that obtaining Allreduce directly from a constraint problem as well as Allgather, could yield a more efficient or optimal algorithm than the current method?
Thanks.
The readme says to run the following for installation
pip install git+https://github.com/microsoft/msccl.git
However, this is pointing to the other msccl repository which is not a python package.
Is it supposed to be the following?
pip install git+https://github.com/microsoft/msccl-tools.git
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.