microsoft / msccl-tools Goto Github PK

Synthesizer for optimal collective communication algorithms

License: MIT License

Python 98.58% Shell 0.10% Dockerfile 1.33%

msccl-tools's Issues

Pretty printout of registered algorithms

It would be convenient to be able to get a readable listing of the plans available to sccl.init to help model developers to tune their models to use size ranges supported by SCCL. Filtering by machine type and collective would be good. Both a command line tool and a library function would be nice.

The parasailteam/msccl-presynth is missing, where to get it now

where

How to generate alltoall algorithm (C, S, R)=(24, 8, 8) for DGX-1 in affordable time limit ?

I try to synthesize (C, S, R)=(24, 8, 8) algorithm for alltoall on my machine. But I cannot synthesize successfully even in one day ! And the SCCL paper shows that it can be synthesized in 133.7s
My script is

TOPO="DGX1"
COLL="Alltoall"
STEPS="8"
ROUNDS=$STEPS
CHUNKS="3"

msccl solve instance ${TOPO} ${COLL} \
    --steps ${STEPS} \
    --rounds ${ROUNDS} \
    --chunks ${CHUNKS} \

And my cpu config is:

Architecture:                    x86_64
CPU op-mode(s):                  32-bit, 64-bit
Model name:                      Intel(R) Xeon(R) Gold 6252 CPU @ 2.10GHz

Problem in generating xml for the allreduce

Hi, first of all thanks for quick response and I found that the examples and their .xml generated algos do have significant impact on the system performance. I moved one step further to test custom auto generated algos. Follwowing example script works well for the allgather :

from msccl.topologies import dgx_a100
from msccl.collectives import allgather, alltoall, reduce_scatter, allreduce
from msccl.collectives import reduce,scatter,gather,broadcast
from pprint import pprint
from msccl.strategies import solve_instance
from msccl.instance import Instance
from msccl.language import *
from msccl.topologies import *
from msccl.serialization import MSCCLEncoder
import os

topology = dgx_a100()
pprint(topology.links)
collective = allgather(topology.num_nodes())
algo = solve_instance(topology, collective, Instance(steps=4), logging=True)
jsonfile = MSCCLEncoder().encode(algo)
with open("data.json", "w") as text_file:
    text_file.write(jsonfile)
os.system("msccl ncclize -f data.json -o test.xml")

Output

[[0, 12, 12, 12, 12, 12, 12, 12],
 [12, 0, 12, 12, 12, 12, 12, 12],
 [12, 12, 0, 12, 12, 12, 12, 12],
 [12, 12, 12, 0, 12, 12, 12, 12],
 [12, 12, 12, 12, 0, 12, 12, 12],
 [12, 12, 12, 12, 12, 0, 12, 12],
 [12, 12, 12, 12, 12, 12, 0, 12],
 [12, 12, 12, 12, 12, 12, 12, 0]]
Solving instance steps=4... synthesized! (0.4s)
Wrote to test.xml

But the problem comes if I change allgather to allreduce as below:

from msccl.topologies import dgx_a100
from msccl.collectives import allgather, alltoall, reduce_scatter, allreduce
from msccl.collectives import reduce,scatter,gather,broadcast
from pprint import pprint
from msccl.strategies import solve_instance
from msccl.instance import Instance
from msccl.language import *
from msccl.topologies import *
from msccl.serialization import MSCCLEncoder
import os

topology = dgx_a100()
pprint(topology.links)
collective = allreduce(topology.num_nodes())
algo = solve_instance(topology, collective, Instance(steps=4), logging=True)
jsonfile = MSCCLEncoder().encode(algo)
with open("data.json", "w") as text_file:
    text_file.write(jsonfile)
os.system("msccl ncclize -f data.json -o test.xml")

Output

[[0, 12, 12, 12, 12, 12, 12, 12],
 [12, 0, 12, 12, 12, 12, 12, 12],
 [12, 12, 0, 12, 12, 12, 12, 12],
 [12, 12, 12, 0, 12, 12, 12, 12],
 [12, 12, 12, 12, 0, 12, 12, 12],
 [12, 12, 12, 12, 12, 0, 12, 12],
 [12, 12, 12, 12, 12, 12, 0, 12],
 [12, 12, 12, 12, 12, 12, 12, 0]]
Solving instance steps=4... synthesized! (1.4s)
Traceback (most recent call last):
  File "/miniconda3/envs/py38/bin/msccl", line 33, in <module>
    sys.exit(load_entry_point('msccl==2.3.0', 'console_scripts', 'msccl')())
  File "/miniconda3/envs/py38/lib/python3.8/site-packages/msccl-2.3.0-py3.8.egg/msccl/__main__.py", line 34, in main
  File "/miniconda3/envs/py38/lib/python3.8/site-packages/msccl-2.3.0-py3.8.egg/msccl/cli/ncclize.py", line 29, in handle
  File "/miniconda3/envs/py38/lib/python3.8/site-packages/msccl-2.3.0-py3.8.egg/msccl/ncclize.py", line 548, in ncclize
RuntimeError: Encountered receive and send on the same buffer index on step 1 (gpu=5, buf=i, off=0)

Can you please help check and resolve this issue so that I can use the generated .xml. Thanks in advance

XML file generated by sccl can't be used by sccl-rt correctly

Question about Multi-Machine Topology

msccl-tools can get local topology by parsing the nvidia-smi topo -m information. How do you get topology of multi-machines? Does msccl-tools support it now?

Could you please offer me some help?

allreduce on generic.ring() with num_node >= 5 cannot be solved.

Problem

The following code cannot solve a least-step algo of allreduce on the ring topology:

topo = ring(5)
coll = allreduce(topo.num_nodes())
algo = solve_least_steps(topo, coll, logging=True)
print(algo)

The program seems to iterate forever.

Description

I found that solve_least_step() work well with topology generic.ring(4) and collective allreduce(), when I am using the following code:

# only to change the ring(5) down to ring(4)
topo = ring(4)
coll = allreduce(topo.num_nodes())
algo = solve_least_steps(topo, coll, logging=True)
print(algo)

Terminal log is as below:

Algorithms need at least 2 steps.
Solving instance steps=2... synthesized! (0.0s)
(step 1) 0:0→3, 0:1→2, 0:2→1, 0:3→0
(step 2) 0:0→1, 0:1→0, 0:2→3, 0:3→2

However, if I run with ring(5) with the code snippet provided in the Problem section, no valid algo can be synthesized in acceptable time, the program will keep running with the terminal logging out as below:

Algorithms need at least 2 steps.
Solving instance steps=2... unsatisfiable. (0.1s)
Solving instance steps=3... unsatisfiable. (0.1s)
Solving instance steps=4... unsatisfiable. (0.1s)
...
Solving instance steps=100... unsatisfiable. (0.5s)
Solving instance steps=101... unsatisfiable. (0.5s)
Solving instance steps=102... unsatisfiable. (0.5s)
Solving instance steps=103... unsatisfiable. (0.5s)
Solving instance steps=104... unsatisfiable. (0.5s)
Solving instance steps=105... unsatisfiable. (0.5s)
Solving instance steps=106... unsatisfiable. (0.5s)
...

To the best of my knowledge, an AllReduce operator always has a reduce-broadcast implementation (may be not the optimal one). And the code below works properly:

topo = ring(5)
coll = reduce(topo.num_nodes(), 0)
algo = solve_least_steps(topo, coll, logging=True)
print(algo)
coll = broadcast(topo.num_nodes(), 0)
algo = solve_least_steps(topo, coll, logging=True)
print(algo)

with output:

Algorithms need at least 2 steps.
Solving instance steps=2... synthesized! (0.0s)
(step 1) 0:2→1, 0:3→4
(step 2) 0:1→0, 0:4→0
Algorithms need at least 2 steps.
Solving instance steps=2... synthesized! (0.0s)
(step 1) 0:0→1, 0:0→4
(step 2) 0:1→2, 0:4→3

indicating that there at least a 4-step algo for allreduce on this ring(5) topology.

Could you please kindly tell me how to synthsize a least-step allreduce algo for ring topology with 5 and more nodes?

Synthesize on multiple-machine topology

Hi. I'm trying to synthesize some pareto-optimal allreduce algorithms on several DGX-A100 machines, which are connected via a single switch. I didn't find this among the pre-defined topologies, and I also don't know how to write a proper topology file. Could you please help me with this?

For example, there are three DGX-A100 machines. Each of them has 8 A100 GPUs and 4 200Gbps NICs. The total 12 NICs are connected to one switch. What's the best practice to synthesize allreduce algorithms on this topology?

Thanks.

Format for specifying custom topology file?

Is there any documentation available for how to specify a custom topology for SCCL? (--topology-file)?
Thanks

How to run the scheme, i.e. '.json' file, synthesized by SCCL on NCCL ?

I use SCCL to synthesize a step=3 alltoall algorithm on topology DGX-1 whose name is ’Alltoall.n8-DGX1-steps3.msccl.json‘. But how can I actually run it on NCCL ? Can you give me an example or a document about how to actually run a synthesized scheme ? Thanks a lot !

Flaky test with threaded Z3 usage

The following crash happens sometimes in the tests. Likely reason: the ncclize scratch remapping uses Z3 to solve mappings on a background thread, which may get interrupted with a timeout and somehow the logic for handling that is broken. The code includes an attempt to handle this, but apparently it is not correct.

=================================== FAILURES ===================================
___________________ test_distribute_alltoall_scatter_gather ____________________
[gw1] linux -- Python 3.9.4 /opt/hostedtoolcache/Python/3.9.4/x64/bin/python

    def test_distribute_alltoall_scatter_gather():
        with in_tempdir():
            assert 0 == os.system('sccl solve instance DGX1 Gather --root 5 --steps 2 -o gather.json')
            assert 0 == os.system('sccl solve instance DGX1 Scatter --root 5 --steps 2 -o scatter.json')
            assert 0 == os.system('sccl distribute alltoall-gather-scatter gather.json scatter.json --copies 2 -o alltoall.json')
            assert os.path.exists('alltoall.json')
>           _check_ncclizes('alltoall.json')

tests/test_cli.py:99: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

path = 'alltoall.json'

    def _check_ncclizes(path):
>       assert 0 == os.system(f'sccl ncclize {path} -o ncclized.sccl.xml')
E       AssertionError: assert 0 == 256
E        +  where 256 = <built-in function system>('sccl ncclize alltoall.json -o ncclized.sccl.xml')
E        +    where <built-in function system> = os.system

tests/test_cli.py:27: AssertionError
----------------------------- Captured stdout call -----------------------------
Solving instance steps=2... synthesized! (0.8s)
Wrote to gather.json
Solving instance steps=2... synthesized! (0.6s)
Wrote to scatter.json
Wrote to alltoall.json
Remapping scratch into input/output...
Optimizing scratch mapping on all GPUs: .....
----------------------------- Captured stderr call -----------------------------
Traceback (most recent call last):
  File "/opt/hostedtoolcache/Python/3.9.4/x64/bin/sccl", line 33, in <module>
    sys.exit(load_entry_point('sccl', 'console_scripts', 'sccl')())
  File "/home/runner/work/sckl-core/sckl-core/sccl/__main__.py", line 32, in main
    if handler(args, args.command):
  File "/home/runner/work/sckl-core/sckl-core/sccl/cli/ncclize.py", line 27, in handle
    ncclized = ncclize(algo,
  File "/home/runner/work/sckl-core/sckl-core/sccl/ncclize.py", line 340, in ncclize
    _remap_scratch_into_input_output(liveness, gpus, logging)
  File "/home/runner/work/sckl-core/sckl-core/sccl/ncclize.py", line 145, in _remap_scratch_into_input_output
    s.add(remap(scratch_idx) != remap(other_idx))
  File "/opt/hostedtoolcache/Python/3.9.4/x64/lib/python3.9/site-packages/z3/z3.py", line 967, in __ne__
    a, b = _coerce_exprs(self, other)
  File "/opt/hostedtoolcache/Python/3.9.4/x64/lib/python3.9/site-packages/z3/z3.py", line 1113, in _coerce_exprs
    a = s.cast(a)
  File "/opt/hostedtoolcache/Python/3.9.4/x64/lib/python3.9/site-packages/z3/z3.py", line 2181, in cast
    if self.eq(val_s):
  File "/opt/hostedtoolcache/Python/3.9.4/x64/lib/python3.9/site-packages/z3/z3.py", line 381, in eq
    return Z3_is_eq_ast(self.ctx_ref(), self.as_ast(), other.as_ast())
  File "/opt/hostedtoolcache/Python/3.9.4/x64/lib/python3.9/site-packages/z3/z3.py", line 514, in as_ast
    return Z3_sort_to_ast(self.ctx_ref(), self.ast)
  File "/opt/hostedtoolcache/Python/3.9.4/x64/lib/python3.9/site-packages/z3/z3core.py", line 2558, in Z3_sort_to_ast
    _elems.Check(a0)
  File "/opt/hostedtoolcache/Python/3.9.4/x64/lib/python3.9/site-packages/z3/z3core.py", line 1414, in Check
    raise self.Exception(self.get_error_message(ctx, err))
z3.z3types.Z3Exception: b'there is no current model'

examples/msccllang: what does the "instances" parameter stands for?

As title, when generating the xml file, some files have the 'numranks' and 'numnodes' parameters, but what is 'instances'?

Performance results of Alltoall in SCCL paper cannot be reproduced upon msccl-rt

I have generated (C,S,R)=(8,3,3) alltoall algorithm using SCCL synthesizer on DGX-1 topology. And I use msccl ncclize <file.json> to generate xml file for msccl runtime. Then, I use master branch of msccl as msccl-rt to run this scheme. Above msccl, a fixed version of nccl-tests for alltoall is used.
My script is just like:

...
NCCL_DEBUG="INFO" \
NCCL_DEBUG_SUBSYS="INIT,ENV" \
MSCCL_XML_FILES="schemes/Alltoall.n8-DGX1-steps3.msccl.xml" \
NCCL_ALGO=MSCCL,RING,TREE \
LD_LIBRARY_PATH=${NCCL_BUILD_PATH}/lib:$LD_LIBRARY_PATH \
srun -N 1 --ntasks-per-node 8 \
nccl-tests/build_msccl/alltoall_perf \
   -b32M -e256M -f 2 -g 1

And the results are below:

We only care about out-of-place results, because nccl-tests doesn't support in-place mode for alltoall. We can see that there is a big gap of busBW between size=64MB and 128MB which is extremely strange.

After I read the source codes of msccl, I know that it's because for function ncclAlltoAll in all_to_all.cc of msccl, it use msccl algorithm for size < 128MB , otherwise, it use naive implementation for alltoall.

Thus the question is that, in SCCL paper, the performance data of (8,3,3) are higher than baseline, i.e. naive implementation, In the case of all message sizes. But why in my experiments, (8,3,3) algorithm even shows worse performance than baseline?
sccl2
In addition, In theoretical analysis under $\alpha,\beta$ p2p communication model, (8,3,3) algorithm should have been better than baseline, at least under the condition of large message size.

Thus, is it due to the poor implementation of cudaKernel for MSCCL algorithm ?

Provide just Gather to Gather-Scatter Alltoall distributor

The alltoall-gather-scatter distributor takes in a Gather and Scatter algorithm for a local topology and creates an Alltoall by stitching these together for a distributed topology with copies of the local one.

Providing the Scatter is redundant, as the Scatter can be produced by reversing the Gather. This will work when the topology is symmetric.

Rounds vs extra rounds for solve least-steps

Currently solve least-steps takes --rounds and fails when it is passed in. Possible desired behaviors:

Pass --rounds and iterate steps up until that bound is reached.
Pass the older style --extra-rounds, with no limit to how many steps can be considered.

The least-steps strategy is meant for just easily finding the latency-optimal algorithm, so it isn't very naturally compatible with either --rounds or --chunks, but having some reasonable behavior here would be nice. Removing both --rounds and --chunks is one option to consider.

Questioning the optimality and generation strategy of Allreduce

Hi, I have questions about Allreduce synthesizing. In the SCCL paper, it is mentioned that the Allreduce algorithm is generated from the Allgather algorithm. I am curious about how much this approach guarantees the optimality of the Allreduce algorithm.

In particular, is there a possibility that obtaining Allreduce directly from a constraint problem as well as Allgather, could yield a more efficient or optimal algorithm than the current method?

Thanks.

Failed to reproduce examples/scclang/simple/allgather_ring on single node with 4 cards

Installation instruction might be incorrect

The readme says to run the following for installation

pip install git+https://github.com/microsoft/msccl.git

However, this is pointing to the other msccl repository which is not a python package.

Is it supposed to be the following?

pip install git+https://github.com/microsoft/msccl-tools.git

microsoft / msccl-tools Goto Github PK

msccl-tools's Issues

Problem

Description

Recommend Projects

Recommend Topics

Recommend Org