Giter Site home page Giter Site logo

facebookresearch / param Goto Github PK

View Code? Open in Web Editor NEW
89.0 27.0 43.0 4.35 MB

PArametrized Recommendation and Ai Model benchmark is a repository for development of numerous uBenchmarks as well as end to end nets for evaluation of training and inference platforms.

License: MIT License

Python 100.00%

param's Introduction

PARAM

PARAM Benchmarks is a repository of communication and compute micro-benchmarks as well as full workloads for evaluating training and inference platforms.

PARAM complements two broad categories of commonly used benchmarks:

  1. C++ based stand-alone compute and communication benchmarks using cuDNN, MKL, NCCL, MPI libraries - e.g., NCCL tests (https://github.com/NVIDIA/nccl-tests), OSU MPI benchmarks (https://mvapich.cse.ohio-state.edu/benchmarks/), and DeepBench (https://github.com/baidu-research/DeepBench).
  2. Application benchmarks such as Deep Learning Recommendation Model (DLRM) and the broader MLPerf benchmarks. Its worth noting that while MLPerf is the de-facto industry standard for benchmarking ML applications we hope to compliment this effort with broader workloads that are of more interest to Facebook with more in-depth analysis of each within this branch of Application benchmarks.

Our initial release of PARAM benchmarks focuses on AI training and comprises of:

  1. Communication: PyTorch based collective benchmarks across arbitrary message sizes, effectiveness of compute-communication overlap, and DLRM communication patterns in fwd/bwd pass
  2. Compute: PyTorch based GEMM, embedding lookup, and linear layer
  3. DLRM: tracks the ext_dist branch of DRLM benchmark use Facebook's DLRM benchmark (https://github.com/facebookresearch/dlrm). In short, PARAM fully relies on DLRM benchmark for end-to-end workload evaluation; with additional extensions as required for scale-out AI training platforms.
  4. PyTorch Execution Trace (ET) replay based tests: The PyTorch ET capturing capabilities, which have recently been introduced, allow for the recording of runtime information of a model at the operator level. This capability enables the creation of replay-based benchmarks (https://dl.acm.org/doi/abs/10.1145/3579371.3589072) to accurately reproduce the original performance.

In essence, PARAM bridges the gap between stand-alone C++ benchmarks and PyTorch/Tensorflow based application benchmarks. This enables us to gain deep insights into the inner workings of the system architecture as well as identify framework-level overheads by stressing all subcomponents of a system.

Version

0.1 : Initial release

Requirements

  • pytorch
  • future
  • numpy
  • apex

License

PARAM benchmarks is released under the MIT license. Please see the LICENSE file for more information.

Contributing

We actively welcome your pull requests! Please see CONTRIBUTING.md and CODE_OF_CONDUCT.md for more info.

param's People

Contributors

aashaka avatar briancoutinho avatar bryanmr avatar facebook-github-bot avatar hiwotadese avatar jianyuh avatar joongunpark avatar jspark1105 avatar kingchc avatar kiri11 avatar kirteshpatil avatar kshiteejm avatar liangluofb avatar louisfeng avatar lw avatar mingyu-liang avatar mingyul-fb avatar minsii avatar nmacchioni avatar nrsatish avatar pranaykoka avatar r-barnes avatar rahulg avatar samiwilf avatar sergei-lebedev avatar shengbao-zheng avatar taekyungheo avatar tiagomantunes avatar wesbland avatar wfu-fb avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

param's Issues

recommended NCCL versions?

Hi,

I am running the Collective-Comms benchmark with 2 single-GPU hosts communicating through RoCEv2. I found the latency seems inconsistent between different NCCL versions.

There is always a constant additional latency of around 10 ms when I use NCCL 2.10 as the backend, no matter how large the tensor size is. However, if I recompile the PyTorch with NCCL 2.7 as the backend, the latency is much smaller. I was wondering if specific versions of NCCL are required for this script?

I have attached the output when running the following command with NCCL 2.10:

~/conda_env/bin/mpirun -np 2 -N 1 --host 11.0.0.1,11.0.0.2 ~/conda_env/bin/python /data/param/train/comms/pt/comms.py --master-ip 11.0.0.1 --backend nccl --device cuda --b 8 --e 256M --n 20 --f 2 --z 1 --collective all_reduce

output

I also made a comparison of the latency (in microseconds) among nccl-tests, PARAM with NCCL 2.7 and PARAM with NCCL 2.10. Should we expect a similar performance as nccl-tests in normal cases?

Size (B) nccl-tests PARAM (nccl 2.7) PARAM (nccl 2.10)
8 150 47.1 11991.1
16 158.3 46.1 10393.1
32 143.7 45.6 11396.5
64 113.6 45.5 12362.4
128 112 46.3 11889.3
256 114.7 47.3 9837.3
512 113.8 46.5 11046.1
1024 117.2 47.2 12022.6
2048 132.7 47.9 10952.3
4096 137 62.2 11646.1
8192 121.3 53.5 12466.5
16384 132 57.6 10948.6
32768 121.5 62.9 13337.2
65536 212.3 75 11786
131072 229.8 80.9 10927.1
262144 237.4 95.4 14178.9
524288 275.3 137.1 10649.8
1048576 352.5 212 11185.1
2097152 525.9 364.3 10624.6
4194304 780 1100.6 13422.3
8388608 1418.6 2883.2 12564.5
16777216 2659.3 7575.9 14743.1
33554432 4795 18189.6 18961.8
67108864 9198.5 36459.1 35850.3
134217728 18129 73286.1 71411.7
268435456 36005 151070.3 154358.7

Our system information:

  • OS: Ubuntu 20.04
  • Network Interface: Mellanox mlx5

ImportError: cannot import name 'ExecutionTraceObserver' from 'torch.profiler'

Dear Authors,

Thank you for the benchmark. I am starting to use this tool to generate ETs for DNN model writing in Pytorch. I have installed and implemented using the following commands.
cd train/compute/python
nohup python3 setup.py install >setup.out 2>&1 &
nohup python3 -m pytorch.run_benchmark -c examples/pytorch/configs/alex_net.json -d cuda --eg --cuda-l2-cache on > bench.out 2>&1 &
However, I got the error of ImportError: cannot import name 'ExecutionTraceObserver' from 'torch.profiler'
I am wondering if my installation or running has some problems. Would it be possible to provide me with some help?
The configuration of Python and Pytorch is as follows.
(rppg-toolbox) [tangyue@v001 python]$ python3 -m torch.utils.collect_env
Collecting environment information...

PyTorch version: 1.12.1
Is debug build: False
CUDA used to build PyTorch: 10.2
ROCM used to build PyTorch: N/A

OS: CentOS Linux release 8.2.2004 (Core) (x86_64)
GCC version: (GCC) 8.3.1 20191121 (Red Hat 8.3.1-5)
Clang version: 9.0.1 (Red Hat 9.0.1-2.module_el8.2.0+309+0c7b6b03)
CMake version: version 3.11.4
Libc version: glibc-2.28

Python version: 3.8.13 (default, Mar 28 2022, 11:38:47) [GCC 7.5.0] (64-bit runtime)
Python platform: Linux-4.18.0-193.28.1.el8_2.x86_64-x86_64-with-glibc2.17
Is CUDA available: True
CUDA runtime version: Could not collect
GPU models and configuration: GPU 0: Tesla V100-SXM2-32GB
Nvidia driver version: 525.60.13
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

Versions of relevant libraries:
[pip3] numpy==1.22.0
[pip3] pytorch3d==0.7.1
[pip3] torch==1.12.1
[pip3] torchaudio==0.12.1
[pip3] torchinfo==1.7.1
[pip3] torchsampler==0.1.2
[pip3] torchsummary==1.5.1
[pip3] torchvision==0.13.1
[conda] blas 1.0 mkl
[conda] cudatoolkit 10.2.89 hfd86e86_1
[conda] ffmpeg 4.3 hf484d3e_0 pytorch
[conda] mkl 2021.4.0 h06a4308_640
[conda] mkl-service 2.4.0 py38h7f8727e_0
[conda] mkl_fft 1.3.1 py38hd3c417c_0
[conda] mkl_random 1.2.2 py38h51133e4_0
[conda] numpy 1.22.0 pypi_0 pypi
[conda] pytorch 1.12.1 py3.8_cuda10.2_cudnn7.6.5_0 pytorch
[conda] pytorch-mutex 1.0 cuda pytorch
[conda] pytorch3d 0.7.1 py38_cu102_pyt1121 pytorch3d
[conda] torchaudio 0.12.1 py38_cu102 pytorch
[conda] torchinfo 1.7.1 pypi_0 pypi
[conda] torchsampler 0.1.2 pypi_0 pypi
[conda] torchsummary 1.5.1 pypi_0 pypi
[conda] torchvision 0.13.1 py38_cu102 pytorch
[3]+ Exit 1 nohup python3 -m pytorch.run_benchmark -c examples/pytorch/configs/alex_net.json -d cuda --eg --cuda-l2-cache on > bench.out 2>&1
Thank you.

commsTraceReplay failing due to missing import

I used torch.profiler.profile to collect some traces and replay them using commsTraceReplay. I encountered the following error:

Traceback (most recent call last):
  File "./commsTraceReplay.py", line 1247, in readTrace
    import commsTraceParser
ModuleNotFoundError: No module named 'commsTraceParser'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "./commsTraceReplay.py", line 1329, in <module>
    main()
  File "./commsTraceReplay.py", line 1325, in main
    traceBench.runBench(commsParams)
  File "./commsTraceReplay.py", line 1000, in runBench
    self.readTrace(remotePath=self.trace_file, rank=global_rank)
  File "./commsTraceReplay.py", line 1250, in readTrace
    self.comms_trace = extractCommsInfo(self.comms_trace)
  File "./commsTraceReplay.py", line 1266, in extractCommsInfo
    newComm.comms = paramToCommName(curComm["comms"].lower())
TypeError: string indices must be integers
Traceback (most recent call last):
  File "./commsTraceReplay.py", line 1247, in readTrace
    import commsTraceParser
ModuleNotFoundError: No module named 'commsTraceParser'

Looks like there is no module named commsTraceParser in the entire repo.

Auto-shrink for alltoallv

Hello,

I am trying to run PARAM in replay mode with 2 ranks using the trace which was generated on larger number of ranks.
The trace contains different operations including alltoallv.

By default PARAM reports issue "Number of tensor splits not equal to group size"

Traceback (most recent call last):
File "./commsTraceReplay.py", line 708, in
main()
File "./commsTraceReplay.py", line 704, in main
traceBench.runBench(comms_world_info, commsParams)
File "./commsTraceReplay.py", line 567, in runBench
self.benchTime(commsParams)
File "./commsTraceReplay.py", line 483, in benchTime
self.collectiveArgs, retFlag=True
File "./param/train/comms/pt/pytorch_dist_backend.py", line 183, in all_to_allv
async_op=collectiveArgs.asyncOp,
File "torch/distributed/distributed_c10d.py", line 2386, in all_to_all_single
work = group.alltoall_base(output, input, output_split_sizes, input_split_sizes, opts)
RuntimeError: Number of tensor splits not equal to group size

If I set "--auto-shrink" then PARAM doesn't report issue and seems do adjustments for smaller scale but now issue happens on transport level:
tl_ucp_coll.c:42 TL_UCP ERROR failure in send completion Message truncated

"Unrecognised argument(s): force" for comms/pt/comms_utils.py

https://github.com/facebookresearch/param/blame/main/train/comms/pt/comms_utils.py#L1734

Traceback (most recent call last):
  File "./comms.py", line 1405, in <module>
    main()
  File "./comms.py", line 1362, in main
    collBenchObj.checkArgs(args)
  File "./comms.py", line 239, in checkArgs
    super().checkArgs(args)
  File "/myworkspace/param/train/comms/pt/comms_utils.py", line 1734, in checkArgs
    force=True,
  File "/opt/conda/lib/python3.7/logging/__init__.py", line 1919, in basicConfig
    raise ValueError('Unrecognised argument(s): %s' % keys)
ValueError: Unrecognised argument(s): force

ProcessGroupNCCL alltoall error

Traceback (most recent call last):
  File "./comms.py", line 526, in <module>
    main()
  File "./comms.py", line 523, in main
    collBenchObj.runBench(comms_world_info, commsParams)
  File "./comms.py", line 485, in runBench
    backendObj.benchmark_comms()
  File "/home/ubuntu/param/train/comms/pt/pytorch_nccl_backend.py", line 252, in benchmark_comms
    self.commsParams.benchTime(index, self.commsParams, self)
  File "./comms.py", line 426, in benchTime
    comm_fn=collectiveFunc, compute_fn=computeFunc
  File "./comms.py", line 164, in runColl
    comm_fn(self.collectiveArgs)
  File "/home/ubuntu/param/train/comms/pt/pytorch_nccl_backend.py", line 108, in all_to_all
    async_op=collectiveArgs.asyncOp,
  File "/usr/local/lib/python3.6/dist-packages/torch/distributed/distributed_c10d.py", line 1827, in all_to_all_single
    work = group.alltoall_base(output, input, output_split_sizes, input_split_sizes, opts)
RuntimeError: ProcessGroupNCCL only supports alltoall* for NCCL lib version >= 2.7.0

However setting NCCL_DEBUG=INFO I see i have a NCCL lib version >=2.7.0

ip-172-31-44-177:11401:11401 [0] NCCL INFO Using network AWS Libfabric
NCCL version 2.7.8+cuda11.0
ip-172-31-44-177:11404:11404 [3] NCCL INFO Bootstrap : Using [0]ens32:172.31.44.177<0>

However if I remove the --collective stanza altogether it works

How do I generate traces for replay?

Hi,
I wonder what is the expected trace format for commsTraceReplay.py and how to generate them?
I tried with traces captured with the latest (PyTorch at tag v1.9.0-rc2 commit d417a094f398f1c4efd7f818b14b8471a597fbcc) PyTorch profiler following the doc:

def trace_handler(prof):
        prof.export_chrome_trace(f"rank{args.rank}.json")

with torch.profiler.profile(activities=[torch.profiler.ProfilerActivity.CPU,torch.profiler.ProfilerActivity.CUDA],
    schedule=torch.profiler.schedule(wait=1,warmup=10,active=2),
    on_trace_ready=trace_handler) as p:
    for _ in range(args.iterations):
        my_training_code_using_DistributedDataParallel()
        p.step()

but got errors with those traces:

mpirun -np 2 -H ${HOST1_IP},${HOST2_IP} python commsTraceReplay.py --master-ip $HOST1_IP --backend nccl --device cuda --trace-path /path/to/rankR.json

Traceback (most recent call last):
  File "/home/param/train/comms/pt/commsTraceReplay.py", line 708, in <module>
    main()
  File "/home/param/train/comms/pt/commsTraceReplay.py", line 704, in main
    traceBench.runBench(comms_world_info, commsParams)
  File "/home/param/train/comms/pt/commsTraceReplay.py", line 562, in runBench
    self.initTraceStat()
  File "/home/param/train/comms/pt/commsTraceReplay.py", line 274, in initTraceStat
    for curComm in self.comms_trace[:self.max_msg_cnt]:
TypeError: unhashable type: 'slice'

Thank you!

DLRM-comms benchmark doesn't have device attribute

PyTorchNCCLBackend expects to have device attribute in commsParams (6e571fa). This is not the case for DLRM benchmark because first it doesn't have such parameter and second it passes dictionary to PyTorchNCCLBackend.

Traceback (most recent call last):
  File "./dlrm.py", line 1129, in <module>
    main()
  File "./dlrm.py", line 1125, in main
    dlrmBench.runBench(mpi_env_params, comms_world_info, args)
  File "./dlrm.py", line 1048, in runBench
    local_rank, global_rank, world_size, group, curDevice = comms_utils.get_rank_details(self.backendFuncs)
  File "/wspace/param/train/comms/pt/comms_utils.py", line 99, in get_rank_details
    curDevice = backendFuncs.get_device()
  File "/wspace/param/train/comms/pt/pytorch_nccl_backend.py", line 213, in get_device
    my_dev = torch.device(self.commsParams.device)
AttributeError: 'dict' object has no attribute 'device'

AttributeError

Hi. I am new to python and can't find any answer to this. The following is my command:
mpirun -np 8 -N 8 ./dlrm.py --mini-batch-size 32 --num-batches 100 --arch-mlp-bot 256-256 --arch-sparse-feature-size 64 --arch-embedding-size "10000-10000-10000-10000-10000-10000-10000-10000"
And then it's my errors:
File "/workspace/param/train/comms/pt/pytorch_dist_backend.py", line 709, in initialize_backend for pg_id, group_ranks in self.commsParams.groupRanks.items(): AttributeError: 'dict' object has no attribute 'groupRanks'

Enabling Kineto dumping on Pytorch

Hi,
I am interested in enabling kineto traces on Pytorch. By looking at the Param code I was able to roughly understand how to enable kineto on pytorch. But the problem is that kineto dumps on pytorch generates only a single operator. So can you guys give me some instructions on how to correctly enable kineto on pytroch models?

OOM for all gather comms tests

Hi,

I'm trying to benchmark multi-node allgather perf using param tests for buffers up to 2G. but the test will OOM at buffer size around 1G. While the same config works for nccl-tests. Any ideas or insight will be helpful. Thank you!. AR and RS tests are fine and results are very similar to nccl-tests. You can reproduce this on A100-40G /H100 clusters. (p4d or p5 on AWS)

PyTorch nightly with cuda 12.1 or PyTorch 2.0.1 with CUDA 11.8

for param I'm launching the following way

mpirun -np $(($NUM_NODES*8)) -N 8 --hostfile $HOST_FILE \
      --tag-output \
      --oversubscribe --allow-run-as-root \
      $MPI_OPTIONS /fsx/lawei/param/train/comms/pt/comms.py \
      --master-ip ip-172-31-49-213 \
      --b 32M \
     ---e 2048M \
      --n 100 \
      --z 0 \
      --backend nccl \
      --device cuda \
      --collective all_gather\

for nccl-test, I'm using NCCL 2.18.3 + CUDA 12.1, but older version also works.

mpirun -np $(($NUM_NODES*8)) -N 8 --hostfile $HOST_FILE \
      --tag-output \
      --oversubscribe --allow-run-as-root \
      bash run_nccl_test.sh

and in the bash file

export LD_LIBRARY_PATH=/usr/local/cuda-12.1/lib:/usr/local/cuda-12.1/lib64:/usr/local/cuda-12.1:$LD_LIBRARY_PATH
export NCCL_DEBUG=INFO
export FI_EFA_USE_DEVICE_RDMA=1
/usr/local/cuda-12.1/efa/test-cuda-12.1/all_gather_perf -b 32M -e 2048M  -n 100  -z 0 -f 2 -g 1

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.