facebookresearch / param Goto Github PK

PArametrized Recommendation and Ai Model benchmark is a repository for development of numerous uBenchmarks as well as end to end nets for evaluation of training and inference platforms.

License: MIT License

Python 99.95% Shell 0.05%

param's Introduction

PARAM

PARAM Benchmarks is a repository of communication and compute micro-benchmarks as well as full workloads for evaluating training and inference platforms.

PARAM complements two broad categories of commonly used benchmarks:

C++ based stand-alone compute and communication benchmarks using cuDNN, MKL, NCCL, MPI libraries - e.g., NCCL tests (https://github.com/NVIDIA/nccl-tests), OSU MPI benchmarks (https://mvapich.cse.ohio-state.edu/benchmarks/), and DeepBench (https://github.com/baidu-research/DeepBench).
Application benchmarks such as Deep Learning Recommendation Model (DLRM) and the broader MLPerf benchmarks. Its worth noting that while MLPerf is the de-facto industry standard for benchmarking ML applications we hope to compliment this effort with broader workloads that are of more interest to Facebook with more in-depth analysis of each within this branch of Application benchmarks.

Our initial release of PARAM benchmarks focuses on AI training and comprises of:

Communication: PyTorch based collective benchmarks across arbitrary message sizes, effectiveness of compute-communication overlap, and DLRM communication patterns in fwd/bwd pass
Compute: PyTorch based GEMM, embedding lookup, and linear layer
DLRM: tracks the ext_dist branch of DRLM benchmark use Facebook's DLRM benchmark (https://github.com/facebookresearch/dlrm). In short, PARAM fully relies on DLRM benchmark for end-to-end workload evaluation; with additional extensions as required for scale-out AI training platforms.
PyTorch Execution Trace (ET) replay based tests: The PyTorch ET capturing capabilities, which have recently been introduced, allow for the recording of runtime information of a model at the operator level. This capability enables the creation of replay-based benchmarks (https://dl.acm.org/doi/abs/10.1145/3579371.3589072) to accurately reproduce the original performance.

In essence, PARAM bridges the gap between stand-alone C++ benchmarks and PyTorch/Tensorflow based application benchmarks. This enables us to gain deep insights into the inner workings of the system architecture as well as identify framework-level overheads by stressing all subcomponents of a system.

Version

0.1 : Initial release

Requirements

pytorch
future
numpy
apex

License

PARAM benchmarks is released under the MIT license. Please see the LICENSE file for more information.

Contributing

We actively welcome your pull requests! Please see CONTRIBUTING.md and CODE_OF_CONDUCT.md for more info.

param's People

Contributors

Stargazers

Watchers

Forkers

manjugv jladd-mlnx shz0116 luo-liang kingchc pallab-zz louisfeng kshiteejm andrei-pokrovsky jianyuh sergei-lebedev moderato nrsatish sazanovd jspark1105 yeonan caogao christindbose ehsanardestani xbwgc amathews-amd findhao minsii cemberk adamweingram briancoutinho pavani-panakanti ethicalsecurity-agency jinlmsft sunghlin mingyu-liang q10 teutades tiagomantunes rashidi1saeed frankucas anageswa taekyungheo stayyule melodylail shawnshanksgui sryap shengfukevin venkatrag1 isyinun paklui azad-meta lpc0220 rohitdwivedula jamesjwu dageita amaslenn lihuibng sanrise

param's Issues

commsTraceReplay failing due to missing import

I used torch.profiler.profile to collect some traces and replay them using commsTraceReplay. I encountered the following error:

Traceback (most recent call last):
  File "./commsTraceReplay.py", line 1247, in readTrace
    import commsTraceParser
ModuleNotFoundError: No module named 'commsTraceParser'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "./commsTraceReplay.py", line 1329, in <module>
    main()
  File "./commsTraceReplay.py", line 1325, in main
    traceBench.runBench(commsParams)
  File "./commsTraceReplay.py", line 1000, in runBench
    self.readTrace(remotePath=self.trace_file, rank=global_rank)
  File "./commsTraceReplay.py", line 1250, in readTrace
    self.comms_trace = extractCommsInfo(self.comms_trace)
  File "./commsTraceReplay.py", line 1266, in extractCommsInfo
    newComm.comms = paramToCommName(curComm["comms"].lower())
TypeError: string indices must be integers
Traceback (most recent call last):
  File "./commsTraceReplay.py", line 1247, in readTrace
    import commsTraceParser
ModuleNotFoundError: No module named 'commsTraceParser'

Looks like there is no module named commsTraceParser in the entire repo.

OOM for all gather comms tests

Hi,

I'm trying to benchmark multi-node allgather perf using param tests for buffers up to 2G. but the test will OOM at buffer size around 1G. While the same config works for nccl-tests. Any ideas or insight will be helpful. Thank you!. AR and RS tests are fine and results are very similar to nccl-tests. You can reproduce this on A100-40G /H100 clusters. (p4d or p5 on AWS)

PyTorch nightly with cuda 12.1 or PyTorch 2.0.1 with CUDA 11.8

for param I'm launching the following way

mpirun -np $(($NUM_NODES*8)) -N 8 --hostfile $HOST_FILE \
      --tag-output \
      --oversubscribe --allow-run-as-root \
      $MPI_OPTIONS /fsx/lawei/param/train/comms/pt/comms.py \
      --master-ip ip-172-31-49-213 \
      --b 32M \
     ---e 2048M \
      --n 100 \
      --z 0 \
      --backend nccl \
      --device cuda \
      --collective all_gather\

for nccl-test, I'm using NCCL 2.18.3 + CUDA 12.1, but older version also works.

mpirun -np $(($NUM_NODES*8)) -N 8 --hostfile $HOST_FILE \
      --tag-output \
      --oversubscribe --allow-run-as-root \
      bash run_nccl_test.sh

and in the bash file

export LD_LIBRARY_PATH=/usr/local/cuda-12.1/lib:/usr/local/cuda-12.1/lib64:/usr/local/cuda-12.1:$LD_LIBRARY_PATH
export NCCL_DEBUG=INFO
export FI_EFA_USE_DEVICE_RDMA=1
/usr/local/cuda-12.1/efa/test-cuda-12.1/all_gather_perf -b 32M -e 2048M  -n 100  -z 0 -f 2 -g 1

torch.distributed.batch_isend_irecv is not recorded properly in ET

We are developing comm_repay and finding a problem with torch.distributed.batch_isend_irecv , which is used in one of our testing trace.
The p2p comm sequence of real training between rank0 and rank8 is:
rank 0: batch -> send -> batch -> send -> recv -> batch -> send -> recv -> batch -> send -> recv -> batch -> recv
rank 8: batch -> recv -> batch -> send -> recv -> batch -> send -> recv -> batch -> send -> recv -> batch -> send
The p2p comm sequence in Execution Trace for replay between rank0 and rank8 is:
rank0: send-> send -> recv-> send -> recv-> send -> recv -> recv
rank8: recv-> send -> recv-> send -> recv-> send -> recv -> send

The issue can be reproduced with the collected ET for https://github.com/pytorch/pytorch/blob/main/test/distributed/test_c10d_nccl.py#L3846

batched-send-recv-0.json attached two files are simpler version of that unit test (with only one batch_isend_irecv call). You can find nccl::coalesced node, which marks the end of the coalescing buffer. I think the trace missed the node to mark the start of the coalescing buffer. After that is added, all send/recv nodes between the start of coalescing and the end of the coalescing should be treated as one coalesced group to replay.
batched-send-recv-1.json

et_replay fails to import torch._inductor.runtime

Describe the Bug

A clear and concise description of what the bug is.

ET replay module crashes with import error
"et_replay/tools/et_replay.py", line 30, in
from torch._inductor.runtime.triton_heuristics import grid, split_scan_grid
ModuleNotFoundError: No module named 'torch._inductor.runtime'
"
please specify dependencies in requirement.txt

Steps to Reproduce

Steps to reproduce the behavior.
version: commit 884a1f0
https://github.com/facebookresearch/param/blob/main/et_replay/README.md
tested on torch 2.2.0

Expected Behavior

et_replay should replay given et_trace

Screenshots

If applicable, add screenshots to help explain your problem.

Improve compute op coverage rate

et_replay skipped some operators, need to understand the reason why these operators are skipped, and how to fix it.

Some known ops:

_scaled_dot_product_efficient_attention_backward_cuda (op schema does not support optional attention_bias)
fbgemm for embedding table
The current implementation tries to recover the embedding table lookup op from its forward call and backward call (The way to match the forward call and backward call is also fragil). It is not reliable as the op itself keep changing.

Cloud Data Pull Request : true+ 1

recommended NCCL versions?

Hi,

I am running the Collective-Comms benchmark with 2 single-GPU hosts communicating through RoCEv2. I found the latency seems inconsistent between different NCCL versions.

There is always a constant additional latency of around 10 ms when I use NCCL 2.10 as the backend, no matter how large the tensor size is. However, if I recompile the PyTorch with NCCL 2.7 as the backend, the latency is much smaller. I was wondering if specific versions of NCCL are required for this script?

I have attached the output when running the following command with NCCL 2.10:

~/conda_env/bin/mpirun -np 2 -N 1 --host 11.0.0.1,11.0.0.2 ~/conda_env/bin/python /data/param/train/comms/pt/comms.py --master-ip 11.0.0.1 --backend nccl --device cuda --b 8 --e 256M --n 20 --f 2 --z 1 --collective all_reduce

I also made a comparison of the latency (in microseconds) among nccl-tests, PARAM with NCCL 2.7 and PARAM with NCCL 2.10. Should we expect a similar performance as nccl-tests in normal cases?

Size (B)	nccl-tests	PARAM (nccl 2.7)	PARAM (nccl 2.10)
8	150	47.1	11991.1
16	158.3	46.1	10393.1
32	143.7	45.6	11396.5
64	113.6	45.5	12362.4
128	112	46.3	11889.3
256	114.7	47.3	9837.3
512	113.8	46.5	11046.1
1024	117.2	47.2	12022.6
2048	132.7	47.9	10952.3
4096	137	62.2	11646.1
8192	121.3	53.5	12466.5
16384	132	57.6	10948.6
32768	121.5	62.9	13337.2
65536	212.3	75	11786
131072	229.8	80.9	10927.1
262144	237.4	95.4	14178.9
524288	275.3	137.1	10649.8
1048576	352.5	212	11185.1
2097152	525.9	364.3	10624.6
4194304	780	1100.6	13422.3
8388608	1418.6	2883.2	12564.5
16777216	2659.3	7575.9	14743.1
33554432	4795	18189.6	18961.8
67108864	9198.5	36459.1	35850.3
134217728	18129	73286.1	71411.7
268435456	36005	151070.3	154358.7

Our system information:

OS: Ubuntu 20.04
Network Interface: Mellanox mlx5

Request for BFloat16 support in comms.py

Currently train/comms/pt/comms.py supports only below data types.
['float32', 'int32', 'long', 'float16', 'float64', 'bool', 'Float', 'Int', 'Long', 'Double', 'Half', 'Bool', 'Byte']

cc @nrsatish

DLRM-comms benchmark doesn't have device attribute

PyTorchNCCLBackend expects to have device attribute in commsParams (6e571fa). This is not the case for DLRM benchmark because first it doesn't have such parameter and second it passes dictionary to PyTorchNCCLBackend.

Traceback (most recent call last):
  File "./dlrm.py", line 1129, in <module>
    main()
  File "./dlrm.py", line 1125, in main
    dlrmBench.runBench(mpi_env_params, comms_world_info, args)
  File "./dlrm.py", line 1048, in runBench
    local_rank, global_rank, world_size, group, curDevice = comms_utils.get_rank_details(self.backendFuncs)
  File "/wspace/param/train/comms/pt/comms_utils.py", line 99, in get_rank_details
    curDevice = backendFuncs.get_device()
  File "/wspace/param/train/comms/pt/pytorch_nccl_backend.py", line 213, in get_device
    my_dev = torch.device(self.commsParams.device)
AttributeError: 'dict' object has no attribute 'device'

How to generate test data for an op that use index to access memory

Some operators use the input index tensor to read/write memory, for example, embedding table lookup op. Since execution trace does not save the data for these tensors, the random generated tensor may be invalid. It will cause runtime error. Need to find a way to fix it.

AttributeError

Hi. I am new to python and can't find any answer to this. The following is my command:
mpirun -np 8 -N 8 ./dlrm.py --mini-batch-size 32 --num-batches 100 --arch-mlp-bot 256-256 --arch-sparse-feature-size 64 --arch-embedding-size "10000-10000-10000-10000-10000-10000-10000-10000"
And then it's my errors:
File "/workspace/param/train/comms/pt/pytorch_dist_backend.py", line 709, in initialize_backend for pg_id, group_ranks in self.commsParams.groupRanks.items(): AttributeError: 'dict' object has no attribute 'groupRanks'

"Unrecognised argument(s): force" for comms/pt/comms_utils.py

https://github.com/facebookresearch/param/blame/main/train/comms/pt/comms_utils.py#L1734

Traceback (most recent call last):
  File "./comms.py", line 1405, in <module>
    main()
  File "./comms.py", line 1362, in main
    collBenchObj.checkArgs(args)
  File "./comms.py", line 239, in checkArgs
    super().checkArgs(args)
  File "/myworkspace/param/train/comms/pt/comms_utils.py", line 1734, in checkArgs
    force=True,
  File "/opt/conda/lib/python3.7/logging/__init__.py", line 1919, in basicConfig
    raise ValueError('Unrecognised argument(s): %s' % keys)
ValueError: Unrecognised argument(s): force

Simplify ET replay and add Integration testing

Summary

Today there are two replay logic supported - one for only comms and one for compute+comm. This has several drawbacks

Code exists in two directories - compute and comms
Changes in one side are not testing compatibility with the other.
@TaekyungHeo has observed bugs and crashes with the replay when both compute and comms is enabled.

Crashes

@TaekyungHeo to add more info on how to reproduce issues

Code unification

Basic idea is to pull things out to a replay directory and unify the code
Details TBD

Integration testing

Ensure changes are unit tested to avoid impact to external users.

ImportError: cannot import name 'ExecutionTraceObserver' from 'torch.profiler'

Dear Authors,

Thank you for the benchmark. I am starting to use this tool to generate ETs for DNN model writing in Pytorch. I have installed and implemented using the following commands.
cd train/compute/python
nohup python3 setup.py install >setup.out 2>&1 &
nohup python3 -m pytorch.run_benchmark -c examples/pytorch/configs/alex_net.json -d cuda --eg --cuda-l2-cache on > bench.out 2>&1 &
However, I got the error of ImportError: cannot import name 'ExecutionTraceObserver' from 'torch.profiler'
I am wondering if my installation or running has some problems. Would it be possible to provide me with some help?
The configuration of Python and Pytorch is as follows.
(rppg-toolbox) [tangyue@v001 python]$ python3 -m torch.utils.collect_env
Collecting environment information...

PyTorch version: 1.12.1
Is debug build: False
CUDA used to build PyTorch: 10.2
ROCM used to build PyTorch: N/A

OS: CentOS Linux release 8.2.2004 (Core) (x86_64)
GCC version: (GCC) 8.3.1 20191121 (Red Hat 8.3.1-5)
Clang version: 9.0.1 (Red Hat 9.0.1-2.module_el8.2.0+309+0c7b6b03)
CMake version: version 3.11.4
Libc version: glibc-2.28

Python version: 3.8.13 (default, Mar 28 2022, 11:38:47) [GCC 7.5.0] (64-bit runtime)
Python platform: Linux-4.18.0-193.28.1.el8_2.x86_64-x86_64-with-glibc2.17
Is CUDA available: True
CUDA runtime version: Could not collect
GPU models and configuration: GPU 0: Tesla V100-SXM2-32GB
Nvidia driver version: 525.60.13
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

Versions of relevant libraries:
[pip3] numpy==1.22.0
[pip3] pytorch3d==0.7.1
[pip3] torch==1.12.1
[pip3] torchaudio==0.12.1
[pip3] torchinfo==1.7.1
[pip3] torchsampler==0.1.2
[pip3] torchsummary==1.5.1
[pip3] torchvision==0.13.1
[conda] blas 1.0 mkl
[conda] cudatoolkit 10.2.89 hfd86e86_1
[conda] ffmpeg 4.3 hf484d3e_0 pytorch
[conda] mkl 2021.4.0 h06a4308_640
[conda] mkl-service 2.4.0 py38h7f8727e_0
[conda] mkl_fft 1.3.1 py38hd3c417c_0
[conda] mkl_random 1.2.2 py38h51133e4_0
[conda] numpy 1.22.0 pypi_0 pypi
[conda] pytorch 1.12.1 py3.8_cuda10.2_cudnn7.6.5_0 pytorch
[conda] pytorch-mutex 1.0 cuda pytorch
[conda] pytorch3d 0.7.1 py38_cu102_pyt1121 pytorch3d
[conda] torchaudio 0.12.1 py38_cu102 pytorch
[conda] torchinfo 1.7.1 pypi_0 pypi
[conda] torchsampler 0.1.2 pypi_0 pypi
[conda] torchsummary 1.5.1 pypi_0 pypi
[conda] torchvision 0.13.1 py38_cu102 pytorch
[3]+ Exit 1 nohup python3 -m pytorch.run_benchmark -c examples/pytorch/configs/alex_net.json -d cuda --eg --cuda-l2-cache on > bench.out 2>&1
Thank you.

Tensor memory allocation is based on tensor id, not tensor storage id

In et_replay, when a tensor memory is allocated, it is based on its tensor id. However, the tensors with different tensor id may refer to the same memory storage. In Ads production model, we saw et_replay ran out of GPU memory while the original workload is ok.

This request is to improve tensor memory allocation based on its storage id to improve memory allocation efficiency.

ProcessGroupNCCL alltoall error

Traceback (most recent call last):
  File "./comms.py", line 526, in <module>
    main()
  File "./comms.py", line 523, in main
    collBenchObj.runBench(comms_world_info, commsParams)
  File "./comms.py", line 485, in runBench
    backendObj.benchmark_comms()
  File "/home/ubuntu/param/train/comms/pt/pytorch_nccl_backend.py", line 252, in benchmark_comms
    self.commsParams.benchTime(index, self.commsParams, self)
  File "./comms.py", line 426, in benchTime
    comm_fn=collectiveFunc, compute_fn=computeFunc
  File "./comms.py", line 164, in runColl
    comm_fn(self.collectiveArgs)
  File "/home/ubuntu/param/train/comms/pt/pytorch_nccl_backend.py", line 108, in all_to_all
    async_op=collectiveArgs.asyncOp,
  File "/usr/local/lib/python3.6/dist-packages/torch/distributed/distributed_c10d.py", line 1827, in all_to_all_single
    work = group.alltoall_base(output, input, output_split_sizes, input_split_sizes, opts)
RuntimeError: ProcessGroupNCCL only supports alltoall* for NCCL lib version >= 2.7.0

However setting NCCL_DEBUG=INFO I see i have a NCCL lib version >=2.7.0

ip-172-31-44-177:11401:11401 [0] NCCL INFO Using network AWS Libfabric
NCCL version 2.7.8+cuda11.0
ip-172-31-44-177:11404:11404 [3] NCCL INFO Bootstrap : Using [0]ens32:172.31.44.177<0>

However if I remove the --collective stanza altogether it works

Auto-shrink for alltoallv

Hello,

I am trying to run PARAM in replay mode with 2 ranks using the trace which was generated on larger number of ranks.
The trace contains different operations including alltoallv.

By default PARAM reports issue "Number of tensor splits not equal to group size"

Traceback (most recent call last):
File "./commsTraceReplay.py", line 708, in
main()
File "./commsTraceReplay.py", line 704, in main
traceBench.runBench(comms_world_info, commsParams)
File "./commsTraceReplay.py", line 567, in runBench
self.benchTime(commsParams)
File "./commsTraceReplay.py", line 483, in benchTime
self.collectiveArgs, retFlag=True
File "./param/train/comms/pt/pytorch_dist_backend.py", line 183, in all_to_allv
async_op=collectiveArgs.asyncOp,
File "torch/distributed/distributed_c10d.py", line 2386, in all_to_all_single
work = group.alltoall_base(output, input, output_split_sizes, input_split_sizes, opts)
RuntimeError: Number of tensor splits not equal to group size

If I set "--auto-shrink" then PARAM doesn't report issue and seems do adjustments for smaller scale but now issue happens on transport level:
tl_ucp_coll.c:42 TL_UCP ERROR failure in send completion Message truncated

Enabling Kineto dumping on Pytorch

Hi,
I am interested in enabling kineto traces on Pytorch. By looking at the Param code I was able to roughly understand how to enable kineto on pytorch. But the problem is that kineto dumps on pytorch generates only a single operator. So can you guys give me some instructions on how to correctly enable kineto on pytroch models?

How do I generate traces for replay?

Hi,
I wonder what is the expected trace format for commsTraceReplay.py and how to generate them?
I tried with traces captured with the latest (PyTorch at tag v1.9.0-rc2 commit d417a094f398f1c4efd7f818b14b8471a597fbcc) PyTorch profiler following the doc:

def trace_handler(prof):
        prof.export_chrome_trace(f"rank{args.rank}.json")

with torch.profiler.profile(activities=[torch.profiler.ProfilerActivity.CPU,torch.profiler.ProfilerActivity.CUDA],
    schedule=torch.profiler.schedule(wait=1,warmup=10,active=2),
    on_trace_ready=trace_handler) as p:
    for _ in range(args.iterations):
        my_training_code_using_DistributedDataParallel()
        p.step()

but got errors with those traces:

mpirun -np 2 -H ${HOST1_IP},${HOST2_IP} python commsTraceReplay.py --master-ip $HOST1_IP --backend nccl --device cuda --trace-path /path/to/rankR.json

Traceback (most recent call last):
  File "/home/param/train/comms/pt/commsTraceReplay.py", line 708, in <module>
    main()
  File "/home/param/train/comms/pt/commsTraceReplay.py", line 704, in main
    traceBench.runBench(comms_world_info, commsParams)
  File "/home/param/train/comms/pt/commsTraceReplay.py", line 562, in runBench
    self.initTraceStat()
  File "/home/param/train/comms/pt/commsTraceReplay.py", line 274, in initTraceStat
    for curComm in self.comms_trace[:self.max_msg_cnt]:
TypeError: unhashable type: 'slice'

Thank you!

reason behind the error?

Hello,
I get the following error:

python3 ~/sim/astra-sim/extern/graph_frontend/param/train/compute/python/build/lib/param_bench/train/compute/python/tools/trace_link.py --et-file ./pytorch_et0.json --kineto-file ./kineto_trace_0.json --exact-match --annotation 'enumerate(DataLoader)#_MultiProcessingDataLoaderIter.__next__'
[2024-03-19 08:44:21,007] execution_trace.py:441 [INFO]: Iteration node ids list = [1, 2216, 24992, 43225, 61321, 79426, 97551, 115473, 133356, 151330, 169160, 187095, 205044, 222820, 240719, 258762, 276644, 294685, 312664, 330426, 348528, 366397, 384192, 402211, 420042, 438014, 455971, 473796, 491710, 509663, 527590, 545569, 563452, 581313, 599257, 617204, 635038, 653018, 670987, 688774, 706819, 724711, 742619, 760647, 778522, 796362, 814376, 832289, 850186, 868155, 886031, 903970, 922059, 939831, 957709, 975604, 993383, 1011351, 1029385, 1047274, 1065277, 1083188, 1101020, 1119059, 1137034, 1154930, 1172953, 1190866, 1208781, 1226717, 1244599, 1262451, 1280351, 1298218, 1316039, 1333961, 1351727, 1369657, 1387584, 1405377, 1423294, 1441239, 1459051, 1476974, 1494877, 1512689, 1530617, 1548550, 1566508, 1584505, 1602407, 1620294, 1638240, 1656171, 1674091, 1692104, 1710069, 1727971, 1745854, 1763599, 1781537, 1799552, 1817433, 1835357, 1853315, 1871241, 1889234, 1907100, 1924916, 1942829, 1960750, 1978610, 1996572, 2014462, 2032349, 2050281, 2068129, 2086085, 2104043, 2122001, 2139913, 2157763, 2175708, 2193640, 2211586, 2229573, 2247456, 2265399, 2283326, 2301269, 2319315, 2337145, 2355062, 2372970, 2390800, 2408827, 2426699, 2444776, 2462690, 2480552, 2498594, 2516523, 2534325, 2552341, 2570246, 2588170, 2606230, 2623963, 2641960, 2660003, 2677732, 2695721, 2713769, 2731570, 2749459, 2767443, 2785239, 2803315, 2821242, 2839064, 2857060, 2875030, 2892845, 2910959, 2928860, 2946651, 2964657, 2982412, 3000358, 3018399, 3036264, 3054200, 3072144, 3089919, 3107858, 3125706, 3143568, 3161515, 3179388, 3197226, 3215159, 3233033, 3250869, 3268808, 3286787, 3304756, 3322784, 3340609, 3358568, 3376624, 3394464, 3412407, 3430452, 3448347, 3466274, 3484100, 3501964, 3519975, 3537923, 3555846, 3573816, 3591713, 3609611, 3627567, 3645477, 3663444, 3681484, 3699271, 3717188, 3735100, 3752888, 3770905, 3788951, 3806776, 3824750, 3842662, 3860501, 3878434, 3896335, 3914229, 3932164, 3950077, 3968058, 3985971, 4003828, 4021652, 4039554, 4057421, 4075268, 4093196, 4111000, 4128993, 4147054, 4164893, 4182943, 4201012, 4218959, 4236977, 4254862, 4272845, 4290882, 4308716, 4326641, 4344631, 4362496, 4380530, 4398424, 4416329, 4434292, 4452184, 4470078, 4487992, 4505864, 4523657, 4541664, 4559545, 4577465]
[2024-03-19 08:44:21,008] trace_link.py:294 [INFO]: Execution trace has 256 > 1 iterations.
[2024-03-19 08:44:21,008] execution_trace.py:682 [INFO]: Copying nodes for iter 2 for ids in the range [24992, 43225)
Traceback (most recent call last):
  File "/Users/sim/astra-sim/extern/graph_frontend/param/train/compute/python/build/lib/param_bench/train/compute/python/tools/trace_link.py", line 895, in <module>
    main()  # pragma: no cover
  File "/Users/sim/astra-sim/extern/graph_frontend/param/train/compute/python/build/lib/param_bench/train/compute/python/tools/trace_link.py", line 868, in main
    ) = trace_analysis(args.et_file, args.kineto_file, args.annotation)
  File "/Users/sim/astra-sim/extern/graph_frontend/param/train/compute/python/build/lib/param_bench/train/compute/python/tools/trace_link.py", line 297, in trace_analysis
    et_ = et.clone_one_iteration(trim_iter)
  File "/opt/homebrew/anaconda3/envs/astra-sim-new/lib/python3.8/site-packages/parambench_train_compute-1.0.0+git.1710448259-py3.8.egg/param_bench/train/compute/python/tools/execution_trace.py", line 704, in clone_one_iteration
    assert len(thread_nodes) > 0
AssertionError

could someone please tell me what is going on?