Comments (13)
好像可以了,我限制了 NCCL_SOCKET_IFNAME=bond0 后起起来了
from libai.
好像可以了,我限制了 NCCL_SOCKET_IFNAME=bond0 后起起来了
这个问题定位和解决的原理是什么呢
参考这个issue
from libai.
两台机器可以互相无密码ssh,不过ssh后的shell type会有变化
from libai.
两台机器可以互相无密码ssh,不过ssh后的shell type会有变化
先指定每个机子相同的卡数试试,然后两个机子能不能互相ping到
from libai.
可以先看一下两台机器能不能ping通, 然后看看两台机器有没有开proxy, 如果开了的话 可以先关掉.
unset http_proxy
unset https_proxy
unset HTTPS_PROXY
unset HTTP_PROXY
然后最好两台机器的卡数是一致的, 在Node0上面也用两卡
from libai.
这个也可以看看,https://github.com/Oneflow-Inc/libai/blob/dev_optimize_MT5/projects/MT5/readme_cn.md
from libai.
F20221031 10:52:42.193126 106986 eager_nccl_comm_manager.cpp:75] Check failed: ncclCommInitRank(comm, device_vec.size(), nccl_unique_id, rank) : unhandled system error (2). To see more detail, please run OneFlow with system variable NCCL_DEBUG=INFO
*** Check failure stack trace: ***
@ 0x7fded300c00a google::LogMessage::Fail()
@ 0x7fded300c2f2 google::LogMessage::SendToLog()
@ 0x7fded300bb77 google::LogMessage::Flush()
@ 0x7fded300e6e9 google::LogMessageFatal::~LogMessageFatal()
@ 0x7fdecb89977d oneflow::(anonymous namespace)::CreateNcclComm()
@ 0x7fdecb89b401 oneflow::EagerNcclCommMgr::GetCommForDevice()
@ 0x7fdeccd3b220 oneflow::(anonymous namespace)::EagerNcclOpKernelCache::Init()
@ 0x7fdeccd2df7f oneflow::(anonymous namespace)::InitEagerNcclOpKernelCache()
@ 0x7fdecdc05fa0 oneflow:1️⃣:StatefulOpKernel::TryInitOpKernelStateAndCache()
@ 0x7fdec93acde5 oneflow::vm::OpCallInstructionType::Compute()
@ 0x7fdecc7f4c31 oneflow::vm::EventRecordedEpStreamType::Run()
@ 0x7fdecc7f8ab3 oneflow::vm::ThreadCtx::TryReceiveAndRun()
@ 0x7fdecc7faaf0 oneflow::(anonymous namespace)::WorkerLoop()
from libai.
测试过机器之间可以ping通,代理也都关了。每个node上用的gpu卡数改为一样后,报错有所变化,看起来是nccl初始化报错了?
from libai.
NCCL debug打开后log:
m5-autorl-test01:47324:48509 [0] NCCL INFO Bootstrap : Using bond0:172.27.231.79<0>
m5-autorl-test01:47324:48509 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
m5-autorl-test01:47324:48509 [0] NCCL INFO Failed to open libibverbs.so[.1]
m5-autorl-test01:47324:48509 [0] NCCL INFO NET/Socket : Using [0]bond0:172.27.231.79<0> [1]veth8739016:fe80::5804:edff:fe5b:6570%veth8739016<0>
m5-autorl-test01:47324:48509 [0] NCCL INFO Using network Socket
NCCL version 2.12.10+cuda11.2
m5-autorl-test01:47325:48508 [1] NCCL INFO Bootstrap : Using bond0:172.27.231.79<0>
m5-autorl-test01:47325:48508 [1] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
m5-autorl-test01:47325:48508 [1] NCCL INFO Failed to open libibverbs.so[.1]
m5-autorl-test01:47325:48508 [1] NCCL INFO NET/Socket : Using [0]bond0:172.27.231.79<0> [1]veth8739016:fe80::5804:edff:fe5b:6570%veth8739016<0>
m5-autorl-test01:47325:48508 [1] NCCL INFO Using network Socket
m5-autorl-test01:47325:48508 [1] NCCL INFO P2P is disabled between connected GPUs 1 and 0. You can repress this message with NCCL_IGNORE_DISABLED_P2P=1.
m5-autorl-test01:47325:48508 [1] NCCL INFO P2P is disabled between connected GPUs 1 and 0. You can repress this message with NCCL_IGNORE_DISABLED_P2P=1.
m5-autorl-test01:47325:48508 [1] NCCL INFO Could not enable P2P between dev 1(=3e000) and dev 0(=3d000)
m5-autorl-test01:47325:48508 [1] NCCL INFO P2P is disabled between connected GPUs 0 and 1. You can repress this message with NCCL_IGNORE_DISABLED_P2P=1.
m5-autorl-test01:47325:48508 [1] NCCL INFO P2P is disabled between connected GPUs 0 and 1. You can repress this message with NCCL_IGNORE_DISABLED_P2P=1.
m5-autorl-test01:47325:48508 [1] NCCL INFO Could not enable P2P between dev 0(=3d000) and dev 1(=3e000)
m5-autorl-test01:47325:48508 [1] NCCL INFO P2P is disabled between connected GPUs 1 and 0. You can repress this message with NCCL_IGNORE_DISABLED_P2P=1.
m5-autorl-test01:47325:48508 [1] NCCL INFO P2P is disabled between connected GPUs 1 and 0. You can repress this message with NCCL_IGNORE_DISABLED_P2P=1.
m5-autorl-test01:47325:48508 [1] NCCL INFO Could not enable P2P between dev 1(=3e000) and dev 0(=3d000)
m5-autorl-test01:47325:48508 [1] NCCL INFO P2P is disabled between connected GPUs 0 and 1. You can repress this message with NCCL_IGNORE_DISABLED_P2P=1.
m5-autorl-test01:47325:48508 [1] NCCL INFO P2P is disabled between connected GPUs 0 and 1. You can repress this message with NCCL_IGNORE_DISABLED_P2P=1.
m5-autorl-test01:47325:48508 [1] NCCL INFO Could not enable P2P between dev 0(=3d000) and dev 1(=3e000)
m5-autorl-test01:47325:48508 [1] NCCL INFO Setting affinity for GPU 1 to 0fffff00,000fffff
m5-autorl-test01:47324:48509 [0] NCCL INFO P2P is disabled between connected GPUs 1 and 0. You can repress this message with NCCL_IGNORE_DISABLED_P2P=1.
m5-autorl-test01:47324:48509 [0] NCCL INFO P2P is disabled between connected GPUs 1 and 0. You can repress this message with NCCL_IGNORE_DISABLED_P2P=1.
m5-autorl-test01:47324:48509 [0] NCCL INFO Could not enable P2P between dev 1(=3e000) and dev 0(=3d000)
m5-autorl-test01:47324:48509 [0] NCCL INFO P2P is disabled between connected GPUs 0 and 1. You can repress this message with NCCL_IGNORE_DISABLED_P2P=1.
m5-autorl-test01:47324:48509 [0] NCCL INFO P2P is disabled between connected GPUs 0 and 1. You can repress this message with NCCL_IGNORE_DISABLED_P2P=1.
m5-autorl-test01:47324:48509 [0] NCCL INFO Could not enable P2P between dev 0(=3d000) and dev 1(=3e000)
m5-autorl-test01:47324:48509 [0] NCCL INFO P2P is disabled between connected GPUs 1 and 0. You can repress this message with NCCL_IGNORE_DISABLED_P2P=1.
m5-autorl-test01:47324:48509 [0] NCCL INFO P2P is disabled between connected GPUs 1 and 0. You can repress this message with NCCL_IGNORE_DISABLED_P2P=1.
m5-autorl-test01:47324:48509 [0] NCCL INFO Could not enable P2P between dev 1(=3e000) and dev 0(=3d000)
m5-autorl-test01:47324:48509 [0] NCCL INFO P2P is disabled between connected GPUs 0 and 1. You can repress this message with NCCL_IGNORE_DISABLED_P2P=1.
m5-autorl-test01:47324:48509 [0] NCCL INFO P2P is disabled between connected GPUs 0 and 1. You can repress this message with NCCL_IGNORE_DISABLED_P2P=1.
m5-autorl-test01:47324:48509 [0] NCCL INFO Could not enable P2P between dev 0(=3d000) and dev 1(=3e000)
m5-autorl-test01:47324:48509 [0] NCCL INFO Setting affinity for GPU 0 to 0fffff00,000fffff
m5-autorl-test01:47324:48509 [0] NCCL INFO Channel 00/02 : 0 1 2 3
m5-autorl-test01:47325:48508 [1] NCCL INFO Trees [0] -1/-1/-1->1->0 [1] -1/-1/-1->1->0
m5-autorl-test01:47324:48509 [0] NCCL INFO Channel 01/02 : 0 1 2 3
m5-autorl-test01:47324:48509 [0] NCCL INFO Trees [0] 1/2/-1->0->-1 [1] 1/-1/-1->0->2
m5-autorl-test01:47325:48508 [1] NCCL INFO P2P is disabled between connected GPUs 1 and 0. You can repress this message with NCCL_IGNORE_DISABLED_P2P=1.
m5-autorl-test01:47325:48508 [1] NCCL INFO Could not enable P2P between dev 1(=3e000) and dev 0(=3d000)
m5-autorl-test01:47324:48509 [0] NCCL INFO Channel 00/0 : 3[38000] -> 0[3d000] [receive] via NET/Socket/0
m5-autorl-test01:47325:48508 [1] NCCL INFO P2P is disabled between connected GPUs 1 and 0. You can repress this message with NCCL_IGNORE_DISABLED_P2P=1.
m5-autorl-test01:47325:48508 [1] NCCL INFO Could not enable P2P between dev 1(=3e000) and dev 0(=3d000)
m5-autorl-test01:47325:48508 [1] NCCL INFO Channel 00/0 : 1[3e000] -> 2[37000] [send] via NET/Socket/0
m5-autorl-test01:47324:48509 [0] NCCL INFO Channel 01/0 : 3[38000] -> 0[3d000] [receive] via NET/Socket/0
m5-autorl-test01:47324:48509 [0] NCCL INFO P2P is disabled between connected GPUs 0 and 1. You can repress this message with NCCL_IGNORE_DISABLED_P2P=1.
m5-autorl-test01:47324:48509 [0] NCCL INFO Could not enable P2P between dev 0(=3d000) and dev 1(=3e000)
m5-autorl-test01:47324:48509 [0] NCCL INFO Channel 00 : 0[3d000] -> 1[3e000] via direct shared memory
m5-autorl-test01:47324:48509 [0] NCCL INFO P2P is disabled between connected GPUs 0 and 1. You can repress this message with NCCL_IGNORE_DISABLED_P2P=1.
m5-autorl-test01:47324:48509 [0] NCCL INFO Could not enable P2P between dev 0(=3d000) and dev 1(=3e000)
m5-autorl-test01:47324:48509 [0] NCCL INFO Channel 01 : 0[3d000] -> 1[3e000] via direct shared memory
m5-autorl-test01:47325:48508 [1] NCCL INFO Channel 01/0 : 1[3e000] -> 2[37000] [send] via NET/Socket/0
m5-autorl-test01:47324:48509 [0] NCCL INFO Connected all rings
m5-autorl-test01:47324:48509 [0] NCCL INFO Channel 00/0 : 2[37000] -> 0[3d000] [receive] via NET/Socket/0
m5-autorl-test01:47324:48509 [0] NCCL INFO Channel 01/0 : 2[37000] -> 0[3d000] [receive] via NET/Socket/0
m5-autorl-test01:47324:48509 [0] NCCL INFO Channel 00/0 : 0[3d000] -> 2[37000] [send] via NET/Socket/0
m5-autorl-test01:47324:48509 [0] NCCL INFO Channel 01/0 : 0[3d000] -> 2[37000] [send] via NET/Socket/0
m5-autorl-test01:47325:48698 [1] NCCL INFO include/net.h:25 -> 2
m5-autorl-test01:47325:48698 [1] NCCL INFO transport/net.cc:515 -> 2
m5-autorl-test01:47325:48698 [1] NCCL INFO proxy.cc:914 -> 2
m5-autorl-test01:47325:48698 [1] proxy.cc:1042 NCCL WARN [Proxy Service 1] Failed to execute operation Connect from rank 1, retcode 2
m5-autorl-test01:47325:48508 [1] misc/socket.cc:523 NCCL WARN Net : Connection closed by remote peer m5-autorl-test01<54529>
m5-autorl-test01:47325:48508 [1] NCCL INFO misc/socket.cc:531 -> 2
m5-autorl-test01:47325:48508 [1] NCCL INFO misc/socket.cc:543 -> 2
m5-autorl-test01:47325:48508 [1] NCCL INFO proxy.cc:805 -> 2
m5-autorl-test01:47325:48508 [1] proxy.cc:808 NCCL WARN Proxy Call to rank 1 failed (Connect)
m5-autorl-test01:47325:48508 [1] NCCL INFO transport/net.cc:269 -> 2
m5-autorl-test01:47325:48508 [1] NCCL INFO transport.cc:127 -> 2
m5-autorl-test01:47325:48508 [1] NCCL INFO init.cc:730 -> 2
m5-autorl-test01:47325:48508 [1] NCCL INFO init.cc:915 -> 2
m5-autorl-test01:47325:48508 [1] NCCL INFO init.cc:951 -> 2
m5-autorl-test01:47325:48508 [1] NCCL INFO init.cc:964 -> 2
from libai.
好的 如果解决了可以关掉此issue
from libai.
还有个问题,能否支持不同节点上,显卡数量不同的情况?
from libai.
好像可以了,我限制了 NCCL_SOCKET_IFNAME=bond0 后起起来了
这个问题定位和解决的原理是什么呢
from libai.
还有个问题,能否支持不同节点上,显卡数量不同的情况?
目前还不支持。多机启动目前假定了每个机器上的 gpu num per node 是一致的。未来可以支持非对称的情形,甚至可以考虑支持异构的集群(不同机器的 gpu 型号不一样)
from libai.
Related Issues (20)
- python requirements缺失?
- 可否支持读取pytorch model进行训练 HOT 3
- 多机训练失败后,非master node的进程没有完全kill掉 HOT 3
- 关于benchmark实验结果的疑问 HOT 2
- [Bug]libai test error:File exists: './data_test/bert_data' HOT 3
- 微信群满了 HOT 3
- CI test 失效
- 纯tensor并行训练,4卡和8卡使用的集合通信算子不同 HOT 2
- TypeError: __init__() got an unexpected keyword argument 'flags' HOT 5
- GLM libai推理报错 HOT 2
- MT5和T5的区别 HOT 4
- [多机多卡][MT5]failed to connect to all addresses HOT 1
- GPT2预训练,libai的throughput和以前的数据不匹配 HOT 1
- 测试并行框架,张量并行结果与官网所给数据不一致
- GLM 10B CN推理加速耗时 HOT 1
- 运行教程的bash tools/train.sh tools/train_net.py configs/vit_imagenet.py 8 命令报错
- Project下的MAE多卡训练报错
- 运行GLM示例报错 module 'oneflow._C' has no attribute 'fused_multi_head_attention_inference_v2' HOT 1
- 建议requirements 中涉及requests指定一下具体版本
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from libai.