Giter Site home page Giter Site logo

多机训练报错 about libai HOT 13 CLOSED

frankxyy avatar frankxyy commented on June 2, 2024
多机训练报错

from libai.

Comments (13)

frankxyy avatar frankxyy commented on June 2, 2024 2

好像可以了,我限制了 NCCL_SOCKET_IFNAME=bond0 后起起来了

from libai.

frankxyy avatar frankxyy commented on June 2, 2024 1

好像可以了,我限制了 NCCL_SOCKET_IFNAME=bond0 后起起来了

这个问题定位和解决的原理是什么呢

NVIDIA/nccl#697

参考这个issue

from libai.

frankxyy avatar frankxyy commented on June 2, 2024

两台机器可以互相无密码ssh,不过ssh后的shell type会有变化

from libai.

xiezipeng-ML avatar xiezipeng-ML commented on June 2, 2024

两台机器可以互相无密码ssh,不过ssh后的shell type会有变化

先指定每个机子相同的卡数试试,然后两个机子能不能互相ping到

from libai.

CPFLAME avatar CPFLAME commented on June 2, 2024

可以先看一下两台机器能不能ping通, 然后看看两台机器有没有开proxy, 如果开了的话 可以先关掉.

unset http_proxy
unset https_proxy
unset HTTPS_PROXY
unset HTTP_PROXY

然后最好两台机器的卡数是一致的, 在Node0上面也用两卡

from libai.

xiezipeng-ML avatar xiezipeng-ML commented on June 2, 2024

这个也可以看看,https://github.com/Oneflow-Inc/libai/blob/dev_optimize_MT5/projects/MT5/readme_cn.md

from libai.

frankxyy avatar frankxyy commented on June 2, 2024

F20221031 10:52:42.193126 106986 eager_nccl_comm_manager.cpp:75] Check failed: ncclCommInitRank(comm, device_vec.size(), nccl_unique_id, rank) : unhandled system error (2). To see more detail, please run OneFlow with system variable NCCL_DEBUG=INFO
*** Check failure stack trace: ***
@ 0x7fded300c00a google::LogMessage::Fail()
@ 0x7fded300c2f2 google::LogMessage::SendToLog()
@ 0x7fded300bb77 google::LogMessage::Flush()
@ 0x7fded300e6e9 google::LogMessageFatal::~LogMessageFatal()
@ 0x7fdecb89977d oneflow::(anonymous namespace)::CreateNcclComm()
@ 0x7fdecb89b401 oneflow::EagerNcclCommMgr::GetCommForDevice()
@ 0x7fdeccd3b220 oneflow::(anonymous namespace)::EagerNcclOpKernelCache::Init()
@ 0x7fdeccd2df7f oneflow::(anonymous namespace)::InitEagerNcclOpKernelCache()
@ 0x7fdecdc05fa0 oneflow:1️⃣:StatefulOpKernel::TryInitOpKernelStateAndCache()
@ 0x7fdec93acde5 oneflow::vm::OpCallInstructionType::Compute()
@ 0x7fdecc7f4c31 oneflow::vm::EventRecordedEpStreamType::Run()
@ 0x7fdecc7f8ab3 oneflow::vm::ThreadCtx::TryReceiveAndRun()
@ 0x7fdecc7faaf0 oneflow::(anonymous namespace)::WorkerLoop()

from libai.

frankxyy avatar frankxyy commented on June 2, 2024

测试过机器之间可以ping通,代理也都关了。每个node上用的gpu卡数改为一样后,报错有所变化,看起来是nccl初始化报错了?

from libai.

frankxyy avatar frankxyy commented on June 2, 2024

NCCL debug打开后log:

m5-autorl-test01:47324:48509 [0] NCCL INFO Bootstrap : Using bond0:172.27.231.79<0>
m5-autorl-test01:47324:48509 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
m5-autorl-test01:47324:48509 [0] NCCL INFO Failed to open libibverbs.so[.1]
m5-autorl-test01:47324:48509 [0] NCCL INFO NET/Socket : Using [0]bond0:172.27.231.79<0> [1]veth8739016:fe80::5804:edff:fe5b:6570%veth8739016<0>
m5-autorl-test01:47324:48509 [0] NCCL INFO Using network Socket
NCCL version 2.12.10+cuda11.2
m5-autorl-test01:47325:48508 [1] NCCL INFO Bootstrap : Using bond0:172.27.231.79<0>
m5-autorl-test01:47325:48508 [1] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
m5-autorl-test01:47325:48508 [1] NCCL INFO Failed to open libibverbs.so[.1]
m5-autorl-test01:47325:48508 [1] NCCL INFO NET/Socket : Using [0]bond0:172.27.231.79<0> [1]veth8739016:fe80::5804:edff:fe5b:6570%veth8739016<0>
m5-autorl-test01:47325:48508 [1] NCCL INFO Using network Socket
m5-autorl-test01:47325:48508 [1] NCCL INFO P2P is disabled between connected GPUs 1 and 0. You can repress this message with NCCL_IGNORE_DISABLED_P2P=1.
m5-autorl-test01:47325:48508 [1] NCCL INFO P2P is disabled between connected GPUs 1 and 0. You can repress this message with NCCL_IGNORE_DISABLED_P2P=1.
m5-autorl-test01:47325:48508 [1] NCCL INFO Could not enable P2P between dev 1(=3e000) and dev 0(=3d000)
m5-autorl-test01:47325:48508 [1] NCCL INFO P2P is disabled between connected GPUs 0 and 1. You can repress this message with NCCL_IGNORE_DISABLED_P2P=1.
m5-autorl-test01:47325:48508 [1] NCCL INFO P2P is disabled between connected GPUs 0 and 1. You can repress this message with NCCL_IGNORE_DISABLED_P2P=1.
m5-autorl-test01:47325:48508 [1] NCCL INFO Could not enable P2P between dev 0(=3d000) and dev 1(=3e000)
m5-autorl-test01:47325:48508 [1] NCCL INFO P2P is disabled between connected GPUs 1 and 0. You can repress this message with NCCL_IGNORE_DISABLED_P2P=1.
m5-autorl-test01:47325:48508 [1] NCCL INFO P2P is disabled between connected GPUs 1 and 0. You can repress this message with NCCL_IGNORE_DISABLED_P2P=1.
m5-autorl-test01:47325:48508 [1] NCCL INFO Could not enable P2P between dev 1(=3e000) and dev 0(=3d000)
m5-autorl-test01:47325:48508 [1] NCCL INFO P2P is disabled between connected GPUs 0 and 1. You can repress this message with NCCL_IGNORE_DISABLED_P2P=1.
m5-autorl-test01:47325:48508 [1] NCCL INFO P2P is disabled between connected GPUs 0 and 1. You can repress this message with NCCL_IGNORE_DISABLED_P2P=1.
m5-autorl-test01:47325:48508 [1] NCCL INFO Could not enable P2P between dev 0(=3d000) and dev 1(=3e000)
m5-autorl-test01:47325:48508 [1] NCCL INFO Setting affinity for GPU 1 to 0fffff00,000fffff
m5-autorl-test01:47324:48509 [0] NCCL INFO P2P is disabled between connected GPUs 1 and 0. You can repress this message with NCCL_IGNORE_DISABLED_P2P=1.
m5-autorl-test01:47324:48509 [0] NCCL INFO P2P is disabled between connected GPUs 1 and 0. You can repress this message with NCCL_IGNORE_DISABLED_P2P=1.
m5-autorl-test01:47324:48509 [0] NCCL INFO Could not enable P2P between dev 1(=3e000) and dev 0(=3d000)
m5-autorl-test01:47324:48509 [0] NCCL INFO P2P is disabled between connected GPUs 0 and 1. You can repress this message with NCCL_IGNORE_DISABLED_P2P=1.
m5-autorl-test01:47324:48509 [0] NCCL INFO P2P is disabled between connected GPUs 0 and 1. You can repress this message with NCCL_IGNORE_DISABLED_P2P=1.
m5-autorl-test01:47324:48509 [0] NCCL INFO Could not enable P2P between dev 0(=3d000) and dev 1(=3e000)
m5-autorl-test01:47324:48509 [0] NCCL INFO P2P is disabled between connected GPUs 1 and 0. You can repress this message with NCCL_IGNORE_DISABLED_P2P=1.
m5-autorl-test01:47324:48509 [0] NCCL INFO P2P is disabled between connected GPUs 1 and 0. You can repress this message with NCCL_IGNORE_DISABLED_P2P=1.
m5-autorl-test01:47324:48509 [0] NCCL INFO Could not enable P2P between dev 1(=3e000) and dev 0(=3d000)
m5-autorl-test01:47324:48509 [0] NCCL INFO P2P is disabled between connected GPUs 0 and 1. You can repress this message with NCCL_IGNORE_DISABLED_P2P=1.
m5-autorl-test01:47324:48509 [0] NCCL INFO P2P is disabled between connected GPUs 0 and 1. You can repress this message with NCCL_IGNORE_DISABLED_P2P=1.
m5-autorl-test01:47324:48509 [0] NCCL INFO Could not enable P2P between dev 0(=3d000) and dev 1(=3e000)
m5-autorl-test01:47324:48509 [0] NCCL INFO Setting affinity for GPU 0 to 0fffff00,000fffff
m5-autorl-test01:47324:48509 [0] NCCL INFO Channel 00/02 : 0 1 2 3
m5-autorl-test01:47325:48508 [1] NCCL INFO Trees [0] -1/-1/-1->1->0 [1] -1/-1/-1->1->0
m5-autorl-test01:47324:48509 [0] NCCL INFO Channel 01/02 : 0 1 2 3
m5-autorl-test01:47324:48509 [0] NCCL INFO Trees [0] 1/2/-1->0->-1 [1] 1/-1/-1->0->2
m5-autorl-test01:47325:48508 [1] NCCL INFO P2P is disabled between connected GPUs 1 and 0. You can repress this message with NCCL_IGNORE_DISABLED_P2P=1.
m5-autorl-test01:47325:48508 [1] NCCL INFO Could not enable P2P between dev 1(=3e000) and dev 0(=3d000)
m5-autorl-test01:47324:48509 [0] NCCL INFO Channel 00/0 : 3[38000] -> 0[3d000] [receive] via NET/Socket/0
m5-autorl-test01:47325:48508 [1] NCCL INFO P2P is disabled between connected GPUs 1 and 0. You can repress this message with NCCL_IGNORE_DISABLED_P2P=1.
m5-autorl-test01:47325:48508 [1] NCCL INFO Could not enable P2P between dev 1(=3e000) and dev 0(=3d000)
m5-autorl-test01:47325:48508 [1] NCCL INFO Channel 00/0 : 1[3e000] -> 2[37000] [send] via NET/Socket/0
m5-autorl-test01:47324:48509 [0] NCCL INFO Channel 01/0 : 3[38000] -> 0[3d000] [receive] via NET/Socket/0
m5-autorl-test01:47324:48509 [0] NCCL INFO P2P is disabled between connected GPUs 0 and 1. You can repress this message with NCCL_IGNORE_DISABLED_P2P=1.
m5-autorl-test01:47324:48509 [0] NCCL INFO Could not enable P2P between dev 0(=3d000) and dev 1(=3e000)
m5-autorl-test01:47324:48509 [0] NCCL INFO Channel 00 : 0[3d000] -> 1[3e000] via direct shared memory
m5-autorl-test01:47324:48509 [0] NCCL INFO P2P is disabled between connected GPUs 0 and 1. You can repress this message with NCCL_IGNORE_DISABLED_P2P=1.
m5-autorl-test01:47324:48509 [0] NCCL INFO Could not enable P2P between dev 0(=3d000) and dev 1(=3e000)
m5-autorl-test01:47324:48509 [0] NCCL INFO Channel 01 : 0[3d000] -> 1[3e000] via direct shared memory
m5-autorl-test01:47325:48508 [1] NCCL INFO Channel 01/0 : 1[3e000] -> 2[37000] [send] via NET/Socket/0
m5-autorl-test01:47324:48509 [0] NCCL INFO Connected all rings
m5-autorl-test01:47324:48509 [0] NCCL INFO Channel 00/0 : 2[37000] -> 0[3d000] [receive] via NET/Socket/0
m5-autorl-test01:47324:48509 [0] NCCL INFO Channel 01/0 : 2[37000] -> 0[3d000] [receive] via NET/Socket/0
m5-autorl-test01:47324:48509 [0] NCCL INFO Channel 00/0 : 0[3d000] -> 2[37000] [send] via NET/Socket/0
m5-autorl-test01:47324:48509 [0] NCCL INFO Channel 01/0 : 0[3d000] -> 2[37000] [send] via NET/Socket/0
m5-autorl-test01:47325:48698 [1] NCCL INFO include/net.h:25 -> 2
m5-autorl-test01:47325:48698 [1] NCCL INFO transport/net.cc:515 -> 2
m5-autorl-test01:47325:48698 [1] NCCL INFO proxy.cc:914 -> 2

m5-autorl-test01:47325:48698 [1] proxy.cc:1042 NCCL WARN [Proxy Service 1] Failed to execute operation Connect from rank 1, retcode 2

m5-autorl-test01:47325:48508 [1] misc/socket.cc:523 NCCL WARN Net : Connection closed by remote peer m5-autorl-test01<54529>
m5-autorl-test01:47325:48508 [1] NCCL INFO misc/socket.cc:531 -> 2
m5-autorl-test01:47325:48508 [1] NCCL INFO misc/socket.cc:543 -> 2
m5-autorl-test01:47325:48508 [1] NCCL INFO proxy.cc:805 -> 2

m5-autorl-test01:47325:48508 [1] proxy.cc:808 NCCL WARN Proxy Call to rank 1 failed (Connect)
m5-autorl-test01:47325:48508 [1] NCCL INFO transport/net.cc:269 -> 2
m5-autorl-test01:47325:48508 [1] NCCL INFO transport.cc:127 -> 2
m5-autorl-test01:47325:48508 [1] NCCL INFO init.cc:730 -> 2
m5-autorl-test01:47325:48508 [1] NCCL INFO init.cc:915 -> 2
m5-autorl-test01:47325:48508 [1] NCCL INFO init.cc:951 -> 2
m5-autorl-test01:47325:48508 [1] NCCL INFO init.cc:964 -> 2

from libai.

CPFLAME avatar CPFLAME commented on June 2, 2024

好的 如果解决了可以关掉此issue

from libai.

frankxyy avatar frankxyy commented on June 2, 2024

还有个问题,能否支持不同节点上,显卡数量不同的情况?

from libai.

strint avatar strint commented on June 2, 2024

好像可以了,我限制了 NCCL_SOCKET_IFNAME=bond0 后起起来了

这个问题定位和解决的原理是什么呢

from libai.

chengtbf avatar chengtbf commented on June 2, 2024

还有个问题,能否支持不同节点上,显卡数量不同的情况?

目前还不支持。多机启动目前假定了每个机器上的 gpu num per node 是一致的。未来可以支持非对称的情形,甚至可以考虑支持异构的集群(不同机器的 gpu 型号不一样)

from libai.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.