Giter Site home page Giter Site logo

mpirun nccl-test hang about ompi HOT 7 CLOSED

913871734 avatar 913871734 commented on August 20, 2024
mpirun nccl-test hang

from ompi.

Comments (7)

913871734 avatar 913871734 commented on August 20, 2024

the link of same issue I have submit to nccl group is:
NVIDIA/nccl-tests#26

from ompi.

wenduwan avatar wenduwan commented on August 20, 2024

Just realized that this the original issue happened on AWS. I will take a look.

from ompi.

wenduwan avatar wenduwan commented on August 20, 2024

@913871734 Sorry for the delayed response. To provide more information, could you kindly modify the issue following Open MPI's template? https://github.com/open-mpi/ompi/issues/new?assignees=&labels=&projects=&template=bug_report.md&title=

from ompi.

bosilca avatar bosilca commented on August 20, 2024

Is this output complete ? According to it only two processes (out of 8) are opening their BTLs. So either it is incomplete or the other processes are blocked before that, or using a different communication PML. Add '--mca pml ob1 --mca pml_base_verbose 50 ' to the mix to see what you get.

from ompi.

913871734 avatar 913871734 commented on August 20, 2024

Is this output complete ? According to it only two processes (out of 8) are opening their BTLs. So either it is incomplete or the other processes are blocked before that, or using a different communication PML. Add '--mca pml ob1 --mca pml_base_verbose 50 ' to the mix to see what you get.

I followed your advice, and get the outputs:

root@ly-node-ip1-ip2-ip3-131:~# mpirun --mca pml ob1 --mca pml_base_verbose 50 --allow-run-as-root --hostfile hostfile -n 2 -N 1 -x NCCL_DEBUG=INFO all_reduce_perf -b 8 -e 128M -f 2 -g 1
[ly-node-ip1-ip2-ip3-131:3679492] mca: base: components_register: registering framework pml components
[ly-node-ip1-ip2-ip3-131:3679492] mca: base: components_register: found loaded component ob1
[ly-node-ip1-ip2-ip3-131:3679492] mca: base: components_register: component ob1 register function successful
[ly-node-ip1-ip2-ip3-131:3679492] mca: base: components_open: opening pml components
[ly-node-ip1-ip2-ip3-131:3679492] mca: base: components_open: found loaded component ob1
[ly-node-ip1-ip2-ip3-131:3679492] mca: base: components_open: component ob1 open function successful
[ly-node-ip1-ip2-ip3-132:3046324] mca: base: components_register: registering framework pml components
[ly-node-ip1-ip2-ip3-132:3046324] mca: base: components_register: found loaded component ob1
[ly-node-ip1-ip2-ip3-132:3046324] mca: base: components_register: component ob1 register function successful
[ly-node-ip1-ip2-ip3-132:3046324] mca: base: components_open: opening pml components
[ly-node-ip1-ip2-ip3-132:3046324] mca: base: components_open: found loaded component ob1
[ly-node-ip1-ip2-ip3-132:3046324] mca: base: components_open: component ob1 open function successful
[ly-node-ip1-ip2-ip3-132:3046324] select: initializing pml component ob1
[ly-node-ip1-ip2-ip3-132:3046324] select: init returned priority 20
[ly-node-ip1-ip2-ip3-132:3046324] selected ob1 best priority 20
[ly-node-ip1-ip2-ip3-132:3046324] select: component ob1 selected
[ly-node-ip1-ip2-ip3-131:3679492] select: initializing pml component ob1
[ly-node-ip1-ip2-ip3-131:3679492] select: init returned priority 20
[ly-node-ip1-ip2-ip3-131:3679492] selected ob1 best priority 20
[ly-node-ip1-ip2-ip3-131:3679492] select: component ob1 selected
[ly-node-ip1-ip2-ip3-132:3046324] check:select: checking my pml ob1 against process [[4120,1],0] pml ob1
[ly-node-ip1-ip2-ip3-131:3679492] check:select: PML check not necessary on self
[ly-node-ip1-ip2-ip3-132:3046324] *** An error occurred in MPI_Allgather
[ly-node-ip1-ip2-ip3-132:3046324] *** reported by process [270008321,1]
[ly-node-ip1-ip2-ip3-132:3046324] *** on communicator MPI_COMM_WORLD
[ly-node-ip1-ip2-ip3-132:3046324] *** MPI_ERR_INTERN: internal error
[ly-node-ip1-ip2-ip3-132:3046324] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[ly-node-ip1-ip2-ip3-132:3046324] ***    and potentially your MPI job)
[ly-node-ip1-ip2-ip3-131:3679486] PMIX ERROR: UNREACHABLE in file server/pmix_server.c at line 2198
[LOG_CAT_COMMPATTERNS]   isend failed in  comm_allreduce_pml at iterations 0

[LOG_CAT_P2P] hmca_bcol_ucx_p2p address preexchange allreduce failed
[ly-node-ip1-ip2-ip3-131:3679492] Error: coll_hcoll_module.c:310 - mca_coll_hcoll_comm_query() Hcol library init failed
[ly-node-ip1-ip2-ip3-131:3679486] 1 more process has sent help message help-mpi-errors.txt / mpi_errors_are_fatal
[ly-node-ip1-ip2-ip3-131:3679486] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages

from ompi.

wenduwan avatar wenduwan commented on August 20, 2024

AWS provides instructions here to run NCCL tests https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/efa-start-nccl.html

In Step 9: Test your EFA and NCCL configuration it provides example commands with openmpi mpirun CLI. For example

/opt/amazon/openmpi/bin/mpirun \
-x FI_EFA_USE_DEVICE_RDMA=1 \
-x LD_LIBRARY_PATH=/opt/nccl/build/lib:/usr/local/cuda/lib64:/opt/amazon/efa/lib:/opt/amazon/openmpi/lib:/opt/aws-ofi-nccl/lib:$LD_LIBRARY_PATH \
-x NCCL_DEBUG=INFO \
--hostfile my-hosts -n 8 -N 8 \
--mca pml ^cm --mca btl tcp,self --mca btl_tcp_if_exclude lo,docker0 --bind-to none \
$HOME/nccl-tests/build/all_reduce_perf -b 8 -e 1G -f 2 -g 1 -c 1 -n 100

You might need to change NCCL/nccl-tests paths depending on your installation.

Could you give that a try?

from ompi.

github-actions avatar github-actions commented on August 20, 2024

It looks like this issue is expecting a response, but hasn't gotten one yet. If there are no responses in the next 2 weeks, we'll assume that the issue has been abandoned and will close it.

from ompi.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.