Giter Site home page Giter Site logo

Comments (14)

ndellingwood avatar ndellingwood commented on June 20, 2024

@brian-kelley I just hopped on and checked modules on MI210 and rocm/5.6.0 was available:

[ndellin@caraway ~]$ salloc -N 1 -p MI210
salloc: Granted job allocation 1009777
[ndellin@lean1 ~]$ module spider rocm

----------------------------------------------------------------------------------------------------------------------------------------------------------------------------
  rocm:
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------
     Versions:
        rocm/5.2.0
        rocm/5.2.3
        rocm/5.3.3
        rocm/5.4.3
        rocm/5.5.1
        rocm/5.6.0
        rocm/5.6.1
     Other possible modules matches:
...

[ndellin@lean1 ~]$ module load rocm/5.6.0
[ndellin@lean1 ~]$ module list

Currently Loaded Modules:
  1) rocm/5.6.0

I also manually launched a cm_test_all_sandia build with rocm/5.6.0 and the build is proceeding without issue

[ndellin@lean1 Caraway-rocm560-MI210]$ ../../scripts/cm_test_all_sandia rocm/5.6.0 --with-hip
Running on machine: vega90a_caraway
KokkosKernels Repository Status:  8f2945d0c99791345053fc839b1ea453354e03f9 Kokkos Kernels: update version guards to drop old version of Kokkos (#2133)

Kokkos Repository Status:  d78a7d4383786359ee8692af5b30aac973fca0da Added in the explicit deduction guides for RangePolicy: • Correctness when passing in an execution space • Workaround for nvcc as RangePolicy<...> doesn't have any template parameters that can be deduced, so gcc/clang assume that a matching ctor in the primary template deduces to RangePolicy<> while nvcc assumes it is a bug.


Going to test compilers:  rocm/5.6.0
Testing compiler rocm/5.6.0
Unrecognized compiler rocm/5.6.0 when looking for Spack variants
Unrecognized compiler rocm/5.6.0 when looking for Spack variants
Unrecognized compiler rocm/5.6.0 when looking for Spack variants
  Starting job rocm-5.6.0-Hip_Serial-release
Hip IS THE KOKKOS DEVICE
kokkos devices: Hip,Serial
kokkos arch: VEGA90A
kokkos options: 
kokkos cuda options: 
kokkos cxxflags: -O3  
extra_args: 
kokkoskernels scalars: 'double,complex_double'
kokkoskernels ordinals: int
kokkoskernels offsets: int,size_t
kokkoskernels layouts: LayoutLeft
kokkoskernels tpls list: 
...

Maybe there was an update in progress that temporarily disrupted the modules? Let's keep an eye on whether this occurs again, there may be a change occurring soon once rocm/6.0 is available

from kokkos-kernels.

ndellingwood avatar ndellingwood commented on June 20, 2024

I just checked the MI250 queue and it looks like rocm/5.6.0 is not available there:

[ndellin@caraway Caraway-rocm560-MI250]$ salloc -N 1 -p MI250
salloc: Granted job allocation 1009778
[ndellin@fat2 Caraway-rocm560-MI250]$ module spider rocm

----------------------------------------------------------------------------------------------------------------------------------------------------------------------------
  rocm:
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------
     Versions:
        rocm/5.2.0
        rocm/5.6.1
        rocm/6.0.0

from kokkos-kernels.

ndellingwood avatar ndellingwood commented on June 20, 2024

I relaunched one of the Jenkins PR jobs running on MI210 and it looks like it is proceeding without issue with rocm/5.6.0, but we'll need to test and then update the jobs to use rocm/5.6.1 to hopefully avoid any bumps if the modules are permanently modified on MI210 like those on MI250

from kokkos-kernels.

ndellingwood avatar ndellingwood commented on June 20, 2024

Hm, looks like there is some issue with the rocm/5.6.1 module on MI250, configure issues just trying to build kokkos

-- Check for working CXX compiler: /usr/bin/hipcc
-- Check for working CXX compiler: /usr/bin/hipcc - broken
CMake Error at /projects/x86-64-zen-rocky8/utilities/cmake/3.27.4/gcc/8.5.0/base/4wmpm4r/share/cmake-3.27/Modules/CMakeTestCXXCompiler.cmake:60 (message):
  The C++ compiler

    "/usr/bin/hipcc"

  is not able to compile a simple test program.

  It fails with the following output:

    Change Dir: '/home/ndellin/kokkos/testing/Caraway-MI250/CMakeFiles/CMakeScratch/TryCompile-O4sGgP'
    
    Run Build Command(s): /projects/x86-64-zen-rocky8/utilities/cmake/3.27.4/gcc/8.5.0/base/4wmpm4r/bin/cmake -E env VERBOSE=1 /usr/bin/gmake -f Makefile cmTC_31082/fast
    /usr/bin/gmake  -f CMakeFiles/cmTC_31082.dir/build.make CMakeFiles/cmTC_31082.dir/build
    gmake[1]: Entering directory '/home/ndellin/kokkos/testing/Caraway-MI250/CMakeFiles/CMakeScratch/TryCompile-O4sGgP'
    Building CXX object CMakeFiles/cmTC_31082.dir/testCXXCompiler.cxx.o
    /usr/bin/hipcc    -o CMakeFiles/cmTC_31082.dir/testCXXCompiler.cxx.o -c /home/ndellin/kokkos/testing/Caraway-MI250/CMakeFiles/CMakeScratch/TryCompile-O4sGgP/testCXXCompiler.cxx
    sh: /opt/rocm-5.6.1/llvm/bin/clang: No such file or directory
    Can't exec "/opt/rocm-5.6.1/bin/rocm_agent_enumerator": No such file or directory at /usr/bin//hipcc.pl line 488.
    Use of uninitialized value $targetsStr in substitution (s///) at /usr/bin//hipcc.pl line 489.
    Use of uninitialized value $targetsStr in split at /usr/bin//hipcc.pl line 495.
    sh: /opt/rocm-5.6.1/llvm/bin/clang: No such file or directory
    gmake[1]: *** [CMakeFiles/cmTC_31082.dir/build.make:78: CMakeFiles/cmTC_31082.dir/testCXXCompiler.cxx.o] Error 127
    gmake[1]: Leaving directory '/home/ndellin/kokkos/testing/Caraway-MI250/CMakeFiles/CMakeScratch/TryCompile-O4sGgP'
    gmake: *** [Makefile:127: cmTC_31082/fast] Error 2
    
    

  

  CMake will not be able to correctly generate this project.
Call Stack (most recent call first):
  CMakeLists.txt:121 (PROJECT)

from kokkos-kernels.

ndellingwood avatar ndellingwood commented on June 20, 2024

Kokkos configures fine with rocm/5.2.0 and rocm/6.0.0 on MI250. I'll open an issue with the sys admins regarding rocm/5.6.1 problems

from kokkos-kernels.

ndellingwood avatar ndellingwood commented on June 20, 2024

So my MI210 test was on lean1 where I was able to load rocm/5.6.0, but a nightly just failed on lean2 due to being unable to find rocm/5.6.0

22:12:22 Hostname:
22:12:22 lean2
22:12:24 Lmod has detected the following error: The following module(s) are unknown:
22:12:24 "rocm/5.6.0"

I'll follow up with sys admins tomorrow

from kokkos-kernels.

brian-kelley avatar brian-kelley commented on June 20, 2024

OK I see, the modules are just different on different nodes of the MI210 queue. Hopefully the admins make them consistent soon, I know they were still testing 6.0.0 on just one of the nodes before applying it to the others.

from kokkos-kernels.

ndellingwood avatar ndellingwood commented on June 20, 2024

Yeah, I opened an issue. Hopefully it can get sorted out quickly. There are problems with the rocm/5.6.1 install, so for the time being shifting to that rocm version isn't a helpful option unfortunately

from kokkos-kernels.

brian-kelley avatar brian-kelley commented on June 20, 2024

Can we restrict the jenkins job to run on lean1 for now?

from kokkos-kernels.

lucbv avatar lucbv commented on June 20, 2024

Yeah we can request a specific node list with salloc when launching the job in the jenkins script I believe?

from kokkos-kernels.

ndellingwood avatar ndellingwood commented on June 20, 2024

@brian-kelley they're rebooting lean1 which will update to the recent image the other nodes are using, but that only leaves rocm/5.6.1 as the closest replacement for 5.6.0 but that module is problematic (hipcc fails during the cmake check)

from kokkos-kernels.

ndellingwood avatar ndellingwood commented on June 20, 2024

@lucbv @brian-kelley lots of progress with the updated rocm modules, sounds like one image update on the nodes may have us in a good state. I'll put in a PR with cm_test_all_sandia updates and modify the PR jobs to use rocm/5.6.1 once I confirm tests are passing

from kokkos-kernels.

ndellingwood avatar ndellingwood commented on June 20, 2024

@lucbv @brian-kelley I updated the Caraway CI jobs to test with rocm/5.6.1, and testing of #2142 confirmed it all worked. I merged the cm_test_all_sandia updates, so CI should be good to go again (though PRs may need to rebase on top of develop to ensure the cm_test_all_sandia are present)

from kokkos-kernels.

ndellingwood avatar ndellingwood commented on June 20, 2024

Nightly and Jenkins CI are running properly again using rocm/5.6.1, closing

from kokkos-kernels.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.