Comments (14)
@brian-kelley I just hopped on and checked modules on MI210 and rocm/5.6.0 was available:
[ndellin@caraway ~]$ salloc -N 1 -p MI210
salloc: Granted job allocation 1009777
[ndellin@lean1 ~]$ module spider rocm
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------
rocm:
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Versions:
rocm/5.2.0
rocm/5.2.3
rocm/5.3.3
rocm/5.4.3
rocm/5.5.1
rocm/5.6.0
rocm/5.6.1
Other possible modules matches:
...
[ndellin@lean1 ~]$ module load rocm/5.6.0
[ndellin@lean1 ~]$ module list
Currently Loaded Modules:
1) rocm/5.6.0
I also manually launched a cm_test_all_sandia build with rocm/5.6.0 and the build is proceeding without issue
[ndellin@lean1 Caraway-rocm560-MI210]$ ../../scripts/cm_test_all_sandia rocm/5.6.0 --with-hip
Running on machine: vega90a_caraway
KokkosKernels Repository Status: 8f2945d0c99791345053fc839b1ea453354e03f9 Kokkos Kernels: update version guards to drop old version of Kokkos (#2133)
Kokkos Repository Status: d78a7d4383786359ee8692af5b30aac973fca0da Added in the explicit deduction guides for RangePolicy: • Correctness when passing in an execution space • Workaround for nvcc as RangePolicy<...> doesn't have any template parameters that can be deduced, so gcc/clang assume that a matching ctor in the primary template deduces to RangePolicy<> while nvcc assumes it is a bug.
Going to test compilers: rocm/5.6.0
Testing compiler rocm/5.6.0
Unrecognized compiler rocm/5.6.0 when looking for Spack variants
Unrecognized compiler rocm/5.6.0 when looking for Spack variants
Unrecognized compiler rocm/5.6.0 when looking for Spack variants
Starting job rocm-5.6.0-Hip_Serial-release
Hip IS THE KOKKOS DEVICE
kokkos devices: Hip,Serial
kokkos arch: VEGA90A
kokkos options:
kokkos cuda options:
kokkos cxxflags: -O3
extra_args:
kokkoskernels scalars: 'double,complex_double'
kokkoskernels ordinals: int
kokkoskernels offsets: int,size_t
kokkoskernels layouts: LayoutLeft
kokkoskernels tpls list:
...
Maybe there was an update in progress that temporarily disrupted the modules? Let's keep an eye on whether this occurs again, there may be a change occurring soon once rocm/6.0 is available
from kokkos-kernels.
I just checked the MI250 queue and it looks like rocm/5.6.0 is not available there:
[ndellin@caraway Caraway-rocm560-MI250]$ salloc -N 1 -p MI250
salloc: Granted job allocation 1009778
[ndellin@fat2 Caraway-rocm560-MI250]$ module spider rocm
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------
rocm:
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Versions:
rocm/5.2.0
rocm/5.6.1
rocm/6.0.0
from kokkos-kernels.
I relaunched one of the Jenkins PR jobs running on MI210 and it looks like it is proceeding without issue with rocm/5.6.0, but we'll need to test and then update the jobs to use rocm/5.6.1 to hopefully avoid any bumps if the modules are permanently modified on MI210 like those on MI250
from kokkos-kernels.
Hm, looks like there is some issue with the rocm/5.6.1 module on MI250, configure issues just trying to build kokkos
-- Check for working CXX compiler: /usr/bin/hipcc
-- Check for working CXX compiler: /usr/bin/hipcc - broken
CMake Error at /projects/x86-64-zen-rocky8/utilities/cmake/3.27.4/gcc/8.5.0/base/4wmpm4r/share/cmake-3.27/Modules/CMakeTestCXXCompiler.cmake:60 (message):
The C++ compiler
"/usr/bin/hipcc"
is not able to compile a simple test program.
It fails with the following output:
Change Dir: '/home/ndellin/kokkos/testing/Caraway-MI250/CMakeFiles/CMakeScratch/TryCompile-O4sGgP'
Run Build Command(s): /projects/x86-64-zen-rocky8/utilities/cmake/3.27.4/gcc/8.5.0/base/4wmpm4r/bin/cmake -E env VERBOSE=1 /usr/bin/gmake -f Makefile cmTC_31082/fast
/usr/bin/gmake -f CMakeFiles/cmTC_31082.dir/build.make CMakeFiles/cmTC_31082.dir/build
gmake[1]: Entering directory '/home/ndellin/kokkos/testing/Caraway-MI250/CMakeFiles/CMakeScratch/TryCompile-O4sGgP'
Building CXX object CMakeFiles/cmTC_31082.dir/testCXXCompiler.cxx.o
/usr/bin/hipcc -o CMakeFiles/cmTC_31082.dir/testCXXCompiler.cxx.o -c /home/ndellin/kokkos/testing/Caraway-MI250/CMakeFiles/CMakeScratch/TryCompile-O4sGgP/testCXXCompiler.cxx
sh: /opt/rocm-5.6.1/llvm/bin/clang: No such file or directory
Can't exec "/opt/rocm-5.6.1/bin/rocm_agent_enumerator": No such file or directory at /usr/bin//hipcc.pl line 488.
Use of uninitialized value $targetsStr in substitution (s///) at /usr/bin//hipcc.pl line 489.
Use of uninitialized value $targetsStr in split at /usr/bin//hipcc.pl line 495.
sh: /opt/rocm-5.6.1/llvm/bin/clang: No such file or directory
gmake[1]: *** [CMakeFiles/cmTC_31082.dir/build.make:78: CMakeFiles/cmTC_31082.dir/testCXXCompiler.cxx.o] Error 127
gmake[1]: Leaving directory '/home/ndellin/kokkos/testing/Caraway-MI250/CMakeFiles/CMakeScratch/TryCompile-O4sGgP'
gmake: *** [Makefile:127: cmTC_31082/fast] Error 2
CMake will not be able to correctly generate this project.
Call Stack (most recent call first):
CMakeLists.txt:121 (PROJECT)
from kokkos-kernels.
Kokkos configures fine with rocm/5.2.0 and rocm/6.0.0 on MI250. I'll open an issue with the sys admins regarding rocm/5.6.1 problems
from kokkos-kernels.
So my MI210 test was on lean1 where I was able to load rocm/5.6.0, but a nightly just failed on lean2 due to being unable to find rocm/5.6.0
22:12:22 Hostname:
22:12:22 lean2
22:12:24 Lmod has detected the following error: The following module(s) are unknown:
22:12:24 "rocm/5.6.0"
I'll follow up with sys admins tomorrow
from kokkos-kernels.
OK I see, the modules are just different on different nodes of the MI210 queue. Hopefully the admins make them consistent soon, I know they were still testing 6.0.0 on just one of the nodes before applying it to the others.
from kokkos-kernels.
Yeah, I opened an issue. Hopefully it can get sorted out quickly. There are problems with the rocm/5.6.1 install, so for the time being shifting to that rocm version isn't a helpful option unfortunately
from kokkos-kernels.
Can we restrict the jenkins job to run on lean1 for now?
from kokkos-kernels.
Yeah we can request a specific node list with salloc when launching the job in the jenkins script I believe?
from kokkos-kernels.
@brian-kelley they're rebooting lean1 which will update to the recent image the other nodes are using, but that only leaves rocm/5.6.1 as the closest replacement for 5.6.0 but that module is problematic (hipcc fails during the cmake check)
from kokkos-kernels.
@lucbv @brian-kelley lots of progress with the updated rocm modules, sounds like one image update on the nodes may have us in a good state. I'll put in a PR with cm_test_all_sandia updates and modify the PR jobs to use rocm/5.6.1 once I confirm tests are passing
from kokkos-kernels.
@lucbv @brian-kelley I updated the Caraway CI jobs to test with rocm/5.6.1, and testing of #2142 confirmed it all worked. I merged the cm_test_all_sandia updates, so CI should be good to go again (though PRs may need to rebase on top of develop to ensure the cm_test_all_sandia are present)
from kokkos-kernels.
Nightly and Jenkins CI are running properly again using rocm/5.6.1, closing
from kokkos-kernels.
Related Issues (20)
- Make sure `KokkosKernels_ENABLED_COMPONENTS` list of valid options is complete
- Necessity of both `KokkosKernels_ENABLED_COMPONENTS` and `KokkosKernels_ENABLE_...`? HOT 6
- axpby introduced deep_copies when alpha,beta are scalars HOT 2
- HIP -O0 -g: spgemm producess incorrect entries
- Trilinos nightly failure, ifpack2: spiluk errors with too few arguments to function call HOT 1
- One-based-ness of coloring is undocumented HOT 1
- Trilinos nightly failure, tpetra: no matching function for call to 'spadd_symbolic' HOT 1
- `KokkosBlas::Impl::MV_Reciprocal_Generic`: `g++-12` internal compiler failure with `-O3 -march=skylake-avx512` HOT 3
- rocSPARSE 3.0.2 for ROCm 6.0 breaking changes HOT 3
- Nightly test failures with cusolver tpl enabled, Cuda.svd_* unit tests HOT 6
- Nightly test failures, Cuda.svd_* and MKL DGEMM HOT 5
- Nightly test failures, builds with gcc/8.3.0 as host compiler: cc1plus: error with KokkosSparse::Impl::Sequential::TrsvWrap<...>::divide HOT 3
- Trilinos nightly failure, Cuda+UVM build, ifpack2/stokhos/sacado interaction in Ifpack2_LocalSparseTriangularSolver_def
- Lapack cuda.gesv_double test failing
- SYCL/PVC: native spmv, spmv_mv fail for complex_double
- Intel/2023.1.0 OpenMP, Serial test failures on SPR HOT 6
- spmv follow on changes: enable/disable deprecated code
- Nightly test failures, Cuda with TPLs, float types, in spiluk HOT 12
- nested namespace holding kk mkl implementation HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from kokkos-kernels.