<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Can be considered fixed by <a class="issue-link js-issue-link" data-error-text="Failed

Time wise: Use my config with cmake v3.15.4 (no CMAKE_CUDA_ARC

Prevent kernel launch failures on CMake 3.15,about celeritas-project/celeritas

Comments (13)

pcanal commented on May 27, 2024 1

Can be considered fixed by #124

from celeritas.

sethrj commented on May 27, 2024

@pcanal Even your suggestion won't be guaranteed to work... because (for example) with cmake you can theoretically set different CUDA_ARCHITECTURES property targets which will lead to different architectures. I think the most robust forward-looking solution would be this:

Allow older versions of CMake but error if CMAKE_CUDA_ARCHITECTURES is defined. It's on the user to get the right CUDA flags, as it always was.
In the installed VecGeomConfig.cmake file, define VECGEOM_CUDA_ARCHITECTURES with the same syntax as CMAKE_CUDA_ARCHITECTURES, and export the value for the latter. If the user manually inputs CUDA flags for -arch= and such, you could either try to translate those flags (not easy, or as you put it, "somehow") or ignore them
If both VECGEOM_CUDA_ARCHITECTURES and CMAKE_CUDA_ARCHITECTURES are set, assert that they're identical.

I recommend we also add to the "readme" or somewhere that this situation is one possible interpretation of the "kernel launch failure" message you saw.

@drbenmorgan As the VecGeom CMake guru, do you have an opinion? It's going to be impossible to test all the edge cases (unless we add a configure-time test for linking a test kernel against vecgeom and running, but even that won't be foolproof and might not work on heterogeneous systems where login and compute nodes have different GPU hardware).

from celeritas.

drbenmorgan commented on May 27, 2024

@sethrj, I'd have to look at the detail, but I think your three points make sense. It feels very much like the C++ standard compile features target property, in that a consuming target should be able to validate/set the CUDA arch based on that of the consumed target.

Would VECGEOM_CUDA_ARCHITECTURES of the consumed vecgeom and CMAKE_CUDA_ARCHITECTURES for the client lib/app have to be exactly identical? Or is it just required that they overlap in the case that multiple arches are configured?

from celeritas.

sethrj commented on May 27, 2024

They wouldn't have to be identical -- I think there's substantial overlap allowed between different sm/compute flags. It really shouldn't be our responsibility to make sure the compiler flags are consistent and valid. I think the best we could do is make sure that the CUDA flags are associated with the VecGeom targets we link against, so that if some invalid flags are specified they cause nvcc to fail. It makes sense as a "courtesy" (to prevent issues like what @pcanal encountered) that we check for CMAKE_CUDA_ARCHITECTURES being unset for CMake version < 3.18, but I think doing any more than that will prevent legitimate configurations and/or be a rabbit hole.

from celeritas.

pcanal commented on May 27, 2024

that we check for CMAKE_CUDA_ARCHITECTURES being unset for CMake version < 3.18,

Note (I am not sure it was clear), that in the failing case I had

cmake 3.15
NO CMAKE_CUDA_ARCHITECTURES
local make file "properly/as-expected" populated with sm_70 (did not check if it was really everywhere it needed to be though).

So the check won't help avoid the bad build I had.

from celeritas.

sethrj commented on May 27, 2024

@pcanal I'm confused because your earlier comment said:

I was using v3.15.4 which does not support yet the flag CMAKE_CUDA_ARCHITECTURES (Noticed that to the warning from cmake "this variable was set but not used").

I don't think the failing configurations (and error messages) were ever completely posted. Can you collate those here for posterity and so I can understand the failure modes?

from celeritas.

pcanal commented on May 27, 2024

Time wise:

Use my config with cmake v3.15.4 (no CMAKE_CUDA_ARCHITECTURES) -> get message about invalid kernel at run-time:
Noticed that the CI worked properly
Copy/paste configuration from the CI (includes CMAKE_CUDA_ARCHITECTURES) and get the warning message: CMAKE_CUDA_ARCHITECTURES set but not used -> but the same result at run-time.
Deduced :) that the CI uses a newer version of cmake.
Try cmake v3.19 without CMAKE_CUDA_ARCHITECTURES -> build fails due to a missing symbol (then noticed local makefiles now contains sm_52)
Try cmake v3.19 with CMAKE_CUDA_ARCHITECTURES -> Success.

from celeritas.

sethrj commented on May 27, 2024

Aha, ok that makes more sense. So in the first case you ran, did you have CMAKE_CUDA_FLAGS set, and was it consistent with the VecGeom installation?

from celeritas.

pcanal commented on May 27, 2024

All the --arch where sm_70, i.e. consistent with VecGeom (the one present where correct, there might (or not) have been some completely missing -- technically I don't know exactly "why" cmake v3.19 works and v3.15 does not).

from celeritas.

sethrj commented on May 27, 2024

Ok, that's super weird then. For what it's worth, on emmet, I rebuilt with cmake 3.17.4, using the manual -arch=sm_35, and it worked. I'm going to try on wildstyle with CMake 3.15 and sm_70...

from celeritas.

pcanal commented on May 27, 2024

and it worked.

Meaning that ./app/demo-rasterizer input.good.json with the kernel success check run correctly, right? (that executable/kernel is the only one that fails as far as I know).

from celeritas.

sethrj commented on May 27, 2024

Meaning that ./app/demo-rasterizer input.good.json with the kernel success check run correctly, right? (that executable/kernel is the only one that fails as far as I know).

Correct on emmet -- however with CMake 3.15.7 and sm_70 on wildstyle it failed, replicating the result you saw. It passes for CMake 3.18. I'm trying now on emmet to see if it's the cmake version, and what on earth the difference in the build could be...

from celeritas.

sethrj commented on May 27, 2024

Well, it's definitely CMake 3.15 vs 3.17 that's causing it -- the former fails on emmet with sm_35, the latter doesn't. One difference I can see is that -lcudadevrt -lcudart_static is added to the newer device link flags but not the failing one; though I don't know how that could possibly cause the error we're seeing.

Regardless, this is definitely enough to say "let's prevent cmake < 3.17 from configuring with vecgeom+cuda".

from celeritas.

Prevent kernel launch failures on CMake 3.15 about celeritas HOT 13 CLOSED

Comments (13)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent