Giter Site home page Giter Site logo

Comments (13)

pcanal avatar pcanal commented on May 27, 2024 1

Can be considered fixed by #124

from celeritas.

sethrj avatar sethrj commented on May 27, 2024

@pcanal Even your suggestion won't be guaranteed to work... because (for example) with cmake you can theoretically set different CUDA_ARCHITECTURES property targets which will lead to different architectures. I think the most robust forward-looking solution would be this:

  • Allow older versions of CMake but error if CMAKE_CUDA_ARCHITECTURES is defined. It's on the user to get the right CUDA flags, as it always was.
  • In the installed VecGeomConfig.cmake file, define VECGEOM_CUDA_ARCHITECTURES with the same syntax as CMAKE_CUDA_ARCHITECTURES, and export the value for the latter. If the user manually inputs CUDA flags for -arch= and such, you could either try to translate those flags (not easy, or as you put it, "somehow") or ignore them
  • If both VECGEOM_CUDA_ARCHITECTURES and CMAKE_CUDA_ARCHITECTURES are set, assert that they're identical.

I recommend we also add to the "readme" or somewhere that this situation is one possible interpretation of the "kernel launch failure" message you saw.

@drbenmorgan As the VecGeom CMake guru, do you have an opinion? It's going to be impossible to test all the edge cases (unless we add a configure-time test for linking a test kernel against vecgeom and running, but even that won't be foolproof and might not work on heterogeneous systems where login and compute nodes have different GPU hardware).

from celeritas.

drbenmorgan avatar drbenmorgan commented on May 27, 2024

@sethrj, I'd have to look at the detail, but I think your three points make sense. It feels very much like the C++ standard compile features target property, in that a consuming target should be able to validate/set the CUDA arch based on that of the consumed target.

Would VECGEOM_CUDA_ARCHITECTURES of the consumed vecgeom and CMAKE_CUDA_ARCHITECTURES for the client lib/app have to be exactly identical? Or is it just required that they overlap in the case that multiple arches are configured?

from celeritas.

sethrj avatar sethrj commented on May 27, 2024

They wouldn't have to be identical -- I think there's substantial overlap allowed between different sm/compute flags. It really shouldn't be our responsibility to make sure the compiler flags are consistent and valid. I think the best we could do is make sure that the CUDA flags are associated with the VecGeom targets we link against, so that if some invalid flags are specified they cause nvcc to fail. It makes sense as a "courtesy" (to prevent issues like what @pcanal encountered) that we check for CMAKE_CUDA_ARCHITECTURES being unset for CMake version < 3.18, but I think doing any more than that will prevent legitimate configurations and/or be a rabbit hole.

from celeritas.

pcanal avatar pcanal commented on May 27, 2024

that we check for CMAKE_CUDA_ARCHITECTURES being unset for CMake version < 3.18,

Note (I am not sure it was clear), that in the failing case I had

  • cmake 3.15
  • NO CMAKE_CUDA_ARCHITECTURES
  • local make file "properly/as-expected" populated with sm_70 (did not check if it was really everywhere it needed to be though).

So the check won't help avoid the bad build I had.

from celeritas.

sethrj avatar sethrj commented on May 27, 2024

@pcanal I'm confused because your earlier comment said:

I was using v3.15.4 which does not support yet the flag CMAKE_CUDA_ARCHITECTURES (Noticed that to the warning from cmake "this variable was set but not used").

I don't think the failing configurations (and error messages) were ever completely posted. Can you collate those here for posterity and so I can understand the failure modes?

from celeritas.

pcanal avatar pcanal commented on May 27, 2024

Time wise:

  • Use my config with cmake v3.15.4 (no CMAKE_CUDA_ARCHITECTURES) -> get message about invalid kernel at run-time:
  • Noticed that the CI worked properly
  • Copy/paste configuration from the CI (includes CMAKE_CUDA_ARCHITECTURES) and get the warning message: CMAKE_CUDA_ARCHITECTURES set but not used -> but the same result at run-time.
  • Deduced :) that the CI uses a newer version of cmake.
  • Try cmake v3.19 without CMAKE_CUDA_ARCHITECTURES -> build fails due to a missing symbol (then noticed local makefiles now contains sm_52)
  • Try cmake v3.19 with CMAKE_CUDA_ARCHITECTURES -> Success.

from celeritas.

sethrj avatar sethrj commented on May 27, 2024

Aha, ok that makes more sense. So in the first case you ran, did you have CMAKE_CUDA_FLAGS set, and was it consistent with the VecGeom installation?

from celeritas.

pcanal avatar pcanal commented on May 27, 2024

All the --arch where sm_70, i.e. consistent with VecGeom (the one present where correct, there might (or not) have been some completely missing -- technically I don't know exactly "why" cmake v3.19 works and v3.15 does not).

from celeritas.

sethrj avatar sethrj commented on May 27, 2024

Ok, that's super weird then. For what it's worth, on emmet, I rebuilt with cmake 3.17.4, using the manual -arch=sm_35, and it worked. I'm going to try on wildstyle with CMake 3.15 and sm_70...

from celeritas.

pcanal avatar pcanal commented on May 27, 2024

and it worked.

Meaning that ./app/demo-rasterizer input.good.json with the kernel success check run correctly, right? (that executable/kernel is the only one that fails as far as I know).

from celeritas.

sethrj avatar sethrj commented on May 27, 2024

Meaning that ./app/demo-rasterizer input.good.json with the kernel success check run correctly, right? (that executable/kernel is the only one that fails as far as I know).

Correct on emmet -- however with CMake 3.15.7 and sm_70 on wildstyle it failed, replicating the result you saw. It passes for CMake 3.18. I'm trying now on emmet to see if it's the cmake version, and what on earth the difference in the build could be...

from celeritas.

sethrj avatar sethrj commented on May 27, 2024

Well, it's definitely CMake 3.15 vs 3.17 that's causing it -- the former fails on emmet with sm_35, the latter doesn't. One difference I can see is that -lcudadevrt -lcudart_static is added to the newer device link flags but not the failing one; though I don't know how that could possibly cause the error we're seeing.

Regardless, this is definitely enough to say "let's prevent cmake < 3.17 from configuring with vecgeom+cuda".

from celeritas.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.