celeritas-project / celeritas Goto Github PK

Celeritas is a new Monte Carlo transport code designed to accelerate scientific discovery in high energy physics by improving detector simulation throughput and energy efficiency using GPUs.

Home Page: https://celeritas-project.github.io/celeritas/user/index.html

License: Other

Python 0.62% Shell 0.20% CMake 3.24% C++ 92.43% Cuda 2.84% C 0.09% Dockerfile 0.10% Groovy 0.05% SWIG 0.10% Jupyter Notebook 0.33%

hep cuda computational-physics monte-carlo detector-simulation particle-transport

celeritas's Introduction

Celeritas

The Celeritas project implements HEP detector physics on GPU accelerator hardware with the ultimate goal of supporting the massive computational requirements of the HL-LHC upgrade.

Documentation

Most of the Celeritas documentation is readable through the codebase through a combination of static RST documentation and Doxygen-markup comments in the source code itself. The full Celeritas user documentation (including selected code documentation incorporated by Breathe) and the Celeritas code documentation are mirrored on our GitHub pages site. You can generate these yourself (if the necessary prerequisites are installed) by setting the CELERITAS_BUILD_DOCS=ON configuration option and running ninja doc (user) or ninja doxygen (developer). A continuously updated version of the static Celeritas user documentation (without API documentation) is hosted on readthedocs.io.

Installation for applications

The easiest way to install Celeritas as a library/app is with Spack:

Follow the first two steps above to install Spack and set up its CUDA usage.
Install Celeritas with spack install celeritas
Use spack load celeritas to add the installation to your PATH.

Then see the "Downstream usage as a library" section of the installation documentation for how to use Celeritas in your application or framework.

Installation for developers

Since Celeritas is still under heavy development and is not yet full-featured for downstream integration, you are likely installing it for development purposes. The installation documentation has a complete description of the code's dependencies and installation process for development.

As an example, if you have the Spack package manager installed and want to do development on a CUDA system with Volta-class graphics cards, execute the following steps from within the cloned Celeritas source directory:

# Set up CUDA (optional)
$ spack external find cuda
# Install celeritas dependencies
$ spack env create celeritas scripts/spack.yaml
$ spack env activate celeritas
$ spack config add packages:all:variants:"cxxstd=17 +cuda cuda_arch=70"
$ spack install
# Configure, build, and test
$ ./build.sh base

If you don't use Spack but have all the dependencies you want (Geant4, googletest, VecGeom, etc.) in your CMAKE_PREFIX_PATH, you can configure and build Celeritas as you would any other project:

$ mkdir build && cd build
$ cmake ..
$ make && ctest

Celeritas guarantees full compatibility and correctness only on the combinations of compilers and dependencies tested under continuous integration:

Compilers:
- GCC 8.4, 12.3
- Clang 10.0, 15.0
- GCC 11.3 + NVCC 11.8
- HIP-Clang 15.0
Dependencies:
- Geant4 11.0.3
- VecGeom 1.2.5

Partial compatibility and correctness is available for an extended range of Geant4:

10.5-10.7: no support for tracking manager offload
11.0: no support for fast simulation offload
11.1-11.2: [no support for default Rayleigh scattering cross section](see #1091)

Since we compile with extra warning flags and avoid non-portable code, most other compilers should work. The full set of configurations is viewable on CI platforms (Jenkins and GitHub Actions). Compatibility fixes that do not cause newer versions to fail are welcome.

Development

See the contribution guide for the contribution process, the development guidelines for further details on coding in Celeritas, and the administration guidelines for community standards and roles.

Directory structure

Directory	Description
app	Source code for installed executable applications
cmake	Implementation code for CMake build configuration
doc	Code documentation and manual
example	Example applications and input files
interface	Wrapper interfaces to Celeritas library functions
scripts	Development and continuous integration helper scripts
src	Library source code
test	Unit tests

Citing Celeritas

If using Celeritas in your work, we ask that you cite the code using its DOECode registration:

Seth R. Johnson, Amanda Lund, Soon Yung Jun, Stefano Tognini, Guilherme Lima, Paul Romano, Philippe Canal, Ben Morgan, and Tom Evans. “Celeritas,” July 2022. https://doi.org/10.11578/dc.20221011.1.

A continually evolving list of works authored by (or with content authored by) core team members is available in our citation file.

celeritas's People

Contributors

Stargazers

Watchers

celeritas's Issues

Refactor physics to have a single combined energy loss/range "process"

In Geant4, the dE/dx table is created by summing the contribution from each process that provides energy loss for a given particle. The range table is then calculated from this total energy loss. Our current implementation assumes per-process dE/dx and range tables, though what we are actually importing is a combination of process and total dE/dx and range.

End-of-Q3 code review

Add single-element test problems

Currently our GDML test inputs all define materials with multiple elements. Until #182 is resolved (which we can worry about in the 3-6 month time frame) we'll not be able to model materials that are are alloys or mixtures.

For #54 let's define a test problem with 3-4 single-element materials in an "interesting" but simple model (something with just planes and cylinders) that roughly approximates a detector of interest in CMS/ATLAS.

Define size_type optimized for device

The performance analysis showed substantial speedups using 32-bit ints over 64. We should probably default to use 32-bits for size in general; use std::size_t or auto when we need to work with standard library classes; and use custom large integers for the rare array that might have more than 4 billion elements.

Add process/element cross section calculation and sampling

Many of the physics interactions have to sample an interacting atom (by cross section) because the physics depends on the nuclear charge, etc. Both Shift and Geant4 do this, but with notable differences:

Geant4 precalculates energy-dependent cross sections per process and integrates over elements in each material. In contrast, Shift's AMPX data (entirely precalculated) is integrated over processes (MT numbers) and only stored per element.
Geant4 integrates over processes on the fly to get the "total" cross section (even though really each independent process is independently sampling an interaction distance), whereas Shift integrates over elements on the fly and samples a single interaction length.
During an interaction, Geant4 recalculates all the elemental cross sections for the selected process and samples the element from those. Shift caches the elemental cross sections used to calculate the total XS so it doesn't have to recalculate those; it then samples the process from the selected element using the cached total cross section for that element. (This requires the cross sections to balance correctly at every energy, which can be a problem if the process- and total- grids are different or different interpolation schemes are used.)
Geant4's element-integrated cross sections are tabulated on a fixed log-spaced grid, whereas Shift's cross section grids are all chosen by AMPX and can be pretty fine (and include discontinuities from shell structure in the low-energy regime).

So related to #3 we need to make some decisions.

Do we assume that Geant's pre-tabulation of per-process element-integrated cross sections are sufficiently accurate? (Do we lose any features or edges by approximating with an interpolation and an arbitrary log grid.)
Can we then assume that per-process per-element cross sections can also be pre-tabulated equally accurately and sampled from?
Can we also assume the pre-process grids are equal, and construct a unified per-material total cross section to sample from (for discrete processes) with interpolation, then on the fly sample from processes, and then sample the element from the selected process? (This is basically what @whokion suggested.)
If any/all of these assumptions are invalid, do we cache material- and process-dependent cross sections as we transport, like Shift does?

Unless we can pre-tabulate and ensure the cross sections balance across elements and processes, we will have to (at least temporarily) cache per-element cross sections during interactions, which Geant4 does for EM models by filling up the thread-private xsec array. There are a few ways we can approach this:

For sampling an element after a process is selected, if the element-integrated value is exact we can just accumulate per-element values on the fly to select an element.
We could preallocate a single "element xs cache" for each particle track, sized to the maximum number of elements in any material. This cache would be used only as an on-the-fly thing for a single process at a time.
We could cache cross sections for each particle track and each process and each element. This could be memory intensive. The plus side is potentially saving cross section reevaluations as a particle moves between volumes.
We could use an on-the-fly approach where each process uses a stack allocator to reserve storage as needed for sampling. This runs the risk of overflowing if too many particles are in a particularly complicated material.

Add GDML files into existing tests

More GDML files should be added to some of the existing tests, where it makes sense to widen test coverage.
Part of this issue is to create a few more GDML files, including more exotic shapes and features, e.g.

translations + rotations
boolean operations on shapes
less used / exotic shapes like torus, cut tubes, elliptical shapes, scaled shapes, and so forth

End-of-Q2 code review

This issue is for revisiting code design decisions and conventions and discussing whether or how to change them. Please add to the list, prefixing with your parenthesized initials. We can start working through these at the infrastructure meeting, or wait until the "collocation day" on october 14.

Naming conventions

(SRJ) The FooPointers.hh files sometimes need define more than just a single class with pointers: sometimes it's useful to define types and other quantities needed by both the device kernel/view and the host management code. Perhaps call the header FooInterface.hh indicating that it's the interface common to both the host and device classes? Likewise, the FooPointers class often contains POD data that aren't just pointers: perhaps we could name those FooInterface instead.

(SRJ) The naming convention for device-compatible utility classes (View vs TrackView vs no suffix) needs refinement. It's weird that the RNG and stack allocator and other classes have View suffixes: they are used like functors even though they point to externally managed data. Small helper classes like Interpolator don't have any such suffix (Interpolator is constructed with local data). The distinction between TrackView (size num_tracks) and View (some other arbitrary size) should be discussed.

(SRJ) We need to be careful about ThreadId being "position of this track inside the track vector", and TrackId being "index of this track in the list of all tracks that are tied to a particular event." Also confusing is that some kernels may launch with a number of threads that isn't the same as the number of track vectors (e.g. when we process secondaries).

(SRJ) I've been inconsistent about the using-aliases for shared pointers (UPNavStatePool, DeviceUniquePtr, SPConstGeo, SptrConstParams) so we need to decide on a uniform usage.

Device/host pointers

(PC) There's no type safety for the pointer system used by kernel or host code: only the accessor names for Params classes (e.g. device_pointers) hints as to whether the pointer is a CPU or GPU pointer. #34 is a first attempt at addressing this.

Units

(SRJ) The coexistence of natural units (for particle physics calculations) and SI/CGS units (for macro-scale quantities such as materials and spatial positions) is still confusing to me.

MeV=1 for mass/energy/momentum, e=1 for charge, and c=1 aren't consistent with formally defined natural unit systems.
The "constant" values for c etc. are defined as 1 and thus can't be used to transform from natural units to macro-scale units (needed for calculating distance traveled by a particle track)
The value of physical constants such as $e$, $c$, and $\epsilon_0$ depend on the unit system, unless they're being used to define the unit system, so the dependence between Units and Constants may be considered ambiguous

(SRJ) Possible solution: class wrapper (celeritas::natural::Energy, celeritas::natural::Mass, celeritas::natural::Momentum) for quantities in natural units classes? Internal to classes (or for state values or whatever) they don't have to use them, but for class interfaces (such as ParticleTrackView) use them to avoid errors when coupling to other classes.

Add MaterialParams/State

This should be a pretty simple class for managing a track's material.

Params are just an array of MaterialDefId, one per VolumeId
Params are to be constructed (except in unit tests as needed) from the geant4-exported data
State is a MaterialDefId, with the "unassigned" case meaning it needs to be updated from the geometry volume, and implying that during that update physics interaction lengths (and other quantities as needed) need to be invalidated

Add interface for common RNGs

The RNG in random/ was based (on my suggestion) on what Shift does, using Nvidia-provided GPUs. To test RNG-based code on CPU as well, and if we want to build on HIP or another platform that doesn't provide curng adapters, we'll need something more portable. @whokion

Move RNG classes in random/ into random/cuda (see #2)
Add similar interfaces to one or more appropriate vecmath RNGs -- either porting vecmath code, or (probably better as a first step) just providing an interface to the vecmath

As the physics develops we'll want to add additional distributions (besides "uniform"). @amandalund you might be the best person to add a random/distributions/IsotropicDistribution class ported from the Nemesis/numerics/Uniform_Unit_Vector_Distribution class. Would you be up for that or shall I?

Sample isotropic direction on device

Complete internals for multiprocess physics kernels

This demo app is an opportunity to sketch out the physics and stepping kernels, before focusing on the "big picture" app described in #28 . I'd like to have multiple materials, but due to time constraints I this app can potentially be made a little more foolproof by skipping the secondary/track initialization code and vecgeom integration, perhaps using a simple mock geometry that divides the universe by the plane x=0 or something (material ID 0 above the plane, 1 below).

Those simplifications aside, there are a few outstanding pieces in the physics that need to be implemented (function names refer to src/physics/base/PhysicsStepUtils.hh):

Physics array construction from host data (ValueGridBuilder) #128
Mapping ProcessId -> {energy grid, ModelId} for each particle type #143
Assembling all the different data structures of the physics interface #141
calc_tabulated_physics_step (not done: on-the-fly processes and perhaps range calculation) #154
calc_energy_loss (not done: "long step" energy loss limiter #160
select_model (not done) @pcanal

Prevent kernel launch failures on CMake 3.15

@pcanal and @mrguilima found that on wc.fnal.gov, the demo-rasterizer app fails with a launch error:

invalid device function

See #119 for some debugging discussion.

The build works fine before adding separable compilation to the main library, see #52 (comment) .

@pcanal found the error:

So the difference between the CI and my environment turned out to be ... the version of cmake.

I was using v3.15.4 which does not support yet the flag CMAKE_CUDA_ARCHITECTURES (Noticed that to the warning from cmake "this variable was set but not used").

When I tried with cmake v3.19.3 without adding that flag, the build failed (missing symbol). This is likely due to the fact that this version of cmake no longer propagate the -arch=sm_70 set by (I think VecGeom) and instead use default value (which turns out to be sm_52). Once I added -DCMAKE_CUDA_ARCHITECTURES:STRING=70, the build succeeded and the process succeeds ....

And I offered further explanation of the difference:

Took me a second, but this makes sense -- with a "default value" (where does that come from? the login node on which celeritas was built, rather than the compute node?) of sm_52 the kernels that don't call vecgeom are fine, since cuda recompiles for the higher architecture at runtime based on the PTX. For the kernels that link against vecgeom I guess separable compilation gets in the way and causes the two different sets of flags to misbehave.

The CMAKE_CUDA_ARCHITECTURES flag is new as of 3.18. It seems that this error might crop up in other circumstances.

@pcanal suggested preventing CMake versions older than 3.18 from even building celeritas with cuda+vecgeom.

VecGeom verification for CMS model

Plan for validation

Visual inspection of slices with raytrace
Visual inspection of (not yet extant) x-ray tool
Numeric comparison of volume estimates, loaded through VGDML/ROOT in VecGeom, vs Geant4/G4GDML and possibly even ROOT tracking: simulate reflecting boundaries, uniform isotropic source with isotropic scattering and some absorption fraction

Tracking

General VecGeom issues:

Global initialization not implemented on device #38 vecgeom!760 vecgeom!768
~~Other initialization crash~~ (@sethrj error, not VecGeom's!) #201

VecGeom issues with CMS:

Transforms aren't properly applied in VGDML
Transforms and boolean volumes don't mix cms images vecgeom#565
ROOT units are loaded incorrectly into VecGeom #192

VecGeom issues with ATLAS:

Hang during load (@vrpascuzzi see below)

Implement portable in-house geometry

There are several reasons to support an alternative geometry implementation to VecGeom:

Exploring performance portability (VecGeom is very closely tied to CUDA)
Evaluating how the app's overall performance changes when using a less expensive geometry implementation
Reducing build/CI times

To do this we'll want to:

Refine geometry propagation/tracking interface. Look at the current use cases for the geometry track view interface, simplify and/or document which functions are part of the "public" interface.
Possibly template key parts of the code on geometry type, and/or use virtual geometry interfaces for other parts of the CPU code (such as cell labels and high-level kernel launch functionality)

Then we'll add a new surrogate geometry implementation (possibly based on RTK from Shift) for simple geometries with planes and cylinders. Maybe define a new JSON (rather than Teuchos XML) for on-disk representation.

ORANGE:

Surface intersections (#291)
Static polymorphism and operator (#306)
JSON I/O (#307)
Volumes and initialization (#312)
Distance-to-boundary (#335)
Params and TrackView interface (#336)
Compile-time switching between VecGeom and ORANGE for main codebase (#337)
Enable previously vecgeom-only tests (#339 )
Enable previously vecgeom-only demo (#343)

Implement value/xs interpolation ("physics tables") on GPU

The physics tables contain energy-dependent, initialization-time physics data in Geant used during the event loop. It includes properties such as mean free paths (inverse of macroscopic cross section, used for sampling distance-to-interaction for "discrete" processes), energy deposition rates (dE/dx, for "continuous" processes), etc.

The dimensions on the data are ragged:

Physics model (e.g., Bremsstrahlung, Klein–Nishina)
Quantity (mean free path $\lambda$, microscopic cross section $\sigma$, energy loss $dE/dx$)
Particle type (? for dE/dx)
Material
Energy grid point

If a spline interpolation option is given (on a per-model basis), geant4 calculates and caches derivatives (but geantv calculates them on-the-fly) from the adjacent points.

The stored data in geant have different acceleration methods to speed lookup:

Some "physics vectors" have a uniform logarithmic grid, even though interpolation is still linear-linear
A couple of the EM process classes have grids that are split into "X" and "XPrim" where the split energy is typically 1 MeV (but perhaps subject to user optionss). For the case of mean free paths for compton
(others... ???)

I'd like to implement the "physics vector" with the more generic "value grid" since none of the methods are specific to physics; it's just about grid lookup and value interpolation.

This should be similar to some of the stuff we've done in shift (physica/sce_cuda/xs_calculator) as well as geant4/geantv.

CPU demo loop

Although most of the core Celeritas components are platform-agnostic, a lot of the glue code (e.g. Model::interact and detail::rayleigh_interact) are GPU/CUDA-specific. To compare CPU apples to GPU apples, we should support a full event loop on the CPU with celeritas. At first blush, this will mostly be copying code from e.g. KleinNishina.cu to .cc, taking host pointers instead of device.

Add muon physics

Moved from #1 .

Add CUDA CI

Create a docker image with CUDA and celeritas spack toolchain
Create Jenkins file fr launching
Integrate into https://cloud.cees.ornl.gov/jenkins-ci/

Add runtime program diagnostics on host and device

Programmatic access to runtime diagnostics such as memory usage (on both host and device) and device occupancy will help us understand the relationship between changes to the code and changes in performance.

CUDA attributes:

Kernel register usage and attributes
Possibly kernel occupancy statistics if not redundant with register usage?
Free and total memory with cudaMemGetInfo #716
Possibly some limited device info from cudaDeviceProp

Host attributes:

Memory usage (from /proc/self/status on linux machines)

Some of these data will be global (constant over per program run), some will scale with the number of functions (register usage), and some may be stored at specific checkpoints (memory usage).

Add Doxygen/Sphinx to codebase and CI

See Sphinx/Breathe/C++ intro.

Add doxyfile and cmake
Add sphinx/breathe and cmake (#437)
Set up ReadTheDocs (http://celeritas.readthedocs.io/)

As the code design components take shape as part of the permanent code base, transition portions of that document to the in-code documentation. Basically, let the Wiki page be for sketching out ideas, and let the code documentation reflect the actual state of the code.

Implement initial electromagnetic physics

G4EmStandardPhysics

(Particle / Process / Model / energy range): see
https://github.com/celeritas-project/celeritas/wiki/Contents-of-test-data for
exported standard EM physics table data.

Sub-components for EM interaction implementation

Bernoulli distribution (rejection sampling)
Material and elemental data (see below)
Constexpr functions for constructing "static" physics constants (see below)
Elemental target sampling (uses number density and xs calculation)

Material data

Number densities of elements
Radiation length
Electron density

Elemental data

Z and various precomputed math operations on it (logs and such)
Coulomb correction factor

External datasets

Livermore or Sandia (Photoelectric, Rayleigh): [element][incident energy]
SB (see #176 )

Constants

These constants depend on experimental physics data (proton mass, electron mass)

Bohr radius
electron compton length (for Bremss)

Add MPI/ensemble support to distribute events across GPUs and compute nodes

To run large jobs on leadership-class machines, we'll need to parallelize across nodes by distributing events using MPI. A first attempt could just equally partition among tasks, but we should also consider investigating whether any MPI framework has a queue-like model that can dispatch events to waiting nodes (or "rebalance" events in case some processors get unlucky).

Split events among Runner to avoid replicating input
Write JSON output to one file per process, rather than stdout

Support multiple elements per material

Currently we're using a simplification of allowing only one element per material, so that we don't have to calculate/tabulate/sample elemental cross sections in a given material.

Our CERN collaborators informed us that many of the Geant4 physics models actually use tabulated-by-element cross section data rather than calculating on the fly, so we should be able to build element sampling (in most cases) into the process/model selection kernel.

Export tabulated micro cross sections from Geant4 (#326 )
Preprocess micros into CDFs/alias tables during Celeritas setup
Sample elements after model selection in the "discrete" kernel (for those that have tabulated cross sections)
On-the-fly element selection for OTF processes (photoelectric, annihilation) inside the interaction kernels

Add tool to import preprocessed data from Geant4

Given a GDML input definition file and physics list (and possibly other problem input parameters like energy cuts), export a single preprocessed data file with:

Particle data (for particles in use based on the physics lists)
Physics tables (for needed particles and models based on the physics lists, and energy based on cutoffs, and materials in the input file)
Material definitions (for each material) with names and corresponding index in the G4MaterialTable (i.e. corresponding index in physics table)
Map of logical volume names to material names

See https://github.com/celeritas-project/celeritas-docs/blob/master/graphs/geant-exporter.pdf

Required to populate physics tables (#3)
Required to convert volume IDs from VecGeom to material IDs in physics tables
Required to render materials instead of volumes (#22)

Regenerate exported geant data as part of CI

This would prevent older (incompatible) data versions of the root import file being committed to the repository @stognini @amandalund

Performance test one or more process Interactors

Add a GPU (and CPU for reference?) test kernel for KNInteractor or some other suitable class. Run a million (maybe billion?) tracks through it, purging the secondary buffer when it fills.

Implement propagation

Geometry @mrguilima

Initialize from an "event" (which you can mock up for now as just a position and direction), which for now (according to @pcanal) requires doing the state location on the CPU, then copying over to GPU
Replace find_next_step and its straight-line solve with the necessary mechanics for calculating the safety and next distance
Update move_next_step so that it crosses the surface as necessary for working with propagation

Maybe you and @pcanal can work together on pulling in some of the necessary code from geantx, and setting up the propagation classes (https://github.com/celeritas-project/celeritas/wiki/Code-design#propagation)? A prototype propagation class would be an excellent task.

Add rasterizer helper python scripts

Pull out the rasterizing and other helper scripts from Exnihilo/Omnibus (export control cleared under workbench/AMPX agreement, and there's no nuclear code in the utilities that will be extracted) and create a github repo with the raytracing scripts that work with the JSON files exported from the ray tracer. For bonus points, hook up the material processing so we can pull in the compositions and use the omnutils heuristics to map colors.

Basic loop demo integration

Propagate through a VecGeom geometry with multiple materials. Include at least one each of continuous and discrete processes. Compare simulation results and timing against Geant4.

Requirements:

Physics process management (#28)
Linear propagator interface (#66)
Material<->volume mapping (#74)
(Possibly) initialization (#52)

Add the following kernels:

Pre-step kernel (sample and set remaining mfp if unset, call calc_tabulated_physics_step)
Along-step kernel (determine actual step distance with geometry vs physics; propagate; call calc_energy_loss)
Post-step kernel (decrement distance-to-interaction, call select_model)
Interactor launches. @sethrj
- Get direction from geometry states
- Template ModelInterface on MemSpace and overload model interact method to work with host pointers
- Loop over all model IDs and fire off kernels
Postprocessing secondaries and interaction results
Initializing tracks from primaries/secondaries @amandalund

Missing pieces:

Geometry and physics input files
Source definition (use HepMC3 input) @stognini
Missing Process classes (Rayleigh?, bremsstrahlung)

Implement spline interpolation

Spline interpolation (or some other nonlinear interpolation) is used in Geant4 by :

Energy loss rate calculation (see calc_energy_loss)
LivermorePE cross sections above K-shell energy (see LivermoreElement)
Cross section (mean free path) for multiple scattering models (Urban and WenzelVI)

Research why it's necessary (the energy loss one provides an explanation of the consequences of linear interpolation, although it seems to me an alternate interpolation scheme such as log-log might have the same effect with less overheard) and compare spline interpolation with alternatives.

Physics mega task list

Interaction length sampling
- Material properties (can probably be deferred since we're pulling the physics tables from geant4
- Physics tables: #3 @amandalund
Implement individual processes: (see #1 for complete list)

More to be added...

See https://github.com/celeritas-project/celeritas/wiki/Code-design

Examine RNG quality and performance

@whokion makes the excellent point that the CUDA default XORWOW is not of high statistical quality, even though it is fast. @pcanal also points out that it's not the same RNG as we're using for our CPU comparisons (currently using the standard library's 32-bit Mersenne Twister).

So we should experiment with choices of our GPU (and possibly CPU) generator:

How sensitive are the results to the RNG quality? Is it only when we have a very simple problem (monoenergetic, monodirectional point source) where such correlations may be evident, or will it also show up in problems with a wider initialization phase space?
How sensitive is the overall application performance to the choice of RNG? Are we being fair in our CPU/GPU comparisons?

Related Results

Performance of random number generators: time [ms] for generating 1e+9 random numbers

Generator	Host (Intel Xeon E5-2650)	Device (Tesla V100-PCIE-32GB)
XORWOW	2331.88 +- 17.56	17.45 +- 0.61
MRG32k3a	22719.91 +- 66.59	15.18 +- 0.46
Philox4_32_10	3702.18 +- 17.43	14.48 +- 0.50
MTGP32	14792.21 +- 64.35	43.30 +- 0.76

Results of Statistical Tests with Test01 quoted from the reference; L’ecuyer, P. and Simard, R. 2007. TestU01: A C library for empirical testing of random number gen- erators. ACM Trans. Math. Softw. 33, 4, Article 22 (August 2007).

Generator	Small Crush	Crush	Big Crush
XORWOW (32 bit)	5	59	-
XORWOW (64 bit)	1	8	7
MRG32k3a	0	0	0
Philox4_32_10	0	0	0
MT19937	0	2	2

Add magnetic field drivers/integrators

Implement magnetic filed integration with a realistic geometry and a non-uniform magnetic field, and test different integrators on GPU.

Runge-Kutta stepper #173
Adaptive step solver #214
Integrate with geometry/navigator (safety distances, high-level Propagator, etc.) #221
Nonuniform magnetic field #250

Possible Quantized State System (QSS) implementation?

Possibly implement simpler steppers (e.g. helix) for uniform magnetic fields

Ensure material/volume ID consistency between Geant and VecGeom

The geant-exporter constructs ImportVolume objects using logical volume ID (aka "unplaced" volumes in VecGeom nomenclature). In contrast, GeoParams/GeoTrackView use physical IDs. It's also unclear how the IDs depend on the front-end and parser, whether they're consistent across readers and versions. For example, I think #192 demonstrated that the physical IDs change depending on whether ROOT or VGDML is constructing the geometry.

We need to be able to guarantee that the material/cutoff IDs are consistent between the geant exporter and the VecGeom definition, and that we're consistently using physical/logical IDs.

Do we only need logical volume IDs? Or will detector applications require physical IDs as part of the interface?
Can we use the volume names to guarantee the correct mapping between applications/libraries?
Should we rely on https://gitlab.cern.ch/VecGeom/VecGeom/-/merge_requests/805 for pulling in the vecgeom data?

@stognini @mrguilima @pcanal

Add production cuts and energy limits

The PhysicsParams (?) should take per-material (or maybe just use global for now?) "production cuts", which are particle- and material-dependent lower energy bounds for creating secondaries. Also update the Moller/Bhabha scattering so that its lower energy limits are based on these production cuts rather than a hard-coded lower energy.

Since there's an interplay between the process cross sections, the cutoffs, and even some models, perhaps the host data needs a separate helper class during setup.

Actually the cutoff management is probably more appropriate as a separate helper class, since during interaction the particle properties being interrogated (at least in the M/B case) are not the incident particle's state.

Add CUDA random number generator

Add and test a GPU-friendly random number state class. Possibly use the Container/View/Ref for the state storage but that may be unnecessarily verbose. Since the RNG shouldn't have any persistent data (just per-thread state)

End goal being to have an on-device generator that acts like the standard library pseudorandom generators.

Add compatibility with multi-threaded geant4

Currently the Geant logger adapter uses G4coutbuf which isn't compatible with multithreaded geant4. Either update the code to be compatible, or just disable the logging redirect when geant is multithreaded.

Add generic hit/step structure for Geant4 integration

Sensitive detectors are tally regions: when a particle is inside a user-selected region of space that represents the "sensitive" component of a detector apparatus, we need to collect information about "hits" (particle travel inside a detector) for later postprocessing (digitization, generating a simulated digital signal from the detector array from an Event based on the timing and energy deposition of track "hits").

Associated data

Parameters (shared, read-only for a problem):

Hash table (or for initial implementation, linear search) of VolumeId for sensitive regions, constructed by querying geometry's name-to-id method for user-given strings
(?) Cutoffs for relevant particles, energy ranges?
(?) Tolerance for spatial resolution on output

Shared

Buffer of "hits" (detector tally events) waiting to be flushed

Each hit is a step of travel in the detector:

Position
Direction
Volume ID
Event ID
Track ID
Particle type
Distance traveled over the step
Initial and final energy on the step
Local energy deposition (since energy loss could include emission of secondaries ?)
Start time
Derivative quantities:
End time over the step

Questions:

Can these segments be combined if they're not less than the distance cut?
The draft requirements document says we also need information about starting/ending process for the track and step. Is that necessary for digitization or just for "mc truth"-like diagnostics?

Hit output

Questions:

What's the output bandwidth requirement at GPU saturation?
Related, what are the memory requirements of the detector output buffer? Will we need to have two buffers (one active, one being transferred to CPU asynchronously for output)?
To run on DOE leadership systems we will need to execute a single process in parallel, requiring some sort of parallel I/O method. Can ROOT support parallel I/O? Can we do "poor man's" I/O, creating a separate output file on each process?
Should we consider basing I/O off ADIOS2, a data transfer and I/O framework that might enable high-performance exascale hit processing/digitization?

Implement Bremsstrahlung

We're going to use the "simple" SB table (G4SeltzerBergerModel::SampleEnergyTransfer rather than G4SBBremTable:: SampleEnergyTransfer) for now: more complex sampler was introduced in Geant4 10.5.

Import 2D physics vector (one 2D vector per element given to reader) #186
Storing 2D vectors on device #186
Sampling from 2D table for a given element (SBSimpleSampler) #204
Sampling exiting brem photon direction #241
Putting the pieces together: interactor (including additional constants) everyone including #241
Integrate positron correction to cross section sampling (>10 GeV) #247

Defer:

LPM correction (>10 GeV)

Add initial diagnostic output to event loop

This is a precursor to the more general #29 . The most generic way to support interaction with experimental frameworks (or our own internal diagnostics) is to allow callbacks inside every step. This might be through a virtual inheritance hierarchy:

class Diagnostic
{
  public:
    using DeviceRefs = ModelInteractRefs;
    virtual begin_simulation() {}
    virtual begin_event(EventId, const DeviceRefs&) {}
    virtual begin_step(const DeviceRefs& data) {}
    virtual end_step(const DeviceRefs& data) {}
    virtual end_event(EventId, const DeviceRefs&) {}
    virtual end_simulation() {}
};

or perhaps a set of enums and callbacks:

void add_callback(CallbackTrigger, std::function<void(const DeviceRefs&>);

Although we won't need it in the first iteration, for the sake of completeness we'll need to have the code related to secondary initialization to be able to detect when the last secondary produced from given event has been transported (so that analysis code can process the information, e.g. in Acceleritas send the accumulated data back to Geant4).

First pass at diagnostics:

Number of tracks per step #287
Processes and particle types #290
Z bin of energy deposition #289
Distribution of number of steps per track #309
Hit creator for acceleritas (@whokion )

Add multiple scattering

Moved from #1 .

Urban (or Goudsmit-Saunderson) model: E < 100 MeV
WentzelVI: 100 MeV < E < 100 TeV

Multiple scattering (MSC) will require additional integration into the stepping algorithm, since it reduces the "true" path length traveled to a "geometric" path length.

Implementation details for MSC models (different path from other discrete-continuous EM processes and modes):

Initialization

import MSC cross section (lambda) tables to evaluate the mean free path
build MSC data including material dependent MSC parameters
access the range (and inverse range) and dedx tables through PhysicsTrackView

Model components (for each model)

a. step limitation (including conversion from the true path length to the geometry path length)
b. conversion from the geometry path length to the true path length after transportation (linear or field propagation)
c. sampling angular distribution and evaluating the lateral displacement which requires information from the step limiter (true path length and its minimum limit, alpha parameter, flag for lateral displacement)

Integration

need safety from GeoTrackView for the step limitation and the angular sampling
test for the candidate selection with the true step length against the physics step length after the step limitation by the energy loss process with (2a) and pass the geometry path length to the propagator.
do the lateral displacement based on the geometry condition (i.e., near boundary) and the size of lateral swift after (2c)

Validation (with respect to Geant4)

model data and parameters
step limitation
angular distribution

Performance evaluation
Extension for muons, charged hadrons and ions

Simulation mega task list

The simulation is responsible for choosing which action a particle undergoes. It maintains a count of the accumulated simulation time since the event and the accumulated path length of a track.

Simulation:

Initialize tracks from events ("event generator")
- New Primary class
- New event primaries will have to be injected into arbitrary empty slots (indices) in a vector of tracks
- Add HEPMC3 support for constructing "event" (vector of primaries)
Initialize tracks from secondaries
Buffer and manage secondaries waiting to be turned into tracks
Stepping loop

The "simulation" stuff is definitely more nebulous than the other components.

Add physics process management

Background

Geant4 implements three methods to apply a process: at-rest (decay/annihilation), along-step (continuous), and post-step (discrete) "GPIL (Get Physical Interaction Length)" and "DoIt".

Discrete: calculates mean-free-path distance and samples an interaction from an exponential distribution. Only one discrete process is selected per step per track.
Continuous: deposits energy and/or emits secondaries at every step. Multiple processes may happen to a single track in a step.
Rest: always happens after a particle's kinetic energy drops to (near) zero.

It seems like a number of "continuous" processes (e.g. cerenkov, synchrotron) are implemented as discrete processes.
(x-ray/optical photon physics are not competing with the (Geant4) standard EM physics processes for a given step,
but rather treated as additional processes top on the standard EM physics. For example, cerenkov/scintillation processes
can be added to the ionization process if the high energy charged particle passes through a scintillation material.)

Requirements

The process interface should:

Set a model (to be implemented) for calculating cross sections or energy deposition rates as a function of energy.
Sample the physical interaction length for the next step and update the interaction length left for a given track
Map a model to Interactor device classes
Apply model cutoffs before the interactor is called.
Apply cutoffs to produced secondaries.

Analyze performance impact of masking vs partitioning

In some of our kernels (namely, the Model::interact kernels in physics) only an arbitrary subset of the threads will be active in that particular kernel. There are two main ways to operate such kernels:

Mask out threads that are inactive -- return immediately (if not alive, or if the model ID for the kernel isn't the model ID of the thread). The kernel size is the total number of threads.
Map applicable indices in the state vector to consecutive thread indices, the kernel size is the number of active threads.

There are a few variants on this that I can think of:

When masking: instead of returning immediately, continue through the kernel but do null-ops on all the data (more like a traditional SIMD model than SIMT)
When mapping: sort the thread IDs so that closer memory locations are more likely to have close data access

A potential masking+mapping hybrid scheme would avoid the need to get the size of the active vector to launch the kernel, and the need to maintain one large vector per model, by executing an argsort to sort the threads by applicable model ID. The kernel would launch with the maximum number of threads, but it would exit immediately for the inapplicable threads. Since the "inactive" warps are grouped together because the ranges are sorted, the entire warps/blocks finish early.

These different methods should be easily testable in the demo-interactor, which currently "masks" based on not being alive.

Add CPU rasterizer

The demo rasterizer is currently CUDA-only. It shouldn't be difficult to extend so that the trace kernel is compatible with CPU as well. This would allow the CPU and GPU straight-line navigation for VecGeom to be compared easily.

Integration with the Geant4 Tasking

As Geant4 provides the C++11 tasking capability since the 10.7.beta release, demonstrate a hybrid (concurrent) computing application that offloads EM particle transport on GPU using celeritas and simulates the rest of simulation (i.e., hadronic physics processes) on the host CPU for a typical HEP detector simulation at LHC:

build a generic HEP application with the Geant4 tasking manager and connect specialized tasks to celeritas (or vice versa)
collect EM particles from the Geant4 tracking loop (mostly photons and electrons from hadronic interactions)
trigger celeritas kernels for specialized GPU tasks for EM particles
merge sensitive hits from Geant4 and celeritas
optimize the load-balancing between the CPU host and devices and evaluate performance

Add simulation state, propagation, event loop

(work in progress)

Simulation state contains:

Relationships between parent and daughter IDs and event IDs
Simulation clock, the time since the event started (for detector timing information)
Cumulative path length traveled in a track (for dE/dx in detectors)
What event ID is to be applied to the track: possibly split into event group ID (e.g., physics interaction/geometry/decay) and sub-ID (physics process) to allow multiple event types to be launched within a single kernel

Event loop:

Abstract "event" base class for launching kernels
Each class instance gets assigned an event ID
Implemented event function calls a kernel that operates only if the event ID matches

Implement verification scheme

Background

As new physics models and processes are now included or underway, a validation scheme is necessary. Geant4 is being constantly validated against data and previous versions, along with independent validations carried forward by experiments. As such, the initial plan is to validate Celeritas with Geant4 only (no external data or theory).

Requirements

Geant4 simulation app

Interface should provide an easy selection of:
- One or more EM Standard Physics processes.
- Primary type, vertex location, direction, and starting energy.
- Geometry may be loaded either i) programmatically, so that updates to material/element/geometry can be done quickly (this option is mostly for hardcoded demo apps); or ii) parsing gdml files, for validating against a given geometry that is loaded into Celeritas.
- Provide option for i) a smaller output (no step data) that could handle millions of tracks, and ii) a full output for detailed inspection.

Validation

Provide analysis scripts to be used as a starting point for each individual validation.

A standardized checklist for simulations input parameters and validation results is still to be determined, although accumulated quantities are preferred at this point.

The final goal is a generic validation system that applies to each and every model, both individually and combined, resulting in a final set of standardized plots and tables that can be automatically generated for future Celeritas updates.

Add demo app for visualization/GPU validation

Add an app to trace slices of input models, operating in parallel over tracks traveling along the x axis, writing volume IDs (and/or material IDs) to an HDF5 file compatible with the Omnibus ray trace visualization tools.

~~Complete straight-line propagator~~
Add rasterizer kernel for converting tracks to pixels
~~Add HDF5 output~~
Add easier-to-implement output as proof of concept

Fix geometry track initialization on device

In #38 we tested and added support for initializing tracks on device using the GlobalLocator, as of this VecGeom PR. Unfortunately our testing wasn't thorough enough, and even though initialization seems to work in the very simple tests we added, initialization fails catastrophically for the CMS models, and it also fails when trying to initialize the two-box test case outside of the world volume.

Add logging infrastructure

Requirements:

MPI-friendly (only output on one processor if all processors hit warning condition)
Customizable for users or experiments can silence/redirect

celeritas-project / celeritas Goto Github PK

celeritas's Introduction

Celeritas

Documentation

Installation for applications

Installation for developers

Development

Directory structure

Citing Celeritas

celeritas's People

Contributors

Stargazers

Watchers

Forkers

celeritas's Issues

Naming conventions

Device/host pointers

Units

Plan for validation

Tracking

G4EmStandardPhysics

Sub-components for EM interaction implementation

Material data

Elemental data

External datasets

Constants

Associated data

Hit output

Background

Requirements

Background

Requirements

Geant4 simulation app

Validation

Recommend Projects

Recommend Topics

Recommend Org