Comments (6)
I did some CPU profiling using callgrind/cachegrind with the following setup:
- System: Perlmutter login node
- Problem: testem3-orange-field-msc, 1 event, 64 primaries/event
- single thread (OMP disabled)
- CMake
RelWithDebInfo
build,CELERITAS_DEBUG=OFF
The graph below shows the estimated cycles spent in each function, weighting instruction fetch, L1, and LL cache miss.
I noticed that axpy leads to many instruction cache miss but it could be because I didn't pass march/mtune
compiler options.
Looking at the L1 read miss, most of them come from XsCalculator::get calls within XsCalculator::operator()
It'd be interesting to see the cache miss in a multithreaded scenario
from celeritas.
@esseivaju Is this with one track slot or the usual number (65K)? I guess the reason I wondered about single-thread performance not being optimal is that we saw a substantial performance gap between single-slot and many-slot, and since the many-slot case is not really optimal (in terms of state cache locality and skipped loops due to masking) I wonder whether the call graph would look any different...
from celeritas.
This is with 4k track slots.
single-thread performance not being optimal is that we saw a substantial performance gap between single-slot and many-slot,
Do you mean that in the single thread case, you saw better performance with one track slot?
from celeritas.
OK 4k track slots, different than our usual regression CPU setting. What does the performance graph look like if you have a single track slot? (Make sure openmp is disabled! 😅) Because I would imagine that with a single track slot you'd get better cache performance for the particle state, even though the cache performance for the "params" data might go down.
from celeritas.
Ok, I have some data with a single track slot. I had to set max_steps=-1
, and OpenMP is disabled at build time. Without profiling and just running the regression problem, it takes ~3x longer with one track slot.
Repeatedly calling ActionSequence:execute
has a large overhead because of dynamic_cast and freeing memory. I haven't located what is being freed but it's called exactly 20x per ActionSequence:execute
so each action is doing it at some point.
Regarding cache efficiency, it isn't helping that much. Below, I'm showing the L1 cache miss per call to AlongStepUniformMscAction::Execute
, (aggregate of instruction miss, +R/W miss) where most cache misses happen.
The first picture is for the single track slot scenario, the second picture is 65k track slots. As expected, you have way less miss per call since you process one track at a time, however, multiplied by how many times you have to call the function, it becomes way worse.
In both cases, ~80% of L1m is for instruction fetch.
from celeritas.
@esseivaju Looks like the allocation is coming from the actions()->label
and passing into ScopedProfiling
. I'm opening a PR to use string_view
for the action labels/descriptions and to delay string allocation in the scoped profiling implementation.
from celeritas.
Related Issues (20)
- Implement CPU-only G4MagneticField wrapper
- Generate input documentation from app structs HOT 1
- Parameterized source distributions
- Better statistics from celer-sim
- Integrate Perfetto for optional CPU profiling HOT 3
- Require one event at a time HOT 1
- Use logical rather than physical navigation state to reconstruct G4NavigationHistory
- Support union boundaries
- Cleanup tasks for v1.0
- Add particle weight to simulations
- Muon Beam Physics
- Celeritas offloading interfaces do not support multiple runs in Geant4 HOT 5
- Implement short-circuit volume evaluation using infix notation HOT 1
- Improve physics memory utilization and performance
- Replace StreamStore and helpers with reduction function
- Construct optical photon physics cross section tables HOT 2
- `G4VDecayChannel`: `G4MuonDecayChannel` HOT 1
- Optimize track vector data layout for particle types
- Debug parallel crashes running with multiple streams on Frontier
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from celeritas.