One of the open questions for our CMS integration is how well the detectors will work

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Implement single-track CPU for performance and improve integration about celeritas HOT 6 OPEN

sethrj commented on July 22, 2024

Implement single-track CPU for performance and improve integration

from celeritas.

Comments (6)

esseivaju commented on July 22, 2024

I did some CPU profiling using callgrind/cachegrind with the following setup:

System: Perlmutter login node
Problem: testem3-orange-field-msc, 1 event, 64 primaries/event
single thread (OMP disabled)
CMake RelWithDebInfo build, CELERITAS_DEBUG=OFF

The graph below shows the estimated cycles spent in each function, weighting instruction fetch, L1, and LL cache miss.

I noticed that axpy leads to many instruction cache miss but it could be because I didn't pass march/mtune compiler options.

Looking at the L1 read miss, most of them come from XsCalculator::get calls within XsCalculator::operator()

It'd be interesting to see the cache miss in a multithreaded scenario

from celeritas.

sethrj commented on July 22, 2024

@esseivaju Is this with one track slot or the usual number (65K)? I guess the reason I wondered about single-thread performance not being optimal is that we saw a substantial performance gap between single-slot and many-slot, and since the many-slot case is not really optimal (in terms of state cache locality and skipped loops due to masking) I wonder whether the call graph would look any different...

from celeritas.

esseivaju commented on July 22, 2024

This is with 4k track slots.

single-thread performance not being optimal is that we saw a substantial performance gap between single-slot and many-slot,

Do you mean that in the single thread case, you saw better performance with one track slot?

from celeritas.

sethrj commented on July 22, 2024

OK 4k track slots, different than our usual regression CPU setting. What does the performance graph look like if you have a single track slot? (Make sure openmp is disabled! 😅) Because I would imagine that with a single track slot you'd get better cache performance for the particle state, even though the cache performance for the "params" data might go down.

from celeritas.

esseivaju commented on July 22, 2024

Ok, I have some data with a single track slot. I had to set max_steps=-1, and OpenMP is disabled at build time. Without profiling and just running the regression problem, it takes ~3x longer with one track slot.

Repeatedly calling ActionSequence:execute has a large overhead because of dynamic_cast and freeing memory. I haven't located what is being freed but it's called exactly 20x per ActionSequence:execute so each action is doing it at some point.

Regarding cache efficiency, it isn't helping that much. Below, I'm showing the L1 cache miss per call to AlongStepUniformMscAction::Execute, (aggregate of instruction miss, +R/W miss) where most cache misses happen.

The first picture is for the single track slot scenario, the second picture is 65k track slots. As expected, you have way less miss per call since you process one track at a time, however, multiplied by how many times you have to call the function, it becomes way worse.

In both cases, ~80% of L1m is for instruction fetch.

from celeritas.

sethrj commented on July 22, 2024

@esseivaju Looks like the allocation is coming from the actions()->label and passing into ScopedProfiling. I'm opening a PR to use string_view for the action labels/descriptions and to delay string allocation in the scoped profiling implementation.

from celeritas.

Implement single-track CPU for performance and improve integration about celeritas HOT 6 OPEN

Comments (6)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent