Giter Site home page Giter Site logo

Comments (6)

esseivaju avatar esseivaju commented on July 22, 2024

I did some CPU profiling using callgrind/cachegrind with the following setup:

  • System: Perlmutter login node
  • Problem: testem3-orange-field-msc, 1 event, 64 primaries/event
  • single thread (OMP disabled)
  • CMake RelWithDebInfo build, CELERITAS_DEBUG=OFF

The graph below shows the estimated cycles spent in each function, weighting instruction fetch, L1, and LL cache miss.

testem3_fm_64p

I noticed that axpy leads to many instruction cache miss but it could be because I didn't pass march/mtune compiler options.

Looking at the L1 read miss, most of them come from XsCalculator::get calls within XsCalculator::operator()

It'd be interesting to see the cache miss in a multithreaded scenario

from celeritas.

sethrj avatar sethrj commented on July 22, 2024

@esseivaju Is this with one track slot or the usual number (65K)? I guess the reason I wondered about single-thread performance not being optimal is that we saw a substantial performance gap between single-slot and many-slot, and since the many-slot case is not really optimal (in terms of state cache locality and skipped loops due to masking) I wonder whether the call graph would look any different...

from celeritas.

esseivaju avatar esseivaju commented on July 22, 2024

This is with 4k track slots.

single-thread performance not being optimal is that we saw a substantial performance gap between single-slot and many-slot,

Do you mean that in the single thread case, you saw better performance with one track slot?

from celeritas.

sethrj avatar sethrj commented on July 22, 2024

OK 4k track slots, different than our usual regression CPU setting. What does the performance graph look like if you have a single track slot? (Make sure openmp is disabled! 😅) Because I would imagine that with a single track slot you'd get better cache performance for the particle state, even though the cache performance for the "params" data might go down.

from celeritas.

esseivaju avatar esseivaju commented on July 22, 2024

Ok, I have some data with a single track slot. I had to set max_steps=-1, and OpenMP is disabled at build time. Without profiling and just running the regression problem, it takes ~3x longer with one track slot.

callgrind_estimate_singletrack

Repeatedly calling ActionSequence:execute has a large overhead because of dynamic_cast and freeing memory. I haven't located what is being freed but it's called exactly 20x per ActionSequence:execute so each action is doing it at some point.

Regarding cache efficiency, it isn't helping that much. Below, I'm showing the L1 cache miss per call to AlongStepUniformMscAction::Execute, (aggregate of instruction miss, +R/W miss) where most cache misses happen.

The first picture is for the single track slot scenario, the second picture is 65k track slots. As expected, you have way less miss per call since you process one track at a time, however, multiplied by how many times you have to call the function, it becomes way worse.

image
image

In both cases, ~80% of L1m is for instruction fetch.

from celeritas.

sethrj avatar sethrj commented on July 22, 2024

@esseivaju Looks like the allocation is coming from the actions()->label and passing into ScopedProfiling. I'm opening a PR to use string_view for the action labels/descriptions and to delay string allocation in the scoped profiling implementation.

from celeritas.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.