Continuing from <a class="issue-link js-issue-link" data-error-text="Failed to load ti

See the good news in <a class="issue-link js-issue-link" data-error-text="Failed to lo

<div class="highlight highlight-source-julia notranslate position-relative overflow-auto" dir="auto"

I have <div class="highlight highlight-source-shell notranslate position-relative

Invalidations about cpusummary.jl HOT 20 OPEN

timholy commented on August 19, 2024

Invalidations

from cpusummary.jl.

Comments (20)

chriselrod commented on August 19, 2024 1

which may explain why I'm seeing those invalidations and others may not?

You shouldn't when starting Julia with -t4.

If we want to fix these, the solution is to have those libraries / functions stop using num_threads().
I don't think the invalidations of num_threads are avoidable if we want it to maintain the current behavior.

from cpusummary.jl.

chriselrod commented on August 19, 2024 1

Again, I don't think this is an issue of primary importance, but it does seems worth keeping track of for a rainy day. The perform_step! recompilation is ~0.5s, so not entirely cheap (but not catastrophic either).

I strongly suspect this is enough to favor Threads.nthreads() there.
If you or @ChrisRackauckas have a benchmark I can run to confirm negligible runtime difference, I'll do that. I'll also try a few microbenchmarks.

from cpusummary.jl.

chriselrod commented on August 19, 2024 1

Ah, yes. It doesn't actually use Sys.CPU_THREADS, instead preferring to use information from Hwloc (width adjustments for ARM Macs -- I think Hwloc might have more details to handle things better).
Presumably you won't get invalidations when starting with 12 threads.

Perhaps, in the case of disagreement, I should favor Sys.CPU_THREADS over the actual number of threads, assuming the disagreement is because of a deliberate user choice.

from cpusummary.jl.

ChrisRackauckas commented on August 19, 2024 1

There's https://github.com/SciML/OrdinaryDiffEq.jl/blob/master/.github/workflows/Downstream.yml which is a quick way to setup a bunch of integration tests on subsets of downstream package tests.

from cpusummary.jl.

ChrisRackauckas commented on August 19, 2024 1

https://github.com/SciML/OrdinaryDiffEq.jl/blob/master/test/runtests.jl#L5-L17

It just grabs the group and runs the subset of the tests.

from cpusummary.jl.

chriselrod commented on August 19, 2024 1

For things like setting the number of threads, can LV basically do the same thing we do with LLVM multiversioning? @turbo could emit a block that starts with

@tturbo or @turbo threads=true could do something like that, but we probably only need the check vs 1.

from cpusummary.jl.

timholy commented on August 19, 2024

See the good news in SciML/DifferentialEquations.jl#786 (emergency is over 🙂 ). We should still poke at this but it ~~may not be~~isn't urgent.

from cpusummary.jl.

timholy commented on August 19, 2024

I did just start Julia with julia -t4 and still saw those invalidations. I'm wondering if there's some compilation-state dependence, and in particular whether it matters whether you build CPUSummary via ] precompile or via using SomePackageThatForcesItToBuild.

With master for Julia and ] dev SnoopCompile SnoopCompileCore (or just "regular" when 2.8 comes out) it's pretty easy to check:

using SnoopCompileCore
invalidations = @snoopr using OrdinaryDiffEq ModelingToolKit;
using SnoopCompile
trees = invalidation_trees(invalidations)

and then look for a num_threads tree.

Again, I don't think this is an issue of primary importance, but it does seems worth keeping track of for a rainy day. The perform_step! recompilation is ~0.5s, so not entirely cheap (but not catastrophic either).

from cpusummary.jl.

chriselrod commented on August 19, 2024

using SnoopCompileCore
invalidations = @snoopr using OrdinaryDiffEq, ModelingToolkit;
using SnoopCompile, CPUSummary
trees = invalidation_trees(invalidations);
ctrees = filtermod(CPUSummary, trees)

I get

julia> ctrees = filtermod(CPUSummary, trees)
1-element Vector{SnoopCompile.MethodInvalidations}:
 inserting convert(S::Type{<:Union{Number, T}}, p::MultivariatePolynomials.AbstractPolynomialLike{T}) where T in MultivariatePolynomials at /home/chriselrod/.julia/packages/MultivariatePolynomials/vqcb5/src/conversion.jl:65 invalidated:
   mt_backedges: 1: signature Tuple{typeof(convert), Type{Hwloc.Attribute}, Any} triggered MethodInstance for CPUSummary.safe_topology_load!() (1 children)


julia> Threads.nthreads(), Sys.CPU_THREADS
(8, 8)

In another Julia session

julia> ctrees = filtermod(CPUSummary, trees)
2-element Vector{SnoopCompile.MethodInvalidations}:
 inserting convert(S::Type{<:Union{Number, T}}, p::MultivariatePolynomials.AbstractPolynomialLike{T}) where T in MultivariatePolynomials at /home/chriselrod/.julia/packages/MultivariatePolynomials/vqcb5/src/conversion.jl:65 invalidated:
   mt_backedges: 1: signature Tuple{typeof(convert), Type{Hwloc.Attribute}, Any} triggered MethodInstance for CPUSummary.safe_topology_load!() (1 children)

 deleting num_threads() in CPUSummary at /home/chriselrod/.julia/packages/CPUSummary/dEmFX/src/topology.jl:42 invalidated:
   backedges: 1: superseding num_threads() in CPUSummary at /home/chriselrod/.julia/packages/CPUSummary/dEmFX/src/topology.jl:42 with MethodInstance for CPUSummary.num_threads() (2 children)


julia> Threads.nthreads(), Sys.CPU_THREADS
(1, 8)

(ode) pkg> st CPUSummary
      Status `~/Documents/progwork/julia/env/ode/Project.toml`
  [2a0fbf3d] CPUSummary v0.1.2

So it appears to be working as intended for me.

from cpusummary.jl.

timholy commented on August 19, 2024

I have

$ env | grep -i thread
JULIA_CPU_THREADS=4

Is that possibly problematic? This is on a Intel(R) Core(TM) i7-8850H CPU @ 2.60GHz (6 physical cores).

from cpusummary.jl.

timholy commented on August 19, 2024

At least on nightly, and when starting with -t4, the only thing that seems to be holding back good precompilation of LV-generated code is the redefinition of cache_size

CPUSummary.jl/src/x86.jl

Line 62 in f5315fc

    
           @eval cache_size(::Union{Val{3},StaticInt{3}}) = $(static(cache_l3_per_core * nc))

which was formerly defined on

CPUSummary.jl/src/x86.jl

Line 34 in f5315fc

@eval cache_size(::Union{Val{$i},StaticInt{$i}}) = $(static(csi))

Not easy for me to fix because method redefinition like this is not "typical" (I recognize you do amazing, atypical things) and I don't know the motivations well enough to offer an alternative.

CC @Tokazama

from cpusummary.jl.

timholy commented on August 19, 2024

I'd be happy to offer an integration test that checks for new inference when running the precompiled workload for a demo consumer of LoopVectorization. If you want it, just let me know which repo I should submit it to.

from cpusummary.jl.

chriselrod commented on August 19, 2024

Not easy for me to fix because method redefinition like this is not "typical" (I recognize you do amazing, atypical things) and I don't know the motivations well enough to offer an alternative.

We should stop doing that.

from cpusummary.jl.

chriselrod commented on August 19, 2024

I'd be happy to offer an integration test that checks for new inference when running the precompiled workload for a demo consumer of LoopVectorization. If you want it, just let me know which repo I should submit it to.

Which repo do you think would be best? LoopVectorization.jl itself, or something that depends on it like TriangularSolve.jl or RecursiveFactorization.jl?

from cpusummary.jl.

timholy commented on August 19, 2024

Probably LV itself. The only issue to be aware of is that tracking down the origin of breakage might require a bit of hunting: if a PR to, say, this package breaks the integration test, then you won't know you've broken it until you next run the tests of LoopVectorization.jl. Unless you like the idea of running that specific test in several of LV's dependencies? You can see somethng similar to what I mean in CodeTracking, which exists to serve Revise:

from cpusummary.jl.

Tokazama commented on August 19, 2024

Apologies I'm not sure what the original motivation for redefining it like this was.

from cpusummary.jl.

timholy commented on August 19, 2024

method redefinition

We should stop doing that.

One thing to check: are you aware that you can use the precompilation process to your advantage? Your package can contain

const some_value_or_type_that_must_be_known_to_inference = begin
    # Some complicated computation, calling lots of functions, which may not be inferrable
end

and the only thing that gets written to the .ji cache file is some_value_or_type_that_must_be_known_to_inference itself. In other words, that block only runs at precompile time, it doesn't run when you load the package.

Of course, if you need to some things in __init__, then this won't help.

For things like setting the number of threads, can LV basically do the same thing we do with LLVM multiversioning? @turbo could emit a block that starts with

if Threads.nthreads() == 1
    # single-threaded implementation
elseif Threads.nthreads() = 6 # my laptop has 6 physical cores
    # 6-thread implementation
else
    @debug "Non-optimized implementation"
    # fallback
end

For users who might want to customize the default number (I typically use 4 threads to reserve a couple for something besides Julia) we could use Preferences.

from cpusummary.jl.

timholy commented on August 19, 2024

@ChrisRackauckas, do you have a link to whatever sits on the opposite side of that workflow? It looks useful but I wasn't sure how to trigger it.

from cpusummary.jl.

ChrisRackauckas commented on August 19, 2024

https://github.com/SciML/OrdinaryDiffEq.jl/blob/master/test/runtests.jl#L5-L17

It just grabs the group and runs the subset of the tests.

from cpusummary.jl.

chriselrod commented on August 19, 2024

Not easy for me to fix because method redefinition like this is not "typical" (I recognize you do amazing, atypical things) and I don't know the motivations well enough to offer an alternative.

You give me too much credit!
I'd meant to start working on cache-based blocking in LoopVectorization, but started working on the rewrite instead.

This was added for that, under the theory it's unlikely to change normally.
Then, more recently, I decided to start redefining L3 cache sizes based on how many threads we have, so code using it won't try to use more than its "share".
This causes invalidations, but is maybe helpful for packages like Octavian.

All that said, one fix was to remove it from LoopVectorization:
JuliaSIMD/LoopVectorization.jl@def5ad1
A second fix was to define the cache as cache per core:
e6f6461

Long term, I'm not overly concerned about this library.
The rewrite will get cache sizes via
https://llvm.org/doxygen/classllvm_1_1TargetTransformInfo.html#a11e8f29aef00ec6b5ffe4bfcc9e965f4
and should hopefully play well with whatever multi-versioning scheme we're using. But we'll see what issues arise when we get there, and that's still a long ways off at the moment.

from cpusummary.jl.

Invalidations about cpusummary.jl HOT 20 OPEN

Comments (20)

Related Issues (11)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent