Versions FTL: 5.25.2 Compiling FTL

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Overzealous inline opitimizations? about ftl HOT 7 OPEN

oliv3r commented on June 30, 2024

Overzealous inline opitimizations?

from ftl.

Comments (7)

DL6ER commented on June 30, 2024

Generally the idea is 'do not inline, let the compiler figure it out, it knows best.

No, because "best" is not a well-defined term here. You see that -O3 and -Os define "best" in two very different ways. Pi-hole itself is optimized for speed because that's what matters with any real application these days (space is cheap as it has never been). The only reason where I see justification for -Os is the MCU world where you may really still be restricted to a few KB of program memory.

gravityDB_finalize_client_statements(*client) involves quite some code that'd be unnecessarily be duplicated all over the place without a definition in a common place. Forcing inlining here actually gives a small speed enhancement as this function is called rather often and the mere fact that we inline saves us the costs of one context switch. Even when this isn't much, it is still measurable at runtime. The compiler, however, doesn't know if this function is called very often (without a profiling) and typically decides against inlining a function that has more than like three lines of code (the exact number has changes a few times in the history of gcc) which is not what we want.

At this point, FTL isn't meant to be compiled with -Os and I'd only consider changing proven efficient code if you can describe a real scenario where buying a barely - if at all - measureable reduction in application size at the costs of a measurable slowdown is worthwhile. In the end we are talking about a reduction at most a few KB (in an application in the multi-MB range this is < 0.1%) whereas the expected slowdown due to removing many optimizations which are known to often increase instruction size* will very likely be in the measurable low integer percent range.

TL;DR: Compiling for size comes at a measurable speed penalty. So far, this isn't a really supported configuration and I would not speak about "overzealous" in this context just because we give the compiler hints it cannot know itself.

*) not only our manual inlining instructions are removed, but also function/jump/labels/loops alignment, prefetching of loop arrays, and reordering of block algorithms, which can greatly reduce branch mis-prediction in the exectuable

from ftl.

oliv3r commented on June 30, 2024

Generally the idea is 'do not inline, let the compiler figure it out, it knows best.

No, because "best" is not a well-defined term here. You see that -O3 and -Os define "best" in two very different ways. Pi-hole itself is optimized for speed because that's what matters with any real application these days (space is cheap as it has never been). The only reason where I see justification for -Os is the MCU world where you may really still be restricted to a few KB of program memory.

I beg to differ, but your point does stand. My accesspoint that has 8 mb of flash and 128 mb of ram running openwrt doesn't have 'space to burn' no matter how cheap it is. Also, -Os has an more important impact, which is caching. Caches is still limited in a lot of CPU's because space for cache is expensive. Also being more effective in cache usage does improve speed and performance. Benchmarks would need to prove this of course.

So there certainly is still usage for Os in certain scenario's. However as I said, your point does stand and I do agree with it for the most part. ;)

gravityDB_finalize_client_statements(*client) involves quite some code that'd be unnecessarily be duplicated all over the place without a definition in a common place. Forcing inlining here actually gives a small speed enhancement as this function is called rather often and the mere fact that we inline saves us the costs of one context switch. Even when this isn't much, it is still measurable at runtime. The compiler, however, doesn't know if this function is called very often (without a profiling) and typically decides against inlining a function that has more than like three lines of code (the exact number has changes a few times in the history of gcc) which is not what we want.

I'm more then happy to accept that you know what you are doing :) no need to prove anything :p

At this point, FTL isn't meant to be compiled with -Os and I'd only consider changing proven efficient code if you can describe a real scenario where buying a barely - if at all - measureable reduction in application size at the costs of a measurable slowdown is worthwhile. In the end we are talking about a reduction at most a few KB (in an application in the multi-MB range this is < 0.1%) whereas the expected slowdown due to removing many optimizations which are known to often increase instruction size* will very likely be in the measurable low integer percent range.

And that's perfectly fine! So the answer to my question is 'no, we know what we are doing'. And that's great :) I should have potentially given more context, and that is I was trying to use the CMake target MinSizeRel because for the some parts, (openwrt, alpine, etc) it's the commonly picked target. This lead me down a path of compilation errors, where compilation was no longer possible.

That means, that

FTL/src/CMakeLists.txt

Line 106 in 8943e26

set(CMAKE_C_FLAGS_MINSIZEREL "-Os -DNDEBUG")

needs to be either removed, as it's not even valid, or (a bit better imo) is to either disable -Winline for that target (could we put -Wnonline there? or increase the inline warning limit (to which I had no success with so far).

TL;DR: Compiling for size comes at a measurable speed penalty. So far, this isn't a really supported configuration and I would not speak about "overzealous" in this context just because we give the compiler hints it cannot know itself.

Well then you conclude say we should actually remove MinSizeRel!

*) not only our manual inlining instructions are removed, but also function/jump/labels/loops alignment, prefetching of loop arrays, and reordering of block algorithms, which can greatly reduce branch mis-prediction in the exectuable

👍

from ftl.

DL6ER commented on June 30, 2024

My accesspoint that has 8 mb of flash and 128 mb of ram running openwrt doesn't have 'space to burn' no matter how cheap it is.

Ah, yes. We do not "support" any embedded devices out of the box so that's why I haven't thought about them. Sure, there are so many different devices out there that not everyone will have a USB port or whatsover with which you could extend storage for the hundreds GBs with only couple of dollars. This context indeed sheds another light on your endeavor - one which I can easily follow.

Well then you conclude say we should actually remove MinSizeRel!

I have no objections. A quick look on git blame on this line reveals it was an external contribution and (assumption ahead!) probably originated from some default template just as the other two (DEBUG and RELEASE).

The Pi-hole team itself solely uses the RELWITHDEBINFO target both with our internal tooling as well as with the recommended compiling steps packed together into build.sh. I'd be ready to say that MinSizeRel was probably tried back when this was added but never since again.

from ftl.

kocburak commented on June 30, 2024

Hi, (newbiew here)
Why aren't we using -Ofast instead of -O3 ?

from ftl.

oliv3r commented on June 30, 2024

@DL6ER so for me, I'm just building with Release and I'm done; but I suppose you guys have to make a choice here. Or at least have test targets that build them ;)

So I'll leave this issue open and you can close it with whatever solution you feel fits best.

from ftl.

Overzealous inline opitimizations? about ftl HOT 7 OPEN

Comments (7)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent