Comments (12)
the optimizer . . . can accumulate/calulcate the
NumericAggState
once for all three aggregates
That is a cool trick! I can't find any notes about it in the documentation or source code though. Can you send me a reference where you're reading that?
Would it be possible to essentially create a copy of the
FunctionCallInfo
This seems like it should be possible. It makes me a bit uneasy though. It seems like something that could easily break between Postgres versions. But if it is really a big performance boost, it might still be worth it. I'd be happy to at least see an experiment taking that approach.
If you're interested in speeding up these functions, making them support partial aggregation would probably have a big payoff, and that would stay within something officially supported.
A few years ago I also experimented with SIMD instructions, and in my rough benchmarks it was roughly halving the time. I just never got around to implementing all the functions. I don't think we could do that for NUMERIC
though since we rely on Postgres's own arithmetic functions.
from aggs_for_vecs.
the optimizer . . . can accumulate/calulcate the NumericAggState once for all three aggregates
I was actually only basing that statement on what the Timescale folks wrote, here. I don't actually know how that translates to the code, I made a guess that the NumericAggState
structure was used because I saw that being used in aggregate functions like numeric_avg_accum.
I was thinking partial aggregation was a slightly different optimization that what I was thinking of, and both could be applied! Am I correct in understanding that partial aggregation would allow parallel workers to compute a subset of the aggregation, followed by a combining function to merge all the worker results into one final result? So what I'd like to do is more optimize the time spent computing multiple aggregates on the same input data set. In my use case, I want to calculate the average, min, max, count, and sum for sets of data. Here's an example query (it uses the Timescale time_bucket() function):
SELECT
time_bucket('5 minutes'::interval, d.ts) AS ts_start
, vec_to_mean(d.data_i) AS data_i
, vec_to_count(d.data_i) AS count_i
, vec_to_min(d.data_i) AS min_i
, vec_to_max(d.data_i) AS max_i
FROM d
GROUP BY time_bucket('5 minutes'::interval, d.ts)
What I was hoping to be able to do was something like this:
SELECT
time_bucket('5 minutes'::interval, d.ts) AS ts_start
, vec_to_mean(vec_stat_calc(d.data_i)) AS data_i
, vec_to_count(vec_stat_calc(d.data_i)) AS count_i
, vec_to_min(vec_stat_calc(d.data_i)) AS min_i
, vec_to_max(vec_stat_calc(d.data_i)) AS max_i
FROM d
GROUP BY time_bucket('5 minutes'::interval, d.ts)
with the hope that the optimizer would be smart enough to execute vec_stat_calc(d.data_i)
once into something like a VecNumericAggState
aggregate state structure, and then all the vec_to_X()
functions turn into "simple" accessors for that aggregate out of the shared state. The VecNumericAggState
could support all numeric data types, like what the pgnum
union provides.
This seems like it should be possible. It makes me a bit uneasy though. It seems like something that could easily break between Postgres versions. But if it is really a big performance boost, it might still be worth it. I'd be happy to at least see an experiment taking that approach.
I wondered about that. I thought the things in fmgr.h
like fmgr_info_copy()
or InitFunctionCallInfoData
followed by FunctionCallInvoke
would be fairly stable APIs (if those are even appropriate here, I'm make a lot of guesses as I'm inexperienced in this stuff).
from aggs_for_vecs.
Ah, I read the Timestamp article you linked before, but it didn't click that you could do the same thing in stock Postgres. That's cool!
Yes, the optimizer can identify repeated expressions and re-use the results. For example this will only call vec_to_max
once:
select vec_to_max(nums), (vec_to_max(nums))[1], array_length(vec_to_max(nums), 1) from measurements;
I like that idea of exposing an aggregate accumulator function and then separate accessor functions to pull off the results. You'll have to define a custom type I believe, but I don't think it will need much user-facing stuff.
I'd like to still keep the simple versions of our vec_to_foo
functions, so that people don't need to do the two-step thing if they don't want to. But having a two-step option would be really cool.
from aggs_for_vecs.
Oh I meant to add: you're probably right about fmgr being pretty stable.
from aggs_for_vecs.
Am I correct in understanding that partial aggregation would allow parallel workers to compute a subset of the aggregation, followed by a combining function to merge all the worker results into one final result?
Exactly!
from aggs_for_vecs.
OK! So the two-step aggregate is in some ways separate from trying to re-use those numeric_*
functions as originally outlined in this issue. I could try to implement a two-step aggregate out of the existing code, it's just that in the NUMERIC
case there is so much already sitting there in Postgres, I thought it would be most excellent to re-use it. Do you have any tips/pointers on using the things in fmgr.h
to make the sort of lower-level calls I'm trying to do?
from aggs_for_vecs.
I've been trying to hack my way into understanding how to make this work. At this point, I'm able to look up numeric_avg_accum
function and I think I'm invoking it once per array element, saving the returned state value and passing it back each time.
Then in the final function I look up the numeric_avg
function and invoke it, but it is returning NULL
and I haven't been able to figure out why.
Here's what I get when I try to use the vec_to_mean
function, with all my excessive logging:
SELECT vec_to_mean(nums) FROM measurements WHERE sensor_id IN (5,6);
NOTICE: cached numeric_avg_accum FmgrInfo
NOTICE: allocated vec_accum_fcinfo memory 64
NOTICE: init vec_accum_fcinfo struct
NOTICE: invoking trans 0
NOTICE: invoking trans 1
NOTICE: invoking trans 2
NOTICE: invoking trans 0
NOTICE: invoking trans 1
NOTICE: invoking trans 2
NOTICE: about to do create proxy_fcinfo
NOTICE: created proxy_fcinfo
NOTICE: init proxy_fcinfo
NOTICE: invoking final 0
ERROR: numeric_avg returned NULL
I've been stuck on this for a while now, maybe you have an idea what's going wrong?
from aggs_for_vecs.
I took a look through your code & the built-in functions you're using. I didn't see anything obvious. But since numeric_avg
does this . . .
/* If there were no non-null inputs, return NULL */
if (state == NULL || NA_TOTAL_COUNT(state) == 0)
PG_RETURN_NULL();
I think I'd start with inspecting the state before you call it to make sure NA_TOTAL_COUNT
is not zero.
I don't think I'll have a chance to look more closely until the weekend. One thing you could do is build Postgres from source then install the extension against it so you can add debugging calls to the Postgres functions too.
from aggs_for_vecs.
Ah, I needed a break to get my head cleared to see the logic bug I fixed in SolarNetwork/aggs_for_vecs@0a84ca5. After that, the function runs. I did take your advice and locally build Postgres and learn how to debug it, so that was a nice learning experience and quite interesting to step through the code. Now I can clean the code up and do some actual testing!
from aggs_for_vecs.
I've cleaned up the code a bit in the latest feature/numeric-stats-agg and the function is still not quite right because of some memory handling issue I haven't been able to figure out yet. The function will eventually crash after a few invocations, and looking in the debugger it looks like some memory addresses are invalid.
from aggs_for_vecs.
Ha ha I stared and stared at that line and never saw the problem. :-) It sounds like you're making some great progress!
from aggs_for_vecs.
OK, I was able to fix the bugs and get everything working in feature/numeric-stats-agg. I think I should start over in a new branch, however, so the original single-step implementation remains as-is, and I can work on two-step aggregation to really get the most benefit out of this approach. I'll close this issue as I think its served its (exploratory) purpose now.
from aggs_for_vecs.
Related Issues (8)
- Support for NUMERIC[] type HOT 32
- Two-step aggregation HOT 24
- Add vec_to_first and vec_to_last aggregates. HOT 2
- Debian package support HOT 2
- Support for variable-length input arrays HOT 6
- Allow processing more input rows than veccounts currently allows HOT 1
- Array math utilities like vec_add, vec_sub, vec_mul, vec_div HOT 6
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from aggs_for_vecs.