Having now worked a little bit with the Postgres code to support <code class="notransl

the optimizer . . . can accumulate/calulcate the <code class="notranslate

I've been trying to <a href="https://github.com/SolarNetwork/aggs_for_vecs/blob/68a464

I've cleaned up the code a bit in the latest <a href="https://github.com/SolarNetwork/

Possible to use existing Postgres aggregate functions like numeric_avg_accum, towards two-step aggregation about aggs_for_vecs HOT 12 CLOSED

pjungwir commented on July 19, 2024

Possible to use existing Postgres aggregate functions like numeric_avg_accum, towards two-step aggregation

from aggs_for_vecs.

Comments (12)

pjungwir commented on July 19, 2024

the optimizer . . . can accumulate/calulcate the NumericAggState once for all three aggregates

That is a cool trick! I can't find any notes about it in the documentation or source code though. Can you send me a reference where you're reading that?

Would it be possible to essentially create a copy of the FunctionCallInfo

This seems like it should be possible. It makes me a bit uneasy though. It seems like something that could easily break between Postgres versions. But if it is really a big performance boost, it might still be worth it. I'd be happy to at least see an experiment taking that approach.

If you're interested in speeding up these functions, making them support partial aggregation would probably have a big payoff, and that would stay within something officially supported.

A few years ago I also experimented with SIMD instructions, and in my rough benchmarks it was roughly halving the time. I just never got around to implementing all the functions. I don't think we could do that for NUMERIC though since we rely on Postgres's own arithmetic functions.

from aggs_for_vecs.

msqr commented on July 19, 2024

the optimizer . . . can accumulate/calulcate the NumericAggState once for all three aggregates

I was actually only basing that statement on what the Timescale folks wrote, here. I don't actually know how that translates to the code, I made a guess that the NumericAggState structure was used because I saw that being used in aggregate functions like numeric_avg_accum.

I was thinking partial aggregation was a slightly different optimization that what I was thinking of, and both could be applied! Am I correct in understanding that partial aggregation would allow parallel workers to compute a subset of the aggregation, followed by a combining function to merge all the worker results into one final result? So what I'd like to do is more optimize the time spent computing multiple aggregates on the same input data set. In my use case, I want to calculate the average, min, max, count, and sum for sets of data. Here's an example query (it uses the Timescale time_bucket() function):

SELECT 
	  time_bucket('5 minutes'::interval, d.ts) AS ts_start
	, vec_to_mean(d.data_i) AS data_i
	, vec_to_count(d.data_i) AS count_i
	, vec_to_min(d.data_i) AS min_i
	, vec_to_max(d.data_i) AS max_i
FROM d
GROUP BY time_bucket('5 minutes'::interval, d.ts)

What I was hoping to be able to do was something like this:

SELECT 
	  time_bucket('5 minutes'::interval, d.ts) AS ts_start
	, vec_to_mean(vec_stat_calc(d.data_i)) AS data_i
	, vec_to_count(vec_stat_calc(d.data_i)) AS count_i
	, vec_to_min(vec_stat_calc(d.data_i)) AS min_i
	, vec_to_max(vec_stat_calc(d.data_i)) AS max_i
FROM d
GROUP BY time_bucket('5 minutes'::interval, d.ts)

with the hope that the optimizer would be smart enough to execute vec_stat_calc(d.data_i) once into something like a VecNumericAggState aggregate state structure, and then all the vec_to_X() functions turn into "simple" accessors for that aggregate out of the shared state. The VecNumericAggState could support all numeric data types, like what the pgnum union provides.

This seems like it should be possible. It makes me a bit uneasy though. It seems like something that could easily break between Postgres versions. But if it is really a big performance boost, it might still be worth it. I'd be happy to at least see an experiment taking that approach.

I wondered about that. I thought the things in fmgr.h like fmgr_info_copy() or InitFunctionCallInfoData followed by FunctionCallInvoke would be fairly stable APIs (if those are even appropriate here, I'm make a lot of guesses as I'm inexperienced in this stuff).

from aggs_for_vecs.

pjungwir commented on July 19, 2024

Ah, I read the Timestamp article you linked before, but it didn't click that you could do the same thing in stock Postgres. That's cool!

Yes, the optimizer can identify repeated expressions and re-use the results. For example this will only call vec_to_max once:

select vec_to_max(nums), (vec_to_max(nums))[1], array_length(vec_to_max(nums), 1) from measurements;

I like that idea of exposing an aggregate accumulator function and then separate accessor functions to pull off the results. You'll have to define a custom type I believe, but I don't think it will need much user-facing stuff.

I'd like to still keep the simple versions of our vec_to_foo functions, so that people don't need to do the two-step thing if they don't want to. But having a two-step option would be really cool.

from aggs_for_vecs.

pjungwir commented on July 19, 2024

Oh I meant to add: you're probably right about fmgr being pretty stable.

from aggs_for_vecs.

pjungwir commented on July 19, 2024

Am I correct in understanding that partial aggregation would allow parallel workers to compute a subset of the aggregation, followed by a combining function to merge all the worker results into one final result?

Exactly!

from aggs_for_vecs.

msqr commented on July 19, 2024

OK! So the two-step aggregate is in some ways separate from trying to re-use those numeric_* functions as originally outlined in this issue. I could try to implement a two-step aggregate out of the existing code, it's just that in the NUMERIC case there is so much already sitting there in Postgres, I thought it would be most excellent to re-use it. Do you have any tips/pointers on using the things in fmgr.h to make the sort of lower-level calls I'm trying to do?

from aggs_for_vecs.

msqr commented on July 19, 2024

I've been trying to hack my way into understanding how to make this work. At this point, I'm able to look up numeric_avg_accum function and I think I'm invoking it once per array element, saving the returned state value and passing it back each time.

Then in the final function I look up the numeric_avg function and invoke it, but it is returning NULL and I haven't been able to figure out why.

Here's what I get when I try to use the vec_to_mean function, with all my excessive logging:

SELECT vec_to_mean(nums) FROM measurements WHERE sensor_id IN (5,6);

NOTICE:  cached numeric_avg_accum FmgrInfo
NOTICE:  allocated vec_accum_fcinfo memory 64
NOTICE:  init vec_accum_fcinfo struct
NOTICE:  invoking trans 0
NOTICE:  invoking trans 1
NOTICE:  invoking trans 2
NOTICE:  invoking trans 0
NOTICE:  invoking trans 1
NOTICE:  invoking trans 2
NOTICE:  about to do create proxy_fcinfo
NOTICE:  created proxy_fcinfo
NOTICE:  init proxy_fcinfo
NOTICE:  invoking final 0
ERROR:  numeric_avg returned NULL

I've been stuck on this for a while now, maybe you have an idea what's going wrong?

from aggs_for_vecs.

pjungwir commented on July 19, 2024

I took a look through your code & the built-in functions you're using. I didn't see anything obvious. But since numeric_avg does this . . .


    /* If there were no non-null inputs, return NULL */
    if (state == NULL || NA_TOTAL_COUNT(state) == 0)
        PG_RETURN_NULL();

I think I'd start with inspecting the state before you call it to make sure NA_TOTAL_COUNT is not zero.

I don't think I'll have a chance to look more closely until the weekend. One thing you could do is build Postgres from source then install the extension against it so you can add debugging calls to the Postgres functions too.

from aggs_for_vecs.

msqr commented on July 19, 2024

Ah, I needed a break to get my head cleared to see the logic bug I fixed in SolarNetwork/aggs_for_vecs@0a84ca5. After that, the function runs. I did take your advice and locally build Postgres and learn how to debug it, so that was a nice learning experience and quite interesting to step through the code. Now I can clean the code up and do some actual testing!

from aggs_for_vecs.

msqr commented on July 19, 2024

I've cleaned up the code a bit in the latest feature/numeric-stats-agg and the function is still not quite right because of some memory handling issue I haven't been able to figure out yet. The function will eventually crash after a few invocations, and looking in the debugger it looks like some memory addresses are invalid.

from aggs_for_vecs.

pjungwir commented on July 19, 2024

Ha ha I stared and stared at that line and never saw the problem. :-) It sounds like you're making some great progress!

from aggs_for_vecs.

msqr commented on July 19, 2024

OK, I was able to fix the bugs and get everything working in feature/numeric-stats-agg. I think I should start over in a new branch, however, so the original single-step implementation remains as-is, and I can work on two-step aggregation to really get the most benefit out of this approach. I'll close this issue as I think its served its (exploratory) purpose now.

from aggs_for_vecs.

Possible to use existing Postgres aggregate functions like numeric_avg_accum, towards two-step aggregation about aggs_for_vecs HOT 12 CLOSED

Comments (12)

Related Issues (8)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent