Giter Site home page Giter Site logo

Two-step aggregation about aggs_for_vecs HOT 24 CLOSED

pjungwir avatar pjungwir commented on July 19, 2024
Two-step aggregation

from aggs_for_vecs.

Comments (24)

pjungwir avatar pjungwir commented on July 19, 2024 1

I can't think of any concrete reason using a Datum is bad. I suppose it just feels like Datum is for external interfaces and that struct is something internal. But yeah since you're using DirectFunctionCall a lot it's convenient to keep them that way. And it makes for some nice polymorphism.

Btw in case you haven't noticed, there are special-purpose InputFunctionCall and OutputFunctionCall functions you can use here instead of DirectFunctionCalling the component in/out functions.

from aggs_for_vecs.

pjungwir avatar pjungwir commented on July 19, 2024

Yeah, I think a custom type is the way to go. Since it's intended as an intermediate value, the in/out functions are not too important. At first you could even just return "hello world" for out and something initialized to all zeros for in. Of course doing something "real" would be better.

Those functions are not really that hard to write. Postgres has some nice helper functions. Here is the in for multiranges, something I wrote a year or two ago. It might be a good example because it's not completely trivial, but it's still pretty easy to understand. A mini finite-state machine is a good way to go. And then even for something complicated, the out function should be very easy.

I'm not sure it makes sense to have Datum fields in this struct instead of the internal types---but I understand how that keeps it generic. Perhaps it's okay.

All these Datums will be of type elementType, right? So then for your _out you can call that type's own _out. And for your _in, you can use a separator that is guaranteed not to be in those types (which should be easy since they're all numbers). Then you can read until that separator and call the appropriate _in.

from aggs_for_vecs.

msqr avatar msqr commented on July 19, 2024

Great, thanks for the link to that example and confirmation on the approach.

I'm not sure it makes sense to have Datum fields in this struct instead of the internal types---but I understand how that keeps it generic. Perhaps it's okay.

All these Datums will be of type elementType, right? So then for your _out you can call that type's own _out. And for your _in, you can use a separator that is guaranteed not to be in those types (which should be easy since they're all numbers). Then you can read until that separator and call the appropriate _in.

Yes, all the Datums will be of elementType so using that type's _in and _out was exactly what I was hoping to be able to do. Alternatively I thought these could be pgnum values, if you think that would be a better approach? I figured I'd be trying to convert them to Datum frequently enough that maybe Datum was better?

from aggs_for_vecs.

msqr avatar msqr commented on July 19, 2024

I've been working through the custom type, defined currently as

CREATE TYPE vecaggstats (
    internallength = variable,
    input = vecaggstats_in,
    output = vecaggstats_out
);

and thanks for your suggestion of using placeholder values for the _in and _out functions to start. I was thinking it would be useful to lookup & cache the Oid of the vecaggstats type in the _PG_init function, however if I do this:

// cache our custom VecAggElementStats type Oid
static Oid VecAggElementStatsTypeOid;

void
_PG_init(void)
{
  // FIXME: this returns error "Unknown type: vecaggstats"... how do this?
  VecAggElementStatsTypeOid = typenameTypeId(NULL, typeStringToTypeName("vecaggstats"));
}

it throws an error when running CREATE EXTENSION aggs_for_vecs. I suppose that the vecaggstats type is not available until after that _PG_init callback then? Is there another way to lookup/cache the Oid just once after the extension is loaded? Or is this typeStringToTypeName() call not worth caching anyway?

from aggs_for_vecs.

pjungwir avatar pjungwir commented on July 19, 2024

I don't think caching the type oid is correct. As you say the type doesn't exist yet, but also it seems like there may be problems if the type gets dropped/restored. Anyway Postgres has its own type cache so just using the typenameTypeId function should be very fast (and cached after the first time).

from aggs_for_vecs.

msqr avatar msqr commented on July 19, 2024

I see, it was the typeStringToTypeName function I thought might be worth caching, but I'll just call that as needed then, thanks.

from aggs_for_vecs.

msqr avatar msqr commented on July 19, 2024

I am to the point in feature/two-step-aggs where this SQL runs:

SELECT vec_agg_count(vec_stat_agg(nums)) FROM measurements;
 vec_agg_count 
---------------
 {4,3,4}

However I have so many doubts/questions that I'm unsure of the correctness of most of it. First up, the vec_stat_agg function is returning an array of VecAggElementsStatsType as previously discussed:

CREATE OR REPLACE FUNCTION
vec_stat_agg_finalfn(internal, numeric[])
RETURNS vecaggstats[]
AS 'aggs_for_vecs', 'vec_stat_agg_finalfn'
LANGUAGE c;
typedef struct VecAggElementStatsType {
  char    vl_len_[4];
  Oid     elemTypeId;
  uint32  count;
  Datum   sum;
  Datum   min;
  Datum   max;
  Datum   mean;
} VecAggElementStatsType;

I was thinking that actually that elemTypeId really is for the entire array, not each element. Do you think that instead a single value should be returned from vec_stat_agg, something like:

CREATE OR REPLACE FUNCTION
vec_stat_agg_finalfn(internal, numeric[])
RETURNS vecaggstats    -- no longer array here
AS 'aggs_for_vecs', 'vec_stat_agg_finalfn'
LANGUAGE c;
typedef struct VecAggStatsType {
  char   vl_len_[4];
  Oid    elemTypeId; // type of Datum used below
  uint32 elemCount;  // number of elements in Datum arrays below
  Datum *counts;
  Datum *sums;
  Datum *mins;
  Datum *maxes;
  Datum *means;
} VecAggStatsType;

If so, is that even be the way to define Datum arrays here, or would it need to be defined as ArrayType * like

typedef struct VecAggStatsType {
  char       vl_len_[4];
  Oid        elemTypeId; // type of Datum used below
  uint32     elemCount;  // number of elements in Datum arrays below
  ArrayType *counts;
  ArrayType *sums;
  ArrayType *mins;
  ArrayType *maxes;
  ArrayType *means;
} VecAggStatsType;

from aggs_for_vecs.

pjungwir avatar pjungwir commented on July 19, 2024

I think returning a single value instead of an array makes more sense.

I think I'd make elemTypeId the type of the input vector's elements, not the type of the array.

I don't think I'd include means at all, since the accessor method can just do sums/counts.

I think I'd do Datum *foo rather than ArrayType *foo. Since these are just internal intermediate values, it doesn't seem worth it to build Postgres arrays for them.

from aggs_for_vecs.

msqr avatar msqr commented on July 19, 2024

I think returning a single value instead of an array makes more sense.

Great, I'll give that a go.

I think I'd make elemTypeId the type of the input vector's elements, not the type of the array.

Yes you are correct, I was glossing over the fact that most of the array types are the same as the input element type, but not necessarily so for things like counts which will be fixed to int8 even when the input is numeric.

I don't think I'd include means at all, since the accessor method can just do sums/counts.

The reason for this is that the mean value would actually calculated by a function call like numeric_avg on that element's state object. I thought that would be preferable (at least in the numeric case) because of all the edge cases numeric_avg takes care of. It is a bit of extra work if you don't want the mean but getting the mean from this aggregate is, I think, a primary use case. If you didn't want all the mean/sum/count values this aggregate calculates, I figure you'd just use a non two-step function. What do you think about that?

I think I'd do Datum *foo rather than ArrayType *foo. Since these are just internal intermediate values, it doesn't seem worth it to build Postgres arrays for them.

Right-o, I'll go with Datum * then.

Thanks for all the feedback!

from aggs_for_vecs.

msqr avatar msqr commented on July 19, 2024

I converted to that VecAggStatsType now, and cleaned up and implemented vec_agg_sum and vec_agg_mean functions so the following SQL runs:

SELECT vec_agg_count(vec_stat_agg(nums)) AS counts
	, vec_agg_sum(vec_stat_agg(nums)) AS sums
	, vec_trim_scale(vec_agg_mean(vec_stat_agg(nums))) AS means
FROM measurements;
 counts  |       sums        |      means      
---------+-------------------+-----------------
 {4,3,4} | {4.92,5.91,14.80} | {1.23,1.97,3.7}
(1 row)

And I verified via a debugger that indeed the vec_stat_agg(nums) part is executed just once! 🎉

from aggs_for_vecs.

pjungwir avatar pjungwir commented on July 19, 2024

Awesome!

from aggs_for_vecs.

msqr avatar msqr commented on July 19, 2024

I've been testing out the performance of the new functions on a larger data set, and have a surprising result where the two-step aggregate is performing about the same, or even slightly slower than, the one-step aggregates. For example here is what I'm comparing:

-- one step variation
SELECT vec_to_count(data_i) AS counts
	, vec_to_sum(data_i) AS sums
	, vec_trim_scale(vec_to_mean(data_i)) AS means
	, vec_to_min(data_i) AS mins
	, vec_to_max(data_i) AS maxes
FROM measurements2;

-- two step variation; expected to be faster
SELECT vec_agg_count(vec_stat_agg(data_i)) AS counts
	, vec_agg_sum(vec_stat_agg(data_i)) AS sums
	, vec_trim_scale(vec_agg_mean(vec_stat_agg(data_i))) AS means
	, vec_agg_min(vec_stat_agg(data_i)) AS mins
	, vec_agg_max(vec_stat_agg(data_i)) AS maxes
FROM measurements2;

When run on this test data set, the two-step variation performs about 2x as fast, but on my larger data set it is about break-even and usually slightly slower. An eample one-step plan looks like

HashAggregate  (cost=291202.49..291263.99 rows=200 width=72) (actual time=331.386..2979.986 rows=20885 loops=1)
  Output: (time_bucket(('900 seconds'::cstring)::interval, d.ts)), vec_trim_scale(vec_to_mean(d.data_i)), solarcommon.array_transpose2d(ARRAY[(vec_to_count(d.data_i))::numeric[], vec_to_min(d.data_i), vec_to_max(d.data_i)])
  Group Key: time_bucket(('900 seconds'::cstring)::interval, d.ts)
  Buffers: shared hit=1747

and a two-step plan like this:

HashAggregate  (cost=291202.49..291265.99 rows=200 width=72) (actual time=385.891..3263.576 rows=20885 loops=1)
  Output: (time_bucket(('900 seconds'::cstring)::interval, d.ts)), vec_trim_scale(vec_agg_mean(vec_stat_agg(d.data_i))), solarcommon.array_transpose2d(ARRAY[(vec_agg_count(vec_stat_agg(d.data_i)))::numeric[], vec_agg_min(vec_stat_agg(d.data_i)), vec_agg_max(vec_stat_agg(d.data_i))])
  Group Key: time_bucket(('900 seconds'::cstring)::interval, d.ts)
  Buffers: shared hit=1747

I tried experimenting with using DirectFunctionCallX style invocations instead of the FunctionCallInvokestyle where possible, but that didn't make any noticeable difference.

I'm a bit stumped! Have any ideas here?

from aggs_for_vecs.

msqr avatar msqr commented on July 19, 2024

I took a look at the benchmark support you created and threw in some benchmarks to profile the numeric support. The results I get look like:

vec_to_mean
=============
SQL rows 20355.5 ms
C array 2247.86 ms

vec_to_mean_numeric
===================
SQL rows 21820.2 ms
C array 4162.31 ms

vec_agg_stat                   <-- this is the two-step variation for count/min/max/avg
============
SQL rows 23994.9 ms
C array 3767.61 ms

vec_agg_stats                  <-- this is the one-step variation for count/min/max/avg
=============
SQL rows 23695 ms
C array 5827.55 ms

So in this query the two-step variation is actually faster. Do you have any tips on approaching some low-level profiling of the functions, I've not done anything like that in C before.

Also, I discovered that if the aggregate spools to temp files, the output of the two-step aggregation is incorrect and sometimes the server even crashes. Would that be from the lack of real _in and _out functions? Or some other fault in the code?

from aggs_for_vecs.

msqr avatar msqr commented on July 19, 2024

I am still curious about the two-step approach with the custom time, however I also realized I could achieve a more limited "two-step" benefit the same way Postgres does for the avg() and sum() aggregates: they share the same transition function and just use different final functions. Given I'm interested in count/min/max support as well, I started a feature/accum branch and threw those in a shared transition function vec_stat_accum, then added count, min, max, sum, and average final functions. For example:

CREATE AGGREGATE vec_agg_count(numeric[]) (
  sfunc     = vec_stat_accum,
  stype     = internal,
  finalfunc = vec_agg_count_finalfn
);
CREATE AGGREGATE vec_agg_mean(numeric[]) (
  sfunc     = vec_stat_accum,
  stype     = internal,
  finalfunc = vec_agg_mean_finalfn
);

Then Postgres still only creates a single state object for all these columns, and I achieve that "two-step" performance benefit I was seeking:

SELECT vec_agg_count(data_i) AS counts
	, vec_agg_sum(data_i) AS sums
	, vec_trim_scale(vec_agg_mean(data_i)) AS means
	, vec_agg_min(data_i) AS mins
	, vec_agg_max(data_i) AS maxes
FROM measurements2;

This is working for me even against the data sets that cause the two-step approach to fail as mentioned previously. Also the benchmarks are similar:

vec_agg_stat                   <-- this is the new vec_agg_X variation for count/min/max/avg/sum
============
SQL rows 24211.1 ms
C array 3841.45 ms

vec_agg_stats                  <-- this is the original vec_to_X variation for count/min/max/avg/sum
=============
SQL rows 24292.2 ms
C array 7910.46 ms

What isn't quite as nice with this approach is how other statistics like the var_samp ones would require additional or different delegate function invocations, which support calculating stuff like the sum-of-squares. For example var_samp(numeric) calls numeric_accum instead of numeric_avg_accum like avg(numeric) does, the only difference between the two being numerc_avg_accum does not calculate the sum-of-squares.

It is nice not to have to perform the extra calculations for sum-of-squares when not needed (especially for numerics, which are expensive), so I thought an aggregate argument to toggle that support on/off could be used, like

SELECT vec_agg_var_samp(data_i, variance => true)
     , vec_agg_mean(data_i, variance => true)
FROM measurements2;

But passing , variance => true to things like vec_agg_mean just so the optimizer can create just one state instance makes the SQL confusing, I think. This seems more clear to me:

SELECT vec_agg_var_samp(vec_agg_stat(data_i, variance => true))
     , vec_agg_mean(vec_agg_stat(data_i, variance => true))
FROM measurements2;

But, the two-step branch does not fully work so I'm missing something.

from aggs_for_vecs.

pjungwir avatar pjungwir commented on July 19, 2024

I didn't get a chance to dig into these questions on the weekend. I agree having variance => true on vec_agg_{mean,var_samp} isn't really something I want, but it does make sense on vec_agg_stat I think.

Is that an hstore argument? I haven't seen that used for named arguments, but I like it!

Your benchmarks are really surprising to me. I'm interested in exploring that some myself, but I probably won't be able to right away. Same with the corruption when spooling to disk (but yeah it's possibly because of the in/our functions).

I'm surprised that re-using the same transfn means it doesn't get called twice. I thought it was just a means of code re-use, not a potential for optimization. I suppose it's true that Postgres shouldn't need to compute it twice though. Nice!

That does make me wonder why the Timescale folks aren't just relying on that. Perhaps their own intermediate result needs to do something in a finalfn.

from aggs_for_vecs.

msqr avatar msqr commented on July 19, 2024

Is that an hstore argument? I haven't seen that used for named arguments, but I like it!

I was just copying something I remembered seeing the Timescale folks doing, but actually I hadn't thought much about what it actually meant, sorry. I thought it was useful for documenting the example, is all.

Same with the corruption when spooling to disk (but yeah it's possibly because of the in/our functions).

This is where I'm really stumped. I've been trying all sorts of different approaches. It seems like the "numeric" quality gets lost when passed to another function before returning with the final query results. Here's the most concise example I have come up with, with the vec_agg_mean aggregate. I have a log that prints out the final average value, and the log lines are correct, and the results in psql are correct. However, if I instead pass the results to vec_trim_scale the output turns into what appear like memory addresses. Here's my query:

SELECT vec_agg_mean(nums) FROM measurements;
                          vec_agg_mean                          
----------------------------------------------------------------
 {1.23000000000000000000,1.9700000000000000,3.7000000000000000}
(1 row)

SELECT vec_trim_scale(vec_agg_mean(nums)) FROM measurements;
                 vec_trim_scale                 
------------------------------------------------
 {94239562249968,94239562249980,94239562249992}
(1 row)

I have yet to be able to figure out what's going on here.

from aggs_for_vecs.

pjungwir avatar pjungwir commented on July 19, 2024

This seems related to memory contexts to me, since the Datum is by-reference. It could also be TOAST-related, where we are failing to deTOAST or the Datums in our struct are getting serialized but then later point to something else. But if I'm reading your last message correctly, it doesn't actually depend on spooling to disk after all, and an intermediate value like that shouldn't be getting TOASTed (I think). The way we're calling built-in numeric functions to do the real work is suspicious to me too, since we don't have as much control over how things are being allocated. Perhaps try making a copy of each Datum before you store it in your struct?

from aggs_for_vecs.

msqr avatar msqr commented on July 19, 2024

Yeah, it doesn't depend on spooling to disk. And this is based solely on invoking numeric_avg on the numeric element's state object. I can even do this to trigger it:

SELECT unnest(vec_agg_mean(nums)) FROM measurements;
NOTICE:  avg 0 = 1.23000000000000000000
NOTICE:  avg 1 = 1.9700000000000000
NOTICE:  avg 2 = 3.7000000000000000
     unnest     
----------------
 94777532913296
 94777532913308
 94777532913320
(3 rows)

The NOTICE logs are in the aggregate final function, showing the result of calling numerc_avg. They produce the correct output there.

I have tried making a copy of the result of that call to numeric_avg, e.g.

state->astate->dvalues[i] = datumCopy(DirectFunctionCall1(numeric_avg, state->vec_states[i]), state->astate->typbyval, state->astate->typlen);

I'm afraid I get the same results as before, however.

from aggs_for_vecs.

msqr avatar msqr commented on July 19, 2024

I tried asking for help on the pgsql-general mailing list but didn't get any takers. I've organized what I feel has culminated in the best performing approach in a feature/accum-cached-fcinfo branch. I cache and re-use FunctionCallInfo instances for the numeric_avg_accum and numeric_cmp functions on the aggregate state, and then create another final-function-local FunctionCallInfo to loop over the final array results with in vec_agg_mean and vec_agg_sum.

Still, the same problem persists, and it affects botht he vec_agg_mean and vec_agg_sum final functions. You can see correct and incorrect results in a query like

SELECT vec_agg_mean(nums), vec_trim_scale(vec_agg_mean(nums)) FROM measurements;
-[ RECORD 1 ]--+---------------------------------------------------------------
vec_agg_mean   | {1.23000000000000000000,1.9700000000000000,3.7000000000000000}
vec_trim_scale | {94597491308128,94597491308140,94597491308152}

There the vec_agg_mean result is correct, while the same result passed through vec_trim_scale produces incorrect results. So I'm stuck for now without any new ideas. 🤔

If you think of anything, let me know!

from aggs_for_vecs.

msqr avatar msqr commented on July 19, 2024

Tom Lane weighed in and found the issue: a dumb copy/paste problem in the SQL definition 😂. I fixed that, and now I think it's working finally! I have to get back in and test again, and add test cases, and then add support for other types...

from aggs_for_vecs.

pjungwir avatar pjungwir commented on July 19, 2024

Ha, that's great! I'm glad he was able to help figure it out.

I'm still interested in exploring the weird benchmark results, and I may have some opportunity the next couple days or this weekend. Also I'm still curious about vec_to_var_samp getting a different result from the built-in version. Is there anything else you need help with this this issue's line of work?

from aggs_for_vecs.

msqr avatar msqr commented on July 19, 2024

I think I've been able to plow through everything on this issue, at least not with a true two-step aggregates but with "shared state" aggregates that achieve my immediate goals of having a fast-performing min/max/sum/mean numeric array aggregate. For example this query computes a single state value that is shared by all output columns:

SELECT vec_agg_count(pad_vec(data_a, 3)) AS counts
	, vec_agg_sum(pad_vec(data_a, 3)) AS sums
	, vec_trim_scale(vec_agg_mean(pad_vec(data_a, 3))) AS means
	, vec_agg_min(pad_vec(data_a, 3)) AS mins
	, vec_agg_max(pad_vec(data_a, 3)) AS maxes
FROM measurements2;

-[ RECORD 1 ]------------------------------
counts | {184,0,180}
sums   | {1164885081,NULL,11117.700}
means  | {6330897.179347826087,NULL,61.765}
mins   | {6255484,NULL,61.765}
maxes  | {6477419,NULL,61.765}

And that query seems to be about 6x as fast as the no-extension unnest() SQL:

vec_agg_stat
============
SQL rows 24083.8 ms
C array 3595.71 ms

And it is about 2x faster than the equivalent SQL using all the vec_to_X aggregates.

I ended up following the approach Postgres does with its own sum() and avg() aggregates, in that for vec_agg_sum()

  • on int2, int4 input, int8 is output
  • on int8 and numeric input, numeric is output
  • on float4 and float8 input, float8 is output

For vec_agg_mean()

  • on float4 and float8 input, float8 is output
  • on other input, numeric is output

I discovered this as I looked up how to call the appropriate delegate transition/final functions for each type.

Everything is now pushed up in the feature/accum-cached-fcinfo branch. I've added tests as well. Feedback welcome! If you think this is worthy of a PR, I'm happy to submit one.

from aggs_for_vecs.

pjungwir avatar pjungwir commented on July 19, 2024

from aggs_for_vecs.

pjungwir avatar pjungwir commented on July 19, 2024

Completed by #13 .

from aggs_for_vecs.

Related Issues (8)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.