Comments (24)
I can't think of any concrete reason using a Datum
is bad. I suppose it just feels like Datum
is for external interfaces and that struct is something internal. But yeah since you're using DirectFunctionCall
a lot it's convenient to keep them that way. And it makes for some nice polymorphism.
Btw in case you haven't noticed, there are special-purpose InputFunctionCall
and OutputFunctionCall
functions you can use here instead of DirectFunctionCall
ing the component in/out functions.
from aggs_for_vecs.
Yeah, I think a custom type is the way to go. Since it's intended as an intermediate value, the in/out functions are not too important. At first you could even just return "hello world" for out
and something initialized to all zeros for in
. Of course doing something "real" would be better.
Those functions are not really that hard to write. Postgres has some nice helper functions. Here is the in
for multiranges, something I wrote a year or two ago. It might be a good example because it's not completely trivial, but it's still pretty easy to understand. A mini finite-state machine is a good way to go. And then even for something complicated, the out
function should be very easy.
I'm not sure it makes sense to have Datum
fields in this struct instead of the internal types---but I understand how that keeps it generic. Perhaps it's okay.
All these Datums
will be of type elementType
, right? So then for your _out
you can call that type's own _out
. And for your _in
, you can use a separator that is guaranteed not to be in those types (which should be easy since they're all numbers). Then you can read until that separator and call the appropriate _in
.
from aggs_for_vecs.
Great, thanks for the link to that example and confirmation on the approach.
I'm not sure it makes sense to have Datum fields in this struct instead of the internal types---but I understand how that keeps it generic. Perhaps it's okay.
All these Datums will be of type elementType, right? So then for your _out you can call that type's own _out. And for your _in, you can use a separator that is guaranteed not to be in those types (which should be easy since they're all numbers). Then you can read until that separator and call the appropriate _in.
Yes, all the Datums
will be of elementType
so using that type's _in
and _out
was exactly what I was hoping to be able to do. Alternatively I thought these could be pgnum
values, if you think that would be a better approach? I figured I'd be trying to convert them to Datum
frequently enough that maybe Datum
was better?
from aggs_for_vecs.
I've been working through the custom type, defined currently as
CREATE TYPE vecaggstats (
internallength = variable,
input = vecaggstats_in,
output = vecaggstats_out
);
and thanks for your suggestion of using placeholder values for the _in
and _out
functions to start. I was thinking it would be useful to lookup & cache the Oid
of the vecaggstats
type in the _PG_init
function, however if I do this:
// cache our custom VecAggElementStats type Oid
static Oid VecAggElementStatsTypeOid;
void
_PG_init(void)
{
// FIXME: this returns error "Unknown type: vecaggstats"... how do this?
VecAggElementStatsTypeOid = typenameTypeId(NULL, typeStringToTypeName("vecaggstats"));
}
it throws an error when running CREATE EXTENSION aggs_for_vecs
. I suppose that the vecaggstats
type is not available until after that _PG_init
callback then? Is there another way to lookup/cache the Oid just once after the extension is loaded? Or is this typeStringToTypeName()
call not worth caching anyway?
from aggs_for_vecs.
I don't think caching the type oid is correct. As you say the type doesn't exist yet, but also it seems like there may be problems if the type gets dropped/restored. Anyway Postgres has its own type cache so just using the typenameTypeId
function should be very fast (and cached after the first time).
from aggs_for_vecs.
I see, it was the typeStringToTypeName
function I thought might be worth caching, but I'll just call that as needed then, thanks.
from aggs_for_vecs.
I am to the point in feature/two-step-aggs where this SQL runs:
SELECT vec_agg_count(vec_stat_agg(nums)) FROM measurements;
vec_agg_count
---------------
{4,3,4}
However I have so many doubts/questions that I'm unsure of the correctness of most of it. First up, the vec_stat_agg
function is returning an array of VecAggElementsStatsType
as previously discussed:
CREATE OR REPLACE FUNCTION
vec_stat_agg_finalfn(internal, numeric[])
RETURNS vecaggstats[]
AS 'aggs_for_vecs', 'vec_stat_agg_finalfn'
LANGUAGE c;
typedef struct VecAggElementStatsType {
char vl_len_[4];
Oid elemTypeId;
uint32 count;
Datum sum;
Datum min;
Datum max;
Datum mean;
} VecAggElementStatsType;
I was thinking that actually that elemTypeId
really is for the entire array, not each element. Do you think that instead a single value should be returned from vec_stat_agg
, something like:
CREATE OR REPLACE FUNCTION
vec_stat_agg_finalfn(internal, numeric[])
RETURNS vecaggstats -- no longer array here
AS 'aggs_for_vecs', 'vec_stat_agg_finalfn'
LANGUAGE c;
typedef struct VecAggStatsType {
char vl_len_[4];
Oid elemTypeId; // type of Datum used below
uint32 elemCount; // number of elements in Datum arrays below
Datum *counts;
Datum *sums;
Datum *mins;
Datum *maxes;
Datum *means;
} VecAggStatsType;
If so, is that even be the way to define Datum arrays here, or would it need to be defined as ArrayType *
like
typedef struct VecAggStatsType {
char vl_len_[4];
Oid elemTypeId; // type of Datum used below
uint32 elemCount; // number of elements in Datum arrays below
ArrayType *counts;
ArrayType *sums;
ArrayType *mins;
ArrayType *maxes;
ArrayType *means;
} VecAggStatsType;
from aggs_for_vecs.
I think returning a single value instead of an array makes more sense.
I think I'd make elemTypeId
the type of the input vector's elements, not the type of the array.
I don't think I'd include means
at all, since the accessor method can just do sums/counts
.
I think I'd do Datum *foo
rather than ArrayType *foo
. Since these are just internal intermediate values, it doesn't seem worth it to build Postgres arrays for them.
from aggs_for_vecs.
I think returning a single value instead of an array makes more sense.
Great, I'll give that a go.
I think I'd make elemTypeId the type of the input vector's elements, not the type of the array.
Yes you are correct, I was glossing over the fact that most of the array types are the same as the input element type, but not necessarily so for things like counts
which will be fixed to int8
even when the input is numeric
.
I don't think I'd include means at all, since the accessor method can just do sums/counts.
The reason for this is that the mean value would actually calculated by a function call like numeric_avg
on that element's state object. I thought that would be preferable (at least in the numeric
case) because of all the edge cases numeric_avg
takes care of. It is a bit of extra work if you don't want the mean
but getting the mean
from this aggregate is, I think, a primary use case. If you didn't want all the mean/sum/count values this aggregate calculates, I figure you'd just use a non two-step function. What do you think about that?
I think I'd do Datum *foo rather than ArrayType *foo. Since these are just internal intermediate values, it doesn't seem worth it to build Postgres arrays for them.
Right-o, I'll go with Datum *
then.
Thanks for all the feedback!
from aggs_for_vecs.
I converted to that VecAggStatsType
now, and cleaned up and implemented vec_agg_sum
and vec_agg_mean
functions so the following SQL runs:
SELECT vec_agg_count(vec_stat_agg(nums)) AS counts
, vec_agg_sum(vec_stat_agg(nums)) AS sums
, vec_trim_scale(vec_agg_mean(vec_stat_agg(nums))) AS means
FROM measurements;
counts | sums | means
---------+-------------------+-----------------
{4,3,4} | {4.92,5.91,14.80} | {1.23,1.97,3.7}
(1 row)
And I verified via a debugger that indeed the vec_stat_agg(nums)
part is executed just once! 🎉
from aggs_for_vecs.
Awesome!
from aggs_for_vecs.
I've been testing out the performance of the new functions on a larger data set, and have a surprising result where the two-step aggregate is performing about the same, or even slightly slower than, the one-step aggregates. For example here is what I'm comparing:
-- one step variation
SELECT vec_to_count(data_i) AS counts
, vec_to_sum(data_i) AS sums
, vec_trim_scale(vec_to_mean(data_i)) AS means
, vec_to_min(data_i) AS mins
, vec_to_max(data_i) AS maxes
FROM measurements2;
-- two step variation; expected to be faster
SELECT vec_agg_count(vec_stat_agg(data_i)) AS counts
, vec_agg_sum(vec_stat_agg(data_i)) AS sums
, vec_trim_scale(vec_agg_mean(vec_stat_agg(data_i))) AS means
, vec_agg_min(vec_stat_agg(data_i)) AS mins
, vec_agg_max(vec_stat_agg(data_i)) AS maxes
FROM measurements2;
When run on this test data set, the two-step variation performs about 2x as fast, but on my larger data set it is about break-even and usually slightly slower. An eample one-step plan looks like
HashAggregate (cost=291202.49..291263.99 rows=200 width=72) (actual time=331.386..2979.986 rows=20885 loops=1)
Output: (time_bucket(('900 seconds'::cstring)::interval, d.ts)), vec_trim_scale(vec_to_mean(d.data_i)), solarcommon.array_transpose2d(ARRAY[(vec_to_count(d.data_i))::numeric[], vec_to_min(d.data_i), vec_to_max(d.data_i)])
Group Key: time_bucket(('900 seconds'::cstring)::interval, d.ts)
Buffers: shared hit=1747
and a two-step plan like this:
HashAggregate (cost=291202.49..291265.99 rows=200 width=72) (actual time=385.891..3263.576 rows=20885 loops=1)
Output: (time_bucket(('900 seconds'::cstring)::interval, d.ts)), vec_trim_scale(vec_agg_mean(vec_stat_agg(d.data_i))), solarcommon.array_transpose2d(ARRAY[(vec_agg_count(vec_stat_agg(d.data_i)))::numeric[], vec_agg_min(vec_stat_agg(d.data_i)), vec_agg_max(vec_stat_agg(d.data_i))])
Group Key: time_bucket(('900 seconds'::cstring)::interval, d.ts)
Buffers: shared hit=1747
I tried experimenting with using DirectFunctionCallX
style invocations instead of the FunctionCallInvoke
style where possible, but that didn't make any noticeable difference.
I'm a bit stumped! Have any ideas here?
from aggs_for_vecs.
I took a look at the benchmark support you created and threw in some benchmarks to profile the numeric support. The results I get look like:
vec_to_mean
=============
SQL rows 20355.5 ms
C array 2247.86 ms
vec_to_mean_numeric
===================
SQL rows 21820.2 ms
C array 4162.31 ms
vec_agg_stat <-- this is the two-step variation for count/min/max/avg
============
SQL rows 23994.9 ms
C array 3767.61 ms
vec_agg_stats <-- this is the one-step variation for count/min/max/avg
=============
SQL rows 23695 ms
C array 5827.55 ms
So in this query the two-step variation is actually faster. Do you have any tips on approaching some low-level profiling of the functions, I've not done anything like that in C before.
Also, I discovered that if the aggregate spools to temp files, the output of the two-step aggregation is incorrect and sometimes the server even crashes. Would that be from the lack of real _in
and _out
functions? Or some other fault in the code?
from aggs_for_vecs.
I am still curious about the two-step approach with the custom time, however I also realized I could achieve a more limited "two-step" benefit the same way Postgres does for the avg()
and sum()
aggregates: they share the same transition function and just use different final functions. Given I'm interested in count/min/max support as well, I started a feature/accum branch and threw those in a shared transition function vec_stat_accum
, then added count, min, max, sum, and average final functions. For example:
CREATE AGGREGATE vec_agg_count(numeric[]) (
sfunc = vec_stat_accum,
stype = internal,
finalfunc = vec_agg_count_finalfn
);
CREATE AGGREGATE vec_agg_mean(numeric[]) (
sfunc = vec_stat_accum,
stype = internal,
finalfunc = vec_agg_mean_finalfn
);
Then Postgres still only creates a single state object for all these columns, and I achieve that "two-step" performance benefit I was seeking:
SELECT vec_agg_count(data_i) AS counts
, vec_agg_sum(data_i) AS sums
, vec_trim_scale(vec_agg_mean(data_i)) AS means
, vec_agg_min(data_i) AS mins
, vec_agg_max(data_i) AS maxes
FROM measurements2;
This is working for me even against the data sets that cause the two-step approach to fail as mentioned previously. Also the benchmarks are similar:
vec_agg_stat <-- this is the new vec_agg_X variation for count/min/max/avg/sum
============
SQL rows 24211.1 ms
C array 3841.45 ms
vec_agg_stats <-- this is the original vec_to_X variation for count/min/max/avg/sum
=============
SQL rows 24292.2 ms
C array 7910.46 ms
What isn't quite as nice with this approach is how other statistics like the var_samp
ones would require additional or different delegate function invocations, which support calculating stuff like the sum-of-squares. For example var_samp(numeric)
calls numeric_accum
instead of numeric_avg_accum
like avg(numeric)
does, the only difference between the two being numerc_avg_accum
does not calculate the sum-of-squares.
It is nice not to have to perform the extra calculations for sum-of-squares when not needed (especially for numerics, which are expensive), so I thought an aggregate argument to toggle that support on/off could be used, like
SELECT vec_agg_var_samp(data_i, variance => true)
, vec_agg_mean(data_i, variance => true)
FROM measurements2;
But passing , variance => true
to things like vec_agg_mean
just so the optimizer can create just one state instance makes the SQL confusing, I think. This seems more clear to me:
SELECT vec_agg_var_samp(vec_agg_stat(data_i, variance => true))
, vec_agg_mean(vec_agg_stat(data_i, variance => true))
FROM measurements2;
But, the two-step branch does not fully work so I'm missing something.
from aggs_for_vecs.
I didn't get a chance to dig into these questions on the weekend. I agree having variance => true
on vec_agg_{mean,var_samp}
isn't really something I want, but it does make sense on vec_agg_stat
I think.
Is that an hstore argument? I haven't seen that used for named arguments, but I like it!
Your benchmarks are really surprising to me. I'm interested in exploring that some myself, but I probably won't be able to right away. Same with the corruption when spooling to disk (but yeah it's possibly because of the in/our functions).
I'm surprised that re-using the same transfn means it doesn't get called twice. I thought it was just a means of code re-use, not a potential for optimization. I suppose it's true that Postgres shouldn't need to compute it twice though. Nice!
That does make me wonder why the Timescale folks aren't just relying on that. Perhaps their own intermediate result needs to do something in a finalfn.
from aggs_for_vecs.
Is that an hstore argument? I haven't seen that used for named arguments, but I like it!
I was just copying something I remembered seeing the Timescale folks doing, but actually I hadn't thought much about what it actually meant, sorry. I thought it was useful for documenting the example, is all.
Same with the corruption when spooling to disk (but yeah it's possibly because of the in/our functions).
This is where I'm really stumped. I've been trying all sorts of different approaches. It seems like the "numeric" quality gets lost when passed to another function before returning with the final query results. Here's the most concise example I have come up with, with the vec_agg_mean
aggregate. I have a log that prints out the final average value, and the log lines are correct, and the results in psql
are correct. However, if I instead pass the results to vec_trim_scale
the output turns into what appear like memory addresses. Here's my query:
SELECT vec_agg_mean(nums) FROM measurements;
vec_agg_mean
----------------------------------------------------------------
{1.23000000000000000000,1.9700000000000000,3.7000000000000000}
(1 row)
SELECT vec_trim_scale(vec_agg_mean(nums)) FROM measurements;
vec_trim_scale
------------------------------------------------
{94239562249968,94239562249980,94239562249992}
(1 row)
I have yet to be able to figure out what's going on here.
from aggs_for_vecs.
This seems related to memory contexts to me, since the Datum is by-reference. It could also be TOAST-related, where we are failing to deTOAST or the Datums in our struct are getting serialized but then later point to something else. But if I'm reading your last message correctly, it doesn't actually depend on spooling to disk after all, and an intermediate value like that shouldn't be getting TOASTed (I think). The way we're calling built-in numeric functions to do the real work is suspicious to me too, since we don't have as much control over how things are being allocated. Perhaps try making a copy of each Datum
before you store it in your struct?
from aggs_for_vecs.
Yeah, it doesn't depend on spooling to disk. And this is based solely on invoking numeric_avg
on the numeric element's state object. I can even do this to trigger it:
SELECT unnest(vec_agg_mean(nums)) FROM measurements;
NOTICE: avg 0 = 1.23000000000000000000
NOTICE: avg 1 = 1.9700000000000000
NOTICE: avg 2 = 3.7000000000000000
unnest
----------------
94777532913296
94777532913308
94777532913320
(3 rows)
The NOTICE
logs are in the aggregate final function, showing the result of calling numerc_avg
. They produce the correct output there.
I have tried making a copy of the result of that call to numeric_avg
, e.g.
state->astate->dvalues[i] = datumCopy(DirectFunctionCall1(numeric_avg, state->vec_states[i]), state->astate->typbyval, state->astate->typlen);
I'm afraid I get the same results as before, however.
from aggs_for_vecs.
I tried asking for help on the pgsql-general mailing list but didn't get any takers. I've organized what I feel has culminated in the best performing approach in a feature/accum-cached-fcinfo branch. I cache and re-use FunctionCallInfo
instances for the numeric_avg_accum
and numeric_cmp
functions on the aggregate state, and then create another final-function-local FunctionCallInfo
to loop over the final array results with in vec_agg_mean
and vec_agg_sum
.
Still, the same problem persists, and it affects botht he vec_agg_mean
and vec_agg_sum
final functions. You can see correct and incorrect results in a query like
SELECT vec_agg_mean(nums), vec_trim_scale(vec_agg_mean(nums)) FROM measurements;
-[ RECORD 1 ]--+---------------------------------------------------------------
vec_agg_mean | {1.23000000000000000000,1.9700000000000000,3.7000000000000000}
vec_trim_scale | {94597491308128,94597491308140,94597491308152}
There the vec_agg_mean
result is correct, while the same result passed through vec_trim_scale
produces incorrect results. So I'm stuck for now without any new ideas. 🤔
If you think of anything, let me know!
from aggs_for_vecs.
Tom Lane weighed in and found the issue: a dumb copy/paste problem in the SQL definition 😂. I fixed that, and now I think it's working finally! I have to get back in and test again, and add test cases, and then add support for other types...
from aggs_for_vecs.
Ha, that's great! I'm glad he was able to help figure it out.
I'm still interested in exploring the weird benchmark results, and I may have some opportunity the next couple days or this weekend. Also I'm still curious about vec_to_var_samp
getting a different result from the built-in version. Is there anything else you need help with this this issue's line of work?
from aggs_for_vecs.
I think I've been able to plow through everything on this issue, at least not with a true two-step aggregates but with "shared state" aggregates that achieve my immediate goals of having a fast-performing min/max/sum/mean numeric array aggregate. For example this query computes a single state value that is shared by all output columns:
SELECT vec_agg_count(pad_vec(data_a, 3)) AS counts
, vec_agg_sum(pad_vec(data_a, 3)) AS sums
, vec_trim_scale(vec_agg_mean(pad_vec(data_a, 3))) AS means
, vec_agg_min(pad_vec(data_a, 3)) AS mins
, vec_agg_max(pad_vec(data_a, 3)) AS maxes
FROM measurements2;
-[ RECORD 1 ]------------------------------
counts | {184,0,180}
sums | {1164885081,NULL,11117.700}
means | {6330897.179347826087,NULL,61.765}
mins | {6255484,NULL,61.765}
maxes | {6477419,NULL,61.765}
And that query seems to be about 6x as fast as the no-extension unnest()
SQL:
vec_agg_stat
============
SQL rows 24083.8 ms
C array 3595.71 ms
And it is about 2x faster than the equivalent SQL using all the vec_to_X
aggregates.
I ended up following the approach Postgres does with its own sum()
and avg()
aggregates, in that for vec_agg_sum()
- on
int2
,int4
input,int8
is output - on
int8
andnumeric
input,numeric
is output - on
float4
andfloat8
input,float8
is output
For vec_agg_mean()
- on
float4
andfloat8
input,float8
is output - on other input,
numeric
is output
I discovered this as I looked up how to call the appropriate delegate transition/final functions for each type.
Everything is now pushed up in the feature/accum-cached-fcinfo branch. I've added tests as well. Feedback welcome! If you think this is worthy of a PR, I'm happy to submit one.
from aggs_for_vecs.
from aggs_for_vecs.
Completed by #13 .
from aggs_for_vecs.
Related Issues (8)
- Support for NUMERIC[] type HOT 32
- Add vec_to_first and vec_to_last aggregates. HOT 2
- Debian package support HOT 2
- Support for variable-length input arrays HOT 6
- Allow processing more input rows than veccounts currently allows HOT 1
- Array math utilities like vec_add, vec_sub, vec_mul, vec_div HOT 6
- Possible to use existing Postgres aggregate functions like numeric_avg_accum, towards two-step aggregation HOT 12
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from aggs_for_vecs.