pjungwir / aggs_for_vecs Goto Github PK
View Code? Open in Web Editor NEWPostgres C extension with aggregate functions for array inputs
License: MIT License
Postgres C extension with aggregate functions for array inputs
License: MIT License
Thank you for this extension!
It's not mentioned in the docs in any shape or form, hence this issue.
Should we expect better performance on properly indexed array columns? If so, which type of indexing is preferred?
How would you feel about a PR to support core math utility functions like addition, subtraction, multiplication, and division between two array arguments? For example I have a use case where I'd like to be able to compute the difference between two array column row values using a window function. For example:
SELECT sensor_id, nums, vec_sub(nums, lag(nums) OVER (ORDER BY sensor_id)) AS diff FROM measurements
sensor_id | nums | diff
-----------+------------------+-------------------
1 | |
2 | {NULL,NULL,NULL} |
3 | {1.23,2.34,3.45} | {NULL,NULL,NULL}
4 | {1.23,NULL,2.34} | {0.00,NULL,-1.11}
5 | {1.23,2.34,4.45} | {0.00,NULL,2.11}
6 | {1.23,1.23,4.56} | {0.00,-1.11,0.11}
(6 rows)
As a proof-of-concept I implemented a vec_sub
function in SolarNetwork/aggs_for_vecs@a214882. I'd be happy to implement the other operations (add, mul, div) if this is something you'd like to include.
I noticed that VecArrayBuildState.veccounts
is an int
array, which can vary in size across platforms and, assuming a size of 32 bits, allows up to about 2 billion input rows before overflowing. What about switching this to uint32
so the type is consistent across platforms and can handle up to about 4 billion input rows? I noticed cases the int
values were being cast to int32
already (example).
I did a proof of concept here that passes all tests. If you like the look I could create a PR?
Thanks for your consideration.
Hello, I would really like to try this extension, as it could really help one particular application of mine, however I have NUMERIC[]
columns. Would it be possible for this extension to support that? I've never looked into writing a Postgres extension myself, so thought I'd ask here to see how hard you thought it might be to do. Thanks for your time!
Having now worked a little bit with the Postgres code to support NUMERIC
, I noticed that internally Postgres provides a great set of functions for aggregating the NUMERIC
type. For example, the avg(numeric)
definition has:
{ aggfnoid => 'avg(numeric)', aggtransfn => 'numeric_avg_accum',
aggfinalfn => 'numeric_avg', aggcombinefn => 'numeric_avg_combine',
aggserialfn => 'numeric_avg_serialize',
aggdeserialfn => 'numeric_avg_deserialize',
aggmtransfn => 'numeric_avg_accum', aggminvtransfn => 'numeric_accum_inv',
aggmfinalfn => 'numeric_avg', aggtranstype => 'internal',
aggtransspace => '128', aggmtranstype => 'internal',
aggmtransspace => '128' },
Internally the aggregates make use of state objects like NumericAggState
. That same state object is shared between many of Postgres' aggregate functions, and (apparently) the optimizer knows that something like
SELECT count(numeric), sum(numeric), avg(numeric) FROM table_of_numerics
can accumulate/calulcate the NumericAggState
once for all three aggregates, and then just execute different final functions on that shared state to produce the desired count, sum, and average results.
I was thinking this approach would be really great applied here, on arrays of numbers, and the performance benefit could be substantial. I also use the Timescale extension, and they are working on a set of two-step aggregates that work great on single-value columns, but as I am working with array columns I want the same sort of result applied like this extension would, as vectors.
That all leads me to my first question: would it be possible/advisable to try to re-use the existing Postgres functions like numeric_avg_accum
, but passing them elements of our input array and storing their result state in this extension's vecvalues
array. Here's what I'm trying to express, in pseudo code:
typedef union pgnum {
int16 i16;
int32 i32;
int64 i64;
float4 f4;
float8 f8;
Datum datum; // to store the value returned from something like numeric_avg_accum
} pgnum;
Datum
vec_to_numeric_avg_accum(PG_FUNCTION_ARGS)
{
// skipping a bunch of declarations
get_typlenbyvalalign(elemTypeId, &elemTypeWidth, &elemTypeByValue, &elemTypeAlignmentCode);
deconstruct_array(currentArray, elemTypeId, elemTypeWidth, elemTypeByValue, elemTypeAlignmentCode,
¤tVals, ¤tNulls, ¤tLength);
for (i = 0; i < arrayLength; i++) {
if (currentNulls[i]) {
// do nothing: nulls can't change the result.
} else {
if (state->state.dnulls[i]) {
state->state.dnulls[i] = false;
}
// forward our invocation to the numeric_avg_accum function, changing the passed value argument from
// our array argument to the array element at index i
state->vecvalues[i].datum = DirectFunctionCall2(numeric_avg_accum, state->vecvalues[i].datum, currentVals[i]);
}
}
PG_RETURN_POINTER(state);
}
I tried to hack something together to see if this works, but it does not, reporting the error ERROR: aggregate function called in non-aggregate context
. I think I understand why that happens: I'm calling DirectFunctionCall2
which will create a new FunctionCallInfo
structure that does not include the same settings as the one passed to vec_to_numeric_avg_accum
, especially whatever part that makes the AggCheckCallContext()
call succeed.
Would it be possible to essentially create a copy of the FunctionCallInfo
passed to vec_to_numeric_avg_accum
but with the array argument swapped to an element of that array, and pass that to the numeric_avg_accum
somehow, maybe via FunctionCallInvoke
?
Per our discussion in #8, I started work on two-step aggregation support. The first goal is to support count/min/max/sum/mean access functions after a vec_stat_agg(numeric[])
aggregate. Note I use numeric[]
here but hope the other number types can be supported as well. Essentially this would support a SQL statement like this:
SELECT vec_agg_count(vec_stat_agg(nums))
, vec_agg_min(vec_stat_agg(nums))
, vec_agg_max(vec_stat_agg(nums))
, vec_agg_sum(vec_stat_agg(nums))
, vec_agg_mean(vec_stat_agg(nums))
FROM measurements
I started exploring this on a feature/two-step-aggs branch. I have a lot to grok still... was hoping you might be able to provide feedback.
Originally I thought I could just return an arbitrary struct
and all the vec_to_X()
functions would know how to use, and I could define the aggregate like
CREATE OR REPLACE FUNCTION
vec_stat_agg_transfn(internal, numeric[])
RETURNS internal
AS 'aggs_for_vecs', 'vec_stat_agg_transfn'
LANGUAGE c;
CREATE OR REPLACE FUNCTION
vec_stat_agg_finalfn(internal, numeric[])
RETURNS internal
AS 'aggs_for_vecs', 'vec_stat_agg_finalfn'
LANGUAGE c;
CREATE AGGREGATE vec_stat_agg(numeric[]) (
sfunc = vec_stat_agg_transfn,
stype = internal,
finalfunc = vec_stat_agg_finalfn,
finalfunc_extra
);
-- vec_agg_count
CREATE OR REPLACE FUNCTION
vec_agg_count(internal)
RETURNS ANYARRAY
AS 'aggs_for_vecs', 'vec_agg_count'
LANGUAGE c;
However Postgres does not like this: it complains
ERROR: unsafe use of pseudo-type "internal" DETAIL: A function returning "internal" must have a least one "internal" argument.
The only way I could get Postgres to accept this was by using anyarray
and anyelement
everywhere, like:
CREATE OR REPLACE FUNCTION
vec_stat_agg_transfn(internal, anyarray)
RETURNS internal
AS 'aggs_for_vecs', 'vec_stat_agg_transfn'
LANGUAGE c;
CREATE OR REPLACE FUNCTION
vec_stat_agg_finalfn(internal, anyarray)
RETURNS anyelement
AS 'aggs_for_vecs', 'vec_stat_agg_finalfn'
LANGUAGE c;
CREATE AGGREGATE vec_stat_agg(anyarray) (
sfunc = vec_stat_agg_transfn,
stype = internal,
finalfunc = vec_stat_agg_finalfn,
finalfunc_extra
);
-- vec_agg_count
CREATE OR REPLACE FUNCTION
vec_agg_count(anyelement)
RETURNS ANYARRAY
AS 'aggs_for_vecs', 'vec_agg_count'
LANGUAGE c;
But I started seeing how perhaps the vec_stat_agg
aggregate's final function has to return an actual registered custom type? Sort of like their Complex
example? Do you think that is how this needs to work, by defining a custom type?
I started looking into this, in a new stats.c file with this:
// Aggregate element statistics to use in two-step aggregation functions.
typedef struct VecAggElementStats {
char vl_len_[4];
Oid elementType;
uint32 count;
Datum sum;
Datum min;
Datum max;
Datum mean;
} VecAggElementStats;
The idea here that I'd need a variable-length type (as Numeric is variable length) and I'd store the aggregate state values as Datum and then perhaps all the number types could be encoded there, even though they are not variable length.
If this is to be an actual custom type, then I see I'd have to implement *_in
and *_out
functions to de/serialize the object from/to text. But then I was getting a bit lost in how to actually approach that, with the variable-length aspect involved using Numeric types. Do you have any pointers here, or can comment on the approach overall with regards to the need for a custom type like this?
I'd like to be able to include this extension in environments where a Debian binary package would be helpful, instead of having to build from source.
I researched a bit how to do this, and got the build working using debhelper
in a feature/debian branch. See the debian/README for how I set up the build environment.
I added a deb
recipe to the Makefile
like this:
deb:
make clean
git archive --format tar --prefix=aggs-for-vecs-$(EXTENSION_VERSION)/ feature/debian |gzip >../aggs-for-vecs_$(EXTENSION_VERSION).orig.tar.gz
pg_buildext updatecontrol
make -f debian/rules debian/control
dh clean
make all
dpkg-buildpackage -us -uc
You can see it's hard-coded to build from the feature/debian
branch at the moment. I think that would eventually change to a tag or perhaps master
.
I originally created the debian
build files via
dh_make_pgxs -v 1.3.0 -h https://github.com/pjungwir/aggs_for_vecs
This turned the underscores in the project name to dashes, so the build packages are named like postgresql-12-aggs-for-vecs_1.3.0-1_amd64.deb
(as Debian package names aren't allowed to have underscores, I think).
In any case, I'm new to building Debian packages like this, so I'm not sure if everything is configured correctly. If you have feedback, or would like me to submit a PR, I'm happy to do that.
I discovered another set of aggregates that would be useful to me: vec_to_first
and vec_to_last
. These return an array containing the first/last non-null values seen in the input arrays, per position. They are expected to be used with an ORDER BY
clause, e.g.
SELECT vec_to_first(data_a ORDER BY ts)
FROM measurements2
I have implemented these in a feature/accum-firstlast and included tests, and the functions work with any element type (not just numbers). I've built off the feature/accum-cached-fcinfo branch as discussed in #10, but does not depend on those changes so could be included without if needed (although I'd prefer to include those as well!).
My application has array columns that can vary in length. Mostly they are the same length, but they can differ. When I query them, however, I know up front the maximum length of all rows for the column. I was wondering how feasible it would be to add support for passing an array length into the aggregate functions, like
SELECT vec_to_sum(array_column, 8)
FROM foo
If the length is provided, then the aggregates would process each row and only handle that many elements, always returning an array of the given length (8 in my example). If no length is provided it would operate as it currently does, taking the length from the first input row.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.