pjungwir / aggs_for_vecs Goto Github PK

Postgres C extension with aggregate functions for array inputs

License: MIT License

Makefile 0.69% C 74.42% Shell 8.50% Ruby 0.38% PLpgSQL 16.01%

aggs_for_vecs's Introduction

`aggs_for_vecs`

This is a C-based Postgres extension offering various aggregate functions like min, max, avg, and var_samp that operate on arrays instead of scalars. It treats each array as a "vector" and handles each element independently. So suppose you have 3 rows each with a 4-element array like so:

id	vals
1	{1,2,3,4}
2	{5,0,-5,0}
3	{3,6,0,9}

Then SELECT vec_to_min(vals) will pick the minimum item in each array position, giving you {1,0,-5,0}.

Note that the functions here are true aggregate functions. If you want something that provides aggregate-like behavior by computing stats from a single array, take a look at my other extension aggs_for_arrays. You could say that this extension follows a row-based format and the other a column-based.

Details

Functions support arrays of any numeric type: SMALLINT, INTEGER, BIGINT, REAL, or DOUBLE PRECISION (aka FLOAT). They either return an array of the same type (e.g. vec_to_min) or an array of FLOAT (e.g. vec_to_mean).

All input arrays must be the same length, or you get an error. The output array will have the same length as the inputs.

NULLs are ignored, and NULL elements are also skipped. Basically you get the same result as if you could do MIN on the first elements, then MIN on the second elements, etc. If all your inputs are simply NULL, then you'll get a NULL in return. But if the inputs are arrays of NULLs, then you'll get an array of NULLs in return (of the same length).

Note that when input arrays have NULL in some positions but not others, you still get correct results for things like mean. That is, we keep a count for each position separately and divide by the appropriate amount.

Installing

This package installs like any Postgres extension. First say:

make && sudo make install

You will need to have pg_config in your path, but normally that is already the case. You can check with which pg_config. (If you have multiple Postgresses installed, make sure that pg_config points to the right one.)

Then in the database of your choice say:

CREATE EXTENSION aggs_for_vecs;

You can also run tests and benchmarks yourself with make test and make bench, respectively, but first you'll have to set up databases for those to use. If you run the commands and they can't find a database, they'll give you instructions how to make one.

Aggregate functions

`vec_to_count(ANYARRAY) RETURNS BIGINT[]`

Returns the count of non-nulls in each array position.

`vec_to_sum(ANYARRAY) RETURNS ANYARRAY`

Returns the sum of non-nulls in each array position.

`vec_to_min(ANYARRAY) RETURNS ANYARRAY`

Returns the minimum in each array position.

`vec_to_max(ANYARRAY) RETURNS ANYARRAY`

Returns the maximum in each array position.

`vec_to_mean(ANYARRAY) RETURNS FLOAT[]`

Returns the average (mean) in each array position.

`vec_to_weighted_mean(ANYARRAY, ANYARRAY) RETURNS FLOAT[]`

Returns the weighted average (mean) in each array position, using the first parameter for the values and the second for the weights. The two arrays should have the same length.

`vec_to_var_samp(ANYARRAY) RETURNS FLOAT[]`

Returns the sample variance in each array position. The code is very similar to the built-in var_samp function, so if it works there it should work here (or it's a bug).

`hist_2d(x ANYELEMENT, y ANYELEMENT, x_bucket_start ANYELEMENT, y_bucket_start ANYELEMENT, x_bucket_width ANYELEMENT, y_bucket_width ANYELEMENT, x_bucket_count INTEGER, y_bucket_count INTEGER)`

Aggregate function that takes a bunch of x and y values, and plots them on a 2-D histogram. The other parameters determine the shape of the histogram (number of buckets on each axis, start of the buckets, width of each bucket).

`hist_md(vals ANYARRAY, indexes INTEGER[], bucket_starts ANYARRAY, bucket_widths ANYARRAY, bucket_counts INTEGER[])`

Aggregate function to compute an n-dimensional histogram. It takes a vector of values, and it uses indexes to pick one or more elements from that vector and treat them as x, y, z, etc. If you want 2 dimensions, there should be two values for indexes, two for bucket_starts, two for bucket_widths, and two for bucket_counts. Or if you want 3 dimensions, you need three values for each of those.

Since the values in indexes should follow Postgres's convention of 1-indexed arrays, so that if indexes is {1,4}, then we will use vals[1] and vals[4] as the histogram x and y.

`vec_to_first(ANYARRAY) RETURNS ANYARRAY`

Returns an array containing the first non-null value in each array position. Designed to be used as an ordered set aggregate like vec_to_first(nums ORDER BY ts).

`vec_to_last(ANYARRAY) RETURNS ANYARRAY`

Returns an array containing the last non-null value in each array position. Designed to be used as an ordered set aggregate like vec_to_last(nums ORDER BY ts).

Non-aggregate math functions

The following functions are not aggregate functions, and accept two arguments (l, r) in the following forms:

array, array
array, number
number, array

If number is provided instead of array, it is treated as if it were an array of the same length as the other argument, where every element has this value. In all cases the argument value types must be the same (for example both the integer type), and when two arrays are provided they must be of the same length. Each function returns an array of the same length as the input array(s) of the same type.

`vec_add(l, r) RETURNS ANYARRAY`

Returns each array position in the first argument added to the same position in the second argument.

`vec_div(l, r) RETURNS ANYARRAY`

Returns each array position in the first argument divided by the same position in the second argument.

`vec_mul(l, r) RETURNS ANYARRAY`

Returns each array position in the first argument multiplied by the same position in the second argument.

`vec_sub(l, r) RETURNS ANYARRAY`

Returns each array position in the second argument subtracted from the same position in the first argument.

Non-aggregate utility functions

The following are other non-arrgregate functions that are useful in combination with the other functions provided by this extension.

`pad_vec(ANYRARRAY, INTEGER) RETURNS ANYARRAY`

Return an array with the same elements as the given array, but extended to have the length of the second argument if necessary. Any added elements will be set to NULL. If the given array is already the given length, it is returned directly.

`vec_coalesce(ANYARRAY, ANYELEMENT) RETURNS ANYARRAY`

Return an array with the same elements as the given array, execept all NULL elements are replaced by the given second argument.

`vec_trim_scale(NUMERIC[]) RETURNS NUMERIC[]`

Trims trailing zeros from NUMERIC elements, for example on the results of a vec_to_mean() operation. In Postgres 13 or later the built-in trim_scale function will be applied to each array element, which adjusts the scale of each numeric such that trailing zeros are dropped. For Postgres 12 or older a polyfill implementation of that function is used.

`vec_without_outliers(ANYARRAY, ANYARRAY, ANYARRAY) RETURNS ANYARRAY`

Useful to trim down the inputs to the other functions here. You pass it three arrays all of the same length and type. The first array has the actual values. The second array gives the minimum amount allowed in each position; the third array, the maximum. The function returns an array where each element is either the input value (if within the min/max) or NULL (if an outlier). You can include NULLs in the min/max arrays to indicate an unbounded limit there, or pass a simple NULL for either to indicate no bounds at all.

Testing

These tests follow the PGXS and pg_regress framework used for Postgres extensions, including Postgres's own contrib package. To run the tests, first install the extension somewhere then say make installcheck. You can use standard libpq envvars to control the database connection, e.g. PGPORT=5436 make installcheck.

Limitations/TODO

Lots of functions are still left to implement:
- vec_to_min_max
- vec_to_median
- vec_to_mode
- vec_to_percentile
- vec_to_percentiles
- vec_to_skewness
- vec_to_kurtosis

Author

Paul A. Jungwirth

Benchmarks

You can get the same behavior as this extension by using UNNEST to break up the input arrays, and then array_agg to put the results back together. But these benchmarks show that aggs_for_vecs functions are 9-10 times faster:

function	SQL	`aggs_for_vecs`
`vec_to_min`	14150.7 ms	1468.14 ms
`vec_to_max`	14062.4 ms	1549.66 ms
`vec_to_mean`	14341.5 ms	1586.62 ms
`vec_to_var_samp`	14196.7 ms	1578.92 ms

aggs_for_vecs's People

Contributors

Stargazers

Watchers

Forkers

tah-alv leszarna solarnetwork postgraphdb

aggs_for_vecs's Issues

Allow processing more input rows than veccounts currently allows

I noticed that VecArrayBuildState.veccounts is an int array, which can vary in size across platforms and, assuming a size of 32 bits, allows up to about 2 billion input rows before overflowing. What about switching this to uint32 so the type is consistent across platforms and can handle up to about 4 billion input rows? I noticed cases the int values were being cast to int32 already (example).

I did a proof of concept here that passes all tests. If you like the look I could create a PR?

Thanks for your consideration.

Debian package support

I'd like to be able to include this extension in environments where a Debian binary package would be helpful, instead of having to build from source.

I researched a bit how to do this, and got the build working using debhelper in a feature/debian branch. See the debian/README for how I set up the build environment.

I added a deb recipe to the Makefile like this:

deb:
	make clean
	git archive --format tar --prefix=aggs-for-vecs-$(EXTENSION_VERSION)/ feature/debian |gzip >../aggs-for-vecs_$(EXTENSION_VERSION).orig.tar.gz
	pg_buildext updatecontrol
	make -f debian/rules debian/control
	dh clean
	make all
	dpkg-buildpackage -us -uc

You can see it's hard-coded to build from the feature/debian branch at the moment. I think that would eventually change to a tag or perhaps master.

I originally created the debian build files via

dh_make_pgxs -v 1.3.0 -h https://github.com/pjungwir/aggs_for_vecs

This turned the underscores in the project name to dashes, so the build packages are named like postgresql-12-aggs-for-vecs_1.3.0-1_amd64.deb (as Debian package names aren't allowed to have underscores, I think).

In any case, I'm new to building Debian packages like this, so I'm not sure if everything is configured correctly. If you have feedback, or would like me to submit a PR, I'm happy to do that.

Two-step aggregation

Per our discussion in #8, I started work on two-step aggregation support. The first goal is to support count/min/max/sum/mean access functions after a vec_stat_agg(numeric[]) aggregate. Note I use numeric[] here but hope the other number types can be supported as well. Essentially this would support a SQL statement like this:

SELECT vec_agg_count(vec_stat_agg(nums))
     , vec_agg_min(vec_stat_agg(nums))
     , vec_agg_max(vec_stat_agg(nums))
     , vec_agg_sum(vec_stat_agg(nums))
     , vec_agg_mean(vec_stat_agg(nums))
FROM measurements

I started exploring this on a feature/two-step-aggs branch. I have a lot to grok still... was hoping you might be able to provide feedback.

Originally I thought I could just return an arbitrary struct and all the vec_to_X() functions would know how to use, and I could define the aggregate like

CREATE OR REPLACE FUNCTION
vec_stat_agg_transfn(internal, numeric[])
RETURNS internal
AS 'aggs_for_vecs', 'vec_stat_agg_transfn'
LANGUAGE c;

CREATE OR REPLACE FUNCTION
vec_stat_agg_finalfn(internal, numeric[])
RETURNS internal
AS 'aggs_for_vecs', 'vec_stat_agg_finalfn'
LANGUAGE c;

CREATE AGGREGATE vec_stat_agg(numeric[]) (
  sfunc = vec_stat_agg_transfn,
  stype = internal,
  finalfunc = vec_stat_agg_finalfn,
  finalfunc_extra
);

-- vec_agg_count

CREATE OR REPLACE FUNCTION
vec_agg_count(internal)
RETURNS ANYARRAY
AS 'aggs_for_vecs', 'vec_agg_count'
LANGUAGE c;

However Postgres does not like this: it complains

ERROR: unsafe use of pseudo-type "internal" DETAIL: A function returning "internal" must have a least one "internal" argument.

The only way I could get Postgres to accept this was by using anyarray and anyelement everywhere, like:

CREATE OR REPLACE FUNCTION
vec_stat_agg_transfn(internal, anyarray)
RETURNS internal
AS 'aggs_for_vecs', 'vec_stat_agg_transfn'
LANGUAGE c;

CREATE OR REPLACE FUNCTION
vec_stat_agg_finalfn(internal, anyarray)
RETURNS anyelement
AS 'aggs_for_vecs', 'vec_stat_agg_finalfn'
LANGUAGE c;

CREATE AGGREGATE vec_stat_agg(anyarray) (
  sfunc = vec_stat_agg_transfn,
  stype = internal,
  finalfunc = vec_stat_agg_finalfn,
  finalfunc_extra
);

-- vec_agg_count

CREATE OR REPLACE FUNCTION
vec_agg_count(anyelement)
RETURNS ANYARRAY
AS 'aggs_for_vecs', 'vec_agg_count'
LANGUAGE c;

But I started seeing how perhaps the vec_stat_agg aggregate's final function has to return an actual registered custom type? Sort of like their Complex example? Do you think that is how this needs to work, by defining a custom type?

I started looking into this, in a new stats.c file with this:

// Aggregate element statistics to use in two-step aggregation functions.
typedef struct VecAggElementStats {
  char    vl_len_[4];
  Oid     elementType;
  uint32  count;
  Datum   sum;
  Datum   min;
  Datum   max;
  Datum   mean;
} VecAggElementStats;

The idea here that I'd need a variable-length type (as Numeric is variable length) and I'd store the aggregate state values as Datum and then perhaps all the number types could be encoded there, even though they are not variable length.

If this is to be an actual custom type, then I see I'd have to implement *_in and *_out functions to de/serialize the object from/to text. But then I was getting a bit lost in how to actually approach that, with the variable-length aspect involved using Numeric types. Do you have any pointers here, or can comment on the approach overall with regards to the need for a custom type like this?

Possible to use existing Postgres aggregate functions like numeric_avg_accum, towards two-step aggregation

Having now worked a little bit with the Postgres code to support NUMERIC, I noticed that internally Postgres provides a great set of functions for aggregating the NUMERIC type. For example, the avg(numeric) definition has:

{ aggfnoid => 'avg(numeric)', aggtransfn => 'numeric_avg_accum',
  aggfinalfn => 'numeric_avg', aggcombinefn => 'numeric_avg_combine',
  aggserialfn => 'numeric_avg_serialize',
  aggdeserialfn => 'numeric_avg_deserialize',
  aggmtransfn => 'numeric_avg_accum', aggminvtransfn => 'numeric_accum_inv',
  aggmfinalfn => 'numeric_avg', aggtranstype => 'internal',
  aggtransspace => '128', aggmtranstype => 'internal',
  aggmtransspace => '128' },

Internally the aggregates make use of state objects like NumericAggState. That same state object is shared between many of Postgres' aggregate functions, and (apparently) the optimizer knows that something like

SELECT count(numeric), sum(numeric), avg(numeric) FROM table_of_numerics

can accumulate/calulcate the NumericAggState once for all three aggregates, and then just execute different final functions on that shared state to produce the desired count, sum, and average results.

I was thinking this approach would be really great applied here, on arrays of numbers, and the performance benefit could be substantial. I also use the Timescale extension, and they are working on a set of two-step aggregates that work great on single-value columns, but as I am working with array columns I want the same sort of result applied like this extension would, as vectors.

That all leads me to my first question: would it be possible/advisable to try to re-use the existing Postgres functions like numeric_avg_accum, but passing them elements of our input array and storing their result state in this extension's vecvalues array. Here's what I'm trying to express, in pseudo code:

typedef union pgnum {
  int16 i16;
  int32 i32;
  int64 i64;
  float4 f4;
  float8 f8;
  Datum datum; // to store the value returned from something like numeric_avg_accum
} pgnum;

Datum
vec_to_numeric_avg_accum(PG_FUNCTION_ARGS)
{
  // skipping a bunch of declarations
  get_typlenbyvalalign(elemTypeId, &elemTypeWidth, &elemTypeByValue, &elemTypeAlignmentCode);
  deconstruct_array(currentArray, elemTypeId, elemTypeWidth, elemTypeByValue, elemTypeAlignmentCode,
      &currentVals, &currentNulls, &currentLength);
  for (i = 0; i < arrayLength; i++) {
    if (currentNulls[i]) {
      // do nothing: nulls can't change the result.
    } else {
      if (state->state.dnulls[i]) {
        state->state.dnulls[i] = false;
      }
      // forward our invocation to the numeric_avg_accum function, changing the passed value argument from
      // our array argument to the array element at index i
      state->vecvalues[i].datum = DirectFunctionCall2(numeric_avg_accum, state->vecvalues[i].datum, currentVals[i]);
    }
  }
  PG_RETURN_POINTER(state);
}

I tried to hack something together to see if this works, but it does not, reporting the error ERROR: aggregate function called in non-aggregate context. I think I understand why that happens: I'm calling DirectFunctionCall2 which will create a new FunctionCallInfo structure that does not include the same settings as the one passed to vec_to_numeric_avg_accum, especially whatever part that makes the AggCheckCallContext() call succeed.

Would it be possible to essentially create a copy of the FunctionCallInfo passed to vec_to_numeric_avg_accum but with the array argument swapped to an element of that array, and pass that to the numeric_avg_accum somehow, maybe via FunctionCallInvoke?

Array math utilities like vec_add, vec_sub, vec_mul, vec_div

How would you feel about a PR to support core math utility functions like addition, subtraction, multiplication, and division between two array arguments? For example I have a use case where I'd like to be able to compute the difference between two array column row values using a window function. For example:

SELECT sensor_id, nums, vec_sub(nums, lag(nums) OVER (ORDER BY sensor_id)) AS diff FROM measurements
 sensor_id |       nums       |       diff        
-----------+------------------+-------------------
         1 |                  | 
         2 | {NULL,NULL,NULL} | 
         3 | {1.23,2.34,3.45} | {NULL,NULL,NULL}
         4 | {1.23,NULL,2.34} | {0.00,NULL,-1.11}
         5 | {1.23,2.34,4.45} | {0.00,NULL,2.11}
         6 | {1.23,1.23,4.56} | {0.00,-1.11,0.11}
(6 rows)

As a proof-of-concept I implemented a vec_sub function in SolarNetwork/aggs_for_vecs@a214882. I'd be happy to implement the other operations (add, mul, div) if this is something you'd like to include.

Support for variable-length input arrays

My application has array columns that can vary in length. Mostly they are the same length, but they can differ. When I query them, however, I know up front the maximum length of all rows for the column. I was wondering how feasible it would be to add support for passing an array length into the aggregate functions, like

SELECT vec_to_sum(array_column, 8)
FROM foo

If the length is provided, then the aggregates would process each row and only handle that many elements, always returning an array of the given length (8 in my example). If no length is provided it would operate as it currently does, taking the length from the first input row.

[QUESTION] Does this extension benefit from having an index of any type?

Thank you for this extension!

It's not mentioned in the docs in any shape or form, hence this issue.

Should we expect better performance on properly indexed array columns? If so, which type of indexing is preferred?

Add vec_to_first and vec_to_last aggregates.

I discovered another set of aggregates that would be useful to me: vec_to_first and vec_to_last. These return an array containing the first/last non-null values seen in the input arrays, per position. They are expected to be used with an ORDER BY clause, e.g.

SELECT vec_to_first(data_a ORDER BY ts)
FROM measurements2

I have implemented these in a feature/accum-firstlast and included tests, and the functions work with any element type (not just numbers). I've built off the feature/accum-cached-fcinfo branch as discussed in #10, but does not depend on those changes so could be included without if needed (although I'd prefer to include those as well!).

Support for NUMERIC[] type

Hello, I would really like to try this extension, as it could really help one particular application of mine, however I have NUMERIC[] columns. Would it be possible for this extension to support that? I've never looked into writing a Postgres extension myself, so thought I'd ask here to see how hard you thought it might be to do. Thanks for your time!

pjungwir / aggs_for_vecs Goto Github PK

aggs_for_vecs's Introduction

aggs_for_vecs

Details

Installing

Aggregate functions

vec_to_count(ANYARRAY) RETURNS BIGINT[]

vec_to_sum(ANYARRAY) RETURNS ANYARRAY

vec_to_min(ANYARRAY) RETURNS ANYARRAY

vec_to_max(ANYARRAY) RETURNS ANYARRAY

vec_to_mean(ANYARRAY) RETURNS FLOAT[]

vec_to_weighted_mean(ANYARRAY, ANYARRAY) RETURNS FLOAT[]

vec_to_var_samp(ANYARRAY) RETURNS FLOAT[]

hist_2d(x ANYELEMENT, y ANYELEMENT, x_bucket_start ANYELEMENT, y_bucket_start ANYELEMENT, x_bucket_width ANYELEMENT, y_bucket_width ANYELEMENT, x_bucket_count INTEGER, y_bucket_count INTEGER)

hist_md(vals ANYARRAY, indexes INTEGER[], bucket_starts ANYARRAY, bucket_widths ANYARRAY, bucket_counts INTEGER[])

vec_to_first(ANYARRAY) RETURNS ANYARRAY

vec_to_last(ANYARRAY) RETURNS ANYARRAY