Comments (32)
Btw a conversion-free way to debug numeric values would be to use the numeric_out
function:
ereport(NOTICE, (errmsg("numeric = %s", DatumGetCString(DirectFunctionCall1(numeric_out, NumericGetDatum(x))))));
from aggs_for_vecs.
I don't think it would be too hard to add numeric support. Are you interested in taking a stab at it? This extension is extremely simple and doesn't really go deep into Postgres stuff. If you are comfortable in C it should be easy to do. Or I'm happy to do it myself, but I'm unsure when I'll get spare time for it.
from aggs_for_vecs.
Thanks for your reply. I'm only so-so comfortable in C, so I could make an attempt. I poked around and found the postgresql/server/utils/numeric.h
header which I'm guessing is what I can use to get a Numeric
pointer and perform manipulations on it?
from aggs_for_vecs.
I've started having a go at SolarNetwork/aggs_for_vecs@878ab04. That commit just includes support for the vec_to_count
function accepting NUMERIC[]
input. I'll start working on support in the other functions next. Comments are welcome, as this is new stuff for me.
from aggs_for_vecs.
I added support for NUMERIC[] in vec_to_max, but I am unsure if this is the correct way to handle the type, because it is a copy-by-reference type. In particular I'm not sure I've correctly handled copying the Numeric
into the state.dvalues
array. The method I found for comparing Numeric
values I adapted from what I found in the btree_gist contrib module.
I'd appreciate it if you could review the change and see if it looks OK to you, before I continue working on the other functions.
from aggs_for_vecs.
Thanks for your work on this already!
Yeah handling by-reference Datums can be tricky because of who owns the memory. But I think this is the right general approach. I did roughly the same thing for vec_to_sum.c
today, but I'm getting a wrong result from the tests, so something is broken with my code. I may get a chance to look at it some more tomorrow.
A few small bits of feedback for your commits:
- I'd rather bump the version as a separate commit after all these changes.
DatumGetNumericCopy
is probably what we want fromDirectFunctionCall*
results, to get something into the aggregate memory context.- I'd like to keep this extension compatible with older versions of Postgres (9.3+), unless we hit something that's too annoying. So not everything in the latest
numeric.h
is available. I don't see any problems in what you've written so far, but it's just something to keep in mind. For example instead of callingnumeric_add_opt_error
we need to useDirectFunctionCall2(numeric_add, ...)
. The numeric-related functions are insrc/include/catalog/pg_proc.dat
. (Search for "numeric".)
from aggs_for_vecs.
Okay I pushed up my work on a branch called numeric
so you can see what I did for vec_to_sum
. Actually DatumGetXXCopy
was not sufficient because that puts the copy into the current memory context, and that was still just a row-by-row context. So whatever we put there would get overwritten on the next call to the transfn. Instead we need to change to the aggregate context before allocating Numeric
s. And then actually we don't need a ...Copy
.
from aggs_for_vecs.
Oh great, thank you for taking the time to work through this so quickly! I can start anew based on your branch for the other functions.
from aggs_for_vecs.
I updated the vec_to_count function based on your work here. In SolarNetwork/aggs_for_vecs@d229c1a I tweaked the numeric values in the tests to be decimal numbers, as I'd like to verify the numeric operations, especially in vec_to_mean
. I hope this is OK to include?
from aggs_for_vecs.
That looks great! And I agree that testing with not-whole numbers is better.
from aggs_for_vecs.
Awesome! OK, what do you think about these lines? If I didn't call DatumGetNumericCopy(currentVals[i])
it was not working for me. This is working in my limited testing, but is this correct do you think?
from aggs_for_vecs.
This looks pretty good too, but DatumGetNumericCopy
will still allocate the Numeric
in the current memory context, which is a row-by-row context unless we change it with MemoryContextSwitchTo
(like in vec_to_sum.c
). Otherwise the next row can clobber what the Datum in dvalues[i]
points to. So outside the loop I'd switch contexts if we're dealing with numerics.
I think we do still want Copy
here since we're just keeping the tuple's own value. But if you #include <utils/datum.h>
(in aggs_for_vecs.c
) then you can say state->dvalues[i] = datumCopy(currentVals[i], elemTypeByValue, elemTypeWidth)
and it's not type-specific.
from aggs_for_vecs.
OK, that datumCopy()
does make things easier, thanks for that.
I'm trying to implement vecs_for_mean
but can't seem to get it to work. If I add some debug logs, I see that the calculations are being performed and producing the expected result, but then the result NUMERIC[]
returned from the function has garbage values in the non-NULL
elements.
For example, with that debug logging included one of the tests runs like this:
✗ numeric mean lots
(in test file test/vec_to_mean.bats, line 69)
`[ "$result" = "{1.23,2.34,2.895}" ]' failed
NOTICE: Before count: 2; f: 2.460000
NOTICE: Final mean: 1.230000
NOTICE: Before count: 1; f: 2.340000
NOTICE: Final mean: 2.340000
NOTICE: Before count: 2; f: 5.790000
NOTICE: Final mean: 2.895000
{6.9263787025407e-310,6.92637870254345e-310,6.9263787025462e-310}
The debug logs are converting the Numeric to a float8 and printing that, so the NOTICE: Final mean: 1.230000
is the result of that conversion... and the expected result. But then you can see the output from the function is garbage.
The code in question, with the debug logs, looks like:
for (i = 0; i < state->state.nelems; i++) {
if (state->state.dnulls[i]) continue;
if (state->inputElementType != NUMERICOID) {
state->state.dvalues[i] = Float8GetDatum(state->vecvalues[i].f8);
} else {
ereport(NOTICE, (errmsg("Before count: %d; f: %f", state->veccounts[i], DatumGetFloat8(DirectFunctionCall1(numeric_float8, NumericGetDatum(state->vecvalues[i].num))))));
Datum count = DirectFunctionCall1(int8_numeric, UInt32GetDatum(state->veccounts[i]));
Datum div = DirectFunctionCall2(numeric_div, NumericGetDatum(state->vecvalues[i].num), count);
ereport(NOTICE, (errmsg("Final mean: %f", DatumGetFloat8(DirectFunctionCall1(numeric_float8, div)))));
state->state.dvalues[i] = div;
}
}
Any ideas on this one?
from aggs_for_vecs.
I think the problem here is you are storing Numeric Datums in davlues
, but this function always returns FLOAT[]
. So the bits get misinterpreted by the caller. Does this give correct values?:
state->state.dvalues[i] = DirectFunctionCall1(numeric_float8, div);
Btw the reason I'm doing the sub/div/add thing is to avoid overflow/underflow problems. Certainly it would be faster to divide only once! Stack Overflow has a reference about that. I think NUMERIC
will still have the same problem, where your total sum exceeds the available precision.
From a higher level, do you think it's desirable to return FLOAT[]
when your inputs are NUMERIC[]
? For other input types it makes sense (e.g. you don't want an INT[]
result when you average INT[]
s) , but perhaps if you start with numerics you'd like to keep numerics?
In that case I think we can accommodate the user by having a separate vec_to_mean(NUMERIC[])
function. I believe Postgres will prefer that over vec_to_mean(anyarray)
when the input type matches. (We should double check though.)
from aggs_for_vecs.
Ah, I wondered about the approach. Numeric math is just so much slower that I thought it would be better to do as few operations as possible.
From a higher level, do you think it's desirable to return FLOAT[] when your inputs are NUMERIC[]?
Hmm, I thought not so much from my perspective, which is to try to maintain the NUMERIC
type during processing as much as possible. I admit that for something like mean
perhaps this is overkill in practice. Even so, I know that in my application the results of this function will be going back into a NUMERIC[]
array, so it does still seem ideal to me to have NUMERIC[]
output for NUMERIC[]
input.
In that case I think we can accommodate the user by having a separate vec_to_mean(NUMERIC[]) function.
Does this imply we have a totally separate source file like vec_to_mean_numeric.c
as well?
from aggs_for_vecs.
Does this imply we have a totally separate source file like vec_to_mean_numeric.c as well?
I think that would be simplest since we are branching on type for all the interesting parts. In theory a single C function can implement more than one Postgres function, but it doesn't often seem like the best approach.
from aggs_for_vecs.
OK, I started looking into this. I came across a bunch of helpful-looking stuff like NumericAggState that I'm guessing is what Postgres uses internally for normal NUMERIC
column based aggregates. Is there any way to make use of these structures and associated functions?
from aggs_for_vecs.
Unfortunately that is private (since it's declared in the .c
file not the .h
file).
If you are worried about performance, probably the best approach is to use a float8 internally like with the other types (even if we convert back to a NUMERIC[]
for the final result). I don't think that will sacrifice any accuracy. In fact it may be better if it avoids intermediate rounding. (I'm not sure what numeric will do there.) But we should definitely not sum and divide once at the end since that can give incorrect results.
from aggs_for_vecs.
I read a bit more about NUMERIC
. I didn't realize it was that arbitrary precision. :-) More importantly, the precision & scale settings don't "follow" the value; they are only applied when saving a value to a column. So I think your sum-everything-and-divide-at-the-end is fine actually. Sorry for asking for a different approach!
from aggs_for_vecs.
Unfortunately that is private (since it's declared in the .c file not the .h file).
Drat, that's a shame, but will carry on with current approach. Yes, NUMERIC
has uber-percision, so seems somewhat reasonable to add and then divide at end. I managed to get this working now. I also incorporated trim_scale for the final output, which is what I would like to see to remove the trailing zeros from the division operation, and I feel a reasonable default position to take. I made the change to aggs_for_vecs--1.3.0.sql
in its own commit, to make it easy to deal with merge-wise later on if need be.
One test fails, however, and I wasn't able to see how to fix it. It is because of the overloaded vec_to_mean()
aggregate function:
✗ int16 mean
(in test file test/vec_to_mean.bats, line 6)
`[ "$result" = "{1,2,2.5}" ]' failed
ERROR: function vec_to_mean(integer[]) is not unique
LINE 1: SELECT vec_to_mean(ints) FROM measurements WHERE sensor_id I...
^
HINT: Could not choose a best candidate function. You might need to add explicit type casts.
I'm not sure why all the other tests pass while this one doesn't though.
from aggs_for_vecs.
from aggs_for_vecs.
from aggs_for_vecs.
OK, thanks for all that. I've pulled in your separate-signatures
branch now. At the moment I still have the vec_to_mean_numeric
code separate and put to the side, because as I've been testing this out on some real-world data (for my application) I'm finding something is not right in the other functions when handling NUMERIC: the function is crashing, and I'm not sure how to track down why. I've tried narrowing the area down with debug logs, but it hasn't revealed anything to me.
To help illustrate, I pushed up a change with some more test data to work with in a new test measurements2
table. I've added one test for the vec_to_max
function that gets the crash to happen:
✗ numeric max data 01
(in test file test/vec_to_max.bats, line 99)
`[ "$result" = "{1.23,NULL,2.34}" ]' failed
ERROR: compressed data is corrupted
I'm guessing this is a memory-handling issue, as all the other tests pass (against the small set of data in the measurements
table). See anything in vec_to_max
that might be causing this?
from aggs_for_vecs.
I pushed up a fix to my own repo (same branch name: feature/numeric2
). The "compressed data is corrupted" error was because you were missing a datumCopy
in the loop to set a new max value. The currentValues
array is allocated in the tuple-by-tuple memory context, so you need to copy it into the aggregate context to make it last beyond the current call to our transfn.
Sorry I never really explained how Postgres memory contexts work: It is an arena system where there are nested "scopes" of finer & finer duration, and when a scope ends all the memory is considered freed, and it gets used again somewhere else later. There is a context for the database connection, for the current transaction, the current query, the life of an aggregate group, the current row, etc. To allocate from a context, you use palloc instead of malloc. Things like datumCopy call palloc. And palloc allocates from the current memory context. Some functions take an alternate context to allocate from, and you can always switch the current context if you like. The arena system means freeing memory is extremely fast. Also there isn't much fragmentation. And it's hard to leak memory. You usually don't even free things; you just let the whole arena get thrown away.
I also fixed the "numeric max data 01" test, because the expected values where incorrect. I did this to find the correct ones:
aggs_for_vecs_test=# select max(data_i[1]), max(data_i[2]), max(data_i[3]), max(data_i[4]), max(data_i[5]), max(data_i[6]), max(data_i[7]), max(data_i[8]), max(data_i[9]) from measurements2;
max | max | max | max | max | max | max | max | max
----------+--------------+-----------+------+----------+-----+-----------+----------+----------
23200619 | 48587.514387 | 234.99962 | 50.5 | 13161880 | 1 | 332.20728 | 23730754 | 23623545
(1 row)
from aggs_for_vecs.
🎉 Thank you for the clear explanation on the memory contexts, I am slowing learning the conventions used in Postgres. Thanks for fixing that test expectation, I mistakenly put in the expectation for vec_to_mean
! In any case, I pulled in your fixes and applied the same to the vec_to_min
and added tests using the measurements2
table for that and vec_to_mean
. All are passing now, and I will continue testing on more of my application tomorrow.
Also FYI, the SQL I am using to generate those test expectations is modeled after what my application currently does, using unnest()
, and nicely outputs an array that can be copied/pasted right into the test:
aggs_for_vecs_test=# WITH di AS (
SELECT idx_i, max(val_i) AS max_i
FROM measurements2 d, unnest(d.data_i) WITH ORDINALITY AS a(val_i, idx_i)
GROUP BY idx_i
)
SELECT array_agg(max_i ORDER BY idx_i) AS data_i_max FROM di;
data_i_max
-------------------------------------------------------------------------------
{23200619,48587.514387,234.99962,50.5,13161880,1,332.20728,23730754,23623545}
(1 row)
I used a similar approach for the min
, mean
, and add
tests.
from aggs_for_vecs.
I've added support for NUMERIC in vec_to_var_samp
and vec_without_outliers
. I also tweaked the implementations to make use of the numeric.h
functions added in Postgres 12, e.g.
#if PG_VERSION_NUM < 120000
state->vecvalues[i].num = DatumGetNumeric(DirectFunctionCall2(numeric_add, NumericGetDatum(state->vecvalues[i].num), currentVals[i]));
#else
state->vecvalues[i].num = numeric_add_opt_error(state->vecvalues[i].num, DatumGetNumeric(currentVals[i]), NULL);
#endif
In order to share the trim_scale
function between the different aggregate functions, I moved the code into a new method in util.c
:
Datum
trimScaleNumeric(Datum num);
There is a slight difference in the calculated results of vec_to_var_samp
from what Postgre's var_samp
does when applied to individual elements of the test data (example here). I tried to track down what Postgres is doing, and I think it comes down to how it handles the decimal scale on the internal numeric operations. It looked like it would require a fair bit of effort to actually figure out and then match the algorithm exactly, so I baled and left the code as-is, figuring the small differences were acceptable as as possible "known limitiation" for now?
from aggs_for_vecs.
That's great progress!
Why do you want to trim the scale before returning a result? It seems to me it might be better to let users choose when to do that by having our own vec_trim_scale function. If you think it's always desirable, can you explain why?
A few more fine details:
- In
trimScaleNumeric
I believe you wantnum
notdiv
on line 116. - The
#if
right before that should not have a space before it. - In
vec_to_mean_numeric_finalfn
we don't need the#if
because it's repeated inside thetrimScaleNumeric
function body.
I have a couple other commits I'd like to include in 1.3.0: one adding vec_to_weighted_mean
(received from someone via email) and another refactoring the tests to use the PGXN standard testing framework. (That commit also adds tests for all supported datatypes.) I know you've added some tests too, so I'm happy to convert those myself to spare you the trouble. What is the best commit/branch to pull down your tests so I can convert them for you?
I also want to add some "upgrade" files like aggs_for_vecs--1.2.1--1.3.0.sql
to support ALTER EXTENSION
. That shouldn't have any conflicts with your work though.
It's been fun collaborating on this together!
from aggs_for_vecs.
Thanks, and yes I think I caught those oversights in a few later commits shortly after my main one. All my latest code is in the feature/numeric2
branch. I did also verify/fix compiling for Postgres 9.6, but I don't have 9.3 readily available to confirm it supports that far back.
As for the trim_scale
stuff I had been thinking something like a vec_trim_scale
would be a great option. I was figuring that because Postgres quickly expands the decimal scale when doing things like multiplication/division, you often end up with trailing zeros where none existed on the input data. I agree that giving folks the option to trim via an explicit function like vec_trim_scale
would preserve the choice to include it or not. I could give that a go, if you like, having the vec_without_outliers
as an example.
And yes, this has been a great learning experience for me, and quite fun!
from aggs_for_vecs.
I worked on the trim_scale
function now in a feature/trim-scale branch:
- added a
vec_trim_scale
function viavec_trim_scale.c
- removed the automatic trim_scale calls from
vec_to_mean
andvec_to_var_samp
- moved the
trimScaleNumeric
function out ofutil.c
and intovec_trim_scale.c
; I kept that routine as a function just because I find the code cleaner being able to usereturn
to exit the inner loop early. If you prefer not to have that, I could move the code into the body of thevec_trim_scale
function.
If you're happy with the feature/trim-scale
branch, I could merge that back into feature/numeric2
.
from aggs_for_vecs.
Your vec_trim_scale
work looks good to me. I just see two minor things to fix:
- On line 72 there is still an extra space before
#if PG_VERSION_NUM >= 130000
. - On line 44 you don't need
valsLength = (ARR_DIMS(valsArray))[0];
because you set it withdeconstruct_array
. (Line 47 passes&valsLength
.)
After that let's go ahead and merge it!
from aggs_for_vecs.
Excellent, thanks for those. I fixed those items and merged everything back into the feature/numeric2
branch now.
from aggs_for_vecs.
I think #6 achieved this.
from aggs_for_vecs.
Related Issues (9)
- Two-step aggregation HOT 24
- Add vec_to_first and vec_to_last aggregates. HOT 2
- Debian package support HOT 2
- [QUESTION] Does this extension benefit from having an index of any type? HOT 2
- Support for variable-length input arrays HOT 6
- Allow processing more input rows than veccounts currently allows HOT 1
- Array math utilities like vec_add, vec_sub, vec_mul, vec_div HOT 6
- Possible to use existing Postgres aggregate functions like numeric_avg_accum, towards two-step aggregation HOT 12
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from aggs_for_vecs.