Giter Site home page Giter Site logo

Comments (32)

pjungwir avatar pjungwir commented on July 19, 2024 1

Btw a conversion-free way to debug numeric values would be to use the numeric_out function:

ereport(NOTICE, (errmsg("numeric = %s", DatumGetCString(DirectFunctionCall1(numeric_out, NumericGetDatum(x))))));

from aggs_for_vecs.

pjungwir avatar pjungwir commented on July 19, 2024

I don't think it would be too hard to add numeric support. Are you interested in taking a stab at it? This extension is extremely simple and doesn't really go deep into Postgres stuff. If you are comfortable in C it should be easy to do. Or I'm happy to do it myself, but I'm unsure when I'll get spare time for it.

from aggs_for_vecs.

msqr avatar msqr commented on July 19, 2024

Thanks for your reply. I'm only so-so comfortable in C, so I could make an attempt. I poked around and found the postgresql/server/utils/numeric.h header which I'm guessing is what I can use to get a Numeric pointer and perform manipulations on it?

from aggs_for_vecs.

msqr avatar msqr commented on July 19, 2024

I've started having a go at SolarNetwork/aggs_for_vecs@878ab04. That commit just includes support for the vec_to_count function accepting NUMERIC[] input. I'll start working on support in the other functions next. Comments are welcome, as this is new stuff for me.

from aggs_for_vecs.

msqr avatar msqr commented on July 19, 2024

I added support for NUMERIC[] in vec_to_max, but I am unsure if this is the correct way to handle the type, because it is a copy-by-reference type. In particular I'm not sure I've correctly handled copying the Numeric into the state.dvalues array. The method I found for comparing Numeric values I adapted from what I found in the btree_gist contrib module.

I'd appreciate it if you could review the change and see if it looks OK to you, before I continue working on the other functions.

from aggs_for_vecs.

pjungwir avatar pjungwir commented on July 19, 2024

Thanks for your work on this already!

Yeah handling by-reference Datums can be tricky because of who owns the memory. But I think this is the right general approach. I did roughly the same thing for vec_to_sum.c today, but I'm getting a wrong result from the tests, so something is broken with my code. I may get a chance to look at it some more tomorrow.

A few small bits of feedback for your commits:

  • I'd rather bump the version as a separate commit after all these changes.
  • DatumGetNumericCopy is probably what we want from DirectFunctionCall* results, to get something into the aggregate memory context.
  • I'd like to keep this extension compatible with older versions of Postgres (9.3+), unless we hit something that's too annoying. So not everything in the latest numeric.h is available. I don't see any problems in what you've written so far, but it's just something to keep in mind. For example instead of calling numeric_add_opt_error we need to use DirectFunctionCall2(numeric_add, ...). The numeric-related functions are in src/include/catalog/pg_proc.dat. (Search for "numeric".)

from aggs_for_vecs.

pjungwir avatar pjungwir commented on July 19, 2024

Okay I pushed up my work on a branch called numeric so you can see what I did for vec_to_sum. Actually DatumGetXXCopy was not sufficient because that puts the copy into the current memory context, and that was still just a row-by-row context. So whatever we put there would get overwritten on the next call to the transfn. Instead we need to change to the aggregate context before allocating Numerics. And then actually we don't need a ...Copy.

from aggs_for_vecs.

msqr avatar msqr commented on July 19, 2024

Oh great, thank you for taking the time to work through this so quickly! I can start anew based on your branch for the other functions.

from aggs_for_vecs.

msqr avatar msqr commented on July 19, 2024

I updated the vec_to_count function based on your work here. In SolarNetwork/aggs_for_vecs@d229c1a I tweaked the numeric values in the tests to be decimal numbers, as I'd like to verify the numeric operations, especially in vec_to_mean. I hope this is OK to include?

from aggs_for_vecs.

pjungwir avatar pjungwir commented on July 19, 2024

That looks great! And I agree that testing with not-whole numbers is better.

from aggs_for_vecs.

msqr avatar msqr commented on July 19, 2024

Awesome! OK, what do you think about these lines? If I didn't call DatumGetNumericCopy(currentVals[i]) it was not working for me. This is working in my limited testing, but is this correct do you think?

from aggs_for_vecs.

pjungwir avatar pjungwir commented on July 19, 2024

This looks pretty good too, but DatumGetNumericCopy will still allocate the Numeric in the current memory context, which is a row-by-row context unless we change it with MemoryContextSwitchTo (like in vec_to_sum.c). Otherwise the next row can clobber what the Datum in dvalues[i] points to. So outside the loop I'd switch contexts if we're dealing with numerics.

I think we do still want Copy here since we're just keeping the tuple's own value. But if you #include <utils/datum.h> (in aggs_for_vecs.c) then you can say state->dvalues[i] = datumCopy(currentVals[i], elemTypeByValue, elemTypeWidth) and it's not type-specific.

from aggs_for_vecs.

msqr avatar msqr commented on July 19, 2024

OK, that datumCopy() does make things easier, thanks for that.

I'm trying to implement vecs_for_mean but can't seem to get it to work. If I add some debug logs, I see that the calculations are being performed and producing the expected result, but then the result NUMERIC[] returned from the function has garbage values in the non-NULL elements.

For example, with that debug logging included one of the tests runs like this:

 ✗ numeric mean lots
   (in test file test/vec_to_mean.bats, line 69)
     `[ "$result" = "{1.23,2.34,2.895}" ]' failed
   NOTICE:  Before count: 2; f: 2.460000
   NOTICE:  Final mean: 1.230000
   NOTICE:  Before count: 1; f: 2.340000
   NOTICE:  Final mean: 2.340000
   NOTICE:  Before count: 2; f: 5.790000
   NOTICE:  Final mean: 2.895000
   {6.9263787025407e-310,6.92637870254345e-310,6.9263787025462e-310}

The debug logs are converting the Numeric to a float8 and printing that, so the NOTICE: Final mean: 1.230000 is the result of that conversion... and the expected result. But then you can see the output from the function is garbage.

The code in question, with the debug logs, looks like:

  for (i = 0; i < state->state.nelems; i++) {
    if (state->state.dnulls[i]) continue;
    if (state->inputElementType != NUMERICOID) {
      state->state.dvalues[i] = Float8GetDatum(state->vecvalues[i].f8);
    } else {
      ereport(NOTICE, (errmsg("Before count: %d; f: %f", state->veccounts[i], DatumGetFloat8(DirectFunctionCall1(numeric_float8, NumericGetDatum(state->vecvalues[i].num))))));
      Datum count = DirectFunctionCall1(int8_numeric, UInt32GetDatum(state->veccounts[i]));
      Datum div = DirectFunctionCall2(numeric_div, NumericGetDatum(state->vecvalues[i].num), count);
      ereport(NOTICE, (errmsg("Final mean: %f", DatumGetFloat8(DirectFunctionCall1(numeric_float8, div)))));
      state->state.dvalues[i] = div;
    }
  }

Any ideas on this one?

from aggs_for_vecs.

pjungwir avatar pjungwir commented on July 19, 2024

I think the problem here is you are storing Numeric Datums in davlues, but this function always returns FLOAT[]. So the bits get misinterpreted by the caller. Does this give correct values?:

state->state.dvalues[i] = DirectFunctionCall1(numeric_float8, div);

Btw the reason I'm doing the sub/div/add thing is to avoid overflow/underflow problems. Certainly it would be faster to divide only once! Stack Overflow has a reference about that. I think NUMERIC will still have the same problem, where your total sum exceeds the available precision.

From a higher level, do you think it's desirable to return FLOAT[] when your inputs are NUMERIC[]? For other input types it makes sense (e.g. you don't want an INT[] result when you average INT[]s) , but perhaps if you start with numerics you'd like to keep numerics?

In that case I think we can accommodate the user by having a separate vec_to_mean(NUMERIC[]) function. I believe Postgres will prefer that over vec_to_mean(anyarray) when the input type matches. (We should double check though.)

from aggs_for_vecs.

msqr avatar msqr commented on July 19, 2024

Ah, I wondered about the approach. Numeric math is just so much slower that I thought it would be better to do as few operations as possible.

From a higher level, do you think it's desirable to return FLOAT[] when your inputs are NUMERIC[]?

Hmm, I thought not so much from my perspective, which is to try to maintain the NUMERIC type during processing as much as possible. I admit that for something like mean perhaps this is overkill in practice. Even so, I know that in my application the results of this function will be going back into a NUMERIC[] array, so it does still seem ideal to me to have NUMERIC[] output for NUMERIC[] input.

In that case I think we can accommodate the user by having a separate vec_to_mean(NUMERIC[]) function.

Does this imply we have a totally separate source file like vec_to_mean_numeric.c as well?

from aggs_for_vecs.

pjungwir avatar pjungwir commented on July 19, 2024

Does this imply we have a totally separate source file like vec_to_mean_numeric.c as well?

I think that would be simplest since we are branching on type for all the interesting parts. In theory a single C function can implement more than one Postgres function, but it doesn't often seem like the best approach.

from aggs_for_vecs.

msqr avatar msqr commented on July 19, 2024

OK, I started looking into this. I came across a bunch of helpful-looking stuff like NumericAggState that I'm guessing is what Postgres uses internally for normal NUMERIC column based aggregates. Is there any way to make use of these structures and associated functions?

from aggs_for_vecs.

pjungwir avatar pjungwir commented on July 19, 2024

Unfortunately that is private (since it's declared in the .c file not the .h file).

If you are worried about performance, probably the best approach is to use a float8 internally like with the other types (even if we convert back to a NUMERIC[] for the final result). I don't think that will sacrifice any accuracy. In fact it may be better if it avoids intermediate rounding. (I'm not sure what numeric will do there.) But we should definitely not sum and divide once at the end since that can give incorrect results.

from aggs_for_vecs.

pjungwir avatar pjungwir commented on July 19, 2024

I read a bit more about NUMERIC. I didn't realize it was that arbitrary precision. :-) More importantly, the precision & scale settings don't "follow" the value; they are only applied when saving a value to a column. So I think your sum-everything-and-divide-at-the-end is fine actually. Sorry for asking for a different approach!

from aggs_for_vecs.

msqr avatar msqr commented on July 19, 2024

Unfortunately that is private (since it's declared in the .c file not the .h file).

Drat, that's a shame, but will carry on with current approach. Yes, NUMERIC has uber-percision, so seems somewhat reasonable to add and then divide at end. I managed to get this working now. I also incorporated trim_scale for the final output, which is what I would like to see to remove the trailing zeros from the division operation, and I feel a reasonable default position to take. I made the change to aggs_for_vecs--1.3.0.sql in its own commit, to make it easy to deal with merge-wise later on if need be.

One test fails, however, and I wasn't able to see how to fix it. It is because of the overloaded vec_to_mean() aggregate function:

 ✗ int16 mean
   (in test file test/vec_to_mean.bats, line 6)
     `[ "$result" = "{1,2,2.5}" ]' failed
   ERROR:  function vec_to_mean(integer[]) is not unique
   LINE 1: SELECT vec_to_mean(ints) FROM measurements WHERE sensor_id I...
                  ^
   HINT:  Could not choose a best candidate function. You might need to add explicit type casts.

I'm not sure why all the other tests pass while this one doesn't though.

from aggs_for_vecs.

pjungwir avatar pjungwir commented on July 19, 2024

from aggs_for_vecs.

pjungwir avatar pjungwir commented on July 19, 2024

from aggs_for_vecs.

msqr avatar msqr commented on July 19, 2024

OK, thanks for all that. I've pulled in your separate-signatures branch now. At the moment I still have the vec_to_mean_numeric code separate and put to the side, because as I've been testing this out on some real-world data (for my application) I'm finding something is not right in the other functions when handling NUMERIC: the function is crashing, and I'm not sure how to track down why. I've tried narrowing the area down with debug logs, but it hasn't revealed anything to me.

To help illustrate, I pushed up a change with some more test data to work with in a new test measurements2 table. I've added one test for the vec_to_max function that gets the crash to happen:

 ✗ numeric max data 01
   (in test file test/vec_to_max.bats, line 99)
     `[ "$result" = "{1.23,NULL,2.34}" ]' failed
   ERROR:  compressed data is corrupted

I'm guessing this is a memory-handling issue, as all the other tests pass (against the small set of data in the measurements table). See anything in vec_to_max that might be causing this?

from aggs_for_vecs.

pjungwir avatar pjungwir commented on July 19, 2024

I pushed up a fix to my own repo (same branch name: feature/numeric2). The "compressed data is corrupted" error was because you were missing a datumCopy in the loop to set a new max value. The currentValues array is allocated in the tuple-by-tuple memory context, so you need to copy it into the aggregate context to make it last beyond the current call to our transfn.

Sorry I never really explained how Postgres memory contexts work: It is an arena system where there are nested "scopes" of finer & finer duration, and when a scope ends all the memory is considered freed, and it gets used again somewhere else later. There is a context for the database connection, for the current transaction, the current query, the life of an aggregate group, the current row, etc. To allocate from a context, you use palloc instead of malloc. Things like datumCopy call palloc. And palloc allocates from the current memory context. Some functions take an alternate context to allocate from, and you can always switch the current context if you like. The arena system means freeing memory is extremely fast. Also there isn't much fragmentation. And it's hard to leak memory. You usually don't even free things; you just let the whole arena get thrown away.

I also fixed the "numeric max data 01" test, because the expected values where incorrect. I did this to find the correct ones:

aggs_for_vecs_test=# select max(data_i[1]), max(data_i[2]), max(data_i[3]), max(data_i[4]), max(data_i[5]), max(data_i[6]), max(data_i[7]), max(data_i[8]), max(data_i[9]) from measurements2;
   max    |     max      |    max    | max  |   max    | max |    max    |   max    |   max    
----------+--------------+-----------+------+----------+-----+-----------+----------+----------
 23200619 | 48587.514387 | 234.99962 | 50.5 | 13161880 |   1 | 332.20728 | 23730754 | 23623545
(1 row)

from aggs_for_vecs.

msqr avatar msqr commented on July 19, 2024

🎉 Thank you for the clear explanation on the memory contexts, I am slowing learning the conventions used in Postgres. Thanks for fixing that test expectation, I mistakenly put in the expectation for vec_to_mean! In any case, I pulled in your fixes and applied the same to the vec_to_min and added tests using the measurements2 table for that and vec_to_mean. All are passing now, and I will continue testing on more of my application tomorrow.

Also FYI, the SQL I am using to generate those test expectations is modeled after what my application currently does, using unnest(), and nicely outputs an array that can be copied/pasted right into the test:

aggs_for_vecs_test=# WITH di AS (
  SELECT idx_i, max(val_i) AS max_i
  FROM measurements2 d, unnest(d.data_i) WITH ORDINALITY AS a(val_i, idx_i)
  GROUP BY idx_i
)
SELECT array_agg(max_i ORDER BY idx_i) AS data_i_max FROM di;
                                  data_i_max                                   
-------------------------------------------------------------------------------
 {23200619,48587.514387,234.99962,50.5,13161880,1,332.20728,23730754,23623545}
(1 row)

I used a similar approach for the min, mean, and add tests.

from aggs_for_vecs.

msqr avatar msqr commented on July 19, 2024

I've added support for NUMERIC in vec_to_var_samp and vec_without_outliers. I also tweaked the implementations to make use of the numeric.h functions added in Postgres 12, e.g.

#if PG_VERSION_NUM < 120000
      state->vecvalues[i].num = DatumGetNumeric(DirectFunctionCall2(numeric_add, NumericGetDatum(state->vecvalues[i].num), currentVals[i]));
#else
      state->vecvalues[i].num = numeric_add_opt_error(state->vecvalues[i].num, DatumGetNumeric(currentVals[i]), NULL);
#endif

In order to share the trim_scale function between the different aggregate functions, I moved the code into a new method in util.c:

Datum
trimScaleNumeric(Datum num);

There is a slight difference in the calculated results of vec_to_var_samp from what Postgre's var_samp does when applied to individual elements of the test data (example here). I tried to track down what Postgres is doing, and I think it comes down to how it handles the decimal scale on the internal numeric operations. It looked like it would require a fair bit of effort to actually figure out and then match the algorithm exactly, so I baled and left the code as-is, figuring the small differences were acceptable as as possible "known limitiation" for now?

from aggs_for_vecs.

pjungwir avatar pjungwir commented on July 19, 2024

That's great progress!

Why do you want to trim the scale before returning a result? It seems to me it might be better to let users choose when to do that by having our own vec_trim_scale function. If you think it's always desirable, can you explain why?

A few more fine details:

  • In trimScaleNumeric I believe you want num not div on line 116.
  • The #if right before that should not have a space before it.
  • In vec_to_mean_numeric_finalfn we don't need the #if because it's repeated inside the trimScaleNumeric function body.

I have a couple other commits I'd like to include in 1.3.0: one adding vec_to_weighted_mean (received from someone via email) and another refactoring the tests to use the PGXN standard testing framework. (That commit also adds tests for all supported datatypes.) I know you've added some tests too, so I'm happy to convert those myself to spare you the trouble. What is the best commit/branch to pull down your tests so I can convert them for you?

I also want to add some "upgrade" files like aggs_for_vecs--1.2.1--1.3.0.sql to support ALTER EXTENSION. That shouldn't have any conflicts with your work though.

It's been fun collaborating on this together!

from aggs_for_vecs.

msqr avatar msqr commented on July 19, 2024

Thanks, and yes I think I caught those oversights in a few later commits shortly after my main one. All my latest code is in the feature/numeric2 branch. I did also verify/fix compiling for Postgres 9.6, but I don't have 9.3 readily available to confirm it supports that far back.

As for the trim_scale stuff I had been thinking something like a vec_trim_scale would be a great option. I was figuring that because Postgres quickly expands the decimal scale when doing things like multiplication/division, you often end up with trailing zeros where none existed on the input data. I agree that giving folks the option to trim via an explicit function like vec_trim_scale would preserve the choice to include it or not. I could give that a go, if you like, having the vec_without_outliers as an example.

And yes, this has been a great learning experience for me, and quite fun!

from aggs_for_vecs.

msqr avatar msqr commented on July 19, 2024

I worked on the trim_scale function now in a feature/trim-scale branch:

  1. added a vec_trim_scale function via vec_trim_scale.c
  2. removed the automatic trim_scale calls from vec_to_mean and vec_to_var_samp
  3. moved the trimScaleNumeric function out of util.c and into vec_trim_scale.c; I kept that routine as a function just because I find the code cleaner being able to use return to exit the inner loop early. If you prefer not to have that, I could move the code into the body of the vec_trim_scale function.

If you're happy with the feature/trim-scale branch, I could merge that back into feature/numeric2.

from aggs_for_vecs.

pjungwir avatar pjungwir commented on July 19, 2024

Your vec_trim_scale work looks good to me. I just see two minor things to fix:

  • On line 72 there is still an extra space before #if PG_VERSION_NUM >= 130000.
  • On line 44 you don't need valsLength = (ARR_DIMS(valsArray))[0]; because you set it with deconstruct_array. (Line 47 passes &valsLength.)

After that let's go ahead and merge it!

from aggs_for_vecs.

msqr avatar msqr commented on July 19, 2024

Excellent, thanks for those. I fixed those items and merged everything back into the feature/numeric2 branch now.

from aggs_for_vecs.

pjungwir avatar pjungwir commented on July 19, 2024

I think #6 achieved this.

from aggs_for_vecs.

Related Issues (9)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.