Return an list of percentiles for a given column.

Stumbled upon this, which looks like a good starting point: <a href="http://stackoverf

Ah, that's a good find, as my next question after looking at this: <a href="http://ana

NumberColumn.percentiles,about wireservice/agate

Comments (36)

jheasly commented on July 18, 2024

Full disclosure: Never calculated a percentile in my life, but you said 'Newbs welcome.'
Is there a vision for the structure of the result? Like mebbe a list of dicts ? i.e., ({10: xx}, {20: xx}, {30: xx}, {40: xx}, {50: xx}, {60: xx}, {70: xx}, {80: xx}, {90: xx})
(I like to start with the result, establish target, then work backward.)
p.s.: In the contrib docs, 'receive' in point 9 is misspelled.

from agate.

onyxfish commented on July 18, 2024

Typo fixed!

I've imagined this just returns a list of 100 values: [1.2, 3.4, 4.5, ...]. I can see the temptation to make it a dict, but a list allows you to say "gimme the first percentile" by just doing percentiles[0], which seems right. I'd like to follow the same pattern for quartiles (#45) and quintiles (#46), so that they are all consistent.

Given that the "10", "20", etc is implied by the order and the operation, I don't think they need to be specified as dictionary keys..

from agate.

onyxfish commented on July 18, 2024

Oh and newbs are certainly welcome. Will mark this ticket as "in-progress."

from agate.

onyxfish commented on July 18, 2024

Stumbled upon this, which looks like a good starting point: http://stackoverflow.com/a/2753343/24608

#45 and #46 can be implemented by reference to this.

from agate.

jheasly commented on July 18, 2024

Ah, that's a good find, as my next question after looking at this: Quantiles, Percentiles: Why so many ways to calculate them?, which references (at least) nine methods here was going to be which one to use? Thanks!

from agate.

onyxfish commented on July 18, 2024

Ha, yeah. I honestly don't know a good answer to that question. Best thing to do (I think) would be to implement one that seems straightforward and then check it against other implementations (R? Excel?) Docstring should probably make note of what algorithm we chose / where we got it from.

from agate.

onyxfish commented on July 18, 2024

Btw, another hint: I think the right way to structure this is:

Column.percentile(n) returns the nth percentile
Column.percentiles() returns an array of 100 percentiles by calling the former 100 times

That way if you only need one you don't have to compute them all.

from agate.

jheasly commented on July 18, 2024

As to a good answer for the question of which method to use; Yeah, you and me both! FWIW, according to this, Excel and R both use the R-7 method. How those map to the stackoverflow one, I haven't quite figured out ...
Will note the algorithm source in a docstring.
And your structure hint seems a good one.
One thing I'm noticing immediately is the stackoverflow method doesn't return a member of the original set. i.e., if you've got this list of values:

(1, 3, 4, 5, 5, 5, 5, 6, 7, 8, 8, 9)

the stackoverflow percentile() method returns:

percentile  result
==========  ======
0.01        1.22
0.02        1.44
0.03        1.66
0.04        1.88
0.05        2.1
0.06        2.32
0.07        2.54
0.08        2.76
 ...

rather than:

percentile  result
==========  ======
0.01        1
0.02        1
0.03        1
0.04        1
0.05        3
0.06        3
0.07        3
0.08        3
 ...

Don't we want the second result set rather than the first?

from agate.

jheasly commented on July 18, 2024

Hrmm. Looking at how OpenOffice does it (I don't have Excel), it returns values like the first list:
With a column A:

	A
1	1
2	3
3	4
4	5
5	5
6	5
7	5
8	6
9	7
10	8
11	8
12	9

=PERCENTILE(A1:A12; 0.25) returns 4.75 (which is the same valute returned by our stackoverflow method, yay!) so I guess this is a long way of saying go with the stackoverflow formula, no?

from agate.

onyxfish commented on July 18, 2024

Yes, def. We want the exact percentile, not whatever value in the list happens to fall closest.

from agate.

jheasly commented on July 18, 2024

OT: Is there in the docs a use case for a NumberColumn over an IntegerColumn?
Or is it just more of a case that a NumberColumn has characteristics of/is a base class for both an IntegerColumn and DecimalColumn?
Just curious ...

from agate.

onyxfish commented on July 18, 2024

It's the latter, though, as documented here:
#64 I'm probably going to be
eliminating both IntegerColumn and DecimalColumn. I think it makes more
sense to just have NumberColumn treat everything as Decimal type. I've been
putting off making the changes because it'll be a hard thing to undo and I
want to be certain I'm not shutting out any useful cases by doing that.
Thoughts.

Chris

On Wed, Apr 30, 2014 at 10:29 AM, John Heasly [email protected]:

OT: Is there in the docs a use case for an IntergerColumn over a
NumberColumn?
Or was it more of a case that a NumberColumn has characteristics of/is a
base class for both an IntegerColumn and DecimalColumn?
Just curious ...

—
Reply to this email directly or view it on GitHubhttps://github.com//issues/35#issuecomment-41810845
.

from agate.

jheasly commented on July 18, 2024

Hrm, well, I can't really speak from a statistics perspective, as I don't reallly have one(!).
The only thing that comes to mind is somtimes folk from the not-so-computational side on NICAR list get finicky about the display of things/getting rid of unwanted decimal bits and/or leading/trailing zeroes, which I guess ends up being either an issue of handling in the templating or just having to grok the difference between integer and decimal from the outset.
I guess at the end of the day, it's a really more of an issue of what does the Major Spreadsheet™ (Excel) and its Knock-offs (OpenOffice, Google) offer?

from agate.

jheasly commented on July 18, 2024

Now that I'm spelunking around, this is some finely crafted/workmanlike stuff!
i.e., Lots of approaches to emulate (i.e., steal) here!

from agate.

onyxfish commented on July 18, 2024

Thanks! I appreciate that! :)

I am sensitive to the "no unnecessary rounding/conversion" issues, however,
I'm less worried about that with a purely analytical library than I would
be if there was a presentation component to this. Eliminating the extra
type has one major benefit: you couldn't specify the wrong type when
computing a new column. So for instance, if you create a column by dividing
two integers, you need to make sure you specify the new column is Decimal.
I can see people getting this wrong and it creating errors.

As best I can discern, Excel, OO, Google, etc all treat numbers internally
as decimals and it's only at presentation time that something is
"int-ifed". I know from writing csvkit that the Excel file formats only
represent numbers one way, so I presume that's true and runtime as well.

Chris

On Wed, Apr 30, 2014 at 11:03 AM, John Heasly [email protected]:

Now that I'm spelunking around, this is some finely crafted/workmanlike
stuff!
i.e., Lots of approaches to emulate (i.e., steal) here!

—
Reply to this email directly or view it on GitHubhttps://github.com//issues/35#issuecomment-41815141
.

from agate.

jheasly commented on July 18, 2024

Kind of like storing datetimes as UTC and converting to local at presentation time; a tried-and-true approach.

from agate.

onyxfish commented on July 18, 2024

Exactly. I've been mulling this for a week and the only reason I've been
able to come up with not to eliminate IntColumn is that sometimes it might
be more performant. That's not a good enough reason, so I'll probably pull
the trigger on this today.

On Wed, Apr 30, 2014 at 11:13 PM, John Heasly [email protected]:

Kind of like storing datetimes as UTC and converting to local at
presentation time; a tried-and-true approach.

—
Reply to this email directly or view it on GitHubhttps://github.com//issues/35#issuecomment-41878639
.

from agate.

jheasly commented on July 18, 2024

Got something (finally) to test locally. Tonight I'd like to merge your latest from today into my fork and give my stuff a whirl ...

from agate.

onyxfish commented on July 18, 2024

Great! I made the changes to kill IntColumn and FloatColumn this morning,
which ended up being messier than I had hoped, but I think it's sorted now.
The cast and validate args when creating a table are gone now too. Data is
always cast the first time a table is created.

On Thu, May 1, 2014 at 11:08 AM, John Heasly [email protected]:

Got something (finally) to test locally. Tonight I'd like to merge your
latest from today into my fork and give my stuff a whirl ...

—
Reply to this email directly or view it on GitHubhttps://github.com//issues/35#issuecomment-41924645
.

from agate.

onyxfish commented on July 18, 2024

(Just putting this here for reference later.)

Numpy's implementation of percentile:
https://github.com/numpy/numpy/blob/v1.8.1/numpy/lib/function_base.py#L2720

(It's predictably unintelligible.)

from agate.

onyxfish commented on July 18, 2024

TIL: the generalized form is actually the quantile.

http://en.wikipedia.org/wiki/Quantile

from agate.

jheasly commented on July 18, 2024

The numpy implementation unintelligibility did not disappoint.
The wikipedia Quantile explanation was a little more friendly/grok-able.

from agate.

jheasly commented on July 18, 2024

I've got the argumentless, return-a-list-of-100 percentile values working pretty good.

But in executing the

Column.percentile(n) returns the nth percentile

bit, I can't pass an argument to my percentile() function with the @no_null_computations decorator. I get this:

Traceback (most recent call last):
  File "./test_script.py", line 68, in <module>
    pct = states.columns['total'].percentile(5)
  File "/Users/jheasly/Development/journalism/journalism/columns.py", line 73, in check
    return func(c)
TypeError: percentile() takes exactly 2 arguments (1 given)

(The objectionable, referenced line 73 is here.)

If I comment out the decorator, it works fine. What to do?

Also, currently there's no error-checking or sanitizing of the integer that gets passed in. I was going to make sure it was a.) an integer and b.) between 1 and 100, inclusive. Anything else I should be sniffing for?

And on a housekeeping note, I won't be able to hit this again until sometime Saturday. It's been a bit more demanding than I'd anticipated when I raised my hand, but I'm learning and it's fun. Thanks fer puttin' up with my nonsense.

from agate.

onyxfish commented on July 18, 2024

Awesome! Really excited to have this included.

bit, I can't pass an argument to my percentile() function with the @no_null_computations decorator.

That's a bug in the decorator brought on by my not having one with an arg to test until now. I'll fix and let you know when it's committed.

Also, currently there's no error-checking or sanitizing of the integer that gets passed in. I was going to make sure it was a.) an integer and b.) between 1 and 100, inclusive. Anything else I should be sniffing for?

You know I did this same sort of thing for Table.__init__ last night and was then reminded (after some Googling) that this is actually considered bad juju in Python. Due to it's "duck-typed" nature, you generally shouldn't test for type. Even though you expect an integer, a float, Decimal, or something other invented type could also be valid.

That being said, it's perfectly valid to test for value, so I'd check that it's a.) it's a whole number (something like n % 1 == 0) and b.) it's between 1 and 100. I don't think you need to check anything else.

No worries on the "deadline." It makes me happier/saner to have someone else hacking on it.

from agate.

onyxfish commented on July 18, 2024

On a whim did some more reading about this. Bugger, this stuff is a lot more complex / less standard than I realized. Didn't realize there was so much disagreement about how to calculate percentiles!

from agate.

jheasly commented on July 18, 2024

Awesome! Really excited to have this included.

Cool! Happy to have been of use!

so I'd check that it's a.) it's a whole number (something like n % 1 == 0) and b.) it's between 1 and 100. I don't think you need to check anything else.

Sounds good. Nifty integer check!

No worries on the "deadline." It makes me happier/saner to have someone else hacking on it.

Awesomeness. Happy to do what I can.

a lot more complex / less standard than I realized.

Yeah, no kidding. Who knew stats could be such a untamed wilderness?

from agate.

onyxfish commented on July 18, 2024

Hi John, is this ready for a pull request?

from agate.

jheasly commented on July 18, 2024

Hi Chris!
I was waiting for this:

That's a bug in the decorator brought on by my not having one with an arg to test until now. I'll fix and let you know when it's committed.

But I can omit the decorator, clean-up and run against my local little test script.

As for real testing, is there a recommended bit of test_columns.py to use as a model?

from agate.

onyxfish commented on July 18, 2024

Doh! That's my fault. I forgot all about it. I've made myself a high-pri ticket and will get to it soon:

#122

For your testing I'd probably look at something like test_counts. That seems like the most similar method.

from agate.

jheasly commented on July 18, 2024

High-pri ticket(!). Whoa.

I'll muck about with some test-making tonight.

from agate.

onyxfish commented on July 18, 2024

Blocker resolved by @mickaobrien!

from agate.

jheasly commented on July 18, 2024

Excellent! Diving into the unittest module ...

from agate.

jheasly commented on July 18, 2024

Coupla questions:
• Right now, percentile() returns a list (running the unittests revealed this to me!), whether it's delivering a requested percentile or the whole shebang. A list is okay, right?
• Speaking of the whole shebang (all 100 percentiles), should I make a test for that case too?

from agate.

onyxfish commented on July 18, 2024

HI John! I merged this on a flight today so I didn't have access to your comments. I changed it return a single value when a single value is requested and also made a few other small changes. And I added a few unit tests. I've opened two new tickets for minor outstanding issues, #129 and #130, but I wanted to go ahead and merge it.

Thanks again for the contribution! I also added you to the AUTHORS file.

from agate.

jheasly commented on July 18, 2024

Hey Chris! Thanks for the improvements in both the code and the tests. They're educational for me to look at/grok/study. I appreciate that.

And thanks for the AUTHORS addition. Next time I visit, I'm going to show my Mom.

from agate.

onyxfish commented on July 18, 2024

Absolutely! Please let me know if any of the changes don't make sense. Mostly I just pruned back things we weren't using from the original source (lambda x: x) and tried to clean up the variable names, etc. (Also there was one issue where you were casting float to Decimal, which is bad juju and fails on Python 2.6).

Thanks again!

from agate.

NumberColumn.percentiles about agate HOT 36 CLOSED

Comments (36)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent