Comments (36)
Full disclosure: Never calculated a percentile in my life, but you said 'Newbs welcome.'
Is there a vision for the structure of the result? Like mebbe a list of dicts ? i.e., ({10: xx}, {20: xx}, {30: xx}, {40: xx}, {50: xx}, {60: xx}, {70: xx}, {80: xx}, {90: xx})
(I like to start with the result, establish target, then work backward.)
p.s.: In the contrib docs, 'receive' in point 9 is misspelled.
from agate.
Typo fixed!
I've imagined this just returns a list of 100 values: [1.2, 3.4, 4.5, ...]
. I can see the temptation to make it a dict, but a list allows you to say "gimme the first percentile" by just doing percentiles[0], which seems right. I'd like to follow the same pattern for quartiles (#45) and quintiles (#46), so that they are all consistent.
Given that the "10", "20", etc is implied by the order and the operation, I don't think they need to be specified as dictionary keys..
from agate.
Oh and newbs are certainly welcome. Will mark this ticket as "in-progress."
from agate.
Stumbled upon this, which looks like a good starting point: http://stackoverflow.com/a/2753343/24608
#45 and #46 can be implemented by reference to this.
from agate.
Ah, that's a good find, as my next question after looking at this: Quantiles, Percentiles: Why so many ways to calculate them?, which references (at least) nine methods here was going to be which one to use? Thanks!
from agate.
Ha, yeah. I honestly don't know a good answer to that question. Best thing to do (I think) would be to implement one that seems straightforward and then check it against other implementations (R? Excel?) Docstring should probably make note of what algorithm we chose / where we got it from.
from agate.
Btw, another hint: I think the right way to structure this is:
Column.percentile(n)
returns the nth percentile
Column.percentiles()
returns an array of 100 percentiles by calling the former 100 times
That way if you only need one you don't have to compute them all.
from agate.
As to a good answer for the question of which method to use; Yeah, you and me both! FWIW, according to this, Excel and R both use the R-7 method. How those map to the stackoverflow one, I haven't quite figured out ...
Will note the algorithm source in a docstring.
And your structure hint seems a good one.
One thing I'm noticing immediately is the stackoverflow method doesn't return a member of the original set. i.e., if you've got this list of values:
(1, 3, 4, 5, 5, 5, 5, 6, 7, 8, 8, 9)
the stackoverflow percentile()
method returns:
percentile result
========== ======
0.01 1.22
0.02 1.44
0.03 1.66
0.04 1.88
0.05 2.1
0.06 2.32
0.07 2.54
0.08 2.76
...
rather than:
percentile result
========== ======
0.01 1
0.02 1
0.03 1
0.04 1
0.05 3
0.06 3
0.07 3
0.08 3
...
Don't we want the second result set rather than the first?
from agate.
Hrmm. Looking at how OpenOffice does it (I don't have Excel), it returns values like the first list:
With a column A:
A | |
---|---|
1 | 1 |
2 | 3 |
3 | 4 |
4 | 5 |
5 | 5 |
6 | 5 |
7 | 5 |
8 | 6 |
9 | 7 |
10 | 8 |
11 | 8 |
12 | 9 |
=PERCENTILE(A1:A12; 0.25)
returns 4.75
(which is the same valute returned by our stackoverflow method, yay!) so I guess this is a long way of saying go with the stackoverflow formula, no?
from agate.
Yes, def. We want the exact percentile, not whatever value in the list happens to fall closest.
from agate.
OT: Is there in the docs a use case for a NumberColumn
over an IntegerColumn
?
Or is it just more of a case that a NumberColumn
has characteristics of/is a base class for both an IntegerColumn
and DecimalColumn
?
Just curious ...
from agate.
It's the latter, though, as documented here:
#64 I'm probably going to be
eliminating both IntegerColumn and DecimalColumn. I think it makes more
sense to just have NumberColumn treat everything as Decimal type. I've been
putting off making the changes because it'll be a hard thing to undo and I
want to be certain I'm not shutting out any useful cases by doing that.
Thoughts.
Chris
On Wed, Apr 30, 2014 at 10:29 AM, John Heasly [email protected]:
OT: Is there in the docs a use case for an IntergerColumn over a
NumberColumn?
Or was it more of a case that a NumberColumn has characteristics of/is a
base class for both an IntegerColumn and DecimalColumn?
Just curious ...—
Reply to this email directly or view it on GitHubhttps://github.com//issues/35#issuecomment-41810845
.
from agate.
Hrm, well, I can't really speak from a statistics perspective, as I don't reallly have one(!).
The only thing that comes to mind is somtimes folk from the not-so-computational side on NICAR list get finicky about the display of things/getting rid of unwanted decimal bits and/or leading/trailing zeroes, which I guess ends up being either an issue of handling in the templating or just having to grok the difference between integer and decimal from the outset.
I guess at the end of the day, it's a really more of an issue of what does the Major Spreadsheet™ (Excel) and its Knock-offs (OpenOffice, Google) offer?
from agate.
Now that I'm spelunking around, this is some finely crafted/workmanlike stuff!
i.e., Lots of approaches to emulate (i.e., steal) here!
from agate.
Thanks! I appreciate that! :)
I am sensitive to the "no unnecessary rounding/conversion" issues, however,
I'm less worried about that with a purely analytical library than I would
be if there was a presentation component to this. Eliminating the extra
type has one major benefit: you couldn't specify the wrong type when
computing a new column. So for instance, if you create a column by dividing
two integers, you need to make sure you specify the new column is Decimal.
I can see people getting this wrong and it creating errors.
As best I can discern, Excel, OO, Google, etc all treat numbers internally
as decimals and it's only at presentation time that something is
"int-ifed". I know from writing csvkit that the Excel file formats only
represent numbers one way, so I presume that's true and runtime as well.
Chris
On Wed, Apr 30, 2014 at 11:03 AM, John Heasly [email protected]:
Now that I'm spelunking around, this is some finely crafted/workmanlike
stuff!
i.e., Lots of approaches to emulate (i.e., steal) here!—
Reply to this email directly or view it on GitHubhttps://github.com//issues/35#issuecomment-41815141
.
from agate.
Kind of like storing datetimes as UTC and converting to local at presentation time; a tried-and-true approach.
from agate.
Exactly. I've been mulling this for a week and the only reason I've been
able to come up with not to eliminate IntColumn is that sometimes it might
be more performant. That's not a good enough reason, so I'll probably pull
the trigger on this today.
On Wed, Apr 30, 2014 at 11:13 PM, John Heasly [email protected]:
Kind of like storing datetimes as UTC and converting to local at
presentation time; a tried-and-true approach.—
Reply to this email directly or view it on GitHubhttps://github.com//issues/35#issuecomment-41878639
.
from agate.
Got something (finally) to test locally. Tonight I'd like to merge your latest from today into my fork and give my stuff a whirl ...
from agate.
Great! I made the changes to kill IntColumn and FloatColumn this morning,
which ended up being messier than I had hoped, but I think it's sorted now.
The cast and validate args when creating a table are gone now too. Data is
always cast the first time a table is created.
C
On Thu, May 1, 2014 at 11:08 AM, John Heasly [email protected]:
Got something (finally) to test locally. Tonight I'd like to merge your
latest from today into my fork and give my stuff a whirl ...—
Reply to this email directly or view it on GitHubhttps://github.com//issues/35#issuecomment-41924645
.
from agate.
(Just putting this here for reference later.)
Numpy's implementation of percentile:
https://github.com/numpy/numpy/blob/v1.8.1/numpy/lib/function_base.py#L2720
(It's predictably unintelligible.)
from agate.
TIL: the generalized form is actually the quantile
.
http://en.wikipedia.org/wiki/Quantile
from agate.
The numpy implementation unintelligibility did not disappoint.
The wikipedia Quantile explanation was a little more friendly/grok-able.
from agate.
I've got the argumentless, return-a-list-of-100 percentile values working pretty good.
But in executing the
Column.percentile(n)
returns the nth percentile
bit, I can't pass an argument to my percentile()
function with the @no_null_computations
decorator. I get this:
Traceback (most recent call last):
File "./test_script.py", line 68, in <module>
pct = states.columns['total'].percentile(5)
File "/Users/jheasly/Development/journalism/journalism/columns.py", line 73, in check
return func(c)
TypeError: percentile() takes exactly 2 arguments (1 given)
(The objectionable, referenced line 73
is here.)
If I comment out the decorator, it works fine. What to do?
Also, currently there's no error-checking or sanitizing of the integer that gets passed in. I was going to make sure it was a.) an integer and b.) between 1 and 100, inclusive. Anything else I should be sniffing for?
And on a housekeeping note, I won't be able to hit this again until sometime Saturday. It's been a bit more demanding than I'd anticipated when I raised my hand, but I'm learning and it's fun. Thanks fer puttin' up with my nonsense.
from agate.
Awesome! Really excited to have this included.
bit, I can't pass an argument to my percentile() function with the @no_null_computations decorator.
That's a bug in the decorator brought on by my not having one with an arg to test until now. I'll fix and let you know when it's committed.
Also, currently there's no error-checking or sanitizing of the integer that gets passed in. I was going to make sure it was a.) an integer and b.) between 1 and 100, inclusive. Anything else I should be sniffing for?
You know I did this same sort of thing for Table.__init__
last night and was then reminded (after some Googling) that this is actually considered bad juju in Python. Due to it's "duck-typed" nature, you generally shouldn't test for type. Even though you expect an integer, a float, Decimal, or something other invented type could also be valid.
That being said, it's perfectly valid to test for value, so I'd check that it's a.) it's a whole number (something like n % 1 == 0
) and b.) it's between 1 and 100. I don't think you need to check anything else.
No worries on the "deadline." It makes me happier/saner to have someone else hacking on it.
from agate.
On a whim did some more reading about this. Bugger, this stuff is a lot more complex / less standard than I realized. Didn't realize there was so much disagreement about how to calculate percentiles!
from agate.
Awesome! Really excited to have this included.
Cool! Happy to have been of use!
so I'd check that it's a.) it's a whole number (something like
n % 1 == 0
) and b.) it's between 1 and 100. I don't think you need to check anything else.
Sounds good. Nifty integer check!
No worries on the "deadline." It makes me happier/saner to have someone else hacking on it.
Awesomeness. Happy to do what I can.
a lot more complex / less standard than I realized.
Yeah, no kidding. Who knew stats could be such a untamed wilderness?
from agate.
Hi John, is this ready for a pull request?
from agate.
Hi Chris!
I was waiting for this:
That's a bug in the decorator brought on by my not having one with an arg to test until now. I'll fix and let you know when it's committed.
But I can omit the decorator, clean-up and run against my local little test script.
As for real testing, is there a recommended bit of test_columns.py
to use as a model?
from agate.
Doh! That's my fault. I forgot all about it. I've made myself a high-pri ticket and will get to it soon:
For your testing I'd probably look at something like test_counts
. That seems like the most similar method.
from agate.
High-pri ticket(!). Whoa.
I'll muck about with some test-making tonight.
from agate.
Blocker resolved by @mickaobrien!
from agate.
Excellent! Diving into the unittest
module ...
from agate.
Coupla questions:
• Right now, percentile()
returns a list (running the unittest
s revealed this to me!), whether it's delivering a requested percentile or the whole shebang. A list is okay, right?
• Speaking of the whole shebang (all 100 percentiles), should I make a test for that case too?
from agate.
HI John! I merged this on a flight today so I didn't have access to your comments. I changed it return a single value when a single value is requested and also made a few other small changes. And I added a few unit tests. I've opened two new tickets for minor outstanding issues, #129 and #130, but I wanted to go ahead and merge it.
Thanks again for the contribution! I also added you to the AUTHORS
file.
from agate.
Hey Chris! Thanks for the improvements in both the code and the tests. They're educational for me to look at/grok/study. I appreciate that.
And thanks for the AUTHORS
addition. Next time I visit, I'm going to show my Mom.
from agate.
Absolutely! Please let me know if any of the changes don't make sense. Mostly I just pruned back things we weren't using from the original source (lambda x: x
) and tried to clean up the variable names, etc. (Also there was one issue where you were casting float to Decimal, which is bad juju and fails on Python 2.6).
Thanks again!
from agate.
Related Issues (20)
- Dependency bug in Parsedatetime v2.5 causing Agate / CSVKit issues HOT 1
- TestSniffer.test_sniffer fails with newer Python3 HOT 2
- Ah ha! It looks like you named your script `agate.py` which is also the name of the library, so instead of importing agate it's importing your own code. Try renaming your script! HOT 2
- `print_table` should handle embedded newlines HOT 3
- Force quoting on certain columns? HOT 2
- fails to install with pip in cloud envs due to transitive system dep HOT 1
- PyICU dependency causing pip upgrade failure on macOS HOT 9
- Using homogenize() after denormalize() results in some rows without row_names HOT 5
- CI: Investigate intermittent test_sniffer error
- best way to convert Date cols to Text after loading? HOT 2
- best way to UNION two tables? HOT 1
- Update more aggregations to work with TimeDelta, specifically Median
- Copying to clipboard HOT 1
- Default string output of TableSet with multiple layers of nesting throws `AttributeError: 'TableSet' object has no attribute 'rows'` HOT 4
- agate requires parsedatetime != 2.5 but won't allow 2.6 HOT 3
- "/" separator for flat JSON files could not be unique HOT 7
- Please add support for python 3.10 HOT 4
- Feature request: line wrapping HOT 1
- Methods missing on some doc pages HOT 1
- Calculating mean for columns, ignoring non-numerical values HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from agate.