camdavidsonpilon / tdigest Goto Github PK

View Code? Open in Web Editor NEW

376.0 12.0 53.0 94 KB

t-Digest data structure in Python. Useful for percentiles and quantiles, including distributed enviroments like PySpark

License: MIT License

Python 92.53% Makefile 7.47%

python percentile quantile estimate pyspark distributed-computing mapreduce

tdigest's People

Stargazers

Watchers

Forkers

bluemoon dastjead atlaspilotpuppy ogrisel mtrbean wangyangcharles hedgefair bigsnarfdude d18s pjz zblz dataai jianqiaoc siarheimelnik mikegraham tkluck orenov egkachai libardo1 haoybl jason790 phamngoclinh96 decastro-alex freephys integersofk nsm120 eddienko afcarl vanbenschoten ryanmwhitephd eforsell m0nzderr abeusher absaj jonathanzailer fagan2888 bigrlab ashishbhutani d3v3l0 arohatgi dgq2011 intelecy kunigami liujinseu jordanlevy99 beflew microprediction kat-grayson sandy4321 jankaspar baumfried shuntw6096 jackmrzhou

tdigest's Issues

Best way to serialize and load

I have a basic question about serialization and deserialization. How do you suggest that this is done? I ask because my instinct was json.dumps(t.to_dict()) but on the reverse trip, the current implementation of update_from_dict seems to be recalibrating from scratch unless I misunderstand. It is very slow. How should one quickly save and load many tdigests?

pyspark example misses import

See https://github.com/CamDavidsonPilon/tdigest/blob/master/pyspark_example.py
"sc" - I assume is a spark connection

Conda Distribution

Hi,
I believe it will be nice and useful to add also conda distribution. (can help with that if needed)

Corner case where values are identical?

I'm interested in the case where a variable takes on discrete values. I created tdigest notebook to illustrate what might be an interesting issue.

Suppose I have sampled many rolls of a die. If I add a tiny amount of noise then tdigest works just fine as a nice representation of the data, with quite an accurate cdf and percentiles.

However, if you run the same spreadsheet with HACK=False then only six centroids are created. This leads to gross inaccuracy in both cdf and percentiles.

I am wondering if there could be a trick here, in order for tdigest to be able to handle cases like this without my hack.

serialization suggestion

Do you have a recommended serialization/deserialization strategy for passing these digests?

pickle would work but before I add an attack vector i was wondering if you had another solution.

digest.percenticle(0) should give min, but does not.

Python3 compatiable

CDF benchmark

Hey there. Great work on this! I just wanted to let you know I've included your implementation in a benchmark I've started here. So far it is the most accurate method, but alas not the fastest.

tdigest fails to serialize in spark

_compute_centroid_quantile can cause a runtime error in cython_trees

When trying to batch update a tdigest object with a set of specific data, it results in a run time error due to a stack overflow in NodeStack.push().

A small example of this can be see at https://gist.github.com/pgr/9b2c4e745a45142eb88b

tdigest version: a70e3bd
bintrees is v2.0.2
sys.version is ['2.7.8 |Anaconda 2.2.0 (32-bit)| (default, Jul 2 2014, 15:13:35) [MSC v.1500 32 bit (Intel)]']

Quantile broken if only one centroid

>>> from tdigest import TDigest as TD
>>> td = TD()
>>> td.update(1)
>>> td.quantile(1)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python2.7/dist-packages/tdigest/tdigest.py", line 184, in quantile
    delta = (c_i.mean - self.C.prev_item(key)[1].mean) / 2.
  File "/usr/local/lib/python2.7/dist-packages/bintrees/abctree.py", line 684, in prev_item
    raise KeyError(str(key))
KeyError: '1.0'

a little bit of inspection shows that quantile breaks if there's only one Centroid in self.C.

Is the code still maintained?

Hi I noticed that the code is not following up with its Java partner.
Are you still maintaining it?

wheel incorrectly contains `.pyc` files

$ ~/opt/venv/bin/pip download --no-deps tdigest
Collecting tdigest
  Downloading https://files.pythonhosted.org/packages/27/41/b714941a6dba3760ddf2c2604daabbb578bcd6063f57ecdbe2c1d8ce4a79/tdigest-0.5.2.1-py2.py3-none-any.whl
  Saved ./tdigest-0.5.2.1-py2.py3-none-any.whl
Successfully downloaded tdigest
You are using pip version 19.0.3, however version 19.1.1 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.
$ unzip -l tdigest-0.5.2.1-py2.py3-none-any.whl 
Archive:  tdigest-0.5.2.1-py2.py3-none-any.whl
  Length      Date    Time    Name
---------  ---------- -----   ----
      155  2016-08-27 01:30   MANIFEST
     4034  2018-03-12 13:33   README.md
     1089  2015-03-17 05:04   LICENSE.txt
    13056  2018-05-05 13:07   tdigest/tdigest.pyc
      861  2017-02-20 02:41   tdigest/test_convergence_of_ks_statistic_over_adding.py
       53  2018-05-05 13:24   tdigest/__init__.py
    10342  2018-05-05 13:24   tdigest/tdigest.py
      248  2018-05-05 13:07   tdigest/__init__.pyc
     1827  2018-05-05 13:07   tdigest/__pycache__/test_convergence_of_ks_statistic_over_adding.cpython-27-PYTEST.pyc
     4035  2018-05-05 13:25   tdigest-0.5.2.1.dist-info/DESCRIPTION.rst
      995  2018-05-05 13:25   tdigest-0.5.2.1.dist-info/metadata.json
        8  2018-05-05 13:25   tdigest-0.5.2.1.dist-info/top_level.txt
      110  2018-05-05 13:25   tdigest-0.5.2.1.dist-info/WHEEL
     4907  2018-05-05 13:25   tdigest-0.5.2.1.dist-info/METADATA
     1269  2018-05-05 13:25   tdigest-0.5.2.1.dist-info/RECORD
---------                     -------
    42989                     15 files

wheels should only contain source, though this contains python2.x pyc files

(this is easy to fix, a simple re-release with modern versions of wheel / setuptools / pip will not produce wheels in this manner)

Updates to algorithm in latest paper [help needed]

https://github.com/tdunning/t-digest/blob/master/docs/t-digest-paper/histo.pdf

serialization method?

Dear Cam,
Have you figured out how to serialise the python version of t-digests?

Thanks,
Alex

python2.7 ypeError: unbound method _cython_3_0_0a9.cython_function_or_method object must be called with RBTree instance as first argument (got AccumulationTree instance instead)

Traceback (most recent call last): File "<stdin>", line 2, in <module> File "/usr/local/lib/python2.7/dist-packages/tdigest/tdigest.py", line 112, in update self._add_centroid(Centroid(x, w)) File "/usr/local/lib/python2.7/dist-packages/tdigest/tdigest.py", line 67, in _add_centroid self.C.insert(centroid.mean, centroid) File "accumulation_tree/accumulation_tree.pyx", line 233, in accumulation_tree.accumulation_tree._AccumulationTree.insert TypeError: unbound method _cython_3_0_0a9.cython_function_or_method object must be called with RBTree instance as first argument (got AccumulationTree instance instead)

my pip env
accumulation-tree==0.6.2 amqp==1.4.6 ansible==1.8 anyjson==0.3.3 APNSWrapper==0.6.1 astroid==1.4.8 atfork==0.1.2 Babel==1.3 backports.functools-lru-cache==1.2.1 backports.ssl-match-hostname==3.4.0.2 beautifulsoup4==4.1.3 billiard==3.3.0.20 biplist==0.6 boilerpipe==1.2.0 boilerpipy==0.2.1b0 boto==2.42.0 bz2file==0.98 cached-property==1.2.0 cachetools==2.0.1 cassandra-driver==3.0.0a1 celery==3.1.18 certifi==2015.4.28 cffi==1.1.0 chardet==2.3.0 Cheetah==2.4.4 chromium-compact-language-detector==0.31415 chronos-python==0.34.0 click==5.1 colorama==0.3.2 configparser==3.5.0 coverage==4.3.4 cssselect==0.9.1 cssutils==1.0 Cython==0.17.4 DateUtils==0.5.2 decorator==3.4.0 Django==1.3.1 django-ajax-selects==1.3.5 django-cache-machine==0.6 django-debug-toolbar-django13==0.8.4 django-mysqlpool==0.1.post8 dnspython==1.12.0 docker-py==1.7.2 docutils==0.8.1 dpkt==1.6 elasticsearch==1.3.0 elasticsearch-dsl==0.0.4.dev0 feedparser==5.1.3 fixture==1.5 Flask==0.10.1 Flask-Admin==1.1.0 flower==0.8.2 flup==1.0.2 functools32==3.2.3.post2 furl==0.3.6 futures==2.1.6 gensim==0.13.1 geojson==1.0.9 glob2==0.5 google-auth==1.3.0 gunicorn==19.3.0 h2==2.4.1 hpack==2.3.0 html2text==2015.6.21 html5lib==1.0b1 httpagentparser==1.1.2 httplib2==0.9.2 hyper==0.7.0 hyperframe==3.2.0 ImageHash==0.3 impyla==0.9.1 iotop==0.6 isort==4.2.5 itsdangerous==0.24 jenkinsapi==0.3.3 Jinja2==2.7.3 jsl==0.2.4 jsonpath-rw==1.4.0 jsonschema==2.5.1 kazoo==2.0 Keras==1.0.6 kombu==3.0.26 lazy-object-proxy==1.2.2 librabbitmq==1.0.0 lipton==0.2.0 lockfile==0.8 luigi==2.3.0 lxml==2.3.4 Mako==1.0.0 Markdown==2.1.1 MarkupSafe==0.23 matplotlib==1.2.0 mccabe==0.5.2 mmh3==2.3 mock==1.0.1 mockredispy==2.9.0.9 mrjob==0.4.1 msgpack-python==0.2.4 mysql-replication==0.1.0 nose==1.2.1 numexpr==2.4rc2 numpy==1.11.1 objgraph==1.7.2 opentracing==1.3.0 orderedmultidict==0.7.1 pandas==0.14.1 peewee==2.6.4 PGen==0.2.1 phonenumbers==7.7.5 pika==0.9.8 Pillow==2.5.1 ply==3.9 premailer==2.5.1 protobuf==2.6.1 publicsuffix==1.0.5 pudb==2013.5.1 py-pypcap==1.1.2 pyasn1==0.4.2 pyasn1-modules==0.2.1 pycparser==2.13 pycurl==7.19.3.1 pyinotify==0.9.4 pylibmc==1.2.3 pylint==1.6.4 pylint-django==0.7.2 pylint-flask==0.5 pylint-plugin-utils==0.2.6 pylint-redis==0.1 pymongo==2.2 PyMySQL==0.5 PyNLPIR==0.4.6 pyparsing==1.5.7 PyStemmer==1.3.0 python-cjson==1.0.5 python-consul==0.4.0 python-daemon==1.5.5 python-dateutil==2.4.0 python-memcached==1.48 pytz==2014.4 pyudorandom==1.0.0 PyYAML==3.11 raven==5.2.0 recordtype==1.1 redis==2.10.1 redis-shard==0.1.6 requests==2.10.0 rsa==3.4.2 schedule==0.1.11 schematics==1.0.post0 scikit-learn==0.14.1 scipy==0.18.0 simplejson==2.3.0 six==1.10.0 slackclient==0.15 smart-open==1.3.3 SQLAlchemy==0.7.6 sqlparse==0.1.5 tables==3.1.1 tailer==0.3 tdigest==0.5.2.2 Theano==0.8.2 thrift==0.8.0 tldextract==1.7.1 tornado==4.1 twilio==6.3.dev0 ua-parser==0.3.6 ujson==1.33 urllib3==1.10 urlparse2==1.1.1 urwid==1.1.2 user-agents==0.3.2 virtualenv==13.0.3 voluptuous==0.8.4 web.py==0.34 websocket-client==0.32.0 Werkzeug==0.10.4 wrapt==1.10.8 WTForms==2.0.2 XlsxWriter==0.7.2 yappi==0.94 zkpython==0.4.2 ZooKeeper==0.4

tdigest.version returns wrong version

I upgraded to 0.5.2 but tdigest.version still returns 0.5.0.
It should be as simple as updating init.py in the tdigest directory.

Alternative faster t-digest implementation

Hi, thanks for the t-digest implementation for python!
I used this for my work and I found in the end, computing t-digest and merging t-digest becoming the bottleneck. So I read the original paper and implemented an another version of it(using the algorithm in the paper). Then I found the performance is better (around 50-100 times faster). I think the improvement part is that we can have some buffer and merge hundred of values into t-digest at once.
I wonder if I could have a PR to this repo and add an alternative implementation to it? So I can use that in my day to day work, thanks.

Not able to install python package

Environment
Operating System: Mac OSX
Python Version: Python 3.5.4
How did you install tdigest: pip

Error is:
DELC02RC08VG8WN:tdigest-0.5.2.2 priyagupta$ pip3 install tdigest
Requirement already satisfied: tdigest in /Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/tdigest-0.5.2.2-py3.5.egg (0.5.2.2)
Collecting accumulation_tree (from tdigest)
Using cached https://files.pythonhosted.org/packages/e9/18/73c11ed9d379b5efea5cabcce4b53762ee4b0c3aea42bd944e992f8ee307/accumulation_tree-0.6.tar.gz
ERROR: Complete output from command python setup.py egg_info:
ERROR: Download error on https://pypi.org/simple/cython/: [SSL: TLSV1_ALERT_PROTOCOL_VERSION] tlsv1 alert protocol version (_ssl.c:719) -- Some packages may not be found!
Couldn't find index page for 'cython' (maybe misspelled?)
Download error on https://pypi.org/simple/: [SSL: TLSV1_ALERT_PROTOCOL_VERSION] tlsv1 alert protocol version (_ssl.c:719) -- Some packages may not be found!
No local packages or working download links found for cython
Traceback (most recent call last):
File "", line 1, in
File "/private/var/folders/bv/n02l1vdn6mn_rvn5hlq454mjgfm0_c/T/pip-install-jmrecmdu/accumulation-tree/setup.py", line 28, in
Extension('accumulation_tree.accumulation_tree', ['accumulation_tree/accumulation_tree.pyx'])
File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/setuptools/init.py", line 144, in setup
_install_setup_requires(attrs)
File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/setuptools/init.py", line 139, in _install_setup_requires
dist.fetch_build_eggs(dist.setup_requires)
File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/setuptools/dist.py", line 717, in fetch_build_eggs
replace_conflicting=True,
File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/pkg_resources/init.py", line 782, in resolve
replace_conflicting=replace_conflicting
File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/pkg_resources/init.py", line 1065, in best_match
return self.obtain(req, installer)
File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/pkg_resources/init.py", line 1077, in obtain
return installer(requirement)
File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/setuptools/dist.py", line 784, in fetch_build_egg
return cmd.easy_install(req)
File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/setuptools/command/easy_install.py", line 673, in easy_install
raise DistutilsError(msg)
distutils.errors.DistutilsError: Could not find suitable distribution for Requirement.parse('cython')
----------------------------------------
ERROR: Command "python setup.py egg_info" failed with error code 1 in /private/var/folders/bv/n02l1vdn6mn_rvn5hlq454mjgfm0_c/T/pip-install-jmrecmdu/accumulation-tree/

Scale function

Hi, more of a question than an issue but I'm curious what scale function has been used in your implementation. On lines 101/102 you have the threshold function that defines the maximum centroid weight. Comparing this to the paper:

https://arxiv.org/pdf/1903.09921.pdf

your expression is close to the that given in section 5.1 (using the k2 scale function) but, if the definition of Z(n) is what it says in the paper than they're not the same. Can you provide some details on this?

Also thanks so much for the python implementation, great piece of work!

Sometimes a test fails (test_uniform)

It seems like half the time just the one test, test_uniform fails, and other times it's fine.
This issue is just to track that problem with test_uniform. We should figure out what's wrong and fix it.

0.5.1.0 sdist is missing on PyPi

0.5.1.0 sdist is missing on PyPi, as well python2 wheels.

Negative trimmed mean

Following program produces negative trimmed mean:

import random
import tdigest

td = tdigest.TDigest()
for i in range(100):
    td.update(random.random())

for i in range(10):
    td.update(i*100)

mean = td.trimmed_mean(10, 99)
print(mean, td)

Output

-488.7492907267765 <T-Digest: n=110, centroids=110>

license violation

Recently, tdigest started to use accumulation_tree to accelerate lookups. Sadly, accumulation_tree is licensed under GPL3+ (as per setup.py), which means that tdigest may not be able to use the MIT license. Also, projects using tdigest will automatically fall under GPL3 license as well.

Percentiles unstable and not increasing

I found this issue with median but then noticed that there are other percentiles which have the same issue. In this case there should not be any negative percentiles as all seen values are positive... furthermore a large negative median when 40th and 60th percentiles are positive is non-sensical. I wonder if this is a bug or a known limitation?

Please see below for an example:

from tdigest.tdigest import TDigest
import numpy as np

vals = [  8.11780000e+04,   2.14100000e+03,   8.29710000e+04,
         7.81110000e+04,   2.30000000e+02,   5.27661000e+05,
         2.63252000e+05,   9.16950000e+04,   1.08515000e+05,
         6.26000000e+02,   7.90000000e+02,   1.24600000e+03,
         4.31357000e+05,   4.64951000e+05,   1.30155000e+05,
         3.21239000e+05,   7.13940000e+04,   8.27000000e+02,
         1.18700000e+03,   8.00000000e+02,   5.29984000e+05,
         4.57174000e+05,   8.13000000e+02,   3.67000000e+02,
         5.25310000e+04,   5.62000000e+02,   4.50359000e+05,
         1.94000000e+03,   1.36000000e+02,   5.36088000e+05,
         4.45300000e+03,   8.06000000e+02,   4.64000000e+02,
         1.44000000e+02,   6.54000000e+02,   1.63800000e+03]

td = TDigest()
print("{: >10} {: >10} {: >10} {: >10}".format("value", "median", "td_median", "error"))
for i, val in enumerate(vals): 
    td.update(val)
    actual_median = np.median(vals[:i+1])
    td_median = td.percentile(50)
    print("%10.0f %10.0f %10.2f %10.2f%%" % (val, actual_median, td_median, abs(td_median - actual_median)/actual_median * 100), end="")
    print(("{: >10.0f} " * 9).format(*[td.percentile(x) for x in np.linspace(10, 90, 9)]))
    print("")

Results in:

     value     median  td_median      error
     81178      81178   81178.00       0.00%     81178      81178      81178      81178      81178      81178      81178      81178      81178 

      2141      41660   81178.00      94.86%      2141       2141       2141       2141      81178      81178      81178      81178      81178 

     82971      81178   81178.00       0.00%      2141       2141       2141      69054      81178      93302      82971      82971      82971 

     78111      79644   79963.00       0.40%      2141       2141      66255      82063      79963      80935      81907      82971      82971 

       230      78111   78111.00       0.00%       230     -17329       2141      58352      78111      79963      81178      82971      82971 

    527661      79644   79963.00       0.40%       230      -9541      13823      74159      79963      81421      15999     149943     527661 

    263252      81178   81178.00       0.00%       230      -1753      62304      89967      81178      55660     119386     285487     527661 

     91695      82074   80341.75       2.11%       230       6035      74159      80449      80342      84549     100709     241454     527661 

    108515      82971   82971.00       0.00%       230      13823      86015      81421      82971      90418      91359     200380     527661 

       626      82074   80341.75       2.11%       148     -17230      58352      79963      80342      85309      65626     158466     527661 

       790      81178   81178.00       0.00%       514        563      -5591      74159      81178      83497      94249     134249     347081 

      1246      79644   79963.00       0.40%       542        759       1314      13671      79963      81393      90418     117093     326124 

    431357      81178   81178.00       0.00%       570        821       1516      66255      81178      84549      74204     247110     457798 

    464951      82074   80341.75       2.11%       598        883      -9389      82063      80342      90418     134249     401102     469766 

    130155      82971   82971.00       0.00%       626        908       2141      79963      82971      98900     130155     380932     464951 

    321239      87333   85309.00       2.32%       654       1043      13671      80935      85309     110438     234589     346455     460136 

     71394      82971   82971.00       0.00%       682       1178      56200      79579      82971     102746     161102     329644     455321 

       827      82074   80341.75       2.11%       710        850      -1366      76643      80342      95527     137892     312834     450505 

      1187      81178   81178.00       0.00%       738        887       1341      75193      81178      90418     114681     296023     445690 

       800      79644   79963.00       0.40%       746        730       1008      52402      79963      85309      91471     279213     440875 

    529984      81178   81178.00       0.00%       755        769       1151      67596      81178      92972     145629     346455     484212 

    457174      82074   80341.75       2.11%       764        808       1294      82790      80342     102746     253698     438154     475524 

       813      81178   81178.00       0.00%       773        814       1271      59999      81178      95527     225035     424560     472000 

       367      79644   79963.00       0.40%       605        803       1124       5648      79963      90418     153366     410967     468475 

     52531      78111   78111.00       0.00%       626        806       1187      35218      78111      85309     130155     397373     464951 

       562      74752   75665.00       1.22%       575        797        883       -423      75665      83497     106944     346455     461427 

    450359      78111   78111.00       0.00%       588        799       1103       9834      78111      87863     161102     437813     457902 

      1940      74752   75665.00       1.22%       601        801       1166      -5448      75665      84549     137892     424901     454378 

       136      71394   71394.00       0.00%       433        816        864       1985      71394      82445     114681     411989     450854 

    536088      74752   75665.00       1.22%       497        794       1082     -10507      75665      85309     215481     443905     511403 

      4453      71394   71394.00       0.00%       510        797       1145       2015      71394      83497     145629     450725     479048 

       806      61962   64999.00       4.90%       523        799        846       2074      64999      81393     122418     437813     475524 

       464      52531   52531.00       0.00%       444        799        817       1806      52531      81907      99208     424901     472000 

       144      28492   35795.75      25.63%       355        660        810       1284      35796      80935     114284     411989     468475 

       654       4453    4453.00       0.00%       367        613        806       1058       4453      79963     108515     399077     464951 

      1638       3297   -8144.50     347.03%       379        629        808       1223      -8144      78600     102746     346455     461427

Horribly slow on PyPy

I ran the tests below and found out that on PyPy tdigest is horribly slow.

# -*- coding: utf-8 -*-
from __future__ import print_function

import sys
from tdigest import TDigest
from numpy.random import randint, random
from time import time

def make_tdigest(items):
    result = TDigest()
    for _ in range(items):
        result.update(random())
    return result


def make_tdigest2(items):
    result = TDigest()
    result.batch_update(random(items))
    return result


def tdigests(count, factory):
    i = 0
    for _ in range(count):
        i+=1
        if i%100==0:
            print('generated items:', i)
        yield dict(timestamp=randint(1,15), tdigest=factory(500))


if __name__=='__main__':
    print('running test in', sys.version)

    print('generating tdigests in batch')
    start = time()
    result = [t for t in tdigests(100, make_tdigest2)]
    end = time() - start
    print('generating tdigests took:', end)
    print('----------')

    print('generating tdigests one by one')
    start = time()
    tdigests = [t for t in tdigests(100, make_tdigest)]
    end = time() - start
    print('generating tdigests took:', end)
    print('----------')

==========
PyPy
running test in 2.7.13 (0e7ea4fe15e82d5124e805e2e4a37cae1a402d4b, Jan 06 2018, 12:46:49)
[PyPy 5.10.0 with GCC 4.2.1 Compatible Apple LLVM 9.0.0 (clang-900.0.39.2)]
generating tdigests in batch
generated items: 100
generating tdigests took: 32.5672068596
----------
generating tdigests one by one
generated items: 100
generating tdigests took: 17.4209430218
----------

==================
Python
running test in 2.7.14 (default, Mar  9 2018, 23:57:12) 
[GCC 4.2.1 Compatible Apple LLVM 9.0.0 (clang-900.0.39.2)]
generating tdigests in batch
generated items: 100
generating tdigests took: 4.16117596626
----------
generating tdigests one by one
generated items: 100
generating tdigests took: 2.38711595535
----------

I've repeated the test in the official PyPy docker container with PyPy 6.0.0 (compatible with python 3) with the same outcome: https://hub.docker.com/_/pypy/

Any ideas?

The trimmed mean estimate is bad

The implementation of the trimmed mean estimate (trimmed_mean method) doesn't look right. The estimate seems way off from the real value. Here is an example:

import numpy as np
from tdigest import TDigest

Creat 10000 samples of random uniform distributition.

x = np.random.random(size=10000)*100

Create a T-Digest for this

d = TDigest()
d.batch_update(x)

Estimate the trimmed mean of X that above the 25% percentile.

tm_estimate = d.trimmed_mean(25,100)
print(tm_estimate)

75.0410094085

Now, find the real 25% percentile and compujte the real trimmed mean.

x_25 = np.percentile(x,25)

x_trimmed = x[x>=x_25]
tm_real = x_trimmed.mean()
print(tm_real)

62.3013933259

Usage of git tags

It would be really nice if tags were used in this project so that one could easily see if the current version 0.4.1.0 on PyPI actually refers to which commit and therefore also if a certain bugfix is included or not.

Works poorly for integers

from tdigest import TDigest
t = TDigest()
t.batch_update(range(10000))
print t.percentile(.50)
# returns something pretty far away from 5000

Tag request for git project: tdigest 0.5.2.2

Can you please add a tag to the git project for release 0.5.2.2?

Negative quantile approximation with high skew/less data

digest = TDigest()
digest.batch_update([62.0, 202.0, 1415.0, 1433.0])
digest.percentile(0.25)

Returns -136.25. This is because in https://github.com/CamDavidsonPilon/tdigest/blob/master/tdigest/tdigest.py#L166-L167, delta is computed as the mean of the means of the neighbouring centroids and is used as the slope to linearly approximate the quantile between the two centroids. In the following line, m_i + ((p - t) / k - 1/2)*delta is negative because delta is very large, and p - t = 0 and thus the expression evaluates to m_i + (-1/2)*delta which is negative.

well done

Just saying hi. This is very nice work indeed. The application of your work at http://dev.microprediction.org/crawling.html may be more than obvious as this is essentially an online CDF estimation contest (or collection of the same). I've put tdigest top of my list at https://github.com/microprediction/microprediction/projects/4 to ensure it is included. This will, by the way, generate plenty of comparative data that might help your research publications. Happy to explain further. See also http://dev.microprediction.org/july.html

difference between java implementation and python

I have
my_set.zip.

When I'm using the java code:

import com.tdunning.math.stats.TDigest;
import org.apache.commons.csv.CSVFormat;
import org.apache.commons.csv.CSVRecord;

import java.io.*;
import java.util.stream.StreamSupport;

public class TDigestTry {
    public static void main(String[] args) throws IOException {
        ClassLoader classLoader = TDigestTry.class.getClassLoader();
        File file = new File(classLoader.getResource("my_set.csv").getFile());
        Reader in = new FileReader(file);
        Iterable<CSVRecord> records = CSVFormat.EXCEL.withHeader().parse(in);
        TDigest digest = TDigest.createAvlTreeDigest(20);
        StreamSupport.stream(records.spliterator(), false)
                .map(record -> new Double(record.get("change rate")))
                .forEach(digest::add);
        System.out.println(digest.quantile(0.05));
        System.out.println(digest.quantile(0.95));
    }
}

I'm getting the results:

3.0
5.0

But when I'm this code:

from pathlib import Path

import pandas
from tdigest import TDigest

if __name__ == '__main__':
    frame = pandas.read_csv(Path(__file__).parents[0].joinpath("resources").joinpath("my_set.csv"))
    digest = TDigest()
    digest.batch_update(frame["change rate"].values)
    print(f"Quantile 0.05 = {digest.percentile(5)};\t\tQuantile 0.95 = {digest.percentile(95)}")

I'm getting the results:

Quantile 0.05 = 2.6495903059149586;		Quantile 0.95 = 3689686.790917569

How come there's a large difference between between the 0.95 quantiles?

P.S
same results when I use:

for value in frame["change rate"].values:
        digest.update(value)

camdavidsonpilon / tdigest Goto Github PK

tdigest's People

Stargazers

Watchers

Forkers

tdigest's Issues

Recommend Projects

Recommend Topics

Recommend Org