Giter Site home page Giter Site logo

pysal / mapclassify Goto Github PK

View Code? Open in Web Editor NEW
128.0 25.0 30.0 22.82 MB

Classification schemes for choropleth mapping.

Home Page: https://pysal.org/mapclassify

License: BSD 3-Clause "New" or "Revised" License

Python 3.12% Jupyter Notebook 96.88%
gis choropleth-map classification visualization

mapclassify's Introduction

mapclassify: Classification Schemes for Choropleth Maps

Continuous Integration codecov PyPI version DOI License Code style: black Binder

mapclassify implements a family of classification schemes for choropleth maps. Its focus is on the determination of the number of classes, and the assignment of observations to those classes. It is intended for use with upstream mapping and geovisualization packages (see geopandas and geoplot) that handle the rendering of the maps.

For further theoretical background see Rey, S.J., D. Arribas-Bel, and L.J. Wolf (2020) "Geographic Data Science with PySAL and the PyData Stack”.

Using mapclassify

Load built-in example data reporting employment density in 58 California counties:

>>> import mapclassify
>>> y = mapclassify.load_example()
>>> y.mean()
125.92810344827588
>>> y.min(), y.max()
(0.13, 4111.4499999999998)

Map Classifiers Supported

BoxPlot

>>> mapclassify.BoxPlot(y)
BoxPlot

     Interval        Count
--------------------------
(   -inf,  -52.88] |     0
( -52.88,    2.57] |    15
(   2.57,    9.36] |    14
(   9.36,   39.53] |    14
(  39.53,   94.97] |     6
(  94.97, 4111.45] |     9

EqualInterval

>>> mapclassify.EqualInterval(y)
EqualInterval

     Interval        Count
--------------------------
[   0.13,  822.39] |    57
( 822.39, 1644.66] |     0
(1644.66, 2466.92] |     0
(2466.92, 3289.19] |     0
(3289.19, 4111.45] |     1

FisherJenks

>>> import numpy as np
>>> np.random.seed(123456)
>>> mapclassify.FisherJenks(y, k=5)
FisherJenks

     Interval        Count
--------------------------
[   0.13,   75.29] |    49
(  75.29,  192.05] |     3
( 192.05,  370.50] |     4
( 370.50,  722.85] |     1
( 722.85, 4111.45] |     1

FisherJenksSampled

>>> np.random.seed(123456)
>>> x = np.random.exponential(size=(10000,))
>>> mapclassify.FisherJenks(x, k=5)
FisherJenks

   Interval      Count
----------------------
[ 0.00,  0.64] |  4694
( 0.64,  1.45] |  2922
( 1.45,  2.53] |  1584
( 2.53,  4.14] |   636
( 4.14, 10.61] |   164

>>> mapclassify.FisherJenksSampled(x, k=5)
FisherJenksSampled

   Interval      Count
----------------------
[ 0.00,  0.70] |  5020
( 0.70,  1.63] |  2952
( 1.63,  2.88] |  1454
( 2.88,  5.32] |   522
( 5.32, 10.61] |    52

HeadTailBreaks

>>> mapclassify.HeadTailBreaks(y)
HeadTailBreaks

     Interval        Count
--------------------------
[   0.13,  125.93] |    50
( 125.93,  811.26] |     7
( 811.26, 4111.45] |     1

JenksCaspall

>>> mapclassify.JenksCaspall(y, k=5)
JenksCaspall

     Interval        Count
--------------------------
[   0.13,    1.81] |    14
(   1.81,    7.60] |    13
(   7.60,   29.82] |    14
(  29.82,  181.27] |    10
( 181.27, 4111.45] |     7

JenksCaspallForced

>>> mapclassify.JenksCaspallForced(y, k=5)
JenksCaspallForced

     Interval        Count
--------------------------
[   0.13,    1.34] |    12
(   1.34,    5.90] |    12
(   5.90,   16.70] |    13
(  16.70,   50.65] |     9
(  50.65, 4111.45] |    12

JenksCaspallSampled

>>> mapclassify.JenksCaspallSampled(y, k=5)
JenksCaspallSampled

     Interval        Count
--------------------------
[   0.13,   12.02] |    33
(  12.02,   29.82] |     8
(  29.82,   75.29] |     8
(  75.29,  192.05] |     3
( 192.05, 4111.45] |     6

MaxP

>>> mapclassify.MaxP(y)
MaxP

     Interval        Count
--------------------------
[   0.13,    8.70] |    29
(   8.70,   16.70] |     8
(  16.70,   20.47] |     1
(  20.47,   66.26] |    10
(  66.26, 4111.45] |    10
>>> mapclassify.MaximumBreaks(y, k=5)
MaximumBreaks

     Interval        Count
--------------------------
[   0.13,  146.00] |    50
( 146.00,  228.49] |     2
( 228.49,  546.67] |     4
( 546.67, 2417.15] |     1
(2417.15, 4111.45] |     1

NaturalBreaks

>>> mapclassify.NaturalBreaks(y, k=5)
NaturalBreaks

     Interval        Count
--------------------------
[   0.13,   75.29] |    49
(  75.29,  192.05] |     3
( 192.05,  370.50] |     4
( 370.50,  722.85] |     1
( 722.85, 4111.45] |     1

Quantiles

>>> mapclassify.Quantiles(y, k=5)
Quantiles

     Interval        Count
--------------------------
[   0.13,    1.46] |    12
(   1.46,    5.80] |    11
(   5.80,   13.28] |    12
(  13.28,   54.62] |    11
(  54.62, 4111.45] |    12

Percentiles

>>> mapclassify.Percentiles(y, pct=[33, 66, 100])
Percentiles

     Interval        Count
--------------------------
[   0.13,    3.36] |    19
(   3.36,   22.86] |    19
(  22.86, 4111.45] |    20

PrettyBreaks

>>> np.random.seed(123456)
>>> x = np.random.randint(0, 10000, (100,1))
>>> mapclassify.PrettyBreaks(x)
Pretty

      Interval         Count
----------------------------
[  300.00,  2000.00] |    23
( 2000.00,  4000.00] |    15
( 4000.00,  6000.00] |    18
( 6000.00,  8000.00] |    24
( 8000.00, 10000.00] |    20

StdMean

>>> mapclassify.StdMean(y)
StdMean

     Interval        Count
--------------------------
(   -inf, -967.36] |     0
(-967.36, -420.72] |     0
(-420.72,  672.57] |    56
( 672.57, 1219.22] |     1
(1219.22, 4111.45] |     1

UserDefined

>>> mapclassify.UserDefined(y, bins=[22, 674, 4112])
UserDefined

     Interval        Count
--------------------------
[   0.13,   22.00] |    38
(  22.00,  674.00] |    18
( 674.00, 4112.00] |     2

Alternative API

As of version 2.4.0 the API has been extended. A classify function is now available for a streamlined interface:

>>> classify(y, 'boxplot')                                  
BoxPlot                   

     Interval        Count
--------------------------
(   -inf,  -52.88] |     0
( -52.88,    2.57] |    15
(   2.57,    9.36] |    14
(   9.36,   39.53] |    14
(  39.53,   94.97] |     6
(  94.97, 4111.45] |     9

Use Cases

Creating and using a classification instance

>>> bp = mapclassify.BoxPlot(y)
>>> bp
BoxPlot

     Interval        Count
--------------------------
(   -inf,  -52.88] |     0
( -52.88,    2.57] |    15
(   2.57,    9.36] |    14
(   9.36,   39.53] |    14
(  39.53,   94.97] |     6
(  94.97, 4111.45] |     9

>>> bp.bins
array([ -5.28762500e+01,   2.56750000e+00,   9.36500000e+00,
         3.95300000e+01,   9.49737500e+01,   4.11145000e+03])
>>> bp.counts
array([ 0, 15, 14, 14,  6,  9])
>>> bp.yb
array([5, 1, 2, 3, 2, 1, 5, 1, 3, 3, 1, 2, 2, 1, 2, 2, 2, 1, 5, 2, 4, 1, 2,
       2, 1, 1, 3, 3, 3, 5, 3, 1, 3, 5, 2, 3, 5, 5, 4, 3, 5, 3, 5, 4, 2, 1,
       1, 4, 4, 3, 3, 1, 1, 2, 1, 4, 3, 2])

Binning new data

>>> bp = mapclassify.BoxPlot(y)
>>> bp
BoxPlot

     Interval        Count
--------------------------
(   -inf,  -52.88] |     0
( -52.88,    2.57] |    15
(   2.57,    9.36] |    14
(   9.36,   39.53] |    14
(  39.53,   94.97] |     6
(  94.97, 4111.45] |     9
>>> bp.find_bin([0, 7, 3000, 48])
array([1, 2, 5, 4])

Note that find_bin does not recalibrate the classifier:

>>> bp
BoxPlot

     Interval        Count
--------------------------
(   -inf,  -52.88] |     0
( -52.88,    2.57] |    15
(   2.57,    9.36] |    14
(   9.36,   39.53] |    14
(  39.53,   94.97] |     6
(  94.97, 4111.45] |     9

Apply

>>> import mapclassify 
>>> import pandas
>>> from numpy import linspace as lsp
>>> data = [lsp(3,8,num=10), lsp(10, 0, num=10), lsp(-5, 15, num=10)]
>>> data = pandas.DataFrame(data).T
>>> data
          0          1          2
0  3.000000  10.000000  -5.000000
1  3.555556   8.888889  -2.777778
2  4.111111   7.777778  -0.555556
3  4.666667   6.666667   1.666667
4  5.222222   5.555556   3.888889
5  5.777778   4.444444   6.111111
6  6.333333   3.333333   8.333333
7  6.888889   2.222222  10.555556
8  7.444444   1.111111  12.777778
9  8.000000   0.000000  15.000000
>>> data.apply(mapclassify.Quantiles.make(rolling=True))
   0  1  2
0  0  4  0
1  0  4  0
2  1  4  0
3  1  3  0
4  2  2  1
5  2  1  2
6  3  0  4
7  3  0  4
8  4  0  4
9  4  0  4

Development Notes

Because we use geopandas in development, and geopandas has stable mapclassify as a dependency, setting up a local development installation involves creating a conda environment, then replacing the stable mapclassify with the development version of mapclassify in the development environment. This can be accomplished with the following steps:

conda-env create -f environment.yml
conda activate mapclassify
conda remove -n mapclassify mapclassify
pip install -e .

mapclassify's People

Contributors

burggraaff avatar darribas avatar dependabot[bot] avatar dfolch avatar github-actions[bot] avatar jeffcsauer avatar jgaboardi avatar jlaura avatar justinpihony avatar knaaptime avatar lanselin avatar ljwolf avatar martinfleis avatar mhwang4 avatar mlyons-tcc avatar mriduls avatar nmalizia avatar pastephens avatar pre-commit-ci[bot] avatar schmidtc avatar sjsrey avatar slumnitz avatar weikang9009 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

mapclassify's Issues

Difference between Natural Breaks and Fisher Jenks schemes

Hi all! I'm working with some census data to build choroplet maps and I'm wondering what is the difference between the natural breaks and fisher jenks schemes. As far as I know, both of them reduce the variance inside the classification group and also between classification groups. But I'm not getting the detailed difference among both methods.

I was looking into the examples that are pointed out in the implementation section here in the repo and they look pretty similar. And here I share another one with the data I'm using now:

image (3)

As the results using both methods are very similar I'm wondering if Is the Fisher Jenks method a kind of optimization of the Natural Breaks scheme?

Many thanks for your help!

Improving FisherJenks

The implemented FisherJenks classifier is very slow. I would like to suggest using jenkspy instead. It's written in cython and it's very fast.

`MaxP.update()` – bins used by not defined

In 8505795 I commented out the update method for MaxP due to linting failure, and then tests were passing without it. The linting was failing because bins is used but never passed in or defined within. I am wondering if this is a bug introduced through copy-paste of another classifier's update method, for example UserDefined?

BUG: HeadTailBreaks raise RecursionError

HeadTailBreaks raises RecursionError if the maximum value within data is there twice (or more). Then it simply locks itself in the loop in values[values >= mean] - mean will be always the same and both values will always returned.

Steps to reproduce:

data = np.random.pareto(2, 1000)
data = np.append(data, data.max())

mc.HeadTailBreaks(data)

I assume that once there are only the same values within remaining values, head_tail_breaks should stop:

def head_tail_breaks(values, cuts):
    """
    head tail breaks helper function
    """
    values = np.array(values)
    mean = np.mean(values)
    cuts.append(mean)
    if len(values) > 1:
        if len(set(values)) > 1:  #this seems to fix the issue
            return head_tail_breaks(values[values >= mean], cuts)
    return cuts

However, I am not sure if it is the intended behaviour to stop and keep multiple values in the last bin as it does not reflect the definition of HeadTailBreaks algorithm (but I cannot see another solution). Happy to do a PR if this is how you want to fix that.

new release on pypi

We need to have a new release on pypi since the api.py is removed now and several packages have mapclassify as a dependency. The last release is in August 2017.

BUG: FisherJenksSampled returns ValueError if Series is passed as y

FisherJenksSampled seems to be the only classifier which has trouble to process pandas.Series as y. Passing the same as an array (df.pop_est.values) works flawlessly.

We should put y = np.asarray(y) somewhere around here to make sure we get an array every time.

df = geopandas.read_file(geopandas.datasets.get_path('naturalearth_lowres'))
mapclassify.FisherJenksSampled(df.pop_est)
/opt/miniconda3/envs/geo_dev/lib/python3.7/site-packages/pandas/core/computation/expressions.py:204: UserWarning: evaluating in Python space because the '*' operator is not supported by numexpr for the bool dtype, use '&' instead
  f"evaluating in Python space because the {repr(op_str)} "
/opt/miniconda3/envs/geo_dev/lib/python3.7/site-packages/pandas/core/computation/expressions.py:204: UserWarning: evaluating in Python space because the '*' operator is not supported by numexpr for the bool dtype, use '&' instead
  f"evaluating in Python space because the {repr(op_str)} "
/opt/miniconda3/envs/geo_dev/lib/python3.7/site-packages/pandas/core/computation/expressions.py:204: UserWarning: evaluating in Python space because the '*' operator is not supported by numexpr for the bool dtype, use '&' instead
  f"evaluating in Python space because the {repr(op_str)} "
/opt/miniconda3/envs/geo_dev/lib/python3.7/site-packages/pandas/core/computation/expressions.py:204: UserWarning: evaluating in Python space because the '*' operator is not supported by numexpr for the bool dtype, use '&' instead
  f"evaluating in Python space because the {repr(op_str)} "
/opt/miniconda3/envs/geo_dev/lib/python3.7/site-packages/pandas/core/computation/expressions.py:204: UserWarning: evaluating in Python space because the '*' operator is not supported by numexpr for the bool dtype, use '&' instead
  f"evaluating in Python space because the {repr(op_str)} "
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-10-5f392df02277> in <module>
----> 1 mapclassify.FisherJenksSampled(df.pop_est)

~/Git/mapclassify/mapclassify/classifiers.py in __init__(self, y, k, pct, truncate)
   1852         self.name = "FisherJenksSampled"
   1853         self.y = y
-> 1854         self._summary()  # have to recalculate summary stats
   1855 
   1856     def _set_bins(self):

~/Git/mapclassify/mapclassify/classifiers.py in _summary(self)
    624     def _summary(self):
    625         yb = self.yb
--> 626         self.classes = [np.nonzero(yb == c)[0].tolist() for c in range(self.k)]
    627         self.tss = self.get_tss()
    628         self.adcm = self.get_adcm()

~/Git/mapclassify/mapclassify/classifiers.py in <listcomp>(.0)
    624     def _summary(self):
    625         yb = self.yb
--> 626         self.classes = [np.nonzero(yb == c)[0].tolist() for c in range(self.k)]
    627         self.tss = self.get_tss()
    628         self.adcm = self.get_adcm()

<__array_function__ internals> in nonzero(*args, **kwargs)

/opt/miniconda3/envs/geo_dev/lib/python3.7/site-packages/numpy/core/fromnumeric.py in nonzero(a)
   1894 
   1895     """
-> 1896     return _wrapfunc(a, 'nonzero')
   1897 
   1898 

/opt/miniconda3/envs/geo_dev/lib/python3.7/site-packages/numpy/core/fromnumeric.py in _wrapfunc(obj, method, *args, **kwds)
     56     bound = getattr(obj, method, None)
     57     if bound is None:
---> 58         return _wrapit(obj, method, *args, **kwds)
     59 
     60     try:

/opt/miniconda3/envs/geo_dev/lib/python3.7/site-packages/numpy/core/fromnumeric.py in _wrapit(obj, method, *args, **kwds)
     49         if not isinstance(result, mu.ndarray):
     50             result = asarray(result)
---> 51         result = wrap(result)
     52     return result
     53 

/opt/miniconda3/envs/geo_dev/lib/python3.7/site-packages/pandas/core/generic.py in __array_wrap__(self, result, context)
   1788             return result
   1789         d = self._construct_axes_dict(self._AXIS_ORDERS, copy=False)
-> 1790         return self._constructor(result, **d).__finalize__(
   1791             self, method="__array_wrap__"
   1792         )

/opt/miniconda3/envs/geo_dev/lib/python3.7/site-packages/pandas/core/series.py in __init__(self, data, index, dtype, name, copy, fastpath)
    312                     if len(index) != len(data):
    313                         raise ValueError(
--> 314                             f"Length of passed values is {len(data)}, "
    315                             f"index implies {len(index)}."
    316                         )

ValueError: Length of passed values is 1, index implies 177.

Add a Pooled classifier

For panel data, keeping the bins constant across periods and defined on the vectorized panel would facilitate snapshot comparisons:

image

Source: PySAL Gitter 2019-10-29.

BUG: greedy(strategy='balanced') does not return correct labels

balanced strategy in greedy seems to be broken.

df = gpd.read_file('https://gist.githubusercontent.com/martinfleis/5c669d1204d120d87f179e31c896043d/raw/5799db9c48ec1d20a309fc1b18b7edf07a8aca6b/gb.geojson')
df.plot(greedy(df, strategy='balanced'), figsize=(12, 12), edgecolor='w')

image

Strategies from networkx work fine. I'll try to figure out what is going on later.

add `min` keyword to `UserDefined`

UserDefined classifier is based on the upper bounds of each class, which means that the resulting legend accessed as .get_legend_classes() is inconsistent across ys. It always uses the min value of y.

I would add an optional min keyword, which would fix min value no matter the y to get consistent legends. That seems to be the best way of fixing geopandas/geopandas#2018.

EqualInterval unclear error when `max_y - min_y = 0`

Hi,

I'm currently using mapclassify to batch process some maps. I'm using the EqualInterval classifier via GeoPandas, I happened upon an error when I had a column that only contained zeroes everywhere:

stuff.py:122: in _plot_diff
    a = EqualInterval(mapdata[tmp_col])
../../../.conda/envs/geo-env/lib/python3.7/site-packages/mapclassify/classifiers.py:1197: in __init__
    MapClassifier.__init__(self, y)
../../../.conda/envs/geo-env/lib/python3.7/site-packages/mapclassify/classifiers.py:614: in __init__
    self._classify()
../../../.conda/envs/geo-env/lib/python3.7/site-packages/mapclassify/classifiers.py:633: in _classify
    self._set_bins()
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

self = <[AttributeError("'EqualInterval' object has no attribute 'bins'") raised in repr()] EqualInterval object at 0x7f661c5d12d0>

    def _set_bins(self):
        y = self.y
        k = self.k
        max_y = max(y)
        min_y = min(y)
        rg = max_y - min_y
        width = rg * 1.0 / k
>       cuts = np.arange(min_y + width, max_y + width, width)
E       ValueError: arange: cannot compute length

../../../.conda/envs/geo-env/lib/python3.7/site-packages/mapclassify/classifiers.py:1207: ValueError

As far as I can tell this is because width becomes zero when max_y and min_y have the same values. I would like for a ValueError to be raised in this case explaining what's wrong. Would a PR for this be ok? Another possible solution would be to fall back to using k = 1 and skip all the binning logic when this case happens, but that feels unintuitive to me.

love and care for docstrings, etc.

In working through #135 and making some doc edits, I'm seeing that the docstrings and notebooks are in need a thorough scouring for consistent formatting, grammar & spelling, etc. And along with that an update to the docs/ infrastructure itself. I'll get to working on that and split it from the work I've already started on in #135.

Extra files in PyPI sdist

Besides missing a few documentation files (this is arguable, though lacking README.md seems unfortunate), the sdist seems to contain a few extra files:

  • mapclassify/deprecation.py
  • mapclassify/flycheck_classifiers (serges-MacBook-Pro.local's conflicted copy 2019-07-03).py
  • mapclassify/test.py

REGR: UserDefined classifier returns ValueError("Minimum and maximum of input data are equal, cannot create bins.")

We have a regression in 2.4.0. This commit 7aad6fc introduced the following check:

if min(self.y) == max(self.y):
raise ValueError("Minimum and maximum of input data are equal, cannot create bins.")

However, that is irrelevant if you use UserDefined classifier. momepy's CI got red with 2.4.0. I define bins using the whole array and then using the same in subsets of data to get counts per bin, therefore it is perfectly fine that min == max.

Would not be better to fall back to k=1 and raise a warning instead of an error in general? You may have discussed that before though... (cc @jeffcsauer)

We should ideally fix this and do 2.4.1 bugfix release before we'll do meta release.

conda-forge UnsatisfiableError on windows and python 3.7

Hi,

for some reason, if I want to install mapclassify with python3.7 on windows from conda-forge, I'll get UnsatisfiableError (https://ci.appveyor.com/project/martinfleis/momepy/builds/28862313). In the environment is nothing else:

name: test
channels:
  - conda-forge
dependencies:
  - python=3.7
  - mapclassify

If I unpin python, I get python3.8 but mapclassify 2.0.1. So I sense the issue with scikit-learn, but I have no clue why this happens. The same situation is with pysal.

Edit: If I keep mapclassify only, you get 2.0.1, if I pin mapclassify to 2.1.1 I get the error.

Backwards compatability

For others who may stumble onto this same issue, I just wanted to make the note that Mapclassify will work with Python 2.7 if you clone the repository, make one small change to setup.py (add the line "from io import open"), and run pip install with the -e parameter pointing to your local version of the repository (e.g., python2 -m pip install -e ~/my_packages/mapclassify).

My experience has been that some of the other packages useful for choropleth visualizations are not compatible with Python3, so it's great that Mapclassify can work in both.

Thank you for making this excellent tool!

Maximum Likelihood–Based Classification Scheme

I have randomly bumped into this paper proposing Maximum Likelihood–Based Classification Scheme. It may be worth adding to mapclassify's portfolio.

Wangshu Mu & Daoqin Tong (2019) Choropleth Mapping with Uncertainty: A Maximum Likelihood–Based Classification Scheme, Annals of the American Association of Geographers, 109:5, 1493-1510, DOI: 10.1080/24694452.2018.1549971

Invalid escape sequences in strings

There are some warnings when bytecompiling files in Python 3.8; this may eventually stop working in newer Python versions.

mapclassify-2.2.0/mapclassify/classifiers.py:568: DeprecationWarning: invalid escape sequence \l
mapclassify-2.2.0/mapclassify/classifiers.py:2593: DeprecationWarning: invalid escape sequence \s

Is mapclassify code black?

@sjsrey Is the "official" mapclassify code formatting style black? If not, shall I create a PR that blackens the repo and add a black badge to README.md?

Strange behaviour in `Quantiles`

Quantiles seems to return numbers of bins that differ from k in some contexts:

In [45]: y = db['HR60'] 
In [46]: breaks = ps.Quantiles(y, 9).bins 
In [47]: len(breaks) 
Out[47]: 8

In [49]: y.min()
Out[49]: 0.0

In [50]: y.max()
Out[50]: 92.936802974000003

In [51]: breaks
Out[51]:
array([  0.        ,   1.13094026,   2.19806302,   3.44429745,
         5.13821807,   7.47789373,  10.86644222,  92.93680297])

In [67]: [print(i, ps.Quantiles(y, k=i).bins.shape[0]) for i in range(1, 10)] 
1 1 
2 2 
3 3 
4 4 
5 5 
6 6 
7 7 
8 7 
9 8 
Out[67]:

current version of mapclassify in docs?

The docs list the current version of mapclassify as v2.1.1. The current (pre) released version is v2.3.0. Should the docs be rebuilt now or after the official release of v2.3.0? If the docs should be rebuilt now, I'll go ahead and make a PR.

Add streamlined API

Opening up a thread there to sketch together what a different, streamlined API could look like for mapclassify algorithms. This grows out of discussions with other projects (e.g. xarray, ipyleaflet. The basic idea is we would like mapclassify to be as easy to use and useful as possible to potential users, and we've started thinking maybe our current API could be streamlined for the "80% of cases" where you have a 1D array with values and you want back a set of labels.

Based on this, we were thinking of starting by adding a method that'd wrap around all the available methods available for classification. For a very rough sketch, something like:

a = numpy.random.random(100)
q = mapclassify.classify(a, "quantiles", k=5)

Another option would be to develop it following the sklearn pattern:

classifier = mapclassify.Quantiles(k=5)
q = classifier.fit_transform(a)

And there might be others more useful for folks. Please if you feel inclined, do drop your views here, it'd be really useful to get as many views as possible. Also, of course, any other ideas or "wish-list" items you may have that'd make you more likely to use/adopt mapclassify for choropleth classification tasks, we'd love to help if possible.

Tagging a few folks we imagine might be interested in some way (@brendancol, @kristinepetrosyan, @martinRenou, @jorisvandenbossche, @sjsrey, @ljwolf, @slumnitz ), but feel free to tag more as you go along!

Inconsistent UserDefined Scale With Multiple Axes

I am trying to compare the differences between the number of items in four different data sets using a consistent scale. The issue is that if one of the data sets do not have data in a certain interval, the color scheme will be thrown off and one color will be missing. This means that the color that was supposed to represent the max value for all values (0.46) will now also represent the max value for a different data set (0.2).

The data was in a GeoDataFrame with data representing each of the four cities in separate columns. I defined a scale that would match the largest value in the four columns and made a number of bins (3 or 4) from 0 to the maximum value. The scale worked well for the column with the largest value, but was incorrect for all of the other columns.

Map_of_cities_and_their_regions

This problem persists no matter how many bins I create, making sure that each column has data within each of the bins I create.

If there is a better way to create a consistent scale across multiple axes, please let me know. Thank you.

mapclassify.Natural_Break() does not return the specified k classes

Hi,

I use mapclassfiy.Natural_Break() to produce bins for my MapBox heatmap.

My code like this:
df = pd.read_table('./files_output/customer_qty.txt',sep=',',header=None).iloc[:,1]
mapclassify.Natural_Breaks(df.iloc[:,1], k=5)

In my thought it should return 5 classes, but it only returned 3 classes

.
The output is:

Natural_Breaks

Lower Upper Count
=============================================
x[i] <= 1.000 54428
1.000 < x[i] <= 26.000 2475
26.000 < x[i] <= 212.000 66

Attachment is the customer_qty.txt data file.
customer_qty.txt

warn or raise vs. print statements for unexpected behavior

There are 2 places in the code base where print statements are used to alert for unexpected behavior. I am of the opinion that these should throw a warning (case 1) and raise an exception (case 2).

  1. This should probably throw a warning.
  2. This should probably raise a ValueError.

BUG: HeadTailBreaks RecursionError due to floating point issue

If you have an array which contains values with a tiny difference, you send HeadTailBreaks into endless recursion.

Because:
Screenshot 2020-10-27 at 13 52 13

from mapclassify import HeadTailBreaks

HeadTailBreaks(np.array([1 + 2**-52, 1, 1]))
---------------------------------------------------------------------------
RecursionError                            Traceback (most recent call last)
<ipython-input-9-9dcdc720a93b> in <module>
----> 1 HeadTailBreaks(np.array([1 + 2**-52, 1, 1]))

/opt/conda/lib/python3.7/site-packages/mapclassify/classifiers.py in __init__(self, y)
   1128 
   1129     def __init__(self, y):
-> 1130         MapClassifier.__init__(self, y)
   1131         self.name = "HeadTailBreaks"
   1132 

/opt/conda/lib/python3.7/site-packages/mapclassify/classifiers.py in __init__(self, y)
    612         self.fmt = FMT
    613         self.y = y
--> 614         self._classify()
    615         self._summary()
    616 

/opt/conda/lib/python3.7/site-packages/mapclassify/classifiers.py in _classify(self)
    631 
    632     def _classify(self):
--> 633         self._set_bins()
    634         self.yb, self.counts = bin1d(self.y, self.bins)
    635 

/opt/conda/lib/python3.7/site-packages/mapclassify/classifiers.py in _set_bins(self)
   1135         x = self.y.copy()
   1136         bins = []
-> 1137         bins = head_tail_breaks(x, bins)
   1138         self.bins = np.array(bins)
   1139         self.k = len(self.bins)

/opt/conda/lib/python3.7/site-packages/mapclassify/classifiers.py in head_tail_breaks(values, cuts)
    182     cuts.append(mean)
    183     if len(set(values)) > 1:
--> 184         return head_tail_breaks(values[values >= mean], cuts)
    185     return cuts
    186 

... last 1 frames repeated, from the frame below ...

/opt/conda/lib/python3.7/site-packages/mapclassify/classifiers.py in head_tail_breaks(values, cuts)
    182     cuts.append(mean)
    183     if len(set(values)) > 1:
--> 184         return head_tail_breaks(values[values >= mean], cuts)
    185     return cuts
    186 

RecursionError: maximum recursion depth exceeded in comparison

def head_tail_breaks(values, cuts):
"""
head tail breaks helper function
"""
values = np.array(values)
mean = np.mean(values)
cuts.append(mean)
if len(set(values)) > 1:
return head_tail_breaks(values[values >= mean], cuts)
return cuts

`plot` doesn't work for pooled classifications

The plot utility function for classification objects does not seem to work for pooled classifications. This might be expected behaviour or not supported functionality but just in case.

Reproducible error:

from pysal.lib import examples
import mapclassify
import geopandas


mx_ex = examples.load_example('mexico')
mx = geopandas.read_file(mx_ex.get_file_list()[0])

years = ['PCGDP1940', 'PCGDP1960', 'PCGDP1980', 'PCGDP2000']
pooled = mapclassify.Pooled(mx[years])

for i, y in enumerate(years):
    classi = pooled.col_classifiers[i]
    classi.plot(mx)

which on the gds_env:6.1 returns:

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
/opt/conda/lib/python3.8/site-packages/geopandas/plotting.py in _mapclassify_choro(values, scheme, **classification_kwds)
   1010     try:
-> 1011         scheme_class = schemes[scheme]
   1012     except KeyError:

KeyError: 'pooled quantiles'

During handling of the above exception, another exception occurred:

KeyError                                  Traceback (most recent call last)
/opt/conda/lib/python3.8/site-packages/geopandas/plotting.py in _mapclassify_choro(values, scheme, **classification_kwds)
   1014         try:
-> 1015             scheme_class = schemes[scheme]
   1016         except KeyError:

KeyError: 'pooled quantiles'

During handling of the above exception, another exception occurred:

ValueError                                Traceback (most recent call last)
<ipython-input-20-f10e01e9c42e> in <module>
     12 for i, y in enumerate(years):
     13     classi = pooled.col_classifiers[i]
---> 14     classi.plot(mx)

/opt/conda/lib/python3.8/site-packages/mapclassify/classifiers.py in plot(self, gdf, border_color, border_width, title, legend, cmap, axis_on, legend_kwds, file_name, dpi, ax)
   2336             fmt = legend_kwds.pop("fmt")
   2337 
-> 2338         ax = gdf.assign(_cl=self.y).plot(
   2339             column="_cl",
   2340             ax=ax,

/opt/conda/lib/python3.8/site-packages/geopandas/plotting.py in __call__(self, *args, **kwargs)
    923             kind = kwargs.pop("kind", "geo")
    924             if kind == "geo":
--> 925                 return plot_dataframe(data, *args, **kwargs)
    926             if kind in self._pandas_kinds:
    927                 # Access pandas plots

/opt/conda/lib/python3.8/site-packages/geopandas/plotting.py in plot_dataframe(df, column, cmap, color, ax, cax, categorical, legend, scheme, k, vmin, vmax, markersize, figsize, legend_kwds, categories, classification_kwds, missing_kwds, aspect, **style_kwds)
    750             classification_kwds["k"] = k
    751 
--> 752         binning = _mapclassify_choro(values[~nan_idx], scheme, **classification_kwds)
    753         # set categorical to True for creating the legend
    754         categorical = True

/opt/conda/lib/python3.8/site-packages/geopandas/plotting.py in _mapclassify_choro(values, scheme, **classification_kwds)
   1015             scheme_class = schemes[scheme]
   1016         except KeyError:
-> 1017             raise ValueError(
   1018                 "Invalid scheme. Scheme must be in the set: %r" % schemes.keys()
   1019             )

ValueError: Invalid scheme. Scheme must be in the set: dict_keys(['boxplot', 'equalinterval', 'fisherjenks', 'fisherjenkssampled', 'headtailbreaks', 'jenkscaspall', 'jenkscaspallforced', 'jenkscaspallsampled', 'maxp', 'maximumbreaks', 'naturalbreaks', 'quantiles', 'percentiles', 'stdmean', 'userdefined'])

HeadTailBreaks RecursionError still not fully resolved

I have seen the RecursionError coming from HeadTailBreaks again. This time caused by a floating point imprecision.

This snippet allows you to reproduce the issue. The parquet is just 2kb containing only the problematic bit of the values.

df = pandas.read_parquet("https://www.dropbox.com/s/p9pgg2pdvnhvgsw/sample.parquet?dl=1")
bins = mapclassify.HeadTailBreaks(df["values"])

The current workaround is to round the values before sending them to mapclassify but we should somehow resolve this under the hood (not that I know how from the top of my head).

bins = mapclassify.HeadTailBreaks(df["values"].round(6))

xref #46 and #95

remove docs badge

Since mapclassify switched from readthedocs to GitHub hosted docs (#41) the badges on README.md and README.rst no longer functional or necessary.

  • remove doc badge from README.md
  • remove doc badge from README.rst

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.