saucecat / pdpbox Goto Github PK

View Code? Open in Web Editor NEW

837.0 837.0 129.0 309.92 MB

python partial dependence plot toolbox

Home Page: http://pdpbox.readthedocs.io/en/latest/

License: MIT License

Python 0.50% Jupyter Notebook 99.50%

pdpbox's People

Contributors

Stargazers

Watchers

Forkers

xtutran gavin2318 mqk flinder dgq2011 bezova saurabhshubham323 kriaz100 yell3434 daydreamersjp sunbc0120 darrenzeng2012 vaidyasm angertdevsingh ike-okonkwo bipoppy pierresalvaire hugodlopes basurounaq11 fuermoxin gilesstrong shreyasp1 conormm daphne-p angelyanyan sidharthiimc lorenzoinvernizzi92 gravitationg nkhuyu knut0815 mycarta outrauk adrielvieira benwufc konradbachusz-zz rookiedata1 kungfuai xhochy sarahmburgart gaworo limingbei mhan1 insciteanalytics chaoankuo supai-red habibmrad anirban6393 wxzquant hughstime guohf3 h3dema darthwaydr007 silver391 daxincaopofu alex33261 ajiajiajiaji luwu29 narayanmahto aviadr1 fyfef1 anselmoo kurhula xrosliang zikkuratti stjordanis nitishhrms hyacintheater nataly44 yuwei21 kavithacd ericwst souvickg 321hg rsest lizequnwz jraval oegedijk ratulalahy dsguseong fagan2888 qiao-27 vishnunitr vabun jm-666 hugovk victen18 yutiansut gwzuo vchoppa danyeljei scruz03 valmach ajaypayattuparambil raghava-thummapudi ellinahwang asealy1 meiselt antwhite zhouwenyu0921 anuj509

pdpbox's Issues

Cannot use pdp.pdp_isolate() on multiclass problem

I have a random forrest model that gives output in the format array([[0, 1]], dtype=uint8) or array([[1, 0]], dtype=uint8). I cant figure out how to use pdp_isolate... my code is here and erorr message:

CODE:
from matplotlib import pyplot as plt
from pdpbox import pdp, get_dataset, info_plots

pdp_inc = pdp.pdp_isolate(model=model, dataset=pd.DataFrame(features.iloc[0]).T, model_features=features.columns.tolist(), feature='r1')
pdp.pdp_plot(pdp_inc, 'SCI_Mainframe')
plt.show()

ERROR:
TypeError: list indices must be integers or slices, not tuple

I dont know what to do. I have spend a couple hours looking for soluytion whitout luck

Sort plot results by mean

It would be nice if a sort parameter can be set in order to sort the results. The results would be something like that :

What do you think?

pdp.pdp_interact_plot only plots 1 graph instead of 3

The pdp_interact_plot is meant to print 3 graphs isn't it? Feature 1 vs y, Feature 2 vs y and finally a contour graph. Mine only prints the contour graph.

Rounding in `get_grid_` causes issues for features with narrow scale

If a feature has percentiles that vary by less than 0.01 the generated grid has duplicate values which (for some reason) leads to unequal dimensions of the .feature_grids and .pdp attributes of the pdpbox.pdp.pdp_isolate_obj.
When using this development version, the problem with the dimensions is gone, but the grid will have only one value in this example. I could fix the problem for my purposes by just removing the rounding statements in the _get_grids() function (see my fork), but assume there's a reason for the rounding so a real fix is probably more involved (?).

Here's a reproducible example (using the version from pypi):

import pandas as pd
import numpy as np

from pdpbox import pdp # Version from pypi
from sklearn.linear_model import SGDClassifier

np.random.seed(123)
df = pd.DataFrame({'y': np.random.randint(0,2,100), 
                  'x1': np.random.uniform(0.5, 0.5001, 100),
                  'x2': np.random.uniform(0.5, 0.5001, 100)})

clf = SGDClassifier()
X = df[['x1', 'x2']]
y = df['y']
clf.fit(X, y)
P = pdp.pdp_isolate(model=clf, train_X=X, feature='x1')

print(P.feature_grids)
print(P.pdp)

Error: 'PDPIsolate' object has no attribute 'pdp_isolate'

Hi,

I am unable to successfully plot the pdp. Find below the code being used -

feat_name = 'avg_albumin'

x_test_pdp = x_test[x_test[feat_name].notnull()] #Did this to remove nulls since pdp cant take null values

pdp_data = pdp.pdp_isolate(model=best_gbm, dataset=x_test_pdp, model_features=cols_best_rfe, feature=feat_name)

# plot it
pdp.pdp_plot(pdp_data, feat_name)
plt.show()

The error statement is- 'PDPIsolate' object has no attribute 'pdp_isolate'.

It would be great if you could help me out with this. Thanks! @doctorperceptron @SauceCat

logic error with one-hot encoding feature

In the example provided, 'embarked' has three labels 'C', 'S', and 'Q', and all 3 labels presented as features of the model.
However, practically when train the model, encoding all labels to the model will cause loss rank. As the name "one-hot" indicates, the base label should not be used as feature to keep full rank. In this case, "Embarked_C" would not be a feature to train the model.
So the PDP should display correct dependency values for feature=['Embarked_S', 'Embarked_Q']

Customize figures

Hi,

first a real big 'Thank you' for this wonderful package! I also read the book from Chris and searched for python implementations of the ICE plots and it really solves some open things.
I assume, you used matplotlib as a figure-backend. And I found the list of parameters you can feed with the dict. But it would be glad to have the possibility to

scale the axis (if you don't want to center the plot)
increase the font size of the axis (not only the labels)
I did not find a way to currently do this with this nice implementation.

Thanks
Jan

pdp_isolate fails for regression tasks

Hi - firstly I'd like to thank you for producing this package, it's really great! I was just reading the ICEBox paper recently and was considering building something, but was delighted to see somebody else already had :)

I'm having issues with calling pdp_isolate on a regression model - it throws the following exception:

usr/local/lib/python3.5/dist-packages/PDPbox-0.1-py3.5.egg/pdpbox/pdp.py in pdp_isolate(model, train_X, feature, num_grid_points, percentile_range)
    113     # store the ICE lines
    114     # for multi-classifier, a dictionary is created
--> 115     if n_classes > 2:
    116         ice_lines = {}
    117         for n_class in range(n_classes):

TypeError: unorderable types: NoneType() > int()

Even the 'Regression.ipynb' example in PDPbox/test/Regression/ does this. A cursory glance at the codebase seems to suggest that when we have a sklearn model without a classes property, n_classes gets set to None on pdp.py line 64. Then all subsequent comparisons of n_classes to an integer will throw this error. Any suggestions?

An issue when change num_grid_points in pdp_plot

Hi,

when I tried to increase num_grid_points into 100, the following value error happened:
"x and y must have same first dimension, but have shapes (81,) and (91,)"

I am confused cuz shouldn't they always be same?

Code:
pdp_age = pdp.pdp_isolate(clf, X, target_feature, num_grid_points = 100, percentile_range = (1,100)) pdp.pdp_plot(pdp_age, target_feature, center = False, figsize=(9,5), plot_lines = False)

It seems that after "pdp_isolate", pdp.size (which determines y) and feature_grids.size become different.

Can you please explain more about how to make sure that x and y have same shape?

Thanks!

Class labels

I am having trouble figuring out how pdpbox assigns class labels to multiclass plots for single feats

here is an code snippet im using for a target that is either yes, no , or unknown

feature = <feat>
pdp_isolated = pdp_isolate(
    model=xgbcl, 
    dataset=X_train_processed, 
    model_features=X_train_processed.columns, 
    feature=feature,
    num_grid_points=100
)

pdp_plot(
    pdp_isolated, 
    feature_name=feature, 
    center=True, 
    x_quantile=True, 
    ncols=2, 
    plot_lines=True, 
    frac_to_plot=100,
    which_classes=[0, 1], 
    plot_pts_dist=True
)

Enable custom color normalize object

Hi @SauceCat,

In my use case of PDPbox's pdp_interact_plot(), I create a bunch of interact plots choosing two features at a time and visualize which of the interaction produces most variation in 'marginal effects' (encoded in color scale). As a result all of the plots have colors spread along entire range. Tiny and large variations in marginal effects can be distinguished only by reading off the range of color scale on the plot.

I think it would be an immense enhancement if we could pass along a custom normalize object that can be shared among multiple plots, thus encoding same value by same color for each plot. A fixed normalize routine shared among multiple plots brings consistency and highlights degree of variation in marginal effect when sharing it with stakeholders.

Two plots below uses full color scale; however, range of marginal effect is quite different.
Plot 1: Small Range of Marginal Effect

Plot 2: Large Range of Marginal Effect

Having issue using info_plots.actual_plot

I am following your examples and getting a weird error.

"fig, axes, summary_df = info_plots.actual_plot(model=forest_reg, X=df_new, feature = '1', feature_name='1')"

Thanks a lot in advance!

Error on running example

When trying to run an example from the docs, I get the following error:
https://pdpbox.readthedocs.io/en/latest/pdp_plot.html

/home/janvanrijn/anaconda3/envs/openml-defaults/bin/python /home/janvanrijn/projects/openml-defaults/test2.py
Traceback (most recent call last):
  File "/home/janvanrijn/projects/openml-defaults/test2.py", line 14, in <module>
    feature='Sex')
  File "/home/janvanrijn/anaconda3/envs/openml-defaults/lib/python3.6/site-packages/pdpbox/pdp.py", line 159, in pdp_isolate
    for feature_grid in feature_grids)
  File "/home/janvanrijn/anaconda3/envs/openml-defaults/lib/python3.6/site-packages/joblib/parallel.py", line 983, in __call__
    if self.dispatch_one_batch(iterator):
  File "/home/janvanrijn/anaconda3/envs/openml-defaults/lib/python3.6/site-packages/joblib/parallel.py", line 825, in dispatch_one_batch
    self._dispatch(tasks)
  File "/home/janvanrijn/anaconda3/envs/openml-defaults/lib/python3.6/site-packages/joblib/parallel.py", line 782, in _dispatch
    job = self._backend.apply_async(batch, callback=cb)
  File "/home/janvanrijn/anaconda3/envs/openml-defaults/lib/python3.6/site-packages/joblib/_parallel_backends.py", line 182, in apply_async
    result = ImmediateResult(func)
  File "/home/janvanrijn/anaconda3/envs/openml-defaults/lib/python3.6/site-packages/joblib/_parallel_backends.py", line 545, in __init__
    self.results = batch()
  File "/home/janvanrijn/anaconda3/envs/openml-defaults/lib/python3.6/site-packages/joblib/parallel.py", line 261, in __call__
    for func, args, kwargs in self.items]
  File "/home/janvanrijn/anaconda3/envs/openml-defaults/lib/python3.6/site-packages/joblib/parallel.py", line 261, in <listcomp>
    for func, args, kwargs in self.items]
  File "/home/janvanrijn/anaconda3/envs/openml-defaults/lib/python3.6/site-packages/pdpbox/pdp_calc_utils.py", line 44, in _calc_ice_lines
    preds = predict(_data[model_features], **predict_kwds)
  File "/home/janvanrijn/anaconda3/envs/openml-defaults/lib/python3.6/site-packages/xgboost/sklearn.py", line 797, in predict_proba
    test_dmatrix = DMatrix(data, missing=self.missing, nthread=self.n_jobs)
AttributeError: 'XGBClassifier' object has no attribute 'n_jobs'

my pip freeze:

AnyQt==0.0.8
asn1crypto==0.24.0
Babel==2.6.0
Bottleneck==1.2.1
certifi==2018.8.24
cffi==1.11.5
chardet==3.0.4
click==6.7
cloudpickle==0.5.3
commonmark==0.8.0
ConfigSpace==0.4.7
cryptography==2.3.1
cycler==0.10.0
Cython==0.28.2
dask==0.18.1
debtcollector==1.19.0
decorator==4.3.0
distributed==1.22.0
docutils==0.14
entrypoints==0.2.3
fasteners==0.14.1
feather-format==0.4.0
future==0.16.0
HeapDict==1.0.0
holoviews==1.10.7
idna==2.7
iso8601==0.1.12
jeepney==0.3.1
joblib==0.12.3
keyring==13.2.1
keyrings.alt==3.1
kiwisolver==1.0.1
liac-arff==2.2.2
matplotlib==2.2.2
mkl-fft==1.0.0
mkl-random==1.0.1
monotonic==1.5
msgpack==0.5.6
netaddr==0.7.19
netifaces==0.10.7
networkx==2.1
numpy==1.14.3
Orange3==3.15.0
oslo.concurrency==3.27.0
oslo.config==6.2.1
oslo.i18n==3.20.0
oslo.utils==3.36.2
pandas==0.24.0.dev0+997.ga197837
param==1.7.0
pbr==4.0.4
PDPbox==0.2.0
psutil==5.4.6
PuLP==1.6.8
pyarrow==0.9.0
pycparser==2.18
pyparsing==2.2.0
pyqtgraph==0.10.0
python-dateutil==2.7.2
python-louvain==0.11
pytz==2018.4
pyviz-comms==0.6.0
PyYAML==3.12
requests==2.19.1
rfc3986==1.1.0
scikit-learn==0.20.0
scikit-optimize==0.5.2
scipy==0.19.1
seaborn==0.9.0
SecretStorage==3.1.0
serverfiles==0.2.1
six==1.11.0
sortedcontainers==2.0.4
stevedore==1.28.0
tblib==1.3.2
toolz==0.9.0
tornado==5.0.2
typing==3.6.4
urllib3==1.23
wrapt==1.10.11
xgboost==0.81
xlrd==1.1.0
xmltodict==0.11.0
zict==0.1.3

Code:



from pdpbox import pdp, get_dataset

test_titanic = get_dataset.titanic()
titanic_data = test_titanic['data']
titanic_target = test_titanic['target']
titanic_features = test_titanic['features']
titanic_model = test_titanic['xgb_model']

pdp_sex = pdp.pdp_isolate(model=titanic_model,
                          dataset=titanic_data,
                          model_features=titanic_features,
                          feature='Sex')
fig, axes = pdp.pdp_plot(pdp_isolate_out=pdp_sex, feature_name='sex')

Which XGboost version do I need?

cc @prerna135

pdp.pdp_plot for multiple features

Is there a way to plot multiple features using pdp.pdp_plot function? Currently, as I understand, the function can generate a plot for individual features and returns matplotlib figure and axis. It is hard to manage individual axis and assign them a new figure and compile all the axes into a figure.

The contour_label_fontsize parameter in _pdp_contour_plot() causes TypeError

On line 251 in pdp_plot_utils.py, one of the parameters for _pdp_contour_plot() is contour_label_fontsize and this causes the following error:

TypeError: clabel() got an unexpected keyword argument 'contour_label_fontsize'

According to the documentation for matplotlib.pyplot.clabel(), the parameter should be called fontsize.

Source: clabel() documentation

Package in PYPI is not the newest

Only PDPbox 0.1 is available via pip.Please update it

unexpected keyword argument 'dataset' using pdp.pdp_interact()

Still the same data ('Rossman Store Sales') and exact same code as tutorial:

ross_data = test_ross['data']
ross_features = test_ross['features']
ross_model = test_ross['rf_model']
ross_target = test_ross['target']

%%time
inter_rf = pdp.pdp_interact(
    model=ross_model, dataset=ross_data, model_features=ross_features, 
    features=['weekofyear', ['StoreType_a', 'StoreType_b', 'StoreType_c', 'StoreType_d']]
)

'dataset' is not recognized as a keyword, I use python 2.7 in Ubuntu 16.04.4 LTS.

Thank you!

git clone https://github.com/SauceCat/PDPbox.git $ cd PDPbox $ python setup.py install

Example Request: Run with own data set

When looking at the examples, we see various options where PDP Box works on self included datasets. They all have a model stored in there. Is there a particular reason for this?

Furthermore, it is not clear how to run PDP Box on any other data frame. Should we provide such model ourselves?

cc @prerna135

ImportError: cannot import name 'get_dataset'

Hi,

When i tried importing get_dataset, I got the import error. Please advise.

ImportError Traceback (most recent call last)
in ()
----> 1 from pdpbox import get_dataset

ImportError: cannot import name 'get_dataset'

Thanks

one-hot encoding feature should contain more than 1 element

Hi,
I ran multiclass XGB with one target column, and execute the following pdp_isolate code. I got an error: "ValueError: one-hot encoding feature should contain more than 1 element". How I can fix it? Thank you in advance.

pdp_xgb = pdp.pdp_isolate(
model=xgb1, dataset=data, model_features=features, feature=['dis_suwK']
)

Allow keywords for predict method

Some non-native implementations of the sklearn interface (most notably XGBoost) allow for keywords to be passed to the model's predict or predict_proba methods.

Concrete example: if you train an XGBoost model with early stopping, you need to specify ntree_limit = clf.best_ntree_limit in order to get the score that actually corresponds to the best model, rather than just the last iteration.

It would be nice to allow such predict keywords to be passed in when calling pdp methods.

I have implemented this feature in a local branch, and would be happy to submit a PR.

Generating plots with sklearn Pipeline objects

Thanks for creating such a tool for Python partial dependence plot. I do find an issue, though. Right now in my project, the trained model is wrapped as a pipeline. Incoming data would have a handful number of features, and the categorical ones will be transformed into one-hot by preprocessors within the pipeline object. PDPbox works fine when I'm calling the pipeline and a numerical feature that is available in the test dataframe. However, things get interesting when I'm trying to plot a one-hot encoded categorical feature...

Cannot pass the original dataframe and the list of one-hot encoded feature names: the feature names are not found in the dataframe.
Cannot pass the transformed dataframe (by first extracting the preprocessor from the sklearn pipeline and applying it on the data) and the list of one-hot encoded feature names: the package only accepts Pandas dataframe (error message: ValueError: only accept pandas DataFrame)
Cannot pass the original dataframe and the original name of the feature: as the feature is one-hot encoded in the pipeline, plots cannot be generated correctly.

Is there a way to better support sklearn Pipeline object? Ideally, users should be able to pass a pipeline and one-hot encoded feature names as arguments.

pdp_isolate_obj, pdp_interact_obj don't pickle

If you try to pickle a pdp_isolate_obj you get a PicklingError:

PicklingError: Can't pickle <type 'instancemethod'>: attribute lookup __builtin__.instancemethod failed

The reason is that currently the model's predict (or predict_proba) method is added as a class member to the object, and pickling instance methods is verboten.

As far as I can tell, there's no reason to add the predict method to class. The pdp_isolate_obj.predict member isn't used anywhere in the code, and it could quite easily be reconstructed from the model, if it were needed. I'd propose to simply remove this member. Happy to submit a PR, if desired.

PDP for One hot encoded features

I have a one hot encoded feature, with the resulting datatype of numeric. When I plot the PDP for this feature, I get a weird plot representing nothing, as below:

The plot works fine for other numeric feature columns. Only is not working fine for this OHE feature. Any suggestions?

ValueError: No objects to concatenate error using pdp.pdp_isolate

I've got pdp_isolate to work on other features, but it throws an exception "ValueError: No objects to concatenate" when plotted for column "f66". Does this mean there isn't enough information to make a plot of this feature?

Code that draws the error is:

pdp_feature = pdp.pdp_isolate(
    model=model, 
    dataset=generic_features, 
    model_features=generic_features.columns.tolist(), 
    feature="f66",
    num_grid_points=10, grid_type='percentile', percentile_range=None, grid_range=None, cust_grid_points=None, memory_limit=0.5, n_jobs=1, predict_kwds={}, data_transformer=None
)

Error printout is:

ValueError                                Traceback (most recent call last)
<ipython-input-48-65f3cae83da8> in <module>
      7         model_features=self._xgb_col_names,
      8         feature=''.join(['f', str(self._featureName_to_featureIdx_map[feature])]),
----> 9         num_grid_points=10, grid_type='percentile', percentile_range=None, grid_range=None, cust_grid_points=None, memory_limit=0.5, n_jobs=1, predict_kwds={}, data_transformer=None
     10     )

~\AppData\Local\Continuum\anaconda3\lib\site-packages\pdpbox\pdp.py in pdp_isolate(model, dataset, model_features, feature, num_grid_points, grid_type, percentile_range, grid_range, cust_grid_points, memory_limit, n_jobs, predict_kwds, data_transformer)
    165             ice_lines.append(ice_line_n_class)
    166     else:
--> 167         ice_lines = pd.concat(grid_results, axis=1)
    168
    169     # calculate the counts

~\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\core\reshape\concat.py in concat(objs, axis, join, join_axes, ignore_index, keys, levels, names, verify_integrity, sort, copy)
    226                        keys=keys, levels=levels, names=names,
    227                        verify_integrity=verify_integrity,
--> 228                        copy=copy, sort=sort)
    229     return op.get_result()
    230

~\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\core\reshape\concat.py in __init__(self, objs, axis, join, join_axes, keys, levels, names, ignore_index, verify_integrity, copy, sort)
    260
    261         if len(objs) == 0:
--> 262             raise ValueError('No objects to concatenate')
    263
    264         if keys is None:

ValueError: No objects to concatenate

The array of data is:

array([ 1.42646187e-04,  2.20339505e-03, -3.71780779e-02, -3.07990126e-02,
       -1.16102087e-03, -1.56650202e-02, -2.06472276e-03, -2.08325083e-02,
        1.91286310e-02,  8.36141875e-03, -8.11077609e-04, -1.92386611e-02,
        2.00920603e-02, -3.85844310e-02, -3.05273896e-03,  1.50174930e-03,
        4.46882570e-03, -2.53156515e-02,  7.88625472e-03, -4.30667359e-03,
       -9.39565665e-03,  8.74410281e-04,  5.34033060e-02, -8.75319328e-03,
       -9.87543920e-03, -5.46208786e-03, -6.50628504e-03, -7.57054927e-03,
       -3.93052166e-03,  2.16708143e-03,  1.42646187e-04, -7.25448253e-03,
       -1.06866675e-02,  3.59743997e-02, -3.22111433e-03, -7.26964438e-03,
        2.44697544e-02,  1.66259945e-02, -3.95567627e-03,  1.93527167e-02,
        1.64243560e-03,  2.66140257e-02,  3.36265867e-02, -1.45173875e-03,
       -6.50628504e-03,  2.01777145e-02,  5.00185982e-03, -2.68591401e-02,
        5.35982088e-03,  7.15823115e-02, -4.28643707e-03,  1.62057894e-02,
       -6.83444130e-03,  5.02746656e-02,  2.21616214e-03, -9.68313760e-03,
       -3.70581025e-03,  8.12628710e-03, -3.70581025e-03, -8.75319328e-03,
        9.87255512e-03,  6.73500014e-03,  5.45383777e-02, -3.86630195e-03,
        2.85044068e-03,  3.06522130e-02,  3.29460041e-03,  9.06824300e-03,
       -3.16553335e-03,  6.66199922e-03,  5.24429083e-04, -3.71585490e-03,
        6.27464537e-03, -8.96973965e-03,  8.17747652e-03, -5.51070443e-04,
        5.00562964e-03, -2.66748822e-02,  1.38245766e-02,  6.50494563e-04,
       -9.44020712e-03,  1.96937834e-02, -2.77638032e-03, -1.82182974e-04,
        7.87300556e-03,  4.61156800e-03, -4.69126662e-02,  1.10852780e-02,
        9.89909316e-04,  3.60295491e-04,  1.57297735e-02,  4.62766075e-02,
        1.33140739e-02,  6.19039552e-03,  1.39541582e-02, -1.69532139e-03,
       -1.16102087e-03,  3.80017680e-03, -6.50628504e-03,  6.10508974e-03,
        1.15785507e-03,  1.68395233e-02,  3.28879539e-03,  1.30998048e-02,
       -7.75223512e-03,  1.76696777e-02, -2.06107510e-03,  2.55053545e-02,
       -4.71198937e-02,  6.10950021e-02,  4.04423462e-05, -3.27229081e-02,
       -1.28115640e-03, -8.02660978e-02,  4.14655803e-02, -2.31697945e-02,
        3.25156125e-02,  3.42166041e-02, -8.96781154e-02,  1.60982957e-03,
        4.81951811e-02,  1.33224275e-03,  1.14265095e-02, -2.24592418e-02,
       -2.36253054e-03,  2.93535839e-02,  1.33670513e-02,  6.95616360e-03,
       -4.05554484e-02, -1.17983934e-02, -4.75658884e-03, -1.99465422e-02,
        1.67505552e-02, -3.21242621e-03,  4.16127660e-03, -3.64498680e-02,
       -2.67821052e-03, -1.50303162e-02,  1.17175118e-02,  1.08760602e-02,
       -5.16998437e-03, -2.69782135e-04, -2.70069166e-03, -4.10286181e-02,
        1.24384512e-03,  1.07679725e-03, -2.89405080e-02, -1.72169721e-03,
        6.27027032e-02, -5.21718738e-04,  3.55731258e-02, -2.26476632e-02,
        1.39934603e-02,  1.19824431e-02, -1.38995866e-02,  4.92600824e-03,
       -7.93047226e-03,  4.36035646e-02, -1.69812569e-04, -6.50628504e-03,
       -7.92308831e-02,  1.37241203e-02,  1.26096808e-02,  1.72127291e-03,
       -2.10604483e-02, -3.86630195e-03,  1.76504359e-02, -6.50628504e-03,
       -4.36824084e-03,  3.47388067e-03,  2.25694330e-03, -1.09188288e-02,
       -2.17498918e-02, -2.32474048e-02,  2.99691884e-02, -2.52163580e-02,
        1.43217263e-02, -8.43296067e-04,  2.70521511e-02, -4.18762082e-02,
       -1.55742569e-02,  9.67830047e-02,  3.12716409e-02,  1.42646187e-04,
        3.46233434e-02, -4.70378050e-03, -1.00597171e-02, -3.10874944e-03,
        8.58906318e-04, -5.80679408e-04, -1.04566714e-02,  7.34680904e-03,
        1.05924652e-02, -2.06613277e-02, -3.16553335e-03,  1.77879256e-03,
       -4.37393282e-02, -5.83798230e-03, -6.18573615e-03, -1.14453858e-02,
       -7.59866074e-03, -2.21871359e-02, -3.16553335e-03, -1.00102993e-04,
        1.18235288e-03,  8.06930693e-03, -1.96816126e-04,  1.84993239e-02,
        2.13461349e-02,  7.07266741e-03,  1.42646187e-04, -5.47744218e-03,
       -1.03254946e-02, -1.73400374e-02, -2.58117361e-04,  1.24010993e-02,
        3.92948733e-03, -1.82873090e-02,  1.75378801e-04, -2.85078366e-03,
        5.82887766e-03, -9.36706224e-03,  2.93535839e-02,  1.69861258e-02,
        1.29833871e-02, -2.34977297e-03,  4.12969746e-04, -4.58429383e-03,
       -1.97912848e-02, -1.71597336e-02,  1.00754861e-02,  1.46873022e-02,
        1.29070842e-02, -8.75319328e-03, -2.44563897e-02, -2.40753008e-02,
       -2.29969049e-03,  5.32787323e-03, -9.60620144e-04,  5.82650104e-03,
        1.23422518e-03,  1.23086831e-02, -4.98560347e-03, -2.14837586e-03,
       -2.19718025e-03,  3.04158477e-03,  1.42646187e-04, -6.03845600e-03,
        2.68922183e-03,  4.90456739e-03, -2.64048818e-02, -4.25583690e-03,
        1.13904307e-03,  1.82139863e-02,  5.86379367e-03,  3.09502502e-02,
        1.04408664e-02,  3.30992549e-02,  3.26630269e-03, -8.84241306e-03,
        2.53524525e-03, -1.93375418e-02, -1.65741420e-03,  3.01436365e-02,
       -3.25712075e-04, -3.89183492e-02, -2.70689577e-03,  1.42646187e-04,
       -4.11961023e-02, -1.19694286e-02,  3.78084147e-03,  5.03621425e-03,
       -1.31770086e-03,  1.73333962e-03,  1.83711198e-02,  4.24728106e-03,
        4.81215520e-03, -2.05966443e-03,  1.55374284e-02,  2.09652199e-04,
       -5.55169662e-02,  1.13634914e-02,  2.09302581e-02,  1.42646187e-04,
        1.55923464e-02,  5.68731417e-03, -3.66920155e-03,  3.35823081e-02,
        2.59721886e-02,  1.55376680e-02,  1.98637057e-03, -5.16036812e-02,
        2.23336166e-02,  8.63353952e-03,  7.94881900e-03, -6.61724505e-04,
        1.48939556e-02,  1.56944799e-02, -1.52858182e-02,  1.28012873e-03,
       -1.90668483e-02, -6.50628504e-03,  7.59052706e-03, -3.16190751e-02,
       -2.24662311e-02,  1.34058745e-03, -2.34977297e-03,  6.91485104e-03,
        2.07674911e-02, -3.43720015e-03,  5.12697636e-03,  5.68418815e-03,
       -1.98302377e-02,  3.17619165e-03,  2.87206334e-02, -4.16068265e-04,
       -1.70174842e-02, -4.25886490e-03, -1.90537711e-02,  1.40145629e-02,
        1.91054495e-03, -1.76012040e-02,  5.62004485e-02, -7.56296167e-03,
        1.93897412e-02, -1.08269795e-03, -4.12494685e-02, -8.52674004e-02,
        2.93535839e-02, -2.34977297e-03, -4.35269843e-02,  8.49059462e-04,
       -3.10987210e-02, -2.11906108e-03,  2.75347206e-02, -1.16102087e-03,
       -2.14935108e-04, -9.37163282e-04,  2.08917108e-02, -1.49624887e-04,
       -1.20360853e-03, -1.84003701e-02,  5.69942226e-02, -6.51881530e-03,
        1.19581645e-02,  9.34846000e-03,  1.95967201e-02,  1.83098260e-02,
        1.50381701e-02, -1.01716106e-02,  1.19326200e-02, -8.45938115e-05,
       -5.39846427e-02,  4.12628460e-02,  5.94171647e-02, -3.16112209e-02,
       -1.04566714e-02,  2.13981361e-03,  2.17117295e-03,  5.24219656e-03,
       -7.49931840e-04, -7.69360893e-02,  1.04539920e-03, -1.25689487e-02,
       -1.11945172e-02, -1.06232041e-02,  7.95284125e-03, -1.19005699e-03,
       -7.01669455e-03,  1.35516141e-03, -1.18424677e-02,  7.34416499e-03,
        1.82665318e-02, -6.50628504e-03, -1.66917762e-04,  7.93482177e-03,
        7.15381390e-02,  7.11127527e-05, -1.48615269e-03,  2.64163194e-02,
       -4.44932982e-02, -2.34977297e-03,  4.55293195e-02, -8.35069547e-03,
        3.70642948e-03,  1.53445914e-02,  5.59201845e-03, -4.88137601e-02,
        1.52571208e-03,  2.52508119e-03, -1.16102087e-03,  1.21301764e-02,
        7.62865196e-03, -9.38930734e-03, -1.87311721e-03,  1.73434631e-02,
       -9.26327361e-03, -7.73302270e-03, -8.66310438e-03,  1.59627685e-02,
        2.15100444e-02, -2.01720393e-02, -6.26028935e-03, -1.04566714e-02,
        1.94997264e-02,  7.40022891e-03,  6.63133282e-02, -2.86199529e-02,
        1.88261116e-03,  1.98344906e-03,  2.16258155e-03,  1.30730564e-02,
       -1.16102087e-03, -2.62273463e-02, -6.53665300e-03, -2.45184235e-02,
       -4.12609788e-02, -2.54090496e-03, -2.54805998e-02,  6.54251064e-03,
       -6.50628504e-03,  1.90305898e-02,  3.77064587e-02,  7.18832359e-03,
       -3.14250026e-04, -3.75230486e-02, -3.92870243e-03,  1.00036256e-02,
        1.99581498e-02, -1.45951448e-02,  3.94904799e-04, -1.47333697e-02,
       -1.12384254e-02, -2.02326112e-02,  1.89755598e-02, -5.89852839e-03,
        7.44942716e-03, -1.07168755e-02, -1.68000304e-02,  1.22239083e-03])

predict_proba() argument after ** must be a mapping, not NoneType

Hi,

I just installed the latest version. I am running your tutorial script "pdpbox_binary_classification.ipynb".

When running step 1.2

fig, axes, summary_df = info_plots.actual_plot(
model=titanic_model, X=titanic_data[titanic_features], feature='Sex', feature_name='gender'
)

I got an error as below:

TypeError Traceback (most recent call last)
in ()
1 fig, axes, summary_df = info_plots.actual_plot(
----> 2 model=titanic_model, X=titanic_data[titanic_features], feature='Sex', feature_name='gender'
3 )

~/.local/lib/python3.6/site-packages/pdpbox/info_plots.py in actual_plot(model, X, feature, feature_name, num_grid_points, grid_type, percentile_range, grid_range, cust_grid_points, show_percentile, show_outliers, endpoint, which_classes, predict_kwds, ncols, figsize, plot_params)
289 # make predictions
290 # info_df only contains feature value and actual predictions
--> 291 prediction = predict(X, **predict_kwds)
292 info_df = X[_make_list(feature)]
293 actual_prediction_columns = ['actual_prediction']

TypeError: predict_proba() argument after ** must be a mapping, not NoneType

I tried a different model and got the same error here.

issues and enhancements list

reorganize the code
efficient parellel process

pdp_interact_plot dimension reference subplot out of alignment.

Here is my code to reproduce the problem:

from pdpbox import pdp, get_dataset, info_plots
from sklearn.datasets import load_iris
from sklearn.ensemble import RandomForestClassifier
import pandas as pd
import matplotlib.pyplot as plt

%matplotlib inline

# Setup data
data = load_iris()
df = pd.DataFrame(data.data, columns = data.feature_names)
df.index = data.target

# Train basic model
estimator = RandomForestClassifier()
model = estimator.fit(df, df.index)

#  pdp_interactions
pdp_paid= pdp.pdp_interact(
    model=model, dataset=df, model_features=df.columns, features=df.columns, 
    num_grid_points=[5, 5, 5], 
    percentile_ranges=[None, None, None], 
    n_jobs=4
)

# plotting
fig, axes = pdp.pdp_interact_plot(
    pdp_paid, ['petal length (cm)', 'petal width (cm)'], plot_type='grid',x_quantile=True, ncols=2, plot_pdp=True, 
    which_classes=[0, 1, 2]
)

pdpbox.version == 0.2.0
matplotlib.version == 3.0.2

The problem is that in the reference docs you have, these subplots that show the dimensional values to the left and above each class plot, they are aligned with the grid of the figure. They seem to be squished. I can probably figure out how to reference to axis or figure directly and correct them but is this expected? Any easy fix?

Thanks! Great library!

Fails for Keras Model

I understand that this package supports all sci-kit models, PDP's should technically work as long as the model.predict function works, however I get this error when trying to run pdp_isolate for a Keras model. Can you confirm its the model thats causing this error? I was successfully able to run it for a RF sci-kit model.

ValueError: cannot reindex from a duplicate axis

I have a couple of features which are scaled between 0 and 1. For all of those I get a "ValueError: cannot reindex from a duplicate axis". I assume that in creating the columns for the different values of a feature, some rounding happens for their naming, which results in several columns having the same name, although I couldn't trace back the error in the code. Multiplying the column by 10 solves the problem but is of course unintended.

The error message below.

Thanks for this beautiful package.

/home/cdsw/.local/lib/python3.6/site-packages/pdpbox/pdp.py in pdp_plot(pdp_isolate_out, feature_name, center, plot_org_pts, plot_lines, frac_to_plot, cluster, n_cluster_centers, cluster_method, x_quantile, figsize, ncols, plot_params, multi_flag, which_class)
546 _pdp_plot(pdp_isolate_out=pdp_isolate_out, feature_name=feature_name, center=center, plot_org_pts=plot_org_pts, plot_lines=plot_lines,
547 frac_to_plot=frac_to_plot, cluster=cluster, n_cluster_centers=n_cluster_centers, cluster_method=cluster_method, x_quantile=x_quantile,
--> 548 ax=ax2, plot_params=plot_params)
549
550

/home/cdsw/.local/lib/python3.6/site-packages/pdpbox/pdp.py in _pdp_plot(pdp_isolate_out, feature_name, center, plot_org_pts, plot_lines, frac_to_plot, cluster, n_cluster_centers, cluster_method, x_quantile, ax, plot_params)
616 pdp_y -= pdp_y[0]
617 for col in display_columns[1:]:
--> 618 ice_lines[col] -= ice_lines[display_columns[0]]
619 ice_lines['actual_preds'] -= ice_lines[display_columns[0]]
620 ice_lines[display_columns[0]] = 0

/home/cdsw/.local/lib/python3.6/site-packages/pandas/core/ops.py in f(self, other)
895
896 def f(self, other):
--> 897 result = method(self, other)
898
899 # this makes sure that we are aligned like the input

/home/cdsw/.local/lib/python3.6/site-packages/pandas/core/ops.py in f(self, other, axis, level, fill_value)
1552 return _combine_series_frame(self, other, na_op,
1553 fill_value=fill_value, axis=axis,
-> 1554 level=level, try_cast=True)
1555 else:
1556 if fill_value is not None:

/home/cdsw/.local/lib/python3.6/site-packages/pandas/core/ops.py in _combine_series_frame(self, other, func, fill_value, axis, level, try_cast)
1437 # default axis is columns
1438 return self._combine_match_columns(other, func, level=level,
-> 1439 try_cast=try_cast)
1440
1441

/home/cdsw/.local/lib/python3.6/site-packages/pandas/core/frame.py in _combine_match_columns(self, other, func, level, try_cast)
4767 def _combine_match_columns(self, other, func, level=None, try_cast=True):
4768 left, right = self.align(other, join='outer', axis=1, level=level,
-> 4769 copy=False)
4770
4771 new_data = left._data.eval(func=func, other=right,

/home/cdsw/.local/lib/python3.6/site-packages/pandas/core/frame.py in align(self, other, join, axis, level, copy, fill_value, method, limit, fill_axis, broadcast_axis)
3548 method=method, limit=limit,
3549 fill_axis=fill_axis,
-> 3550 broadcast_axis=broadcast_axis)
3551
3552 @appender(_shared_docs['reindex'] % _shared_doc_kwargs)

/home/cdsw/.local/lib/python3.6/site-packages/pandas/core/generic.py in align(self, other, join, axis, level, copy, fill_value, method, limit, fill_axis, broadcast_axis)
7364 copy=copy, fill_value=fill_value,
7365 method=method, limit=limit,
-> 7366 fill_axis=fill_axis)
7367 else: # pragma: no cover
7368 raise TypeError('unsupported type: %s' % type(other))

/home/cdsw/.local/lib/python3.6/site-packages/pandas/core/generic.py in _align_series(self, other, join, axis, level, copy, fill_value, method, limit, fill_axis)
7461
7462 if lidx is not None:
-> 7463 fdata = fdata.reindex_indexer(join_index, lidx, axis=0)
7464 else:
7465 raise ValueError('Must specify axis=0 or 1')

/home/cdsw/.local/lib/python3.6/site-packages/pandas/core/internals.py in reindex_indexer(self, new_axis, indexer, axis, fill_value, allow_dups, copy)
4412 # some axes don't allow reindexing with dups
4413 if not allow_dups:
-> 4414 self.axes[axis]._can_reindex(indexer)
4415
4416 if axis >= self.ndim:

/home/cdsw/.local/lib/python3.6/site-packages/pandas/core/indexes/base.py in _can_reindex(self, indexer)
3558 # trying to reindex on an axis with duplicates
3559 if not self.is_unique and len(indexer):
-> 3560 raise ValueError("cannot reindex from a duplicate axis")
3561
3562 def reindex(self, target, method=None, level=None, limit=None,

ValueError: cannot reindex from a duplicate axis

Partial Dependence Plot kills the kernel w/ Xgboost

Doesn't work w/ XGBClassifier. The kernel dies.

info_plots.actual_plot() got an error

when i execute the follow code just like binary_classification tutorial:
"""
fig, axes, summary_df = info_plots.actual_plot(
model=titanic_model, X=titanic_data[titanic_features], feature=['Embarked_C', 'Embarked_S', 'Embarked_Q'],
feature_name='embarked'
)
"""
and got follow error:
"""
TypeError: predict_proba() argument after ** must be a mapping, not NoneType
"""
i also tried lgb.LGBMClassifier and lgb raw model on my own data but got same error.
is there anyone knows how to fix it?

_draw_pdp_distplot does not adjust xlimits if percentile_range is provided

Looks like the two plots do not talk to each other:
i.e. if i provide percentile_range to the pdp_isolate function, the ICE line axes limits looks correct;
however, the 'rug' plot is not adjusted accordingly.

Upstream to sklearn?

Hey.
Do you have any interest in upstreaming part of this to scikit-learn?
We had a PR here: scikit-learn/scikit-learn#5653
but it's a bit stalled.
We probably don't want to have as much plotting code as you do, but some basics in sklearn would be cool.

PDPbox saved XGBoost models do not play well with latest XGBoost

I am trying to execute the code:

from pdpbox import pdp, get_dataset, info_plots
test_titanic = get_dataset.titanic()

And I'm having the below error.
PDP 0.2.0+13.g73c6966
XGBoost 1.1.0-SNAPSHOT
conda environment

Stacktrace:

XGBoostError                              Traceback (most recent call last)
<ipython-input-2-931a5e8d7b9f> in <module>
----> 1 test_titanic = get_dataset.titanic()

~/anaconda3/lib/python3.6/site-packages/PDPbox-0.2.0+13.g73c6966-py3.6.egg/pdpbox/get_dataset.py in titanic()
      7 
      8 def titanic():
----> 9         dataset = joblib.load(os.path.join(DIR, 'datasets/test_titanic.pkl'))
     10         return dataset
     11 

~/anaconda3/lib/python3.6/site-packages/joblib/numpy_pickle.py in load(filename, mmap_mode)
    603                     return load_compatibility(fobj)
    604 
--> 605                 obj = _unpickle(fobj, filename, mmap_mode)
    606 
    607     return obj

~/anaconda3/lib/python3.6/site-packages/joblib/numpy_pickle.py in _unpickle(fobj, filename, mmap_mode)
    527     obj = None
    528     try:
--> 529         obj = unpickler.load()
    530         if unpickler.compat_mode:
    531             warnings.warn("The file '%s' has been generated with a "

~/anaconda3/lib/python3.6/pickle.py in load(self)
   1048                     raise EOFError
   1049                 assert isinstance(key, bytes_types)
-> 1050                 dispatch[key[0]](self)
   1051         except _Stop as stopinst:
   1052             return stopinst.value

~/anaconda3/lib/python3.6/site-packages/joblib/numpy_pickle.py in load_build(self)
    340         NDArrayWrapper is used for backward compatibility with joblib <= 0.9.
    341         """
--> 342         Unpickler.load_build(self)
    343 
    344         # For backward compatibility, we support NDArrayWrapper objects.

~/anaconda3/lib/python3.6/pickle.py in load_build(self)
   1505         setstate = getattr(inst, "__setstate__", None)
   1506         if setstate is not None:
-> 1507             setstate(state)
   1508             return
   1509         slotstate = None

~/anaconda3/lib/python3.6/site-packages/xgboost/core.py in __setstate__(self, state)
   1096             ptr = (ctypes.c_char * len(buf)).from_buffer(buf)
   1097             _check_call(
-> 1098                 _LIB.XGBoosterUnserializeFromBuffer(handle, ptr, length))
   1099             state['handle'] = handle
   1100         self.__dict__.update(state)

~/anaconda3/lib/python3.6/site-packages/xgboost/core.py in _check_call(ret)
    187     """
    188     if ret != 0:
--> 189         raise XGBoostError(py_str(_LIB.XGBGetLastError()))
    190 
    191 

XGBoostError: [18:53:06] /home/sergey/xgboost/src/learner.cc:834: Check failed: header == serialisation_header_: 

  If you are loading a serialized model (like pickle in Python) generated by older
  XGBoost, please export the model by calling `Booster.save_model` from that version
  first, then load it back in current version.  There's a simple script for helping
  the process. See:

    https://xgboost.readthedocs.io/en/latest/tutorials/saving_model.html

  for reference to the script, and more details about differences between saving model and
  serializing.


Stack trace:
  [bt] (0) /home/sergey/anaconda3/lib/python3.6/site-packages/xgboost/lib/libxgboost.so(dmlc::LogMessageFatal::~LogMessageFatal()+0x64) [0x7fe81e08c784]
  [bt] (1) /home/sergey/anaconda3/lib/python3.6/site-packages/xgboost/lib/libxgboost.so(xgboost::LearnerIO::Load(dmlc::Stream*)+0x674) [0x7fe81e19f444]
  [bt] (2) /home/sergey/anaconda3/lib/python3.6/site-packages/xgboost/lib/libxgboost.so(XGBoosterUnserializeFromBuffer+0x5e) [0x7fe81e07f61e]
  [bt] (3) /home/sergey/anaconda3/lib/python3.6/lib-dynload/../../libffi.so.6(ffi_call_unix64+0x4c) [0x7fe84c23d630]
  [bt] (4) /home/sergey/anaconda3/lib/python3.6/lib-dynload/../../libffi.so.6(ffi_call+0x22d) [0x7fe84c23cfed]
  [bt] (5) /home/sergey/anaconda3/lib/python3.6/lib-dynload/_ctypes.cpython-36m-x86_64-linux-gnu.so(_ctypes_callproc+0x2ce) [0x7fe84b3c509e]
  [bt] (6) /home/sergey/anaconda3/lib/python3.6/lib-dynload/_ctypes.cpython-36m-x86_64-linux-gnu.so(+0x13ad5) [0x7fe84b3c5ad5]
  [bt] (7) /home/sergey/anaconda3/bin/python -m ipykernel -f /home/sergey/.local/share/jupyter/runtime/kernel-813f0269-7bc5-4ef8-b890-fb9b799698ce.json(_PyObject_FastCallDict+0x8b) [0x559094256f8b]
  [bt] (8) /home/sergey/anaconda3/bin/python -m ipykernel -f /home/sergey/.local/share/jupyter/runtime/kernel-813f0269-7bc5-4ef8-b890-fb9b799698ce.json(+0x1a162e) [0x5590942e562e]

Feature request ALE plots

Accumulated local effects describe how features influence the prediction of a machine learning model on average. ALE plots are a faster and unbiased alternative to partial dependence plots (PDPs).

https://christophm.github.io/interpretable-ml-book/ale.html

There's R modules supporting them but no Python module.

PDP plot: use train or test set in pdp_isolate()?

Hi @SauceCat , quick conceptual question:

Say if I selected I given feature to analyze in my multiclass classifier.

In pdp.pdp_isolate() function for PDP plot, when would make sense to use the train set or test set to fill dataset parameter?

Initially, I'd say it is more complete to build 2 PDP plots for the same feature, one using train set and another using test set. So you can verify if that feature is having an equivalent impact on both sets. But I am interested in your thoughts.

Regards, Fernando

Submit to pypi

Thanks for your great library! I'm using it in a course I'm teaching, so to make things easy, I'd like to make it installable from pypi. I'm happy to submit it, so you don't have to worry about it - but I figured I'd double-check that there wasn't any reason you wanted to avoid posting it to pypi (or whether you'd rather do it yourself). If I don't hear from you, I'll assume it's OK - but just ping me if you have any issues! :)

Fontsize/Label error in pdp.pdp_interact_plot when contour = True

This command works fine and produces the expected results:

fig, axes = pdp.pdp_interact_plot(
    pdp_interact_out = inter1,
    feature_names=['NOx', 'NO_2'],
    plot_type='grid'
)

However, changing only plot_type to contour gives an error related to the labels and the font size. The figure appears label-less at the bottom after this error. Any guess or help is appreciated.

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-363-7b31c15b4793> in <module>()
      2     pdp_interact_out = inter1,
      3     feature_names=['NOx', 'NO_2'],
----> 4     plot_type='contour'
      5 )

/Users/jsg/Documents/DrivenData_Cold_Forecast/venv/lib/python3.6/site-packages/pdpbox/pdp.py in pdp_interact_plot(pdp_interact_out, feature_names, plot_type, x_quantile, plot_pdp, which_classes, figsize, ncols, plot_params)
    773             fig.add_subplot(inter_ax)
    774             _pdp_inter_one(pdp_interact_out=pdp_interact_plot_data[0], inter_ax=inter_ax, norm=None,
--> 775                            feature_names=feature_names_adj, **inter_params)
    776     else:
    777         wspace = 0.3

/Users/jsg/Documents/DrivenData_Cold_Forecast/venv/lib/python3.6/site-packages/pdpbox/pdp_plot_utils.py in _pdp_inter_one(pdp_interact_out, feature_names, plot_type, inter_ax, x_quantile, plot_params, norm, ticks)
    330             # for numeric not quantile
    331             X, Y = np.meshgrid(pdp_interact_out.feature_grids[0], pdp_interact_out.feature_grids[1])
--> 332         im = _pdp_contour_plot(X=X, Y=Y, **inter_params)
    333     elif plot_type == 'grid':
    334         im = _pdp_inter_grid(**inter_params)

/Users/jsg/Documents/DrivenData_Cold_Forecast/venv/lib/python3.6/site-packages/pdpbox/pdp_plot_utils.py in _pdp_contour_plot(X, Y, pdp_mx, inter_ax, cmap, norm, inter_fill_alpha, fontsize, plot_params)
    249     c1 = inter_ax.contourf(X, Y, pdp_mx, N=level, origin='lower', cmap=cmap, norm=norm, alpha=inter_fill_alpha)
    250     c2 = inter_ax.contour(c1, levels=c1.levels, colors=contour_color, origin='lower')
--> 251     inter_ax.clabel(c2, contour_label_fontsize=fontsize, inline=1)
    252     inter_ax.set_aspect('auto')
    253 

/Users/jsg/Documents/DrivenData_Cold_Forecast/venv/lib/python3.6/site-packages/matplotlib/axes/_axes.py in clabel(self, CS, *args, **kwargs)
   6221 
   6222     def clabel(self, CS, *args, **kwargs):
-> 6223         return CS.clabel(*args, **kwargs)
   6224     clabel.__doc__ = mcontour.ContourSet.clabel.__doc__
   6225 

TypeError: clabel() got an unexpected keyword argument 'contour_label_fontsize'

Thank you in advance. Awesome library by the way!

How can I save a pdp plot to a image file?

Display results using inverse logit function

I find this library to be incredibly useful, though I would like to know if there are ways to customize the pdp_plot a bit more. Specifically, I would like to be able to:

Display the values on the distribution of data points as percentages.
Adjust the results using the inverse logit function so that the values are on a 0 to 1 scale rather than -1 to 1 scale. This would allow the ability to interpret results as more of a probability, which is easier for non-technical stakeholders to understand.

I don't see a way to do either of these at present, but it would be incredibly helpful to have these as optional arguments.

Couldn't plot using info_plots.target_plot

ross_data = test_ross['data']
ross_features = test_ross['features']
ross_model = test_ross['rf_model']
ross_target = test_ross['target']

fig, axes, summary_df = info_plots.target_plot(
    df=ross_data, feature='SchoolHoliday', feature_name='SchoolHoliday', target=ross_target
)
_ = axes['bar_ax'].set_xticklabels(['Not SchoolHoliday', 'SchoolHoliday'])

I use the exact same dataset 'Rossmann Store Sales' and the same code (Tutorial: pdpbox_regression.ipynb), but I encounter the error, and the plot is empty.

Please advise, thank you!

2D PDP plot with iso-line

Would be very handy to be able to draw the 2D PDP plot with an option for a manual iso-line.

Take the example of a credit risk application: I need to have in my plot the cutoff line regarding the probability of default. Take the example in the annex. If you need further details, let me know

Not exactly an issue: dedup DataFrame

I just recently started to use this excellent repository to fill in a much needed gap in scikit learn. A suggestion for clarity in the parameters of pdpbox.pdp.pdp_isolate is to require train_X to be a deduplicated pandas dataframe because it caused a bit of confusion on my part when I wasn't able to plot due to the indexing issues from duplicated values. It's really just as simple as df.drop_duplicates(). Thanks for all of your work!

EDIT:

Another data checking step should be added at line 303 in pdp.py for using pdp.pdp_interact. If the feature grids are not specified and are defaulted to 10 and train_X.shape[0] is less than 100, then you will have an error on line 305 since data_chunk_size will round to 0. I just need to specify that num_grid_points=[5,5] so that it would run when train_X.shape[0] = 25.

Support for PySpark Models

Does PDPbox support PySpark models as well or any plan of releasing PySpark support in a future release?

Can't customise PDP plot colors

It would be great to have the option to customise the PDPbox plot colours. Could this be added as a feature?

x_quantile fails with binary features

When I run

obj = pdp.pdp_isolate(model, X_train, X_train.columns, 'addy_change')
pdp.pdp_plot(obj, 'addy_change', plot_pts_dist=True, x_quantile=True)

where "addy_change" is a binary variable, I get the error pasted below.

The problem seems to be that count_data['xticklabels'] doesn't exist for binary variables, but when x_quantile = True, _pdp_plot looks for that key anyway.

My use case is that I'm actually looping through a large list of variables, with x_quantile set to True for all of them. I'm wondering if it would make sense for pdp_plot to ignore x_quantile=True if the variable is binary.

Barring that, it would be helpful to have a more informative error message.

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
~/.conda/envs/checking/lib/python3.6/site-packages/pandas/core/indexes/base.py in get_loc(self, key, method, tolerance)
   2524             try:
-> 2525                 return self._engine.get_loc(key)
   2526             except KeyError:

pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

KeyError: 'xticklabels'

During handling of the above exception, another exception occurred:

KeyError                                  Traceback (most recent call last)
<ipython-input-70-faa0732e94a6> in <module>()
----> 1 pdp.pdp_plot(obj, 'addy_change', plot_pts_dist=True, x_quantile=True)

~/.conda/envs/checking/lib/python3.6/site-packages/pdpbox/pdp.py in pdp_plot(pdp_isolate_out, feature_name, center, plot_pts_dist, plot_lines, frac_to_plot, cluster, n_cluster_centers, cluster_method, x_quantile, show_percentile, figsize, ncols, plot_params, which_classes)
    414 
    415             _pdp_plot(pdp_isolate_out=pdp_plot_data[0], feature_name=feature_name_adj, pdp_ax=_pdp_ax,
--> 416                       count_ax=_count_ax, **pdp_plot_params)
    417         else:
    418             pdp_ax = plt.subplot(outer_grid[1])

~/.conda/envs/checking/lib/python3.6/site-packages/pdpbox/pdp_plot_utils.py in _pdp_plot(pdp_isolate_out, feature_name, center, plot_lines, frac_to_plot, cluster, n_cluster_centers, cluster_method, x_quantile, show_percentile, pdp_ax, count_data, count_ax, plot_params)
     97             # need to plot data distribution
     98             if x_quantile:
---> 99                 count_display_columns = count_data['xticklabels'].values
    100                 # number of grids = number of bins + 1
    101                 # count_x: min -> max + 1

~/.conda/envs/checking/lib/python3.6/site-packages/pandas/core/frame.py in __getitem__(self, key)
   2137             return self._getitem_multilevel(key)
   2138         else:
-> 2139             return self._getitem_column(key)
   2140 
   2141     def _getitem_column(self, key):

~/.conda/envs/checking/lib/python3.6/site-packages/pandas/core/frame.py in _getitem_column(self, key)
   2144         # get column
   2145         if self.columns.is_unique:
-> 2146             return self._get_item_cache(key)
   2147 
   2148         # duplicate columns & possible reduce dimensionality

~/.conda/envs/checking/lib/python3.6/site-packages/pandas/core/generic.py in _get_item_cache(self, item)
   1840         res = cache.get(item)
   1841         if res is None:
-> 1842             values = self._data.get(item)
   1843             res = self._box_item_values(item, values)
   1844             cache[item] = res

~/.conda/envs/checking/lib/python3.6/site-packages/pandas/core/internals.py in get(self, item, fastpath)
   3841 
   3842             if not isna(item):
-> 3843                 loc = self.items.get_loc(item)
   3844             else:
   3845                 indexer = np.arange(len(self.items))[isna(self.items)]

~/.conda/envs/checking/lib/python3.6/site-packages/pandas/core/indexes/base.py in get_loc(self, key, method, tolerance)
   2525                 return self._engine.get_loc(key)
   2526             except KeyError:
-> 2527                 return self._engine.get_loc(self._maybe_cast_indexer(key))
   2528 
   2529         indexer = self.get_indexer([key], method=method, tolerance=tolerance)

pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

KeyError: 'xticklabels'

How to train titanic_model

When i run with own data set,I get the following error:
AttributeError Traceback (most recent call last)
in
4 feature='sex',
5 feature_name='Gender',
----> 6 predict_kwds={}
7 )

/opt/anaconda2/envs/python35/lib/python3.5/site-packages/pdpbox/info_plots.py in actual_plot(model, X, feature, feature_name, num_grid_points, grid_type, percentile_range, grid_range, cust_grid_points, show_percentile, show_outliers, endpoint, which_classes, predict_kwds, ncols, figsize, plot_params)
289 # make predictions
290 # info_df only contains feature value and actual predictions
--> 291 prediction = predict(X, **predict_kwds)
292 info_df = X[_make_list(feature)]
293 actual_prediction_columns = ['actual_prediction']

/opt/anaconda2/envs/python35/lib/python3.5/site-packages/xgboost/core.py in predict(self, data, output_margin, ntree_limit, pred_leaf, pred_contribs, approx_contribs, pred_interactions, validate_features)
1282
1283 if validate_features:
-> 1284 self._validate_features(data)
1285
1286 length = c_bst_ulong()

/opt/anaconda2/envs/python35/lib/python3.5/site-packages/xgboost/core.py in _validate_features(self, data)
1669 """
1670 if self.feature_names is None:
-> 1671 self.feature_names = data.feature_names
1672 self.feature_types = data.feature_types
1673 else:

/opt/anaconda2/envs/python35/lib/python3.5/site-packages/pandas/core/generic.py in getattr(self, name)
5065 if self._info_axis._can_hold_identifiers_and_holds_name(name):
5066 return self[name]
-> 5067 return object.getattribute(self, name)
5068
5069 def setattr(self, name, value):

AttributeError: 'DataFrame' object has no attribute 'feature_names'

so i want to know how to train the titanic_model in the example.
Thank for you advice.

Interpretation of PD Values in a classification setting

Hi,

i was wondering how to exactly interpret the values on the y-axis of the partial dependence plots in the case of a classification. The classifier outputs probabilities between 0 and 1, however, the plots shows negative and positive values which can also be greater than one.

Thanks in advance