chasedehan / boostaroota Goto Github PK

A fast xgboost feature selection algorithm

License: MIT License

Python 98.27% R 1.73%

machine-learning machine-learning-algorithms feature-selection xgboost-algorithm xgboost dimension-reduction algorithm boruta data-science datascientist datascience machinelearning

boostaroota's Introduction

BoostARoota

A Fast XGBoost Feature Selection Algorithm (plus other sklearn tree-based classifiers)

Why Create Another Algorithm?

Automated processes like Boruta showed early promise as they were able to provide superior performance with Random Forests, but has some deficiencies including slow computation time: especially with high dimensional data. Regardless of the run time, Boruta does perform well on Random Forests, but performs poorly on other algorithms such as boosting or neural networks. Similar deficiencies occur with regularization on LASSO, elastic net, or ridge regressions in that they perform well on linear regressions, but poorly on other modern algorithms.

I am proposing and demonstrating a feature selection algorithm (called BoostARoota) in a similar spirit to Boruta utilizing XGBoost as the base model rather than a Random Forest. The algorithm runs in a fraction of the time it takes Boruta and has superior performance on a variety of datasets. While the spirit is similar to Boruta, BoostARoota takes a slightly different approach for the removal of attributes that executes much faster.

Installation

Easiest way is to use pip:

$ pip install boostaroota

Usage

This module is built for use in a similar manner to sklearn with fit(), transform(), etc. In order to use the package, it does require X to be one-hot-encoded(OHE), so using the pandas function pd.get_dummies(X) may be helpful as it determines which variables are categorical and converts them into dummy variables. This package does rely on pandas under the hood so data must be passed in as a pandas dataframe.

Assuming you have X and Y split, you can run the following:

from boostaroota import BoostARoota
import pandas as pd

#OHE the variables - BoostARoota may break if not done
x = pd.getdummies(x)
#Specify the evaluation metric: can use whichever you like as long as recognized by XGBoost
  #EXCEPTION: multi-class currently only supports "mlogloss" so much be passed in as eval_metric
br = BoostARoota(metric='logloss')

#Fit the model for the subset of variables
br.fit(x, y)

#Can look at the important variables - will return a pandas series
br.keep_vars_

#Then modify dataframe to only include the important variables
br.transform(x)

It's really that simple! Of course, as we build more functionality there may be a few more Keep in mind that since you are OHE, if you have a numeric variable that is imported by python as a character, pd.get_dummies() will convert those numeric into many columns. This can cause your DataFrame to explode in size, giving unexpected results and high run times.

###New as of 1/22/2018, can insert any sklearn tree-based learner into BoostARoota Please be aware that this hasn't been fully tested out for which parameters (cutoff, iterations, etc) are optimal. Currently, that will require some trial and error on the user's part.

For example, to use another classifer, you will initialize the object and then pass that object into the BoostARoota object like so:

from sklearn.ensemble import ExtraTreesClassifier
clf = ExtraTreesClassifier()

br = BoostARoota(clf=clf)
new_train = br.fit_transform(x, y)

You can also view a complete demo here.

Usage - Choosing Parameters

The default parameters are optimally chosen for the widest range of input dataframes. However, there are cases where other values could be more optimal.

clf [default=None] - optional, recommended to leave empty
- Will default to xgboost if left empty
- For use with any tree based learner from sklearn.
  - The default parameters are not optimal and will require user experimentation.
cutoff [default=4] - float (cutoff > 0)
- Adjustment to removal cutoff from the feature importances
  - Larger values will be more conservative - if values are set too high, a small number of features may end up being removed.
  - Smaller values will be more aggressive; as long as the value is above zero (can be a float)
iters [default=10] - int (iters > 0)
- The number of iterations to average for the feature importances
  - While it will run, don't want to set this value at 1 as there is quite a bit of random variation
  - Smaller values will run faster as it is running through XGBoost a smaller number of times
  - Scales linearly. iters=4 takes 2x time of iters=2 and 4x time of iters=1
max_rounds [default=100] - int (max_rounds > 0)
- The number of times the core BoostARoota algorithm will run. Each round eliminates more and more features
  - Default is set high enough that it really shouldn't be reached under normal circumstances
  - You would want to set this value low if you felt that it was aggressively removing variables.
delta [default=0.1] - float (0 < delta <= 1)
- Stopping criteria for whether another round is started
  - Regardless of this value, will not progress past max_rounds
  - A value of 0.1 means that at least 10% of the features must be removed in order to move onto the next round
  - Setting higher values will make it more difficult to move to follow on rounds (ex. setting at 1 guarantees only one round)
  - Setting too low of a delta may result in eliminating too many features and would be constrained by max_rounds
silent [default=False] - boolean
- Set to True if don't want to see the BoostARoota output printed. Will still show any errors or warnings that may occur.

How it works

Similar in spirit to Boruta, BoostARoota creates shadow features, but modifies the removal step.

One-Hot-Encode the feature set
Double width of the data set, making a copy of all features in original dataset
Randomly shuffle the new features created in (2). These duplicated and shuffled features are referred to as "shadow features"
Run XGBoost classifier on the entire data set ten times. Running it ten times allows for random noise to be smoothed, resulting in more robust estimates of importance. The number of repeats is a parameter than can be changed.
Obtain importance values for each feature. This is a simple importance metric that sums up how many times the particular feature was split on in the XGBoost algorithm.
Compute "cutoff": the average feature importance value for all shadow features and divide by four. Shadow importance values are divided by four (parameter can be changed) to make it more difficult for the variables to be removed. With values lower than this, features are removed at too high of a rate.
Remove features with average importance across the ten iterations that is less than the cutoff specified in (6)
Go back to (2) until the number of features removed is less than ten percent of the total.
Method returns the features remaining once completed.

Algorithm Performance

BoostARoota is shorted to BAR and the below table is utilizing the LSVT dataset from the UCI datasets. The algorithm has been tested on other datasets. If you are interested in the specifics of the testing please take a look at the testBAR.py script. The basics are that it is run through 5-fold CV, with the model selection performed on the training set and then predicting on the heldout test set. It is done this way to avoid overfitting the feature selection process.

All tests are run on a 12 core (hyperthreaded) Intel i7. - Future iterations will compare run times on a 28 core Xeon, 120 cores on Spark, and running xgboost on a GPU.

Data Set	Target	Boruta Time	BoostARoota Time	BoostARoota LogLoss	Boruta LogLoss	All Features LogLoss	BAR >= All
LSVT	0/1	50.289s	0.487s	0.5617	0.6950	0.7311	Yes
HR	0/1	33.704s	0.485s	0.1046	0.1003	0.1047	Yes
Fraud	0/1	38.619s	1.790s	0.4333	0.4353	0.4333	Yes

As can be seen, the speed up from BoostARoota is around 100x with substantial reductions in log loss. Part of this speed up is that Boruta is running single threaded, while BoostARoota (on XGB) is running on all 12 cores. Not sure how this time speed up works with larger datasets as of yet.

This has also been tested on Kaggle's House Prices. With nothing done except running BoostARoota and evaluated on RMSE, all features scored .15669, while BoostARoota scored 0.1560.

Future Functionality (i.e. Current Shortcomings)

The text file FS_algo_basics.txt details how I was thinking through the algorithm and what additional functionality was thought about during the creation.

Preprocessing Steps - Need some first pass filters for reducing dimensionality right off the bat
- Check and drop identical features, leaving option to drop highly correlated variables
- Drop variables with near-zero-variance to target variable (creating threshold will be difficult)
- LDA, PCA, PLS rankings
  - Challenge with these is they remove based on linear relationships whereas trees are able to pick out the non-linear relationships and a variable with a low linear dependency may be powerful when combined with others.
- t-SNE - Has shown some promise in high-dimensional data
Algorithm could use a better stopping criteria
- Next step is to test it against Y and the eval_metric to see when it is falling off.
Expand compute to handle larger datasets (if user has the hardware)
- Run on Dask - Issue was opened up and Chase is working on it
- Run on PySpark: make it easy enough that can just pass in SparkContext - will require some refactoring
- Run XGBoost on GPU - although may run into memory issues with the shadow features.

Updates

1/22/18 - Added functionality to insert any tree based classifier from sklearn into BoostARoota.
10/26/17 - Modified Structure to resemble sklearn classes and added tuning parameters.
9/22/17 - Uploaded to PyPI and expanded tests
9/8/17 - Added Support for multi-class classification, but only for the logloss eval_metric. Need to pass in eval="mlogloss"
9/6/17 - have implemented in BoostARoota2() a stopping criteria specifying that at least 10% of features need to be dropped to continue.
8/25/17 - The testBAR.py testing framework was just completed and ran through a number of datasets

Want to Contribute?

This project has found some initial successes and there are a number of directions it can head. It would be great to have some additional help if you are willing/able. Whether it is directly contributing to the codebase or just giving some ideas, any help is appreciated. The goal is to make the algorithm as robust as possible. The primary focus right now is on the components under Future Implementations, but are in active development. Please reach out to see if there is anything you would like to contribute in that part to make sure we aren't duplicating work.

A special thanks to Progressive Leasing for sponsoring this research.

boostaroota's People

Contributors

Stargazers

Watchers

Forkers

sireeshapulipati ruiyeni micseb colinsongf himanshu4318 tomashenco dzimitryb zorospace tonyle9 liumenglife mejihero roymachinelearning mmejdoubi jun2hou xxzcool yinhao1501 mannyjop aleksandrkosolapov chuan1997 vishalbelsare nanaakwasiabayieboateng aaiyer ghayth82 jlopezpena ncolbertnyt daryleserrant iharshulhan stillmatic sakutepov samsontontoye lkampoli csutjf tdl77 ursu1964 zzzly-05 adamhajari arsenygorbunov acepugh

boostaroota's Issues

Dask integration

Much like your idea for pyspark integration, I would like to see simliar support for passing in a dask client as is supported by the dask-xgboost library. I have found initial success in reducing high dimensional data using the BoostaRoota library but find the bottleneck to be during the initial load of the parquet file repository. I'll offer what assitance I can regarding this work.

Ben.

Using BoostARoota for regression task

Hi, thank you for a great job! I am wondering is there any way to use BoostARoota for regression task?

'can only concatenate str (not "float") to str' while all dtypes are float64, int64

Just runed as in example

from boostaroota import BoostARoota

br = BoostARoota(metric='logloss')
br.fit(X, y)

for classification task

getting

[06:31:00] WARNING: ../src/learner.cc:767: 
Parameters: { "silent" } are not used.

Round:  1  iteration:  10
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[22], line 4
      1 br = BoostARoota(metric='logloss')
      3 #Fit the model for the subset of variables
----> 4 br.fit(X, y)

File ~/anaconda3/envs/filter/lib/python3.11/site-packages/boostaroota/boostaroota.py:46, in BoostARoota.fit(self, x, y)
     45 def fit(self, x, y):
---> 46     self.keep_vars_ = _BoostARoota(x, y,
     47                                    metric=self.metric,
     48                                    clf = self.clf,
     49                                    cutoff=self.cutoff,
     50                                    iters=self.iters,
     51                                    max_rounds=self.max_rounds,
     52                                    delta=self.delta,
     53                                    silent=self.silent)
     54     return self

File ~/anaconda3/envs/filter/lib/python3.11/site-packages/boostaroota/boostaroota.py:219, in _BoostARoota(x, y, metric, clf, cutoff, iters, max_rounds, delta, silent)
    217 i += 1
    218 if clf is None:
--> 219     crit, keep_vars = _reduce_vars_xgb(new_x,
    220                                        y,
    221                                        metric=metric,
    222                                        this_round=i,
    223                                        cutoff=cutoff,
    224                                        n_iterations=iters,
    225                                        delta=delta,
    226                                        silent=silent)
    227 else:
    228     crit, keep_vars = _reduce_vars_sklearn(new_x,
    229                                            y,
    230                                            clf=clf,
   (...)
    234                                            delta=delta,
    235                                            silent=silent)

File ~/anaconda3/envs/filter/lib/python3.11/site-packages/boostaroota/boostaroota.py:130, in _reduce_vars_xgb(x, y, metric, this_round, cutoff, n_iterations, delta, silent)
    127     if not silent:
    128         print("Round: ", this_round, " iteration: ", i)
--> 130 df['Mean'] = df.mean(axis=1)
    131 #Split them back out
    132 real_vars = df[~df['feature'].isin(shadow_names)]

File ~/anaconda3/envs/filter/lib/python3.11/site-packages/pandas/core/generic.py:11556, in NDFrame._add_numeric_operations.<locals>.mean(self, axis, skipna, numeric_only, **kwargs)
  11539 @doc(
  11540     _num_doc,
  11541     desc="Return the mean of the values over the requested axis.",
   (...)
  11554     **kwargs,
  11555 ):
> 11556     return NDFrame.mean(self, axis, skipna, numeric_only, **kwargs)

File ~/anaconda3/envs/filter/lib/python3.11/site-packages/pandas/core/generic.py:11201, in NDFrame.mean(self, axis, skipna, numeric_only, **kwargs)
  11194 def mean(
  11195     self,
  11196     axis: Axis | None = 0,
   (...)
  11199     **kwargs,
  11200 ) -> Series | float:
> 11201     return self._stat_function(
  11202         "mean", nanops.nanmean, axis, skipna, numeric_only, **kwargs
  11203     )

File ~/anaconda3/envs/filter/lib/python3.11/site-packages/pandas/core/generic.py:11158, in NDFrame._stat_function(self, name, func, axis, skipna, numeric_only, **kwargs)
  11154     nv.validate_stat_func((), kwargs, fname=name)
  11156 validate_bool_kwarg(skipna, "skipna", none_allowed=False)
> 11158 return self._reduce(
  11159     func, name=name, axis=axis, skipna=skipna, numeric_only=numeric_only
  11160 )

File ~/anaconda3/envs/filter/lib/python3.11/site-packages/pandas/core/frame.py:10524, in DataFrame._reduce(self, op, name, axis, skipna, numeric_only, filter_type, **kwds)
  10520     df = df.T
  10522 # After possibly _get_data and transposing, we are now in the
  10523 #  simple case where we can use BlockManager.reduce
> 10524 res = df._mgr.reduce(blk_func)
  10525 out = df._constructor(res).iloc[0]
  10526 if out_dtype is not None:

File ~/anaconda3/envs/filter/lib/python3.11/site-packages/pandas/core/internals/managers.py:1534, in BlockManager.reduce(self, func)
   1532 res_blocks: list[Block] = []
   1533 for blk in self.blocks:
-> 1534     nbs = blk.reduce(func)
   1535     res_blocks.extend(nbs)
   1537 index = Index([None])  # placeholder

File ~/anaconda3/envs/filter/lib/python3.11/site-packages/pandas/core/internals/blocks.py:339, in Block.reduce(self, func)
    333 @final
    334 def reduce(self, func) -> list[Block]:
    335     # We will apply the function and reshape the result into a single-row
    336     #  Block with the same mgr_locs; squeezing will be done at a higher level
    337     assert self.ndim == 2
--> 339     result = func(self.values)
    341     if self.values.ndim == 1:
    342         # TODO(EA2D): special case not needed with 2D EAs
    343         res_values = np.array([[result]])

File ~/anaconda3/envs/filter/lib/python3.11/site-packages/pandas/core/frame.py:10487, in DataFrame._reduce.<locals>.blk_func(values, axis)
  10485     return values._reduce(name, skipna=skipna, **kwds)
  10486 else:
> 10487     return op(values, axis=axis, skipna=skipna, **kwds)

File ~/anaconda3/envs/filter/lib/python3.11/site-packages/pandas/core/nanops.py:96, in disallow.__call__.<locals>._f(*args, **kwargs)
     94 try:
     95     with np.errstate(invalid="ignore"):
---> 96         return f(*args, **kwargs)
     97 except ValueError as e:
     98     # we want to transform an object array
     99     # ValueError message to the more typical TypeError
    100     # e.g. this is normally a disallowed function on
    101     # object arrays that contain strings
    102     if is_object_dtype(args[0]):

File ~/anaconda3/envs/filter/lib/python3.11/site-packages/pandas/core/nanops.py:158, in bottleneck_switch.__call__.<locals>.f(values, axis, skipna, **kwds)
    156         result = alt(values, axis=axis, skipna=skipna, **kwds)
    157 else:
--> 158     result = alt(values, axis=axis, skipna=skipna, **kwds)
    160 return result

File ~/anaconda3/envs/filter/lib/python3.11/site-packages/pandas/core/nanops.py:421, in _datetimelike_compat.<locals>.new_func(values, axis, skipna, mask, **kwargs)
    418 if datetimelike and mask is None:
    419     mask = isna(values)
--> 421 result = func(values, axis=axis, skipna=skipna, mask=mask, **kwargs)
    423 if datetimelike:
    424     result = _wrap_results(result, orig_values.dtype, fill_value=iNaT)

File ~/anaconda3/envs/filter/lib/python3.11/site-packages/pandas/core/nanops.py:727, in nanmean(values, axis, skipna, mask)
    724     dtype_count = dtype
    726 count = _get_counts(values.shape, mask, axis, dtype=dtype_count)
--> 727 the_sum = _ensure_numeric(values.sum(axis, dtype=dtype_sum))
    729 if axis is not None and getattr(the_sum, "ndim", False):
    730     count = cast(np.ndarray, count)

File ~/anaconda3/envs/filter/lib/python3.11/site-packages/numpy/core/_methods.py:48, in _sum(a, axis, dtype, out, keepdims, initial, where)
     43 def _amin(a, axis=None, out=None, keepdims=False,
     44           initial=_NoValue, where=True):
     45     return umr_minimum(a, axis, None, out, keepdims, initial, where)
     47 def _sum(a, axis=None, dtype=None, out=None, keepdims=False,
---> 48          initial=_NoValue, where=True):
     49     return umr_sum(a, axis, dtype, out, keepdims, initial, where)
     51 def _prod(a, axis=None, dtype=None, out=None, keepdims=False,
     52           initial=_NoValue, where=True):

TypeError: can only concatenate str (not "float") to str```


When I calculate `df.mean(axis=1)` it outputs correct answer without failing.

Configurable estimator similar to boruta_py

Would it be possible to make the estimator configurable? Currently you're requiring xgboost, but I'd prefer to be able to try to use lightgbm (https://github.com/szilard/GBM-perf) as it benchmarks a bit faster.

Sklearn implementation has an error

Hi @chasedehan,

I think I found an error in the sklearn implementation.

At the moment you add one column to df2 for every iteration that you are doing. And then df2 is joined to df again. Like this many duplicate columns are created that are diluting the mean of the feature importance later on. You can find this if you print out df after every iteration

try:
    importance = clf.feature_importances_
    df2['fscore' + str(i)] = importance
except ValueError:
    print("this clf doesn't have the feature_importances_ method.  Only Sklearn tree based methods allowed")

# importance = sorted(importance.items(), key=operator.itemgetter(1))

# df2 = pd.DataFrame(importance, columns=['feature', 'fscore'+str(i)])
df2['fscore'+str(i)] = df2['fscore'+str(i)] / df2['fscore'+str(i)].sum()
df = pd.merge(df, df2, on='feature', how='outer')
if not silent:
    print("Round: ", this_round, " iteration: ", i)

Here is a suggestion how to fix it:

if len(getattr(clf, 'feature_importances_', [])) == 0:
    raise ValueError(
        "this clf doesn't have the feature_importances_ method. Only Sklearn tree based methods allowed"
    )

if i == 1:
    df = pd.DataFrame({'feature': new_x.columns})

# importance = sorted(importance.items(), key=operator.itemgetter(1))

importance = clf.feature_importances_
importance = np.column_stack([new_x.columns, importance])
df2 = pd.DataFrame(importance, columns=['feature', 'fscore'+str(i)])
df2['fscore'+str(i)] = df2['fscore'+str(i)] / df2['fscore'+str(i)].sum()
df = pd.merge(df, df2, on='feature', how='outer')
if not silent:
    print("Round: ", this_round, " iteration: ", i) ```

Pyspark integration

Would it be possible to share your pyspark implementation of these functions? I have seen that the full integration is planned for future updates, however you mention that you've ran the tests using Spark already.
Thanks!
Vykintas

Add correlation preprocessing

Hello

I've written a library, that could be a implementation of your idea of correlation preprocessing: https://github.com/bukson/nancorrmp

It is designed to work on multiple cores in parallel way and can handle nans and infs as feature values.

I can contribute and add code in some way, but probably you want to establish some way of making preprocessing, so for now I am just creating an issue.

Please contact me if you think that I can help you

Data must be 1-dimensional

I would appreciate if you could let me know how to deal with this error:

X = np.array(pd.read_csv('tot_X_1.csv',header=None).values)
y = np.array(pd.read_csv('tot_Y_1.csv',header=None).values.ravel())

# Split data set to train and test data
X_train, X_test, y_train, y_test = train_test_split(
    X, y,stratify=y, test_size=0.3, random_state=42)

X_train=pd.get_dummies(X_train)


br = BoostARoota(metric='f1')

#Fit the model for the subset of variables
br.fit(X_train,y_train)

#Can look at the important variables - will return a pandas series
br.keep_vars_

#Then modify dataframe to only include the important variables
br.transform(X_train)

Error:

  File "D:/mifs-master_2/MU/learning-from-imbalanced-classes-master/learning-from-imbalanced-classes-master/continuous/Bankrupt_2/Bankrupt/data/chase.py", line 15, in <module>
    X_train=pd.get_dummies(X_train)
  File "C:\Users\Markazi.co\Anaconda3\lib\site-packages\pandas\core\reshape\reshape.py", line 1215, in get_dummies
    sparse=sparse, drop_first=drop_first)
  File "C:\Users\Markazi.co\Anaconda3\lib\site-packages\pandas\core\reshape\reshape.py", line 1222, in _get_dummies_1d
    codes, levels = _factorize_from_iterable(Series(data))
  File "C:\Users\Markazi.co\Anaconda3\lib\site-packages\pandas\core\series.py", line 264, in __init__
    raise_cast_failure=True)
  File "C:\Users\Markazi.co\Anaconda3\lib\site-packages\pandas\core\series.py", line 3234, in _sanitize_array
    raise Exception('Data must be 1-dimensional')
Exception: Data must be 1-dimensional

Best regards,

How do I cite BoostARoota?

Dear author,

How do I cite BoostARoota? Is there a paper on it? Thanks.

Regards,
Jundong

merging outside of the loop

Seems to me that in the "_reduce_vars..." functions you need to move the merging of DataFrames df + df2 out from the loop, outherwise you get more and more duplicate columns on each iteration in "df" DataFrame, so your df.mean(axis=1) method would return wrong value.

Add random seed for shadow feature

#Add option to set random state while performing shuffling for shadow features. So that result can be generated again.

ZeroDivisionError: division by zero

ZeroDivisionError Traceback (most recent call last)
in ()
1 br = BoostARoota(metric='logloss',delta=0.05)
----> 2 br.fit(all_feats,target_data);

~/anaconda2/envs/py3k/lib/python3.6/site-packages/boostaroota/boostaroota.py in fit(self, x, y)
51 max_rounds=self.max_rounds,
52 delta=self.delta,
---> 53 silent=self.silent)
54 return self
55

~/anaconda2/envs/py3k/lib/python3.6/site-packages/boostaroota/boostaroota.py in _BoostARoota(x, y, metric, clf, cutoff, iters, max_rounds, delta, silent)
224 n_iterations=iters,
225 delta=delta,
--> 226 silent=silent)
227 else:
228 crit, keep_vars = _reduce_vars_sklearn(new_x,

~/anaconda2/envs/py3k/lib/python3.6/site-packages/boostaroota/boostaroota.py in _reduce_vars_xgb(x, y, metric, this_round, cutoff, n_iterations, delta, silent)
139 #Check for the stopping criteria
140 #Basically looking to make sure we are removing at least 10% of the variables, or we should stop
--> 141 if (len(real_vars['feature']) / len(x.columns)) > (1-delta):
142 criteria = True
143 else:

ZeroDivisionError: division by zero

Use other importance metric than split

Is it possible to use xgboost's gain metric for feature importance?

AttributeError: 'numpy.ndarray' object has no attribute 'columns'

When running br.fit(train.values,labels.values) I get the following error:

AttributeError Traceback (most recent call last)
in ()
1 br = BoostARoota(metric='logloss')
2
----> 3 br.fit(train.values,labels.values)
4 len(train.columns)
5 len(br.keep_vars_)

~/anaconda2/envs/py3k/lib/python3.6/site-packages/boostaroota/boostaroota.py in fit(self, x, y)
51 max_rounds=self.max_rounds,
52 delta=self.delta,
---> 53 silent=self.silent)
54 return self
55

~/anaconda2/envs/py3k/lib/python3.6/site-packages/boostaroota/boostaroota.py in _reduce_vars_xgb(x, y, metric, this_round, cutoff, n_iterations, delta, silent)
113 for i in range(1, n_iterations+1):
114 # Create the shadow variables and run the model to obtain importances
--> 115 new_x, shadow_names = _create_shadow(x)
116 dtrain = xgb.DMatrix(new_x, label=y)
117 bst = xgb.train(param, dtrain, verbose_eval=False)

~/anaconda2/envs/py3k/lib/python3.6/site-packages/boostaroota/boostaroota.py in _create_shadow(x_train)
77 """
78 x_shadow = x_train.copy()
---> 79 for c in x_shadow.columns:
80 np.random.shuffle(x_shadow[c].values)
81 # rename the shadow

AttributeError: 'numpy.ndarray' object has no attribute 'columns'

chasedehan / boostaroota Goto Github PK

boostaroota's Introduction

BoostARoota

Why Create Another Algorithm?

Installation

Usage

Usage - Choosing Parameters

How it works

Algorithm Performance

Future Functionality (i.e. Current Shortcomings)

Updates

Want to Contribute?

boostaroota's People

Contributors

Stargazers

Watchers

Forkers

boostaroota's Issues

Recommend Projects

Recommend Topics

Recommend Org