lacava / few Goto Github PK

View Code? Open in Web Editor NEW

50.0 50.0 22.0 607 KB

a feature engineering wrapper for sklearn

Home Page: https://lacava.github.io/few

License: GNU General Public License v3.0

Python 92.36% Shell 2.00% C++ 5.42% Makefile 0.22%

few's People

Contributors

Stargazers

Watchers

Forkers

arita37 erp12 codeaudit rgupta90 epistasislab dophan batermj qriner echo66 sunbc0120 ryan102590 sandy4321 zqhfpjlswsqy dionresearch seanigami harel-coffee

few's Issues

add balanced accuracy option

add option to assess ML model based on balanced accuracy.

ImportError: dlopen ... symbol not found

Hi, I've cloned few, built and installed on OS X 10.12 using:

CC=gcc-7 python setup.py install

But I'm getting a symbol not found error on import of the few module.

I note a few warnings during the build process beginning with: #warning "Using deprecated NumPy API, disable it by ...

and then finally:

g++ -bundle -undefined dynamic_lookup -L/Users/robertreynolds/anaconda3/envs/ml/lib -arch x86_64 -L/Users/robertreynolds/anaconda3/envs/ml/lib -arch x86_64 -arch x86_64 build/temp.macosx-10.7-x86_64-3.6/few/lib/few_lib.o -o build/lib.macosx-10.7-x86_64-3.6/few_lib.cpython-36m-darwin.so
clang: warning: libstdc++ is deprecated; move to libc++ with a minimum deployment target of OS X 10.9 [-Wdeprecated]

Any advice what to check next?
Otherwise, I'm not entirely clear on why I'm seeing a clang message, so that, along with the indicated warning is my first avenue to explore.

Error with installation

Hello,

Trying to attempt this package but running into some issues, any idea? I have VS 14.16 now on PC and getting this error when typing 'pip install few'. At first it was asking for eigency but now after that installation this error popped up.

Sincerely,
G

installing few with pip

it cannot be installed with pip since you are importing eigency inside setup.py and the requirements are not installed (yet)!

GridSearchCV error

Concerning errors of the form

self.ml.named_steps = undefined
    205                   hasattr(self.ml.named_steps['ml'],'feature_importances_')):
    208                 coef = (self.ml.named_steps['ml'].coef_ if
AttributeError: 'SGDClassifier' object has no attribute 'named_steps'

when using FEW in GridSearchCV while changing the ML parameter. The pipeline object needs to be redefined in the fit method so that GridSearch can change self.ml and the pipeline gets updated.

Issues with current ML validation score

Hello,

Thanks for the help so far. I was able to get the tool up and running in windows.

However, 2 weird things I am observing.

When I use Gradient Boost Regressor - my score gets worse by the generation even when I switched the scoring function sign. The first score is nearly my best score I have gotten by myself (no feature engineering done on data set).

https://github.com/GinoWoz1/AdvancedHousePrices/blob/master/FEW_GB.ipynb

When I use Random Forest - same scorer - current ML validation score returns as 0 and runs really fast

https://github.com/GinoWoz1/AdvancedHousePrices/blob/master/FEW_RF.ipynb

I think I am missing something on how to use this tool but no idea what. I am trying to use this in tandem with TPOT as I am exploring feature creation GA/GP based tools. Sincerely appreciate any advice/guidance you can provide.

Sincerely,
G

If original features are found by FEW, transform() method fails with TypeError

eg:

print('Model: {}'.format(learner.print_model()))

Model: original features

Phi = learner.transform(X_test.values)

TypeError Traceback (most recent call last)
in ()
----> 1 Phi = learner.transform(X_test.values)

~/anaconda3/envs/ml/lib/python3.6/site-packages/FEW-0.0.38-py3.6-macosx-10.7-x86_64.egg/few/few.py in transform(self, x, inds, labels)
395 # return np.asarray(Parallel(n_jobs=10)(delayed(self.out)(I,x,labels,self.otype) for I in self._best_inds)).transpose()
396 return np.asarray(
--> 397 [self.out(I,x,labels,self.otype) for I in self._best_inds]).transpose()
398
399

TypeError: 'NoneType' object is not iterable

value error in lasso

occasional error:
File "/home/bill/anaconda3/lib/python3.5/runpy.py", line 184, in _run_module_as_main
"main", mod_spec)
File "/home/bill/anaconda3/lib/python3.5/runpy.py", line 85, in run_code
exec(code, run_globals)
File "/media/bill/Drive/Dropbox/PostDoc/code/few/few/few.py", line 506, in
main()
File "/media/bill/Drive/Dropbox/PostDoc/code/few/few/few.py", line 495, in main
learner.fit(training_features, training_labels)
File "/media/bill/Drive/Dropbox/PostDoc/code/few/few/few.py", line 181, in fit
self.ml.fit(pop.X.transpose(),y_t)
File "/home/bill/anaconda3/lib/python3.5/site-packages/sklearn/linear_model/least_angle.py", line 1132, in fit
Lars.fit(self, X, y)
File "/home/bill/anaconda3/lib/python3.5/site-packages/sklearn/linear_model/least_angle.py", line 671, in fit
return_n_iter=True, positive=self.positive)
File "/home/bill/anaconda3/lib/python3.5/site-packages/sklearn/linear_model/least_angle.py", line 260, in lars_path
sign_active[n_active] = np.sign(C)
ValueError: cannot convert float NaN to integer

should not occur due to safe operator outputs

tensor flow

incorporate tensor flow evaluation

add length-normalized probability of mutation

replace point mutation with a length-based probability of mutation at each node.

stall effects

track stalling in runs and act based on them.
stalling occurs when the engineered features are no longer improving either 1) the ML model CV score or 2) the median fitness of the features themselves.
if stalling occurs, there should be options to

exit
modify search to capture a different part of the search space. this could be achieved by increasing complexity of the features, increasing variation steps, or lowering selection presure.

multiple-feature gp representation

incorporate a multi-output program representation for which each individual has its own ML model.

cythonize evaluation routine

write evaluation routine for features in c++ with eigen and interface with main codebase via Cython. Include distutils changes to support package distribution.

few.model() and few.print_model()

Hello!

Thanks for sharing your work, this is really cool!

I was wondering if you could provide a bit of explanation as to the difference between these two outputs of the algorithm.

Also, is there any (outside) documentation on all this?

Thanks in advance!

Kind regards,
Theodore.

move mad calculation outside of lexicase population loop

case statistics should just be calculated once rather than each selection event.

implement 3-fold cross validation for internal updating of best model

currently the training data is split into training and validation sets and the best model is updated when a model with a higher validation score is found. we could simplify quite a bit and have a more robust validation measure by removing train_test_split and the associated numpy arrays / fitting predicting code with a direct call to cross_val_score(self.ml,features,labels,cv=3) or cross_val_score(self.ml,self.X[self.valid_loc(),:].transpose(),labels,cv=3).

low GPU utilization with tf option

I'm getting low utilization of the GPU using the tensorflow evaluation strategy. There are a few things to try:

use this method to profile tensorflow and see where the inefficiencies lay.
according to this, using feed_dict is not a good idea. need to look into using pipelines or variables for feeding input data to the graphs.

add size mediated parent selection for lexicase survival

include option to mediate lexicase survival via size of programs for each selection event.

add encoding operators for GWAS

add operators that re-encode input SNPs based on different encodings. include (add, dom, rec, het, sub-add, super-add). Need to resolve how underlying data would be represented; maybe assume the input is additive?

adding new node types

Greetings!

How can we add custom node types? Is it (currently) possible?

normalize feature transformations

normalize feature transformations automatically before feeding them into the ML fit method. store the transformer so that it can be used in prediction/transformation as well.

feature importance determines who survives

implement truncation selection based on feature scores. if none, use estimated fitness

random numbers seed not working?

Greetings!

I have the following code:

feats_gen = FEW(
                ml=DecisionTreeClassifier(random_state=10, max_depth=None, min_samples_leaf=5), 
                population_size=100, tourn_size=2,                 
                mutation_rate=0.5, crossover_rate=0.5, 
                sel='epsilon_lexicase',   
                clean=True,                
                mdr=True, boolean=True, 
                random_state=10, verbosity=1, 
                scoring_function=roc_auc_score, 
                max_depth=10, min_depth=1, max_depth_init=1, 
                classification=True, 
                generations=50, max_stall=None, 
                names=list(X_train.select_dtypes(include=[np.number]).columns))

feats_gen.fit(X_train.select_dtypes(include=[np.number]).values, 
              y_train.astype(int).values)

test_ = preprocessing_pipeline.transform(e.test)

X_test = test_.X
y_test = test_[test_.target_name].astype(int)

roc_auc_score(y_test, feats_gen._best_estimator.predict_proba(feats_gen.transform(X_test.select_dtypes(include=[np.number]).values))[:, 1])

Everytime I run this code, I get different ROC AUC values in both training and test. I'm pretty sure preprocessing_pipeline is deterministic.

Thanks in advance!

use check_random_state() instead of numpy.seed

Currently, few does not accept an instance of numpy.random as random state.