The feat's discuss from cavalab

implement evaluation functions for boolean functions

implement NSGA-II survival

implement the survival portion of NSGA-II

handle inf and nan in features

handle inf and nan outputs in generated features

replace inf and nan in evaluated outputs with large values

python / sklearn wrapper

create python wrapper class to use fewtwo with sklearn

build examples

build examples of regression and classification using fewtwo and add them to the documentation.

make arity members of Node class into one map

right now Node has an arity_f and arity_b to handle multiple data types. to make it more extensible we should roll arity into one variable of type std::map<char,unsigned int> where the character corresponds to the output/input type of interest.

make load_csv determine data types

when reading in a csv file, load_csv should be able to determine whether certain features are boolean by checking the length of unique values and wether they are 0 and 1.

fewtwo should take dtypes as a parameter, and set terminal output types accordingly. the NodeVariable class should be able to take output type as an argument, and push its values to the appropriate typed stack.

constructor for NodeConstant

have two constructors, one for floating point values, one for boolean values. set otype accordingly.

enhanced selection/survival options

distinguish between selection and survival methods for the selectionOperator class. Implement basic operators for survival that take offspring plus elite and that do random selection for parents.

LeastAngleRegression fails

Least angle regression appears to fail under some circumstances, throwing an eigen assertion error. This appears to be an issue in the shogun implementation.

add more ML options

need to add more ML options to ml class.

liblinear (logistic regression)
svm
nearest centroid

set dimensionality as fraction

allow dimensionality to be set as a fraction of the dimensionality of the data. so user can write

set_dim('1.5x')

and the maximum dimensionality will be 1.5 times the number of features in the raw dataset.

this will have to be set within the fit method to work, otherwise the size of X is uknown.

print stats

create and print stats based on verbosity.

if verbosity = 0, nothing should be printed.
if verbosity = 1, the current best score, current generation, and a portion of the current archive should be printed. at the end, the archive should be shown.
if verbosity = 2, the population equations each generation, the current part of execution, the parameter settings should be printed, in addition to what is printed with verbosity =1.

class polymorphism for node class

replace switch case node style with polymorphic classes, with a class corresponding to each node type.

create a helper class to instantiate nodes from string names that are modifiable by the user.
replace terminals and functions in prams class with smart pointers.

read files from command line

read in csv files from the command line. read the 'class' labelled column into a vector y, and the rest of the data into matrix X. the load_csv function should be modified to implement this.

add setter helper functions to fewtwo

add member functions to fewtwo to set parameter values after construction. make sure to set parameters of downstream classes as well.

additional nodes

implement these nodes:

xor
gaussian = exp(-x^2)
2d-gaussian = exp(-((x-mean(x))^2/(2*var(x)) + (y-mean(y))^2/var(y)))
logit = 1/(1+exp(-x))
step = 1 if positive, 0 otherwise
sign = 1 if positive, 0 if 0, -1 if negative
tanh = hyperbolic tangent [-1, 1]

static functions to initialize shogun

write Fewtwo scoped functions to initialize and exit shogun with ref counter to make sure it happens only once

train / validation split of initial data

split the input data into training and validation sets. update best_ind to the individual found with the best validation score.

change normalize function to eigen

use eigen's normalize function rather than shogun's in out_ml()

add documentation

use doxygen to generate documentation.

implement lexicase selection

implement lexicase and epsilon-lexicase selection

add longitudinal data input

add a longitudinal data variable of type vector<vector<ArrayXd> that contains longitudinal feature data for each sample. it should default to empty and be an argument to the fit() method.

classification: output probabilities for each class

make a function for generating prediction probabilities from fewtwo so that the python version can use sklearn's AUC tools.

specialized variation for changing dimensionality

specialize crossover and variation to allow dimensionality changes to occur

extract and package ML methods

the shogun ML methods need to be extended and repackaged with Feat. I'm getting an occassional error in model training just from parallelizing the evaluation routine. I think it would be better to compile the shogun methods with Fewtwo and include them in the repo. Additionally, the tree-based methods do not have feature importance measures, as far as I can tell. Packaging the shogun methods would make it easy for us to extend them.

Since the goal is to keep this a header-only library, we also need to make the shogun learners header-only.

It would also be desirable to remove the shogunmatrix type and use eigen matrices directly.

create unit tests

implement unit tests for each class.

balanced accuracy

make classification fitness default to balanced accuracy.

incorporate multiple types into output matrix

add outputs from the boolean stack to the representation.

add constants flag to fewtwo.h

add a boolean argument to Fewtwo class called erc that turns on ephemeral random constants in the trees. default value should be false.

if true, include terminal values that are constants in CreateNode and in params::set_terminals.

Stack class

write a generic stack class that can be used in place of type-specific stacks and handle multi-typed operators.

it should look something handle pushing and popping with single commands (wrapping the vector back() and pop_back()) and should handle types internally, e.g.

Stack stack;
stack['f'].push(stack['b'].pop().select(stack['f'].pop(), stack['f'].pop()));

would represent and if-then-else operation.

longitudinal data nodes

add nodes to process longitudinal data. these nodes represent multi-dimensional data from the ehr.

fix insertion mutation for boolean terminals

insertion mutation is currently set up to only handle floating point terminals. fix !

FewtwoCV

Write a crossvalidation wrapper for fewtwo.

command line arguments

setup a command line parser in the main.cc file that allows arguments to fewtwo class to be passed via the command line.

clean up documentation

replace javadoc style documentation with simpler /// style brief documentation notation. in other words, replace

/* 
* @brief comment
*/

with

/// comment

in all class declarations. see doxygen guide or my recent changes to fewtwo.h on master

if no dataset input, show help

move help text to method and display help from the command line if no dataset is specified.

incorporate eigen

let's add eigen version 3.3 (or greater) to the source code to make installation easier. This version is also required for GPU support of eigen operations.

add_node() function for adding custom nodes

make an add_function(shared_ptr<Node> N){params.functions.push_back(N);} function in fewtwo class that allows users to register a custom node function to fewtwo.

auc roc

Add methods to compute the area under the receiver-operating curve for models in feat.

feat should have a method to predict the probability of each sample being a positive class
a utility function roc() should translate these probabilities into an roc curve
a utility function auc_roc() should calculate the area under the roc curve

default random methods

make an overloaded () method in rnd.h for double OR set unsigned i default value to 1.0 in double operator()(unsigned i)

integrate feature importance into variation step

use variable importance (individual::weights) to control variation probabilities.

mutation: for mutation, the probability of a node being mutated should be (1-w)/n
crossover: in general we want more important engineered features to be shared, and less important features to be replaced. however, we have to be careful because we don't want individuals to end up with redundant features, which could occur if unimportant features are consistently replaced with important ones. A conservative approach is to simply weight the probability of being swapped OUT, and not the probability of being chosen as a subtree.

add eval_complexity() method to each node

each node needs an eval_complexity() method that operates on a vector stack to calculate the complexity of an expression.

softmax transformation of weights

make a softmax function in utils.h that takes in a vector and returns its softmax transform.

vector<T> softmax(vector<T>& w)

w_new = exp(i)/sum(exp(j) for j in w) for i in w
return w_new

cavalab / feat Goto Github PK

feat's Issues

Recommend Projects

Recommend Topics

Recommend Org