cavalab / feat Goto Github PK
View Code? Open in Web Editor NEWA feature engineering automation tool for learning data representations
Home Page: https://cavalab.org/feat
License: GNU General Public License v3.0
A feature engineering automation tool for learning data representations
Home Page: https://cavalab.org/feat
License: GNU General Public License v3.0
implement the survival portion of NSGA-II
handle inf and nan outputs in generated features
create python wrapper class to use fewtwo with sklearn
build examples of regression and classification using fewtwo and add them to the documentation.
right now Node has an arity_f
and arity_b
to handle multiple data types. to make it more extensible we should roll arity into one variable of type std::map<char,unsigned int>
where the character corresponds to the output/input type of interest.
when reading in a csv file, load_csv should be able to determine whether certain features are boolean by checking the length of unique values and wether they are 0 and 1.
fewtwo should take dtypes as a parameter, and set terminal output types accordingly. the NodeVariable class should be able to take output type as an argument, and push its values to the appropriate typed stack.
have two constructors, one for floating point values, one for boolean values. set otype accordingly.
distinguish between selection and survival methods for the selectionOperator class. Implement basic operators for survival that take offspring plus elite and that do random selection for parents.
Least angle regression appears to fail under some circumstances, throwing an eigen assertion error. This appears to be an issue in the shogun implementation.
need to add more ML options to ml class.
allow dimensionality to be set as a fraction of the dimensionality of the data. so user can write
set_dim('1.5x')
and the maximum dimensionality will be 1.5 times the number of features in the raw dataset.
this will have to be set within the fit method to work, otherwise the size of X is uknown.
create and print stats based on verbosity.
if verbosity = 0, nothing should be printed.
if verbosity = 1, the current best score, current generation, and a portion of the current archive should be printed. at the end, the archive should be shown.
if verbosity = 2, the population equations each generation, the current part of execution, the parameter settings should be printed, in addition to what is printed with verbosity =1.
replace switch case node style with polymorphic classes, with a class corresponding to each node type.
read in csv files from the command line. read the 'class' labelled column into a vector y, and the rest of the data into matrix X. the load_csv
function should be modified to implement this.
add member functions to fewtwo to set parameter values after construction. make sure to set parameters of downstream classes as well.
implement these nodes:
write Fewtwo scoped functions to initialize and exit shogun with ref counter to make sure it happens only once
split the input data into training and validation sets. update best_ind
to the individual found with the best validation score.
use eigen's normalize function rather than shogun's in out_ml()
use doxygen to generate documentation.
implement lexicase and epsilon-lexicase selection
add a longitudinal data variable of type vector<vector<ArrayXd>
that contains longitudinal feature data for each sample. it should default to empty and be an argument to the fit()
method.
make a function for generating prediction probabilities from fewtwo so that the python version can use sklearn's AUC tools.
specialize crossover and variation to allow dimensionality changes to occur
the shogun ML methods need to be extended and repackaged with Feat. I'm getting an occassional error in model training just from parallelizing the evaluation routine. I think it would be better to compile the shogun methods with Fewtwo and include them in the repo. Additionally, the tree-based methods do not have feature importance measures, as far as I can tell. Packaging the shogun methods would make it easy for us to extend them.
Since the goal is to keep this a header-only library, we also need to make the shogun learners header-only.
It would also be desirable to remove the shogunmatrix type and use eigen matrices directly.
implement unit tests for each class.
make classification fitness default to balanced accuracy.
add outputs from the boolean stack to the representation.
add a boolean argument to Fewtwo class called erc
that turns on ephemeral random constants in the trees. default value should be false.
if true, include terminal values that are constants in CreateNode and in params::set_terminals.
write a generic stack class that can be used in place of type-specific stacks and handle multi-typed operators.
it should look something handle pushing and popping with single commands (wrapping the vector back()
and pop_back()
) and should handle types internally, e.g.
Stack stack;
stack['f'].push(stack['b'].pop().select(stack['f'].pop(), stack['f'].pop()));
would represent and if-then-else operation.
add nodes to process longitudinal data. these nodes represent multi-dimensional data from the ehr.
insertion mutation is currently set up to only handle floating point terminals. fix !
Write a crossvalidation wrapper for fewtwo.
setup a command line parser in the main.cc file that allows arguments to fewtwo class to be passed via the command line.
replace javadoc style documentation with simpler ///
style brief documentation notation. in other words, replace
/*
* @brief comment
*/
with
/// comment
in all class declarations. see doxygen guide or my recent changes to fewtwo.h on master
move help text to method and display help from the command line if no dataset is specified.
let's add eigen version 3.3 (or greater) to the source code to make installation easier. This version is also required for GPU support of eigen operations.
make an add_function(shared_ptr<Node> N){params.functions.push_back(N);}
function in fewtwo class that allows users to register a custom node function to fewtwo.
Add methods to compute the area under the receiver-operating curve for models in feat.
roc()
should translate these probabilities into an roc curveauc_roc()
should calculate the area under the roc curvemake an overloaded () method in rnd.h for double OR set unsigned i
default value to 1.0 in double operator()(unsigned i)
use variable importance (individual::weights) to control variation probabilities.
mutation: for mutation, the probability of a node being mutated should be (1-w)/n
crossover: in general we want more important engineered features to be shared, and less important features to be replaced. however, we have to be careful because we don't want individuals to end up with redundant features, which could occur if unimportant features are consistently replaced with important ones. A conservative approach is to simply weight the probability of being swapped OUT, and not the probability of being chosen as a subtree.
each node needs an eval_complexity() method that operates on a vector stack to calculate the complexity of an expression.
make a softmax function in utils.h that takes in a vector and returns its softmax transform.
vector<T> softmax(vector<T>& w)
w_new = exp(i)/sum(exp(j) for j in w) for i in w
return w_new
setup travis continuous integration using unit tests
write getter functions to get parameter values from fewtwo object.
need to store best individual to do this.
add insertion and delete mutation operators
we should add the initial model to the population as an individual in the first generation. this would consist of simply constructing an individual by pushing each feature into a program stack, subject to limits on dimensionality.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.