Giter Site home page Giter Site logo

cavalab / feat Goto Github PK

View Code? Open in Web Editor NEW
30.0 10.0 14.0 143.89 MB

A feature engineering automation tool for learning data representations

Home Page: https://cavalab.org/feat

License: GNU General Public License v3.0

CMake 0.63% C++ 86.77% Shell 0.12% Python 3.93% Makefile 0.19% Cuda 8.16% C 0.20%

feat's Issues

build examples

build examples of regression and classification using fewtwo and add them to the documentation.

make arity members of Node class into one map

right now Node has an arity_f and arity_b to handle multiple data types. to make it more extensible we should roll arity into one variable of type std::map<char,unsigned int> where the character corresponds to the output/input type of interest.

make load_csv determine data types

when reading in a csv file, load_csv should be able to determine whether certain features are boolean by checking the length of unique values and wether they are 0 and 1.

fewtwo should take dtypes as a parameter, and set terminal output types accordingly. the NodeVariable class should be able to take output type as an argument, and push its values to the appropriate typed stack.

enhanced selection/survival options

distinguish between selection and survival methods for the selectionOperator class. Implement basic operators for survival that take offspring plus elite and that do random selection for parents.

LeastAngleRegression fails

Least angle regression appears to fail under some circumstances, throwing an eigen assertion error. This appears to be an issue in the shogun implementation.

add more ML options

need to add more ML options to ml class.

  • liblinear (logistic regression)
  • svm
  • nearest centroid

set dimensionality as fraction

allow dimensionality to be set as a fraction of the dimensionality of the data. so user can write

set_dim('1.5x')

and the maximum dimensionality will be 1.5 times the number of features in the raw dataset.

this will have to be set within the fit method to work, otherwise the size of X is uknown.

print stats

create and print stats based on verbosity.

if verbosity = 0, nothing should be printed.
if verbosity = 1, the current best score, current generation, and a portion of the current archive should be printed. at the end, the archive should be shown.
if verbosity = 2, the population equations each generation, the current part of execution, the parameter settings should be printed, in addition to what is printed with verbosity =1.

class polymorphism for node class

replace switch case node style with polymorphic classes, with a class corresponding to each node type.

  • create a helper class to instantiate nodes from string names that are modifiable by the user.
  • replace terminals and functions in prams class with smart pointers.

read files from command line

read in csv files from the command line. read the 'class' labelled column into a vector y, and the rest of the data into matrix X. the load_csv function should be modified to implement this.

additional nodes

implement these nodes:

  • xor
  • gaussian = exp(-x^2)
  • 2d-gaussian = exp(-((x-mean(x))^2/(2*var(x)) + (y-mean(y))^2/var(y)))
  • logit = 1/(1+exp(-x))
  • step = 1 if positive, 0 otherwise
  • sign = 1 if positive, 0 if 0, -1 if negative
  • tanh = hyperbolic tangent [-1, 1]

add longitudinal data input

add a longitudinal data variable of type vector<vector<ArrayXd> that contains longitudinal feature data for each sample. it should default to empty and be an argument to the fit() method.

extract and package ML methods

the shogun ML methods need to be extended and repackaged with Feat. I'm getting an occassional error in model training just from parallelizing the evaluation routine. I think it would be better to compile the shogun methods with Fewtwo and include them in the repo. Additionally, the tree-based methods do not have feature importance measures, as far as I can tell. Packaging the shogun methods would make it easy for us to extend them.

Since the goal is to keep this a header-only library, we also need to make the shogun learners header-only.

It would also be desirable to remove the shogunmatrix type and use eigen matrices directly.

add constants flag to fewtwo.h

add a boolean argument to Fewtwo class called erc that turns on ephemeral random constants in the trees. default value should be false.

if true, include terminal values that are constants in CreateNode and in params::set_terminals.

Stack class

write a generic stack class that can be used in place of type-specific stacks and handle multi-typed operators.

it should look something handle pushing and popping with single commands (wrapping the vector back() and pop_back()) and should handle types internally, e.g.

Stack stack;
stack['f'].push(stack['b'].pop().select(stack['f'].pop(), stack['f'].pop()));

would represent and if-then-else operation.

longitudinal data nodes

add nodes to process longitudinal data. these nodes represent multi-dimensional data from the ehr.

  • long variable node ( 1)
  • mean ( 1)
  • median (1)
  • max (1)
  • min (1)
  • var (2 )
  • skew (3)
  • kurtosis (4)
  • slope(4) (covariance(time, y)/variance(y))
  • count (1) : returns count of readings
  • recent (1) : returns the most recent reading (should be merged with time ? comment for time node - by default, return the latest measurment. returns data with a set time delay, initialized during construction. if the delay is longer than the data, return last. )

FewtwoCV

Write a crossvalidation wrapper for fewtwo.

command line arguments

setup a command line parser in the main.cc file that allows arguments to fewtwo class to be passed via the command line.

clean up documentation

replace javadoc style documentation with simpler /// style brief documentation notation. in other words, replace

/* 
* @brief comment
*/

with

/// comment

in all class declarations. see doxygen guide or my recent changes to fewtwo.h on master

incorporate eigen

let's add eigen version 3.3 (or greater) to the source code to make installation easier. This version is also required for GPU support of eigen operations.

auc roc

Add methods to compute the area under the receiver-operating curve for models in feat.

  • feat should have a method to predict the probability of each sample being a positive class
  • a utility function roc() should translate these probabilities into an roc curve
  • a utility function auc_roc() should calculate the area under the roc curve

default random methods

make an overloaded () method in rnd.h for double OR set unsigned i default value to 1.0 in double operator()(unsigned i)

integrate feature importance into variation step

use variable importance (individual::weights) to control variation probabilities.

mutation: for mutation, the probability of a node being mutated should be (1-w)/n
crossover: in general we want more important engineered features to be shared, and less important features to be replaced. however, we have to be careful because we don't want individuals to end up with redundant features, which could occur if unimportant features are consistently replaced with important ones. A conservative approach is to simply weight the probability of being swapped OUT, and not the probability of being chosen as a subtree.

softmax transformation of weights

make a softmax function in utils.h that takes in a vector and returns its softmax transform.

vector<T> softmax(vector<T>& w)

w_new = exp(i)/sum(exp(j) for j in w) for i in w
return w_new

add initial model to population

we should add the initial model to the population as an individual in the first generation. this would consist of simply constructing an individual by pushing each feature into a program stack, subject to limits on dimensionality.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.