Giter Site home page Giter Site logo

cavalab / feat Goto Github PK

View Code? Open in Web Editor NEW
30.0 9.0 14.0 144.49 MB

A feature engineering automation tool for learning data representations

Home Page: https://cavalab.org/feat

License: GNU General Public License v3.0

CMake 0.63% C++ 86.77% Shell 0.12% Python 3.93% Makefile 0.19% Cuda 8.16% C 0.20%

feat's Introduction

FEAT

Build Status License: GPL v3

FEAT is a feature engineering automation tool that learns new representations of raw data to improve classifier and regressor performance. The underlying methods use Pareto optimization and evolutionary computation to search the space of possible transformations.

FEAT wraps around a user-chosen ML method and provides a set of representations that give the best performance for that method. Each individual in FEAT's population is its own data representation.

FEAT uses the Shogun C++ ML toolbox to fit models.

Check out the documentation for installation and examples.

References

  1. La Cava, W., Singh, T. R., Taggart, J., Suri, S., & Moore, J. H.. Learning concise representations for regression by evolving networks of trees. ICLR 2019. arxiv:1807.0091

  2. La Cava, W. & Moore, Jason H. (2020). Genetic programming approaches to learning fair classifiers. GECCO 2020. Best Paper Award. ACM, arXiv, experiments

  3. La Cava, W., Lee, P.C., Ajmal, I., Ding, X., Cohen, J.B., Solanki, P., Moore, J.H., and Herman, D.S (2021). Application of concise machine learning to construct accurate and interpretable EHR computable phenotypes. In Review. medRxiv, experiments

Contact

Maintained by William La Cava (william.lacava at childrens.harvard.edu)

Acknowledgments

Special thanks to these contributors:

  • Tilak Raj Singh @tilakhere
  • Srinivas Suri @srinu634
  • James P Taggert @JPT2
  • Daniel Herman
  • Paul Lee

This work is supported by grant K99-LM012926 and R00-LM012926 from the National Library of Medicine. FEAT is being developed to learn clinical diagnostics in the Cava Lab at Harvard Medical School.

License

GNU GPLv3, see LICENSE

Installation

To see our installation process from scratch, check out the Github actions workflow.

Dependencies

Feat uses cmake to build. It also depends on the Eigen matrix library for C++ as well as the Shogun ML library. Both come in packages on conda that should work across platforms.

Install in a Conda Environment

The easiest option for install is to use the conda environment we provide. Then the build process is the following:

git clone https://github.com/lacava/feat # clone the repo
cd feat # enter the directory
conda env create
conda activate feat
pip install .

If you want to roll your own with the dependencies, some other options are shown below. In this case, you need to tell the [configure]{.title-ref} script where Shogun and Eigen are. Edit this lines:

export SHOGUN_LIB=/your/shogun/lib/
export SHOGUN_DIR=/your/shugn/include/
export EIGEN3_INCLUDE_DIR=/your/eigen/eigen3/

If you need Eigen and Shogun and don't want to use conda, follow these instructions.

Eigen

Eigen is a header only package. We need Eigen 3 or greater.

Debian/Ubuntu

On Debian systems, you can grab the package:

sudo apt-get install libeigen3-dev

You can also download the headers and put them somewhere. Then you just have to tell cmake where they are with the environmental variable EIGEN3_INCLUDE_DIR. Example:

# Eigen 3.3.4
wget "http://bitbucket.org/eigen/eigen/get/3.3.4.tar.gz"
tar xzf 3.3.4.tar.gz 
mkdir eigen-3.3.4 
mv eigen-eigen*/* eigen-3.3.4
# set an environmental variable to tell cmake where Eigen is
export EIGEN3_INCLUDE_DIR="$(pwd)/eigen-3.3.4/"

Shogun

You don't have to compile Shogun, just download the binaries. Their install guide is good. We've listed two of the options here.

Debian/Ubuntu

You can also get the Shogun packages:

sudo add-apt-repository ppa:shogun-toolbox/nightly -y
sudo apt-get update -y
sudo apt-get install -qq --force-yes --no-install-recommends libshogun18
sudo apt-get install -qq --force-yes --no-install-recommends libshogun-dev

Running the tests

(optional) If you want to run the c++ tests, you need to install Google Test. A useful guide to doing so is available here. Then you can use cmake to build the tests. From the repo root,

./configure tests   # builds the test Makefile
make -C build tests # compiles the tests
./build/tests # runs the tests

For the python tests, run

python tests/wrappertest.py

Contributing

Please follow the Github flow guidelines for contributing to this project.

In general, this is the approach:

  • Fork the repo into your own repository and clone it locally.

    git clone https://github.com/my_user_name/feat
    
  • Have an idea for a code change. Checkout a new branch with an appropriate name.

    git checkout -b my_new_change
    
  • Make your changes.

  • Commit your changes to the branch.

    git commit -m "adds my new change"
    
  • Check that your branch has no conflict with Feat's master branch by merging the master branch from the upstream repo.

    git remote add upstream https://github.com/lacava/feat
    git fetch upstream
    git merge upstream/master
    
  • Fix any conflicts and commit.

    git commit -m "Merges upstream master"
    
  • Push the branch to your forked repo.

    git push origin my_new_change
    
  • Go to either Github repo and make a new Pull Request for your forked branch. Be sure to reference any relevant issues.

feat's People

Contributors

by1ttz4isqkao80f avatar foolnotion avatar galdeia avatar jpt2 avatar lacava avatar srinu634 avatar tilakhere avatar weixuanfu avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

feat's Issues

additional nodes

implement these nodes:

  • xor
  • gaussian = exp(-x^2)
  • 2d-gaussian = exp(-((x-mean(x))^2/(2*var(x)) + (y-mean(y))^2/var(y)))
  • logit = 1/(1+exp(-x))
  • step = 1 if positive, 0 otherwise
  • sign = 1 if positive, 0 if 0, -1 if negative
  • tanh = hyperbolic tangent [-1, 1]

default random methods

make an overloaded () method in rnd.h for double OR set unsigned i default value to 1.0 in double operator()(unsigned i)

class polymorphism for node class

replace switch case node style with polymorphic classes, with a class corresponding to each node type.

  • create a helper class to instantiate nodes from string names that are modifiable by the user.
  • replace terminals and functions in prams class with smart pointers.

LeastAngleRegression fails

Least angle regression appears to fail under some circumstances, throwing an eigen assertion error. This appears to be an issue in the shogun implementation.

integrate feature importance into variation step

use variable importance (individual::weights) to control variation probabilities.

mutation: for mutation, the probability of a node being mutated should be (1-w)/n
crossover: in general we want more important engineered features to be shared, and less important features to be replaced. however, we have to be careful because we don't want individuals to end up with redundant features, which could occur if unimportant features are consistently replaced with important ones. A conservative approach is to simply weight the probability of being swapped OUT, and not the probability of being chosen as a subtree.

extract and package ML methods

the shogun ML methods need to be extended and repackaged with Feat. I'm getting an occassional error in model training just from parallelizing the evaluation routine. I think it would be better to compile the shogun methods with Fewtwo and include them in the repo. Additionally, the tree-based methods do not have feature importance measures, as far as I can tell. Packaging the shogun methods would make it easy for us to extend them.

Since the goal is to keep this a header-only library, we also need to make the shogun learners header-only.

It would also be desirable to remove the shogunmatrix type and use eigen matrices directly.

longitudinal data nodes

add nodes to process longitudinal data. these nodes represent multi-dimensional data from the ehr.

  • long variable node ( 1)
  • mean ( 1)
  • median (1)
  • max (1)
  • min (1)
  • var (2 )
  • skew (3)
  • kurtosis (4)
  • slope(4) (covariance(time, y)/variance(y))
  • count (1) : returns count of readings
  • recent (1) : returns the most recent reading (should be merged with time ? comment for time node - by default, return the latest measurment. returns data with a set time delay, initialized during construction. if the delay is longer than the data, return last. )

build examples

build examples of regression and classification using fewtwo and add them to the documentation.

make load_csv determine data types

when reading in a csv file, load_csv should be able to determine whether certain features are boolean by checking the length of unique values and wether they are 0 and 1.

fewtwo should take dtypes as a parameter, and set terminal output types accordingly. the NodeVariable class should be able to take output type as an argument, and push its values to the appropriate typed stack.

FewtwoCV

Write a crossvalidation wrapper for fewtwo.

add constants flag to fewtwo.h

add a boolean argument to Fewtwo class called erc that turns on ephemeral random constants in the trees. default value should be false.

if true, include terminal values that are constants in CreateNode and in params::set_terminals.

print stats

create and print stats based on verbosity.

if verbosity = 0, nothing should be printed.
if verbosity = 1, the current best score, current generation, and a portion of the current archive should be printed. at the end, the archive should be shown.
if verbosity = 2, the population equations each generation, the current part of execution, the parameter settings should be printed, in addition to what is printed with verbosity =1.

add longitudinal data input

add a longitudinal data variable of type vector<vector<ArrayXd> that contains longitudinal feature data for each sample. it should default to empty and be an argument to the fit() method.

Stack class

write a generic stack class that can be used in place of type-specific stacks and handle multi-typed operators.

it should look something handle pushing and popping with single commands (wrapping the vector back() and pop_back()) and should handle types internally, e.g.

Stack stack;
stack['f'].push(stack['b'].pop().select(stack['f'].pop(), stack['f'].pop()));

would represent and if-then-else operation.

incorporate eigen

let's add eigen version 3.3 (or greater) to the source code to make installation easier. This version is also required for GPU support of eigen operations.

clean up documentation

replace javadoc style documentation with simpler /// style brief documentation notation. in other words, replace

/* 
* @brief comment
*/

with

/// comment

in all class declarations. see doxygen guide or my recent changes to fewtwo.h on master

read files from command line

read in csv files from the command line. read the 'class' labelled column into a vector y, and the rest of the data into matrix X. the load_csv function should be modified to implement this.

auc roc

Add methods to compute the area under the receiver-operating curve for models in feat.

  • feat should have a method to predict the probability of each sample being a positive class
  • a utility function roc() should translate these probabilities into an roc curve
  • a utility function auc_roc() should calculate the area under the roc curve

command line arguments

setup a command line parser in the main.cc file that allows arguments to fewtwo class to be passed via the command line.

enhanced selection/survival options

distinguish between selection and survival methods for the selectionOperator class. Implement basic operators for survival that take offspring plus elite and that do random selection for parents.

make arity members of Node class into one map

right now Node has an arity_f and arity_b to handle multiple data types. to make it more extensible we should roll arity into one variable of type std::map<char,unsigned int> where the character corresponds to the output/input type of interest.

add more ML options

need to add more ML options to ml class.

  • liblinear (logistic regression)
  • svm
  • nearest centroid

add initial model to population

we should add the initial model to the population as an individual in the first generation. this would consist of simply constructing an individual by pushing each feature into a program stack, subject to limits on dimensionality.

set dimensionality as fraction

allow dimensionality to be set as a fraction of the dimensionality of the data. so user can write

set_dim('1.5x')

and the maximum dimensionality will be 1.5 times the number of features in the raw dataset.

this will have to be set within the fit method to work, otherwise the size of X is uknown.

softmax transformation of weights

make a softmax function in utils.h that takes in a vector and returns its softmax transform.

vector<T> softmax(vector<T>& w)

w_new = exp(i)/sum(exp(j) for j in w) for i in w
return w_new

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.