Giter Site home page Giter Site logo

crossfader's Introduction

https://rawgit.com/bettermg/crossfader/master/demo.png

What is this?

An experimental model for robust dimensionality reduction of arbitrary data sets. It lets you explore the marginal distributions of all parameters with any parameter(s) fixed. It is also extremely fast at computing these probability distributions. It assumes nothing about the distribution of the features, which can have any units and have any scale.

What can it be used for?

  • Predicting missing values
  • Visualizing probability distributions
  • Finding outliers
  • Modelling uncertainties

Demo Time

Here is a demo trained on a bunch of different data sets. The demo is written in JS and uses pre-trained models.

More background

Tools like Crossfilter are great at visualizing datasets and how features are correlated. Crossfilter renders real data points based on a number of feature selections. However as the number of features increase, it gets harder to find data points that fulfill all criteria. This is the curse of dimensionality problem which often makes analysis of high-dimensional data hard.

This package has a different approach. It computes a statistical model of the underlying data. The downside is that you can no longer explore real data points. The upside is you can explore conditional dependencies and make predictions about data you have not observed.

How does it work?

It builds an autoencoder that learns to reconstruct missing data.

To be able to work with any distributions, it reduces all inputs to a series of binary values. Every feature is encoded as a binary feature vector by constructing splits from the empirical distribution of the training data. The predicted probabilities for each split then gives the CDF directly.

The autoencoder has a series of hidden bottleneck layers (typically 2-5 layers with 20-100 units). One way to think of it is that the autoencoder finds a low-dimensional manifold in the high-dimensional space. This manifold can be highly nonlinear due to the nonlinearities in the autoencoder. The autoencoder then essentially learns a projection from the high dimensional space onto the manifold and another projection back to the original space. The dimensionality reduction is effectively a way of getting around the curse of dimensionality.

The autoencoder is trained using Theano in Python. You can run it on a GPU although the speed improvements are not drastic because of some bottlenecks.

Limitations and future work

  • Training the autoencoder is very slow. It can take a few hours for small data sets (with 10-100 features).
  • The training time grows with the number of features.
  • The training time should stay relatively constant with the size of the training data since it is using stochastic gradient descent.
  • It only handles numerical attributes, although categorical should be pretty easy to add.
  • The model probably overfits quite a bit and should do proper cross-validation to measure this.
  • Similarly there is a bunch of hyperparameters that have to be tuned to your data set.

License

Released under the Apache 2.0 license.

crossfader's People

Contributors

erikbern avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.