Giter Site home page Giter Site logo

brushfire's Introduction

Brushfire

Brushfire

Brushfire is a framework for distributed supervised learning of decision tree ensemble models in Scala.

The basic approach to distributed tree learning is inspired by Google's PLANET, but considerably generalized thanks to Scala's type parameterization and Algebird's aggregation abstractions.

Brushfire currently supports:

  • binary and multi-class classifiers
  • numeric features (discrete and continuous)
  • categorical features (including those with very high cardinality)
  • k-fold cross validation and random forests
  • chi-squared test as a measure of split quality
  • feature importance and brier scores
  • Scalding/Hadoop as a distributed computing platform

In the future we plan to add support for:

  • regression trees
  • CHAID-like multi-way splits
  • error-based pruning
  • many more ways to evaluate splits and trees
  • Spark and single-node in-memory platforms

Authors

Thanks for assistance and contributions:

Quick start

sbt brushfireScalding/assembly
cd example
./iris
cat iris.output/step_03

If it worked, you should see a JSON representation of 4 versions of a decision tree for classifying irises.

To use brushfire in your own SBT project, add the following to your build.sbt:

libraryDependencies += "com.stripe" %% "brushfire" % "0.6.3"

To use brushfire as a jar in your own Maven project, add the following to your POM file:

<dependency>
  <groupId>com.stripe</groupId>
  <artifactId>brushfire_${scala.binary.version}</artifactId>
  <version>0.6.3</version>
</dependency>

Using Brushfire with Scalding

The only distributed computing platform that Brushfire currently supports is Scalding, version 0.12 or later.

The simplest way to use Brushfire with Scalding is by subclassing TrainerJob and overriding trainer to return an instance of Trainer. Example:

import com.stripe.brushfire._
import com.stripe.brushfire.scalding._
import com.twitter.scalding._

class MyJob(args: Args) extends TrainerJob(args) {
  import JsonInjections._

  def trainer = ???
}
```

You should import either `JsonInjections` or `KryoInjections` to specify serialization in either JSON or base64-encoded Kryo, respectively; the former has the advantage of being human readable, the latter is more efficient, which can be important for very large trees.

To construct a `Trainer`, you need to pass it training data as a Scalding `TypedPipe` of Brushfire [Instance[K, V,T]](http://stripe.github.io/brushfire/#com.stripe.brushfire.Instance) objects. `Instance` looks like this:

````scala
case class Instance[K, V, T](id: String, timestamp: Long, features: Map[K, V], target: T)
  • The id should be unique for each instance.
  • If there's an associated observation time, it should be the timestamp. (Otherwise 0L is fine)
  • features is a Map from feature name (type K, usually String) to some value of type V. There's built-in implicit support for Int, Double, Boolean, and String types (with the assumption for Int and String that there is a small, finite number of possible values). If, as is common, you need to mix different feature types, see the section on Dispatched below.
  • the only built-in support for target currently is for Map[L,Long], where L represents some label type (for example Boolean for a binary classifier or String for multi-class). The Long values represent the weight for the instance, which is usually 1.

Example:

Instance("AS-2014", 1416168857L, Map("lat" -> 49.2, "long" -> 37.1, "altitude" -> 35000.0), Map(true -> 1L))

You also need to pass it a Sampler. Here are some samplers you might use:

One you have constructed a Trainer, you most likely want to call expandTimes(base: String, times: Int). This will build a new ensemble of trees from the training data and expand them times times, to depth times. At each step, the trees will be serialized to a directory (on HDFS, unless you're running in local mode) under base.

Fuller example:

import com.stripe.brushfire._
import com.stripe.brushfire.scalding._
import com.twitter.scalding._

class MyJob(args: Args) extends TrainerJob(args) {
  import JsonInjections._

  def trainingData: TypedPipe[Instance[K, V,T]] = ???
  def trainer = Trainer(trainingData, KFoldSampler(4)).expandTimes(args("output"), 5)
}

#In Memory Expansion

Having expanded as deep as you want using the distributed algorithm, you may wish to ask for further, in-memory expansion of any nodes that are sufficiently small at this point by calling expandSmallNodes(path: String, times: Int). By default, this will downsample every node to at most 10,000 instances of training data, and expand until they have fewer than 10 instances. You may need to tune this value, which you do by setting an implicit Stopper:

val implicit stopper = FrequencyStopper(10000, 10)
trainer.expandInMemory(args("output") + "/mem", 100)
```

Note that the distributed algorithm will *stop* expanding at the same instance count that the in-memory algorithm wants, ie, 10,000 instances by default.

# Dispatched

If you have mixed features, the recommended value type is `Dispatched[Int,String,Double,String]`, which requires your feature values to match any one of these four cases:

* `Ordinal(v: Int)` for numeric features with a reasonably small number of possible values
* `Nominal(v: String)` for categorical features with a reasonably small number of possible values
* `Continuous(v: Double)` for numeric features with a large or infinite number of possible values
* `Sparse(v: String)` for categorical features with a large or infinite number of possible values

Note that using `Sparse` and especially `Continuous` features will currently slow learning down considerably. (But on the other hand, if you try to use `Ordinal` or `Nominal` with a feature that has hundreds of thousands of unique values, it will be even slower, and then fail).

Example of a features map:

````scala
Map("age" -> Ordinal(35), "gender" -> Nominal("male"), "weight" -> Continuous(130.23), "name" -> Sparse("John"))

Extending Brushfire

Brushfire is designed to be extremely pluggable. Some ways you might want to extend it are (from simplest to most involved):

  • Adding a new sampling strategy, to get finer grained control over how instances are allocated to trees, or between the training set and the test set: define a new Sampler
  • Add a new evaluation strategy (such as log-likelihood or entropy): define a new Evaluator
  • Adding a new feature type, or a new way of binning an existing feature type (such as log-binning real numbers): define a new Splitter
  • Adding a new target type (such as real-valued targets for regression trees): define a new Evaluator, a new Stopper and quite likely also define a new Splitter for any continuous or sparse feature types you want to be able to use.
  • Add a new distributed computation platform: define a new equivalent of Trainer, idiomatically to the platform you're using. (There's no specific interface this should implement.)

brushfire's People

Contributors

arelra avatar avi-stripe avatar avibryant avatar brahn avatar brahn-stripe avatar colinmarc avatar df-stripe avatar echen avatar erik-stripe avatar franklin-stripe avatar gkk-stripe avatar johnynek avatar mlm-stripe avatar roban avatar roban-stripe avatar thomas-stripe avatar tixxit avatar travisbrown-stripe avatar vendethiel avatar vitalyli avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

brushfire's Issues

output Trainer metadata

A simpler form of #19 - instead of having a two-way serialization of sampler etc, just write out to HDFS somewhere a description of the sampler, evaluator, etc being used for a given expansion.

`Voter` trait

To encapsulate the various strategies for combining predictions from multiple trees. Probably something like:

trait Voter[T] {
   type V
   def create(target: T): V
   def monoid: Monoid[V]
   def result(votes: V): T
}

with subclasses like class Plurality[L] extends Voter[Map[L, Long]], class SoftVote[L] extends Voter[Map[L, Long]], and so on.

Add count to LeafNode

This would let us implement the stopper threshold/sampling logic without needing a full Stopper trait. This should possibly be a prereq to merging #11

Serializer samplers along with trees

It's not totally clear how this should work, but the trees are best interpreted in the context of a specific sampler, so it seems funny not to store that. (This is especially true for OutOfTimeSampler, for example, where it would be better to store the time threshold for later use).

use TreeTraversal in training

We should be able to use the fancy missing-feature traversals during training, too (which may mean splitting weight for an instance of training data across leaves).

Downsample large nodes for in-memory splitting?

Currently, Trainer's expandSmallNodes will just ignore any leaves that are sufficiently large (ie, that would have been split by the distributed expand, if it were given the chance). Given that you are likely to stop the distributed splitting at some point (and so it won't get the chance), you end up in this somewhat strange dynamic where the largest (and thus perhaps most important) nodes don't get fully expanded but the smaller, possibly less important ones do.

An alternative would be to have the in-memory algorithm expand everything, but downsample at each node (individually computing the rate based on the current leaf's target) to make sure that each one can fit into memory. This would at least get a few more levels of depth for those nodes, although they wouldn't go as deep as a true, distributed full expansion would. The distributions of the leaves this would create would be underweighted relative to stuff that hadn't been downsampled, but a) it's not clear that matters much and b) they could be fixed up with an updateTargets at the end if desired.

It's also interesting to note that you could iterate this, with progressively less downsampling needed each time, or even build the whole tree this way, especially if you only did a small number of levels each time, though I think it would be both more expensive and less effective than the current distributed approach.

Thoughts @snoble @RyW90 @mlmanapat ?

per-node error output

closely related to #24 (in that that can be built on top of this): we should be able to produce some representation of the tree with each node annotated with its error - not the sum of its leaves' errors, but the error of the sums of its leaves' distributions.

separate brushfire-tree from brushfire-training

Some applications, like model servers, will not need the training code. It would be nice if they could depend on a smaller subset of the code. Likely, this would include:

  • Tree
  • Predicate
  • Voter
  • the serialization Injections
  • Dispatched

but hopefully would not include:

  • Instance
  • Splitter
  • Error
  • Sampler

Export `TreeGenerator` and maybe some tests

Our tree generators for scalacheck are fairly complex, so if we created a new brushfire-laws package or something that included them, then they could be re-used elsewhere.

Brushfire should use Bonsai trees in training

To support faster training (especially in local mode) we could be using compressed Bonsai trees directly in Brushfire. There's really no advantage to Brushfire's native tree type so we should remove it. The idea is that we'd batch our operations for 1 step (eg growing the tree by 1) and make the changes to the Bonsai tree in bulk.

This involves both exposing some new functionality from Bonsai, and replacing the use of trees (and the use of the tree ops type class) in Brushfire.

Multi-label targets

It would be useful to support multi-label (not just multi-class) prediction targets.

Voter refactor to work with sumLeaves

If we follow through with #40 and #42, then prediction will be handled by sumLeaves and each current Voter will essentially become a VectorSpace instance + a method to turn targets into predictions (vectors in the vectorspace). It may also be that we get rid of Voter completely, and instead just create various distribution types + VectorSpace instances for them.

Add some property-based tests

Most of the interfaces brushfire exposes have "laws" associated with them which we could write tests for, such that any new implementation would have to conform to them. This would make it much easier to extend brushfire with confidence.

Add an Encoder to the pipeline

It would be helpful to have a concept of an Encoder[K,U,V] which can transform Map[K,U] to the Map[K,V] used by a given set of trees, and which can be serialized along with the trees.

We also want (in some cases) to be able to "train" these encoders from a training set - for example to figure out which numeric features are continuous vs. discrete, or even to do dimensionality reduction etc.

error-minimizing pruner

This is only really relevant to people building single-tree models, but you should be able to prune a single tree to minimize validation error

Stacking

Having trained N trees, you can then treat each of those N predictions as a feature itself, and run a logistic regression over that (thus "stacking" a logistic regression model on top of the decision tree model). It would be interesting to experiment with this.

Add Spark support

Most of the code is generic enough to run on a different framework other than Scalding. However, there is some dependency on the new Execution module of Scalding that I couldn't completely get around.
Is it possible to refactor the code that will be less Scalding specific, or just explain to me how it all works, and I'll try to do it?

README points to nonexistent artifacts

The instructions under Quick Start are incorrect. It shows that brushfire is compiled for multiple Scala versions. However, neither a brushfire_2.10 or brushfire_2.11 exist in oss.sonatype.org, only a non-versioned brushfire.

Include Ordering[E] in Error[T, P, E]

It seems like we'll need an ordering at some point in order for an error to be useful. In a world of incoherent type classes, it may be prudent to trap the Ordering in the Error instance as well, rather than relying on the correct instance being passed in as an implicit parameter.

Add new `sumLeaves` method (or similar)

We currently get only 1 leaf in leafFor, but it is possible we may want to allow paths to diverge during tree evaluation (eg feature is missing). To do this, we should add a new sumLeaves method which allows us traverse down multiple paths and aggregate the results using the prior probabilities of each diverging path.

During tree traversal, we'd always follow edges whose predicates return true. We would also allow nodes to have multiple true predicates. For each true edge, we'd determine their prior probabilities relative to all true edges and use it to scale the result returned for each edge, then sum the result. The method may look something like:

def sumLeaves[C[_], A: Numeric](row: Map[K, V])(f: Leaf => C[A])(implicit vs: VectorSpace[A, C]): C[A] = ???

Unify Error and Evaluator

It should be sufficiently general/flexible to always just split to minimize training error, rather than having a separate evaluator.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.