stripe-archive / brushfire Goto Github PK

Distributed decision tree ensemble learning in Scala

License: Other

Scala 95.02% Shell 1.44% Python 3.54%

brushfire's Introduction

Brushfire

Brushfire is a framework for distributed supervised learning of decision tree ensemble models in Scala.

The basic approach to distributed tree learning is inspired by Google's PLANET, but considerably generalized thanks to Scala's type parameterization and Algebird's aggregation abstractions.

Brushfire currently supports:

binary and multi-class classifiers
numeric features (discrete and continuous)
categorical features (including those with very high cardinality)
k-fold cross validation and random forests
chi-squared test as a measure of split quality
feature importance and brier scores
Scalding/Hadoop as a distributed computing platform

In the future we plan to add support for:

regression trees
CHAID-like multi-way splits
error-based pruning
many more ways to evaluate splits and trees
Spark and single-node in-memory platforms

Authors

Avi Bryant http://twitter.com/avibryant

Thanks for assistance and contributions:

Edwin Chen https://twitter.com/echen
Dan Frank http://twitter.com/danielhfrank
Nathan Howell https://twitter.com/nathanhowell
Roban Kramer https://twitter.com/robanhk
Colin Marc http://twitter.com/colinmarc
Steven Noble http://twitter.com/snoble
Erik Osheim http://twitter.com/d6
Tom Switzer https://twitter.com/tixxit

Quick start

sbt brushfireScalding/assembly
cd example
./iris
cat iris.output/step_03

If it worked, you should see a JSON representation of 4 versions of a decision tree for classifying irises.

To use brushfire in your own SBT project, add the following to your build.sbt:

libraryDependencies += "com.stripe" %% "brushfire" % "0.6.3"

To use brushfire as a jar in your own Maven project, add the following to your POM file:

<dependency>
  <groupId>com.stripe</groupId>
  <artifactId>brushfire_${scala.binary.version}</artifactId>
  <version>0.6.3</version>
</dependency>

Using Brushfire with Scalding

The only distributed computing platform that Brushfire currently supports is Scalding, version 0.12 or later.

The simplest way to use Brushfire with Scalding is by subclassing TrainerJob and overriding trainer to return an instance of Trainer. Example:

import com.stripe.brushfire._
import com.stripe.brushfire.scalding._
import com.twitter.scalding._

class MyJob(args: Args) extends TrainerJob(args) {
  import JsonInjections._

  def trainer = ???
}
```

You should import either `JsonInjections` or `KryoInjections` to specify serialization in either JSON or base64-encoded Kryo, respectively; the former has the advantage of being human readable, the latter is more efficient, which can be important for very large trees.

To construct a `Trainer`, you need to pass it training data as a Scalding `TypedPipe` of Brushfire [Instance[K, V,T]](http://stripe.github.io/brushfire/#com.stripe.brushfire.Instance) objects. `Instance` looks like this:

````scala
case class Instance[K, V, T](id: String, timestamp: Long, features: Map[K, V], target: T)

The id should be unique for each instance.
If there's an associated observation time, it should be the timestamp. (Otherwise 0L is fine)
features is a Map from feature name (type K, usually String) to some value of type V. There's built-in implicit support for Int, Double, Boolean, and String types (with the assumption for Int and String that there is a small, finite number of possible values). If, as is common, you need to mix different feature types, see the section on Dispatched below.
the only built-in support for target currently is for Map[L,Long], where L represents some label type (for example Boolean for a binary classifier or String for multi-class). The Long values represent the weight for the instance, which is usually 1.

Example:

Instance("AS-2014", 1416168857L, Map("lat" -> 49.2, "long" -> 37.1, "altitude" -> 35000.0), Map(true -> 1L))

You also need to pass it a Sampler. Here are some samplers you might use:

SingleTreeSampler will use the entirety of the training data to construct a single tree.
KFoldSampler(numTrees: Int) will construct k different trees, each excluding a random 1/k of the data, for use in cross-validation.
RFSampler(numTrees: Int, featureRate: Double, samplingRate: Double) will construct multiple trees, each using a separate bootstrap sample (using samplingRate, which defaults to 1.0). Each node in the tree will also only consider a random featureRate sample of the features available. (This is the approach used for random forests).

One you have constructed a Trainer, you most likely want to call expandTimes(base: String, times: Int). This will build a new ensemble of trees from the training data and expand them times times, to depth times. At each step, the trees will be serialized to a directory (on HDFS, unless you're running in local mode) under base.

Fuller example:

import com.stripe.brushfire._
import com.stripe.brushfire.scalding._
import com.twitter.scalding._

class MyJob(args: Args) extends TrainerJob(args) {
  import JsonInjections._

  def trainingData: TypedPipe[Instance[K, V,T]] = ???
  def trainer = Trainer(trainingData, KFoldSampler(4)).expandTimes(args("output"), 5)
}

#In Memory Expansion

Having expanded as deep as you want using the distributed algorithm, you may wish to ask for further, in-memory expansion of any nodes that are sufficiently small at this point by calling expandSmallNodes(path: String, times: Int). By default, this will downsample every node to at most 10,000 instances of training data, and expand until they have fewer than 10 instances. You may need to tune this value, which you do by setting an implicit Stopper:

val implicit stopper = FrequencyStopper(10000, 10)
trainer.expandInMemory(args("output") + "/mem", 100)
```

Note that the distributed algorithm will *stop* expanding at the same instance count that the in-memory algorithm wants, ie, 10,000 instances by default.

# Dispatched

If you have mixed features, the recommended value type is `Dispatched[Int,String,Double,String]`, which requires your feature values to match any one of these four cases:

* `Ordinal(v: Int)` for numeric features with a reasonably small number of possible values
* `Nominal(v: String)` for categorical features with a reasonably small number of possible values
* `Continuous(v: Double)` for numeric features with a large or infinite number of possible values
* `Sparse(v: String)` for categorical features with a large or infinite number of possible values

Note that using `Sparse` and especially `Continuous` features will currently slow learning down considerably. (But on the other hand, if you try to use `Ordinal` or `Nominal` with a feature that has hundreds of thousands of unique values, it will be even slower, and then fail).

Example of a features map:

````scala
Map("age" -> Ordinal(35), "gender" -> Nominal("male"), "weight" -> Continuous(130.23), "name" -> Sparse("John"))

Extending Brushfire

Brushfire is designed to be extremely pluggable. Some ways you might want to extend it are (from simplest to most involved):

Adding a new sampling strategy, to get finer grained control over how instances are allocated to trees, or between the training set and the test set: define a new Sampler
Add a new evaluation strategy (such as log-likelihood or entropy): define a new Evaluator
Adding a new feature type, or a new way of binning an existing feature type (such as log-binning real numbers): define a new Splitter
Adding a new target type (such as real-valued targets for regression trees): define a new Evaluator, a new Stopper and quite likely also define a new Splitter for any continuous or sparse feature types you want to be able to use.
Add a new distributed computation platform: define a new equivalent of Trainer, idiomatically to the platform you're using. (There's no specific interface this should implement.)

brushfire's People

Contributors

Stargazers

Watchers

Forkers

sheltowt non rand aficionado arelra ml-ai-nlp-ir echen 0asa jfmwz lucentcosmos ssbk1234 tomzhang tixxit splade duke-translational-bioinformatics nathanhowell malcolmgreaves doubaokun mathkann codeaudit mindis cailurus erik-stripe lizhangzhan linearregression daishichao mishraanimesh johnynek marutiagarwal eightysteele sandy4321 a10y ronbak travisbrown semanticbeeng alainlompo techscientist github-allen tzebin lazybrainai cnxtech ejiro-precious imgaryyang statisticallyfit isabella232

brushfire's Issues

Make brushfire-core and brushfire-scalding separate maven modules

output Trainer metadata

A simpler form of #19 - instead of having a two-way serialization of sampler etc, just write out to HDFS somewhere a description of the sampler, evaluator, etc being used for a given expansion.

`Voter` trait

To encapsulate the various strategies for combining predictions from multiple trees. Probably something like:

trait Voter[T] {
   type V
   def create(target: T): V
   def monoid: Monoid[V]
   def result(votes: V): T
}

with subclasses like class Plurality[L] extends Voter[Map[L, Long]], class SoftVote[L] extends Voter[Map[L, Long]], and so on.

Add count to LeafNode

This would let us implement the stopper threshold/sampling logic without needing a full Stopper trait. This should possibly be a prereq to merging #11

Ensure current Predicates return true for missing data

If data is missing, we should return true (indicating the edge should be followed). This works in conjunction with #40.

Serializer samplers along with trees

It's not totally clear how this should work, but the trees are best interpreted in the context of a specific sampler, so it seems funny not to store that. (This is especially true for OutOfTimeSampler, for example, where it would be better to store the time threshold for later use).

Make it easy to train/test on exponentially-decayed weights

We already have the timestamp in Instance, we should make it easy to create target dists with label and a half-life.

Multiway splitters

QTreeSplitter should support negative numbers

I think we can do this fairly easily with (QTree,QTree), ie one for each sign.

use TreeTraversal in training

We should be able to use the fancy missing-feature traversals during training, too (which may mean splitting weight for an instance of training data across leaves).

Use the id to make downsampling deterministic

Downsample large nodes for in-memory splitting?

Currently, Trainer's expandSmallNodes will just ignore any leaves that are sufficiently large (ie, that would have been split by the distributed expand, if it were given the chance). Given that you are likely to stop the distributed splitting at some point (and so it won't get the chance), you end up in this somewhat strange dynamic where the largest (and thus perhaps most important) nodes don't get fully expanded but the smaller, possibly less important ones do.

An alternative would be to have the in-memory algorithm expand everything, but downsample at each node (individually computing the rate based on the current leaf's target) to make sure that each one can fit into memory. This would at least get a few more levels of depth for those nodes, although they wouldn't go as deep as a true, distributed full expansion would. The distributions of the leaves this would create would be underweighted relative to stuff that hadn't been downsampled, but a) it's not clear that matters much and b) they could be fixed up with an updateTargets at the end if desired.

It's also interesting to note that you could iterate this, with progressively less downsampling needed each time, or even build the whole tree this way, especially if you only did a small number of levels each time, though I think it would be both more expensive and less effective than the current distributed approach.

Thoughts @snoble @RyW90 @mlmanapat ?

Add new `SparseEqualTo[V]` predicate

Create a SparseEqualTo[V] predicate that defaults to false for missing data (eg None).

per-node error output

closely related to #24 (in that that can be built on top of this): we should be able to produce some representation of the tree with each node annotated with its error - not the sum of its leaves' errors, but the error of the sums of its leaves' distributions.

separate brushfire-tree from brushfire-training

Some applications, like model servers, will not need the training code. It would be nice if they could depend on a smaller subset of the code. Likely, this would include:

Tree
Predicate
Voter
the serialization Injections
Dispatched

but hopefully would not include:

Instance
Splitter
Error
Sampler

Export `TreeGenerator` and maybe some tests

Our tree generators for scalacheck are fairly complex, so if we created a new brushfire-laws package or something that included them, then they could be re-used elsewhere.

Brushfire should use Bonsai trees in training

To support faster training (especially in local mode) we could be using compressed Bonsai trees directly in Brushfire. There's really no advantage to Brushfire's native tree type so we should remove it. The idea is that we'd batch our operations for 1 step (eg growing the tree by 1) and make the changes to the Bonsai tree in bulk.

This involves both exposing some new functionality from Bonsai, and replacing the use of trees (and the use of the tree ops type class) in Brushfire.

Multi-label targets

It would be useful to support multi-label (not just multi-class) prediction targets.

The resolve-pom-maven-plugin is causing shade to be ignored

Not sure why. Commenting out the plugin from brushfire-parent/pom.xml gets the big jars building again.

Voter refactor to work with sumLeaves

If we follow through with #40 and #42, then prediction will be handled by sumLeaves and each current Voter will essentially become a VectorSpace instance + a method to turn targets into predictions (vectors in the vectorspace). It may also be that we get rid of Voter completely, and instead just create various distribution types + VectorSpace instances for them.

Add some property-based tests

Most of the interfaces brushfire exposes have "laws" associated with them which we could write tests for, such that any new implementation would have to conform to them. This would make it much easier to extend brushfire with confidence.

Add an Encoder to the pipeline

It would be helpful to have a concept of an Encoder[K,U,V] which can transform Map[K,U] to the Map[K,V] used by a given set of trees, and which can be serialized along with the trees.

We also want (in some cases) to be able to "train" these encoders from a training set - for example to figure out which numeric features are continuous vs. discrete, or even to do dimensionality reduction etc.

error-minimizing pruner

This is only really relevant to people building single-tree models, but you should be able to prune a single tree to minimize validation error

Stacking

Having trained N trees, you can then treat each of those N predictions as a feature itself, and run a logistic regression over that (thus "stacking" a logistic regression model on top of the decision tree model). It would be interesting to experiment with this.

Add Spark support

Most of the code is generic enough to run on a different framework other than Scalding. However, there is some dependency on the new Execution module of Scalding that I couldn't completely get around.
Is it possible to refactor the code that will be less Scalding specific, or just explain to me how it all works, and I'll try to do it?

Real-valued labels

README points to nonexistent artifacts

The instructions under Quick Start are incorrect. It shows that brushfire is compiled for multiple Scala versions. However, neither a brushfire_2.10 or brushfire_2.11 exist in oss.sonatype.org, only a non-versioned brushfire.

Include Ordering[E] in Error[T, P, E]

It seems like we'll need an ordering at some point in order for an error to be useful. In a world of incoherent type classes, it may be prudent to trap the Ordering in the Error instance as well, rather than relying on the correct instance being passed in as an implicit parameter.

Add new `sumLeaves` method (or similar)

We currently get only 1 leaf in leafFor, but it is possible we may want to allow paths to diverge during tree evaluation (eg feature is missing). To do this, we should add a new sumLeaves method which allows us traverse down multiple paths and aggregate the results using the prior probabilities of each diverging path.

During tree traversal, we'd always follow edges whose predicates return true. We would also allow nodes to have multiple true predicates. For each true edge, we'd determine their prior probabilities relative to all true edges and use it to scale the result returned for each edge, then sum the result. The method may look something like:

def sumLeaves[C[_], A: Numeric](row: Map[K, V])(f: Leaf => C[A])(implicit vs: VectorSpace[A, C]): C[A] = ???

Unify Error and Evaluator

It should be sufficiently general/flexible to always just split to minimize training error, rather than having a separate evaluator.