stripe-archive / brushfire Goto Github PK

Distributed decision tree ensemble learning in Scala

License: Other

Scala 95.02% Shell 1.44% Python 3.54%

brushfire's Issues

Add count to LeafNode

This would let us implement the stopper threshold/sampling logic without needing a full Stopper trait. This should possibly be a prereq to merging #11

Export `TreeGenerator` and maybe some tests

Our tree generators for scalacheck are fairly complex, so if we created a new brushfire-laws package or something that included them, then they could be re-used elsewhere.

Make it easy to train/test on exponentially-decayed weights

We already have the timestamp in Instance, we should make it easy to create target dists with label and a half-life.

`Voter` trait

To encapsulate the various strategies for combining predictions from multiple trees. Probably something like:

trait Voter[T] {
   type V
   def create(target: T): V
   def monoid: Monoid[V]
   def result(votes: V): T
}

with subclasses like class Plurality[L] extends Voter[Map[L, Long]], class SoftVote[L] extends Voter[Map[L, Long]], and so on.

Include Ordering[E] in Error[T, P, E]

It seems like we'll need an ordering at some point in order for an error to be useful. In a world of incoherent type classes, it may be prudent to trap the Ordering in the Error instance as well, rather than relying on the correct instance being passed in as an implicit parameter.

Serializer samplers along with trees

It's not totally clear how this should work, but the trees are best interpreted in the context of a specific sampler, so it seems funny not to store that. (This is especially true for OutOfTimeSampler, for example, where it would be better to store the time threshold for later use).

Add an Encoder to the pipeline

It would be helpful to have a concept of an Encoder[K,U,V] which can transform Map[K,U] to the Map[K,V] used by a given set of trees, and which can be serialized along with the trees.

We also want (in some cases) to be able to "train" these encoders from a training set - for example to figure out which numeric features are continuous vs. discrete, or even to do dimensionality reduction etc.

Use the id to make downsampling deterministic

Unify Error and Evaluator

It should be sufficiently general/flexible to always just split to minimize training error, rather than having a separate evaluator.

per-node error output

closely related to #24 (in that that can be built on top of this): we should be able to produce some representation of the tree with each node annotated with its error - not the sum of its leaves' errors, but the error of the sums of its leaves' distributions.

Downsample large nodes for in-memory splitting?

Currently, Trainer's expandSmallNodes will just ignore any leaves that are sufficiently large (ie, that would have been split by the distributed expand, if it were given the chance). Given that you are likely to stop the distributed splitting at some point (and so it won't get the chance), you end up in this somewhat strange dynamic where the largest (and thus perhaps most important) nodes don't get fully expanded but the smaller, possibly less important ones do.

An alternative would be to have the in-memory algorithm expand everything, but downsample at each node (individually computing the rate based on the current leaf's target) to make sure that each one can fit into memory. This would at least get a few more levels of depth for those nodes, although they wouldn't go as deep as a true, distributed full expansion would. The distributions of the leaves this would create would be underweighted relative to stuff that hadn't been downsampled, but a) it's not clear that matters much and b) they could be fixed up with an updateTargets at the end if desired.

It's also interesting to note that you could iterate this, with progressively less downsampling needed each time, or even build the whole tree this way, especially if you only did a small number of levels each time, though I think it would be both more expensive and less effective than the current distributed approach.

Thoughts @snoble @RyW90 @mlmanapat ?

Stacking

Having trained N trees, you can then treat each of those N predictions as a feature itself, and run a logistic regression over that (thus "stacking" a logistic regression model on top of the decision tree model). It would be interesting to experiment with this.

Add new `SparseEqualTo[V]` predicate

Create a SparseEqualTo[V] predicate that defaults to false for missing data (eg None).

separate brushfire-tree from brushfire-training

Some applications, like model servers, will not need the training code. It would be nice if they could depend on a smaller subset of the code. Likely, this would include:

Tree
Predicate
Voter
the serialization Injections
Dispatched

but hopefully would not include:

Instance
Splitter
Error
Sampler

Make brushfire-core and brushfire-scalding separate maven modules

The resolve-pom-maven-plugin is causing shade to be ignored

Not sure why. Commenting out the plugin from brushfire-parent/pom.xml gets the big jars building again.

Add new `sumLeaves` method (or similar)

We currently get only 1 leaf in leafFor, but it is possible we may want to allow paths to diverge during tree evaluation (eg feature is missing). To do this, we should add a new sumLeaves method which allows us traverse down multiple paths and aggregate the results using the prior probabilities of each diverging path.

During tree traversal, we'd always follow edges whose predicates return true. We would also allow nodes to have multiple true predicates. For each true edge, we'd determine their prior probabilities relative to all true edges and use it to scale the result returned for each edge, then sum the result. The method may look something like:

def sumLeaves[C[_], A: Numeric](row: Map[K, V])(f: Leaf => C[A])(implicit vs: VectorSpace[A, C]): C[A] = ???

Add some property-based tests

Most of the interfaces brushfire exposes have "laws" associated with them which we could write tests for, such that any new implementation would have to conform to them. This would make it much easier to extend brushfire with confidence.

Add Spark support

Most of the code is generic enough to run on a different framework other than Scalding. However, there is some dependency on the new Execution module of Scalding that I couldn't completely get around.
Is it possible to refactor the code that will be less Scalding specific, or just explain to me how it all works, and I'll try to do it?

Voter refactor to work with sumLeaves

If we follow through with #40 and #42, then prediction will be handled by sumLeaves and each current Voter will essentially become a VectorSpace instance + a method to turn targets into predictions (vectors in the vectorspace). It may also be that we get rid of Voter completely, and instead just create various distribution types + VectorSpace instances for them.

Brushfire should use Bonsai trees in training

To support faster training (especially in local mode) we could be using compressed Bonsai trees directly in Brushfire. There's really no advantage to Brushfire's native tree type so we should remove it. The idea is that we'd batch our operations for 1 step (eg growing the tree by 1) and make the changes to the Bonsai tree in bulk.

This involves both exposing some new functionality from Bonsai, and replacing the use of trees (and the use of the tree ops type class) in Brushfire.

stripe-archive / brushfire Goto Github PK

brushfire's Issues

Recommend Projects

Recommend Topics

Recommend Org