Giter Site home page Giter Site logo

brushfire's Issues

Add count to LeafNode

This would let us implement the stopper threshold/sampling logic without needing a full Stopper trait. This should possibly be a prereq to merging #11

Export `TreeGenerator` and maybe some tests

Our tree generators for scalacheck are fairly complex, so if we created a new brushfire-laws package or something that included them, then they could be re-used elsewhere.

`Voter` trait

To encapsulate the various strategies for combining predictions from multiple trees. Probably something like:

trait Voter[T] {
   type V
   def create(target: T): V
   def monoid: Monoid[V]
   def result(votes: V): T
}

with subclasses like class Plurality[L] extends Voter[Map[L, Long]], class SoftVote[L] extends Voter[Map[L, Long]], and so on.

Include Ordering[E] in Error[T, P, E]

It seems like we'll need an ordering at some point in order for an error to be useful. In a world of incoherent type classes, it may be prudent to trap the Ordering in the Error instance as well, rather than relying on the correct instance being passed in as an implicit parameter.

Serializer samplers along with trees

It's not totally clear how this should work, but the trees are best interpreted in the context of a specific sampler, so it seems funny not to store that. (This is especially true for OutOfTimeSampler, for example, where it would be better to store the time threshold for later use).

Add an Encoder to the pipeline

It would be helpful to have a concept of an Encoder[K,U,V] which can transform Map[K,U] to the Map[K,V] used by a given set of trees, and which can be serialized along with the trees.

We also want (in some cases) to be able to "train" these encoders from a training set - for example to figure out which numeric features are continuous vs. discrete, or even to do dimensionality reduction etc.

Unify Error and Evaluator

It should be sufficiently general/flexible to always just split to minimize training error, rather than having a separate evaluator.

per-node error output

closely related to #24 (in that that can be built on top of this): we should be able to produce some representation of the tree with each node annotated with its error - not the sum of its leaves' errors, but the error of the sums of its leaves' distributions.

Downsample large nodes for in-memory splitting?

Currently, Trainer's expandSmallNodes will just ignore any leaves that are sufficiently large (ie, that would have been split by the distributed expand, if it were given the chance). Given that you are likely to stop the distributed splitting at some point (and so it won't get the chance), you end up in this somewhat strange dynamic where the largest (and thus perhaps most important) nodes don't get fully expanded but the smaller, possibly less important ones do.

An alternative would be to have the in-memory algorithm expand everything, but downsample at each node (individually computing the rate based on the current leaf's target) to make sure that each one can fit into memory. This would at least get a few more levels of depth for those nodes, although they wouldn't go as deep as a true, distributed full expansion would. The distributions of the leaves this would create would be underweighted relative to stuff that hadn't been downsampled, but a) it's not clear that matters much and b) they could be fixed up with an updateTargets at the end if desired.

It's also interesting to note that you could iterate this, with progressively less downsampling needed each time, or even build the whole tree this way, especially if you only did a small number of levels each time, though I think it would be both more expensive and less effective than the current distributed approach.

Thoughts @snoble @RyW90 @mlmanapat ?

Stacking

Having trained N trees, you can then treat each of those N predictions as a feature itself, and run a logistic regression over that (thus "stacking" a logistic regression model on top of the decision tree model). It would be interesting to experiment with this.

separate brushfire-tree from brushfire-training

Some applications, like model servers, will not need the training code. It would be nice if they could depend on a smaller subset of the code. Likely, this would include:

  • Tree
  • Predicate
  • Voter
  • the serialization Injections
  • Dispatched

but hopefully would not include:

  • Instance
  • Splitter
  • Error
  • Sampler

Add new `sumLeaves` method (or similar)

We currently get only 1 leaf in leafFor, but it is possible we may want to allow paths to diverge during tree evaluation (eg feature is missing). To do this, we should add a new sumLeaves method which allows us traverse down multiple paths and aggregate the results using the prior probabilities of each diverging path.

During tree traversal, we'd always follow edges whose predicates return true. We would also allow nodes to have multiple true predicates. For each true edge, we'd determine their prior probabilities relative to all true edges and use it to scale the result returned for each edge, then sum the result. The method may look something like:

def sumLeaves[C[_], A: Numeric](row: Map[K, V])(f: Leaf => C[A])(implicit vs: VectorSpace[A, C]): C[A] = ???

Add some property-based tests

Most of the interfaces brushfire exposes have "laws" associated with them which we could write tests for, such that any new implementation would have to conform to them. This would make it much easier to extend brushfire with confidence.

Add Spark support

Most of the code is generic enough to run on a different framework other than Scalding. However, there is some dependency on the new Execution module of Scalding that I couldn't completely get around.
Is it possible to refactor the code that will be less Scalding specific, or just explain to me how it all works, and I'll try to do it?

Voter refactor to work with sumLeaves

If we follow through with #40 and #42, then prediction will be handled by sumLeaves and each current Voter will essentially become a VectorSpace instance + a method to turn targets into predictions (vectors in the vectorspace). It may also be that we get rid of Voter completely, and instead just create various distribution types + VectorSpace instances for them.

Brushfire should use Bonsai trees in training

To support faster training (especially in local mode) we could be using compressed Bonsai trees directly in Brushfire. There's really no advantage to Brushfire's native tree type so we should remove it. The idea is that we'd batch our operations for 1 step (eg growing the tree by 1) and make the changes to the Bonsai tree in bulk.

This involves both exposing some new functionality from Bonsai, and replacing the use of trees (and the use of the tree ops type class) in Brushfire.

README points to nonexistent artifacts

The instructions under Quick Start are incorrect. It shows that brushfire is compiled for multiple Scala versions. However, neither a brushfire_2.10 or brushfire_2.11 exist in oss.sonatype.org, only a non-versioned brushfire.

output Trainer metadata

A simpler form of #19 - instead of having a two-way serialization of sampler etc, just write out to HDFS somewhere a description of the sampler, evaluator, etc being used for a given expansion.

Multi-label targets

It would be useful to support multi-label (not just multi-class) prediction targets.

error-minimizing pruner

This is only really relevant to people building single-tree models, but you should be able to prune a single tree to minimize validation error

use TreeTraversal in training

We should be able to use the fancy missing-feature traversals during training, too (which may mean splitting weight for an instance of training data across leaves).

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.