stripe-archive / brushfire Goto Github PK
View Code? Open in Web Editor NEWDistributed decision tree ensemble learning in Scala
License: Other
Distributed decision tree ensemble learning in Scala
License: Other
This would let us implement the stopper threshold/sampling logic without needing a full Stopper
trait. This should possibly be a prereq to merging #11
Our tree generators for scalacheck are fairly complex, so if we created a new brushfire-laws
package or something that included them, then they could be re-used elsewhere.
We already have the timestamp in Instance
, we should make it easy to create target dists with label and a half-life.
To encapsulate the various strategies for combining predictions from multiple trees. Probably something like:
trait Voter[T] {
type V
def create(target: T): V
def monoid: Monoid[V]
def result(votes: V): T
}
with subclasses like class Plurality[L] extends Voter[Map[L, Long]]
, class SoftVote[L] extends Voter[Map[L, Long]]
, and so on.
It seems like we'll need an ordering at some point in order for an error to be useful. In a world of incoherent type classes, it may be prudent to trap the Ordering
in the Error
instance as well, rather than relying on the correct instance being passed in as an implicit parameter.
It's not totally clear how this should work, but the trees are best interpreted in the context of a specific sampler, so it seems funny not to store that. (This is especially true for OutOfTimeSampler, for example, where it would be better to store the time threshold for later use).
It would be helpful to have a concept of an Encoder[K,U,V]
which can transform Map[K,U]
to the Map[K,V]
used by a given set of trees, and which can be serialized along with the trees.
We also want (in some cases) to be able to "train" these encoders from a training set - for example to figure out which numeric features are continuous vs. discrete, or even to do dimensionality reduction etc.
It should be sufficiently general/flexible to always just split to minimize training error, rather than having a separate evaluator.
closely related to #24 (in that that can be built on top of this): we should be able to produce some representation of the tree with each node annotated with its error - not the sum of its leaves' errors, but the error of the sums of its leaves' distributions.
Currently, Trainer's expandSmallNodes
will just ignore any leaves that are sufficiently large (ie, that would have been split by the distributed expand, if it were given the chance). Given that you are likely to stop the distributed splitting at some point (and so it won't get the chance), you end up in this somewhat strange dynamic where the largest (and thus perhaps most important) nodes don't get fully expanded but the smaller, possibly less important ones do.
An alternative would be to have the in-memory algorithm expand everything, but downsample at each node (individually computing the rate based on the current leaf's target) to make sure that each one can fit into memory. This would at least get a few more levels of depth for those nodes, although they wouldn't go as deep as a true, distributed full expansion would. The distributions of the leaves this would create would be underweighted relative to stuff that hadn't been downsampled, but a) it's not clear that matters much and b) they could be fixed up with an updateTargets
at the end if desired.
It's also interesting to note that you could iterate this, with progressively less downsampling needed each time, or even build the whole tree this way, especially if you only did a small number of levels each time, though I think it would be both more expensive and less effective than the current distributed approach.
Thoughts @snoble @RyW90 @mlmanapat ?
Having trained N trees, you can then treat each of those N predictions as a feature itself, and run a logistic regression over that (thus "stacking" a logistic regression model on top of the decision tree model). It would be interesting to experiment with this.
Create a SparseEqualTo[V]
predicate that defaults to false
for missing data (eg None
).
Some applications, like model servers, will not need the training code. It would be nice if they could depend on a smaller subset of the code. Likely, this would include:
but hopefully would not include:
Not sure why. Commenting out the plugin from brushfire-parent/pom.xml
gets the big jars building again.
We currently get only 1 leaf in leafFor
, but it is possible we may want to allow paths to diverge during tree evaluation (eg feature is missing). To do this, we should add a new sumLeaves
method which allows us traverse down multiple paths and aggregate the results using the prior probabilities of each diverging path.
During tree traversal, we'd always follow edges whose predicates return true
. We would also allow nodes to have multiple true
predicates. For each true edge, we'd determine their prior probabilities relative to all true edges and use it to scale the result returned for each edge, then sum the result. The method may look something like:
def sumLeaves[C[_], A: Numeric](row: Map[K, V])(f: Leaf => C[A])(implicit vs: VectorSpace[A, C]): C[A] = ???
Most of the interfaces brushfire exposes have "laws" associated with them which we could write tests for, such that any new implementation would have to conform to them. This would make it much easier to extend brushfire with confidence.
Most of the code is generic enough to run on a different framework other than Scalding. However, there is some dependency on the new Execution module of Scalding that I couldn't completely get around.
Is it possible to refactor the code that will be less Scalding specific, or just explain to me how it all works, and I'll try to do it?
If we follow through with #40 and #42, then prediction will be handled by sumLeaves
and each current Voter
will essentially become a VectorSpace
instance + a method to turn targets into predictions (vectors in the vectorspace). It may also be that we get rid of Voter
completely, and instead just create various distribution types + VectorSpace
instances for them.
To support faster training (especially in local mode) we could be using compressed Bonsai trees directly in Brushfire. There's really no advantage to Brushfire's native tree type so we should remove it. The idea is that we'd batch our operations for 1 step (eg growing the tree by 1) and make the changes to the Bonsai tree in bulk.
This involves both exposing some new functionality from Bonsai, and replacing the use of trees (and the use of the tree ops type class) in Brushfire.
The instructions under Quick Start are incorrect. It shows that brushfire is compiled for multiple Scala versions. However, neither a brushfire_2.10 or brushfire_2.11 exist in oss.sonatype.org, only a non-versioned brushfire.
If data is missing, we should return true (indicating the edge should be followed). This works in conjunction with #40.
A simpler form of #19 - instead of having a two-way serialization of sampler etc, just write out to HDFS somewhere a description of the sampler, evaluator, etc being used for a given expansion.
It would be useful to support multi-label (not just multi-class) prediction targets.
This is only really relevant to people building single-tree models, but you should be able to prune a single tree to minimize validation error
We should be able to use the fancy missing-feature traversals during training, too (which may mean splitting weight for an instance of training data across leaves).
I think we can do this fairly easily with (QTree,QTree)
, ie one for each sign.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.