Comments (3)
A much more radical design. @tixxit, @NathanHowell, curious what you think.
Assume that Instance
just has a single type for its feature vector, V
. An Encoder[V]
produces a bitset (or Vector[Boolean]
) from each V
. We assume that Encoder
is generally composed of many other Encoders
. Each bit represents something that would appear as a predicate in the tree, like age > 20
.
All the games with QTree
and so on for determining good split points happen in the stage that trains the Encoder
, which in many cases can probably be done once at the root. Then the expansion becomes much simpler - we don't have fancy Splitters
or anything like that; we just accumulate a separate (T,T)
for each bit in the encoding, and pick the best one based on the error. In fact, because we have the parent T
and T
is almost always a Group
, we can just accumulate the T
for the instances with the given bit set and figure out the other half.
(In a world with metadata, we also want (M, M)
for each bit. This is less likely to be a Group
, although it seems pretty plausible even then that with one side's M
and the parent's M
you could make a reasonable determination as to whether this was an allowable split.)
If it's unsatisfying to determine all of the possible splits up front at the root, you can always add in a re-encoding phase where, given an established tree, you train encoders for each of the leaves, and the union the resulting encodings to get a superset of all of them.
Some advantages:
- the tree itself can be much more compactly represented (this will be especially of interest to @tixxit) since the predicates are just indexes into an encoding. Of course we separately need to serialize the encoder which will be large.
- we're doing less work while expanding each leaf, and sending less info around. Getting rid of all the joint distribution splitter stuff would be really nice and makes the training phase much easier to grok
- we no longer have the goofy special case for "sparse" features; in effect, everything becomes a sparse feature. In general, we can ditch
Dispatched
or at least move its equivalent into theEncoder
layer.
Cons:
- as described, locks us to binary splits
- makes it less likely we'll take advantage of local distributions for choosing splits (I'm not really sure how well the "re-encoding" thing would work in practice but presumably you wouldn't do that at every level anyway)
This all make sense?
from brushfire.
... apologies for thinking out loud, but:
The above - encoding to a bitset - assumes that there's a bounded, finite number of predicates we might want in the tree. But that's more limiting than we want, and more limiting than necessary. As a simple example, imagine a text feature, where we want the predicates to be of the form contains_token("foo")
. There's an infinite number of these predicates possible.
The properties we actually want from an encoder are:
- for any given instance, we produce a finite (and reasonably small) number of predicates which will return true for this instance
- there are no predicates not in this set which will be produced for any other instance, which would return true for this instance
I'm not sure yet how that translates to an implementation.
from brushfire.
Ok, so in concrete terms, maybe something like:
- you have
encoders: Vector[Encoder[V]]
for the model (where, again,V
is now a generic type for your feature vector) - you have
apply(vec: V): Set[U]
onEncoder[V]
- each split has a
pred: (Int,U)
where the Int indexes into the vector of encoders, and the idea is that you're splitting onencoders(pred._1)(vec).contains(pred._2)
The U
just needs to be something with equality/hash/etc, and that you can serialize into your model. I can almost see an argument for forcing it to be String
(for comprehensibility of the serialized models) or to be Long
(for efficiency of tree representation - and anything that doesn't map nicely to a Long you just hash). But that's a pretty unimportant side note.
In this scheme you would probably specify all the available Encoder
instances up front; the "training" phase (including re-training with a larger tree later) would be to increase the available set of U
values for the Encoder
(in the cases where that made sense). So for example in a continuous numeric feature, you'd imagine the encoder keeping a List[Double]
internally of split points which it would use to produce the Set[U]
of all split points where the V
's value was < the split point; it would choose this List empirically based on looking at a QTree
or whatever of the training set.
from brushfire.
Related Issues (20)
- Make brushfire-core and brushfire-scalding separate maven modules
- error-minimizing pruner HOT 3
- per-node error output
- output Trainer metadata
- Make it easy to train/test on exponentially-decayed weights
- Add some property-based tests HOT 1
- The resolve-pom-maven-plugin is causing shade to be ignored HOT 1
- Add new `sumLeaves` method (or similar) HOT 1
- Add new `SparseEqualTo[V]` predicate HOT 3
- Ensure current Predicates return true for missing data
- Voter refactor to work with sumLeaves
- Multi-label targets HOT 1
- separate brushfire-tree from brushfire-training HOT 2
- Include Ordering[E] in Error[T, P, E]
- use TreeTraversal in training
- README points to nonexistent artifacts
- QTreeSplitter should support negative numbers HOT 1
- Export `TreeGenerator` and maybe some tests
- Brushfire should use Bonsai trees in training
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from brushfire.