Giter Site home page Giter Site logo

Comments (3)

avibryant avatar avibryant commented on June 2, 2024

A much more radical design. @tixxit, @NathanHowell, curious what you think.

Assume that Instance just has a single type for its feature vector, V. An Encoder[V] produces a bitset (or Vector[Boolean]) from each V. We assume that Encoder is generally composed of many other Encoders. Each bit represents something that would appear as a predicate in the tree, like age > 20.

All the games with QTree and so on for determining good split points happen in the stage that trains the Encoder, which in many cases can probably be done once at the root. Then the expansion becomes much simpler - we don't have fancy Splitters or anything like that; we just accumulate a separate (T,T) for each bit in the encoding, and pick the best one based on the error. In fact, because we have the parent T and T is almost always a Group, we can just accumulate the T for the instances with the given bit set and figure out the other half.

(In a world with metadata, we also want (M, M) for each bit. This is less likely to be a Group, although it seems pretty plausible even then that with one side's M and the parent's M you could make a reasonable determination as to whether this was an allowable split.)

If it's unsatisfying to determine all of the possible splits up front at the root, you can always add in a re-encoding phase where, given an established tree, you train encoders for each of the leaves, and the union the resulting encodings to get a superset of all of them.

Some advantages:

  • the tree itself can be much more compactly represented (this will be especially of interest to @tixxit) since the predicates are just indexes into an encoding. Of course we separately need to serialize the encoder which will be large.
  • we're doing less work while expanding each leaf, and sending less info around. Getting rid of all the joint distribution splitter stuff would be really nice and makes the training phase much easier to grok
  • we no longer have the goofy special case for "sparse" features; in effect, everything becomes a sparse feature. In general, we can ditch Dispatched or at least move its equivalent into the Encoder layer.

Cons:

  • as described, locks us to binary splits
  • makes it less likely we'll take advantage of local distributions for choosing splits (I'm not really sure how well the "re-encoding" thing would work in practice but presumably you wouldn't do that at every level anyway)

This all make sense?

from brushfire.

avibryant avatar avibryant commented on June 2, 2024

... apologies for thinking out loud, but:

The above - encoding to a bitset - assumes that there's a bounded, finite number of predicates we might want in the tree. But that's more limiting than we want, and more limiting than necessary. As a simple example, imagine a text feature, where we want the predicates to be of the form contains_token("foo"). There's an infinite number of these predicates possible.

The properties we actually want from an encoder are:

  • for any given instance, we produce a finite (and reasonably small) number of predicates which will return true for this instance
  • there are no predicates not in this set which will be produced for any other instance, which would return true for this instance

I'm not sure yet how that translates to an implementation.

from brushfire.

avibryant avatar avibryant commented on June 2, 2024

Ok, so in concrete terms, maybe something like:

  • you have encoders: Vector[Encoder[V]] for the model (where, again, V is now a generic type for your feature vector)
  • you have apply(vec: V): Set[U] on Encoder[V]
  • each split has a pred: (Int,U) where the Int indexes into the vector of encoders, and the idea is that you're splitting on encoders(pred._1)(vec).contains(pred._2)

The U just needs to be something with equality/hash/etc, and that you can serialize into your model. I can almost see an argument for forcing it to be String (for comprehensibility of the serialized models) or to be Long (for efficiency of tree representation - and anything that doesn't map nicely to a Long you just hash). But that's a pretty unimportant side note.

In this scheme you would probably specify all the available Encoder instances up front; the "training" phase (including re-training with a larger tree later) would be to increase the available set of U values for the Encoder (in the cases where that made sense). So for example in a continuous numeric feature, you'd imagine the encoder keeping a List[Double] internally of split points which it would use to produce the Set[U] of all split points where the V's value was < the split point; it would choose this List empirically based on looking at a QTree or whatever of the training set.

from brushfire.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.