Giter Site home page Giter Site logo

Comments (6)

jekbradbury avatar jekbradbury commented on May 13, 2024 1

This is related to support for explicit loops in the frontend. Supporting variable sizes (i.e. symbolic shapes in TVM) in general is probably a substantial change, but allowing explicit loops, declared in the frontend code, over a single variable-sized dimension is very similar to the existing support for variable batch sizes but would cover many NLP use cases (like RNNs/QRNNs).

from tensorcomprehensions.

ftynse avatar ftynse commented on May 13, 2024 1

Technically, this is simpler than it looks. Most of the compilation flow should support this transparently (Halide and polyhedral passes). For example, polyhedral scheduling is meant to operate on symbolic parameters and we have an option to substitute them with inferred numerical values before or _after the scheduling itself. We even have tests that emit parametric code.

However, this will degrade performance. Simply put, the more information we have about the operation, the deeper we can analyze it, the better it can be optimized. So I'd argue for as specialized code as possible.

The main problem with RNNs now would be their outer sequentiality. But this is mostly orthogonal to variable sizes.

from tensorcomprehensions.

jekbradbury avatar jekbradbury commented on May 13, 2024

An RNN kernel would look something like this:

def elman_rnn(float(T,B,Ci) input, float(B,Co) h0, float(Ci,Co) i2h, float(Co,Co) h2h) -> (hidden) {
    for t in T {
        if t == 0 {hidden(t,b,co) +=! h2h(ci,co) * h0(b,ci)}
        else {hidden(t,b,co) +=! h2h(ci,co) * hidden(t-1,b,ci)}
        hidden(t,b,co) += i2h(ci,co) * input(t,b,ci)
    }
}

which does indeed seem pretty annoying to support, far beyond the variable T. I was also wrong and TC doesn't appear to currently support optimizing for a variable batch size.

A QRNN kernel would also have these issues, just without the reduction inside the loop.

from tensorcomprehensions.

ftynse avatar ftynse commented on May 13, 2024

Indeed, the "imperative syntax" proposed earlier in TC context is not yet implemented in the language. It is annoying to support efficient code generation. The "imperative loop" is outer sequential and the current compilation pass would just map the computation to a single block given that syntax is supported by the frontend. We would need to emit global synchronizations even for a naïve mapping to more than one block, and we currently cannot. But this seems orthogonal to variable batch sizes.

Turning on parametric batch sizes is a small change. In general, TC looks at the actual sizes of the supplied tensors and infers numerical values of all symbolic parameters. These values are substituted in some optimization passes. Disabling this substitution looks trivial. However, disabling it for a specific parameter requires the user to somehow tell us for which one.

from tensorcomprehensions.

nicolasvasilache avatar nicolasvasilache commented on May 13, 2024

Let me first share some context about parametric sizes and then a short-term "solution" that should work in practice.

Solving the general problem is complicated as it requires inferring proper parameter regions and devising proper strategies for each region; this is a longer term goal to make it push button and just work. Additionally emitting symbolic code involves inefficiencies (control-flow that just disappears with JIT'ing or missing information about parameter ranges that messes up internal heuristics).

The current approach works because it is pragmatic and goes for the lowest hanging fruit, however it suffers from needing to autotune for various sizes if one isn't careful. Compilation given fixed options however is not a big deal. It happens already under the hood all the time when you emit SASS from PTX (the first time you run a kernel, then it gets cached to disk).

One simple way to circumvent the autotuning pain on the user side is to just reuse the options found by an autotuning run when you change sizes. This would give the same type of behavior that one would get from a parametric codegen:
a. with parametric codegen, the options don't change, the code is compiled, cached and autotuned only once;
b. with options reuse, the options don't change, the code is autotuned only once but the code is compiled and cached for each new size; this should significantly improve the user experience though

So for a short-term improvement in the workflow I would recommend:

  1. take a TC + sizes and run the autotuner with large enough generations/candidates on it, save the best options
  2. take the same TC, change sizes, reuse the options in 1 and just call compile
    2b. alternatively use the options in 1 as a starting point for calling a small autotune run with few generations/candidates to do some quick minimal exploration
  3. when sizes change too much and perf is too far from some notion of good, repeat step 1.
  4. as always, save your cached results in a proto FB file that you can reuse across runs so you only need to autotune / compile the least amount possible; note however that we are still very early on in the life of the project and the proto itself is subject to change so 1-3 will prob need to be redone a few times until we iterate to a stable enough system.

Hiding the above from the user amounts to solving the general problem but I think the faster workflow outlined in 1-4 is very easy to setup and should improve things significantly. @jekbradbury 's example about RNNs is different and we have similar things in the work but only at a conceptual stage atm.

@roadwalker does the above sound like a reasonable solution?

from tensorcomprehensions.

nicolasvasilache avatar nicolasvasilache commented on May 13, 2024

@jekbradbury @seongwook-ham see if #225 starts addressing your needs. The code is still JIT'ed and will take a few seconds for various parameter sizes but autotuning results can be easily reused.

If you have real world parametric needs where you see many different versions for 1 parameter (i.e. > 100) then we probably can give you 1-3 parameters in a relatively short term but it would be great to work on a concrete example.

RNN loops are still outside of the scope at this point though.

from tensorcomprehensions.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.