facebookresearch / tensorcomprehensions Goto Github PK

A domain specific language to express machine learning workloads.

Home Page: https://facebookresearch.github.io/TensorComprehensions/

License: Apache License 2.0

CMake 2.40% Shell 1.47% C++ 91.97% C 0.11% Python 3.91% Dockerfile 0.14%

machine-learning domain-specific-language

tensorcomprehensions's Introduction

Tensor Comprehensions (TC) is a fully-functional C++ library to automatically synthesize high-performance machine learning kernels using Halide, ISL and NVRTC or LLVM. TC additionally provides basic integration with Caffe2 and PyTorch. We provide more details in our paper on arXiv.

This library is designed to be highly portable, machine-learning-framework agnostic and only requires a simple tensor library with memory allocation, offloading and synchronization capabilities.

For now, we have integrated TC with Caffe2 and PyTorch.

A simple example

The following illustrates a short but powerful feature of the library: the capacity to JIT-compile high-performance machine learning kernels on demand, for specific sizes.

import tensor_comprehensions as tc
import torch
lang = """
def tensordot(float(N, C1, C2, H, W) I0, float(N, C2, C3, H, W) I1) -> (O) {
    O(n, c1, c3, h, w) +=! I0(n, c1, c2, h, w) * I1(n, c2, c3, h, w)
}
"""
N, C1, C2, C3, H, W = 32, 512, 8, 2, 28, 28
tensordot = tc.define(lang, name="tensordot")
I0, I1 = torch.randn(N, C1, C2, H, W).cuda(), torch.randn(N, C2, C3, H, W).cuda()
best_options = tensordot.autotune(I0, I1, cache=True)
out = tensordot(I0, I1, options=best_options)

After a few generations of autotuning on a 2-GPU P100 system, we see results resembling:

In C++ a minimal autotuning example resembles the following:

TEST(TensorDot, SimpleAutotune) {
  // 1. Define and setup the TC compilation unit with CUDA memory
  // management backed by ATen tensors.
  std::string tc = R"TC(
def tensordot(float(N, C1, C2, H, W) I0,
              float(N, C2, C3, H, W) I1)  -> (O)
{
    O(n, c1, c3, h, w) +=! I0(n, c1, r_c2, h, w) * I1(n, r_c2, c3, h, w)
}
  )TC";

  // 2. Allocate tensors with random data.
  at::Tensor I0 = at::CUDA(at::kFloat).rand({32,  8, 16, 17, 25});
  at::Tensor I1 = at::CUDA(at::kFloat).rand({32, 16, 2, 17, 25});

  // 3. Run autotuning with evolutionary search starting from a naive option.
  auto naiveOptions = Backend::MappingOptionsType::makeNaiveMappingOptions();
  tc::aten::ATenAutotuner<tc::CudaBackend, tc::autotune::GeneticSearch>
      geneticAutotuneATen(tc);
  auto bestOption =
      geneticAutotuneATen.tune("tensordot", {I0, I1}, {naiveOptions});

  // 4. Compile and run the TC with the best option after allocating output
  //    tensors.
  auto pExecutor =
      tc::aten::compile<Backend>(tc, "tensordot", {I0, I1}, bestOption[0]);
  auto outputs = tc::aten::prepareOutputs(tc, "tensordot", {I0, I1});
  auto timings = tc::aten::profile(*pExecutor, {I0, I1}, outputs);
  std::cout << "tensordot size I0: " << I0.sizes() << ", "
            << "size I1: " << I1.sizes()
            << " ran in: " << timings.kernelRuntime.toMicroSeconds() << "us\n";
}

Note that we only need to autotune a TC once to obtain reasonable mapping options that can translate to other problem sizes for a given TC as the following snippet illustrates:

// 5. Reuse bestOptions from autotuning on another kernel
for (auto sizes : std::vector<std::pair<at::IntList, at::IntList>>{
         {{4, 9, 7, 16, 14}, {4, 7, 3, 16, 14}},
         {{8, 5, 11, 10, 10}, {8, 11, 16, 10, 10}},
     }) {
  at::Tensor I0 = makeATenTensor<Backend>(sizes.first);
  at::Tensor I1 = makeATenTensor<Backend>(sizes.second);
  auto pExecutor =
      tc::aten::compile<Backend>(tc, "tensordot", {I0, I1}, bestOption[0]);
  auto outputs = tc::aten::prepareOutputs(tc, "tensordot", {I0, I1});
  auto timings = tc::aten::profile(*pExecutor, {I0, I1}, outputs);
  std::cout << "tensordot size I0: " << I0.sizes() << ", "
            << "size I1: " << I1.sizes()
            << " ran in: " << timings.kernelRuntime.toMicroSeconds()
            << "us\n";
}

Putting it all together, one may see:

> build$ ./examples/example_simple
[==========] Running 1 test from 1 test case.
[----------] Global test environment set-up.
[----------] 1 test from TensorDot
[ RUN      ] TensorDot.SimpleAutotune
Generation 0    Jobs(Compiled, GPU)/total  (10, 10)/10   (best/median/worst)us: 226/4238/7345
Generation 1    Jobs(Compiled, GPU)/total  (10, 10)/10   (best/median/worst)us: 220/221/233
Generation 2    Jobs(Compiled, GPU)/total  (10, 10)/10   (best/median/worst)us: 220/221/234
tensordot size I0: [16, 8, 16, 17, 25], size I1: [16, 16, 2, 17, 25] ran in: 239us
tensordot size I0: [4, 9, 7, 16, 14], size I1: [4, 7, 3, 16, 14] ran in: 56us
tensordot size I0: [8, 5, 11, 10, 10], size I1: [8, 11, 16, 10, 10] ran in: 210us
[       OK ] TensorDot.SimpleAutotune (27812 ms)
[----------] 1 test from TensorDot (27812 ms total)

[----------] Global test environment tear-down
[==========] 1 test from 1 test case ran. (27812 ms total)
[  PASSED  ] 1 test.

We have not yet characterized the precise fraction of peak performance we obtain but it is not uncommon to obtain 80%+ of peak shared memory bandwidth after autotuning. Solid register-level optimizations are still in the work but TC in its current form already addresses the productivity gap between the needs of research and the needs of production. Which is why we are excited to share it with the entire community and bring this collaborative effort in the open.

Documentation

General: You can find detailed information about Tensor Comprehensions here.

C++ API: We also provide documentation for our C++ API which can can be found here

Installation

Binaries

We provide conda package for making it easy to install and use TC binary. Please refer to our documentation here for instructions.

From Source

You can find documentation here which contains instructions for building TC via docker, conda packages or in non-conda environment.

Communication

Email: [email protected]
GitHub issues: bug reports, feature requests, install issues, RFCs, thoughts, etc.

Code of Conduct

See the CODE_OF_CONDUCT.md file for more details.

License

Tensor Comprehensions is distributed under a permissive Apache v2.0 license, see the LICENSE file for more details.

Contributing

See the CONTRIBUTING.md file for more details.

tensorcomprehensions's People

Contributors

Stargazers

Watchers

Forkers

jarlene brettkoonce lordbismaya shyamalschandra fundou shaunstanislauslau aeewlwong chubbymaggie dreadlord1984 spencerx hunslater-deeplearning linpingchuan shubhampachori12110095 aymanshams07 tonyle9 claudiouzelac tarrysingh chr1sj0nes vishalseshagiri vmiheer rvecsler chamrc zdevito moore123 jekbradbury delphigeek eminsight bin2000 shuolongbj bwasti cznyx juxiangwu srireddy821 shiyongde hedgefair raychen05 akhti davidmr001 zgsxwsdxg aprilxiaoyanliu suzhoushr denglly gjtjx ml-lab ilikelucifer echu moskomule libing4752 shunsunsun salexspb yunfei-teng perdavan yinghai ngchc googol-lab lijiansong nicolasvasilache ezyang katch2sid apaszke wojohowitz00 alabhyavaibhav lukw00heck merrymercy phenixi mitmul mingminz arunkumarramanan dingobye lberrada yuetingliu longjohncoder pengd abdullah1908 zzmjohn codeaudit amit2014 qijune dchichkov mingzhe09088 luluteam342 wenke1020 hbcbh1999 protonu hongdayu throneclay uoft-ecosystem mansonri stillwater-sc mazm13 rajeshpatadiya hrishikeshv lvdmaaten terminiter ananthc richardfido albertcohen deep-brainz zakuto3 ftynse

tensorcomprehensions's Issues

Relax cuda versioning asserts from test harness

We assert cuda8, cudnn6 atm but it is there only for reproducibility reasons from some time ago. We should relax this.

@nicolasvasilache is working on it , creating the task for tracking for users.

Scattering

We currently support computed expressions in the indices on the RHS. We should plumb through computed expressions on the LHS too, e.g:

hist(im(i)) +=! 1

Improved Layer Fusion Support

It seems like one of the huge benefits of a tool like TC is the ability to fuse layers and optimize the schedule jointly. There may be significant speed-ups from fusing all the layers of a full model and auto-scheduling for example. Unless I'm reading the docs wrong, the only way to currently do this fusion is to define an entire model in one TC language. However, since there are already many standard layers in tc.database, it would be much cleaner to define the model as a series of TC layers. This could be implemented by adding a fuse function that takes a list of individual languages and produces one large one that can then be optimized. If something like this feature already exists, adding a fusion tutorial to the docs would prevent other people like me from getting confused.

Docker images improvement - non tiered version

We need 4 docker files only:

gcc 4.8 - cuda8-cudnn6 - conda
gcc 4.8 - cuda8-cudnn6 - non conda

gcc 5 - cuda9-cudnn7 - conda
gcc 5 - cuda9-cudnn7 - non conda

as per discussion with @pietern and @ezyang, these images can be based off of nvidia docker images

max=! fails with inff not known to nvrtc

FAILS
def softmax(float(N, D) I) -> (O, expsum, maxVal) {
        maxVal(n) max=! I(n, d)
        expsum(n) +=! exp(I(n, d) - maxVal(n))
        O(n, d) = exp(I(n, d) - maxVal(n)) / expsum(n)
}
Error: https://gist.github.com/prigoyal/0318a964368f86b3a6a09228c7bfbf71

PASSES (bang removed from max=)
def softmax(float(N, D) I) -> (O, expsum, maxVal) {
        maxVal(n) max= I(n, d)
        expsum(n) +=! exp(I(n, d) - maxVal(n))
        O(n, d) = exp(I(n, d) - maxVal(n)) / expsum(n)
}

Support for RNN loops

There is a big interest in having support for RNN loops in TC. Creating this master task to discuss further and track progress

TC conda package for CUDA9

There is interest in having TC conda package which is compatible with CUDA9. I'll look into packaging that.

Bump ATen submodule

required for supporting pytorch master
required to unblock @salexspb and team

Problem in Importing tensor_comprehensions `[Installation]`

Hi,

Once I installed the TensorComprehensions in conda enviroment and the installation was successful, but I couldn't import the tensor_comprehensions:

ImportError: No module named 'tensor_comprehensions'

Then I installed the package from source, but again I encounter the mentioned error. I tried to run it with both python 2 and python3.

Enable test/test_mapper_llvm.cc::LLVMCodegen.DISABLED_BasicExecutionEngine

#138 only adds a skeleton for the CPU test and smoke checks compilation.
Make this into a real test.

Can't compile max reduction

When submitting a bug report, please include the following information (where relevant):

OS: ubuntu 16.04
How you installed TC (docker, conda, source): conda
Python version: 3.6
CUDA/cuDNN version: 9 / 7
Conda version (if using conda): conda 4.3.30
Docker image (if using docker): nvidia/cuda:9.0-cudnn7-devel-ubuntu16.04
GCC/GXX version (if compiling from source): 5.4.0
LLVM/Tapir git hash used (if compiling from source): NA
Commit hash of our repo and submodules (if compiling from source): NA

Hi, I'm experimenting and in particular I try to implement convolutional part of lenet as a single function in TC. However max-pooling doesn't work for me

Here is the script:

lang = """
def lenet_convpart(float(B, 32, 32) I0, 
                   float(6, 5, 5) Weight1, float(6) Bias1) -> (O1, O2) {

    O1(b, c, h, w) +=! I0(b, h + kh, w + kw) * Weight1(c, kh, kw)
    O1(b, c, h, w) = fmax(0, O1(b, c, h, w) + Bias1(c))
    # trivial reduction not working
    O2(b, c, h, w) max=! O1(b, c, h, w)
    # this one is of actual interest
    # O2(b, c, h1, w1) max=! O1(b, c, h1 * 2 + kh1, w1 * 2 + kw1) where kh1 in 0:2, kw1 in 0:2
}
"""
N = 100
tensordot = tc.define(lang, name="lenet_convpart")
I0 = torch.randn(100, 32, 32).cuda()
W1 = torch.randn(6, 5, 5).cuda()
B1 = torch.randn(6).cuda()

best_options = tensordot.autotune(I0, W1, B1, cache=True)
out = tensordot(I0, W1, B1, options=best_options)

Here is the output:

[INFO]: Autotuning cache will be saved to: /tmp/lenet_convpart_100_32_32_6_5_5_6_53fe0682-3df7-42ae-9592-f279767ec145.cuda/options
[WARNING]: Using 'naive' type mapping options for autotuning. See help(your_layer.autotune) for how to set mapping options.
[ERROR]: Caught Exception: Could not compile function

Add C++ test for all TC from layers database

there might be few missing TC defs in C++ land, they probably only exist in python side. I'll add C++ test case for all of them.

[Build] error: expected type-specifier for llvm::CilkABI()

I'm trying to build the TC project from the docker images and default settings, but I'm stuck with the following error:

/root/TensorComprehensions/src/core/polyhedral/codegen_llvm.cc: In member function 'void tc::polyhedral::{anonymous}::CodeGen_TC::optimize_module()':
/root/TensorComprehensions/src/core/polyhedral/codegen_llvm.cc:331:25: error: expected type-specifier
     b.tapirTarget = new llvm::CilkABI();

I'm using the following docker image to automate the build, using the command docker build -t tc-local . to build:

FROM tensorcomprehensions/linux-xenial-gcc5-tapir5.0-cuda9-cudnn7-py3:x86

ENV DEBIAN_FRONTEND noninteractive

RUN cd /root && \
  git clone https://github.com/facebookresearch/TensorComprehensions.git --recursive

ENV TC_DIR=/root/TensorComprehensions
RUN cd ${TC_DIR} && \
  git submodule update --init --recursive && \
  BUILD_TYPE=Release PYTHON=$(which python3) WITH_CAFFE2=OFF ./build.sh --all

I had to remove the manual CLANG_HOME environmental variable from the instructions because in the docker image the LLVM location is different and already specified in CLANG_HOME.

Configuration

OS: Ubuntu 14.04
How you installed TC: docker
Python version: Python 3.5.2
CUDA/cuDNN version: CUDA v9.1.85, cuDNN 7
Conda version (is using conda): None
Docker image (if using docker): tensorcomprehensions/linux-xenial-gcc5-tapir5.0-cuda9-cudnn7-py3:x86
GCC/GXX version (if compiling from source): 5.4.1
LLVM/Tapir git hash used (if compiling from source): 5.0.0

Commit hash of the repo:

32414e3ef77215a0bea834dccdc85b046a66f02f

Commit hash of all the submodules: gist

Improve SIGINT/SIGTERM during autotuning

When the signal handling logic receives a SIGINT/SIGTERM it dumps the compilation caches, waits for the tuning sessions to finish and then throws an exception.

Maybe something got lost during refactoring but the current logic is not very useful:

The caches are stored after autotuning even if a signal was not raised.
If the user has to wait until autotuning exits normally why is an exception thrown.
Not being to immediately stop the autotuner with a SIGTERM is annoying.

@prigoyal what are the constraints imposed by the python bindings?

in Start example typo

in Docs: Let’s start with a simple example is a matrix vector product:
def mv(float(R,C) A, float(C) x) -> (o) {
o(i) += A(i,j) * b(j)
}

should be:
o(i) += A(i,j) *x(j)

Indirections on LHS [for indexing grad]

the reason for having the LHS indirection is because let's say someone has the lookup table:

def lut(float(B, R) M, int32(B, N) I) -> (O) {
    O(b, n) +=! M(I(b, n), r)
}

now they want to define the gradient for this, the most intuitive way to write the TC for backwards would be

M(I(b, n), r) = O_grad(b, r)

from @ftynse

without other uses of M, we cannot infer its size in the first dimension, even with a where clause; this is not a problem for input tensors whose size is specified
Halide->Polyhedral lowering pass should differentiate between must writes (all writes are must-writes now) and may writes
so should polyhedral dependence analysis
advanced transformations (reductions, shared memory) should be disabled
codegen should properly emit LHS indirections; hopefully this just works thanks to Halide

Uniformize docker images

We used to have all our trusty docker images used to inherit from this FROM nvidia/cuda:8.0-cudnn6-devel-ubuntu14.04 and all our xenial docker images used to inherit from this FROM nvidia/cuda:9.0-cudnn7-devel-ubuntu16.04

This has changed drastically and I see that we now start from a naked ubuntu image unvetted by NVIDIA and we are installing cuda/cudnn with a custom script.

Still, when providing the Docker for conda_recipes we now use the NVIDIA image again.

I would prefer that we only used the NVIDIA images like we had for months but let's discuss if something else makes more sense in a different CI context.

[Build] Using protobuf installation at non-standard location

I want to use a protobuf installation within $HOME to build with TC. CMake picks up the path to the protoc application but seems to fail at picking the right include directories,

In file included from /home/sanjay/Projects/TensorComprehensions/build/src/proto/compcache.pb.cc:5:0:
/home/sanjay/Projects/TensorComprehensions/build/src/proto/compcache.pb.h:9:42: fatal error: google/protobuf/stubs/common.h: No such file or directory
compilation terminated.
In file included from /home/sanjay/Projects/TensorComprehensions/build/src/proto/mapping_options.pb.cc:5:0:
/home/sanjay//Projects/TensorComprehensions/build/src/proto/mapping_options.pb.h:9:42: fatal error: google/protobuf/stubs/common.h: No such file or directory
compilation terminated.
src/proto/CMakeFiles/tc_proto.dir/build.make:110: recipe for target 'src/proto/CMakeFiles/tc_proto.dir/compcache.pb.cc.o' failed
make[2]: *** [src/proto/CMakeFiles/tc_proto.dir/compcache.pb.cc.o] Error 1

with tc_proto.dir/build.make having the line,

cd /home/sanjay/Projects/TensorComprehensions/build/src/proto && /home/sanjay/Projects/proto-install/protobuf-3.4.0/install/bin/protoc --cpp_out cd /home/sanjay/Projects/TensorComprehensions/build/src/proto && /home/sanjay/Projects/proto-install/protobuf-3.4.0/install/bin/protoc --cpp_out /home/sanjay/Projects/TensorComprehensions/build/src/proto -I /home/sanjay/Projects/TensorComprehensions/src/proto /home/sanjay/Projects/TensorComprehensions/src/proto/compcache.protoProjects/TensorComprehensions/build/src/proto -I /home/sanjay/Projects/TensorComprehensions/src/proto /home/sanjay/Projects/TensorComprehensions/src/proto/compcache.proto

What can I do to ensure that the requisite include paths are included ?

Setup:

OS: Ubuntu 16.04
How you installed TC (docker, conda, source): build from source within a conda environment
Python version: 3.6
CUDA/cuDNN version: CUDA 8
Conda version (if using conda): 4.4.11
Docker image (if using docker):
GCC/GXX version (if compiling from source): 5.4
LLVM/Tapir git hash used (if compiling from source): 5.0.0git-195a877
Commit hash of our repo and submodules (if compiling from source): recursively clone from 8c345a5 of TC.

Ternary term generates cuda errors and warnings

Hi,

OS: Fedora 25
How you installed TC (docker, conda, source): conda
Python version: 3.6
CUDA/cuDNN version: 8.0
Conda version (if using conda): 4.4.11

I am trying to write the backward of this TC, the forward works as expected, but autotuning the backward generates weird cuda errors that are impossible to understand.
I noticed that removing the ternary term of the second line in backward will solve the issue, I guess this is the root cause.
Thanks for your help.

import tensor_comprehensions as tc
import torch


norm_lang = """
def norm_max(float(N, C, H, W) I) -> (O, P, maxVal) {{
    O(n, c, h, w) = exp(I(n, c, h, w) * {beta})
    maxVal(n, h, w) max= O(n, c, h, w)
    P(n, c, h, w) = O(n, c, h, w) / maxVal(n, h, w)
}}

def norm_max_grad(float(N, C, H, W) I, float(N, C, H, W) O_grad, float(N, C, H, W) P_grad, float(N, H, W) maxVal) -> (I_grad){{
    I_grad(n, c, h, w) +=! exp(I(n, c, h, w) * {beta}) * {beta} * maxVal(n, h, w)
    I_grad(n, c, h, w) = I_grad(n, c, h, w) - (exp(I(n, c, h, w) * {beta}) * (I(n, c, h, w) == maxVal(n, h, w) ? 1: 0))
    I_grad(n, c, h, w) = I_grad(n, c, h, w) / pow(maxVal(n, h, w), 2)
    I_grad(n, c, h, w) = I_grad(n, c, h, w) * I_grad(n, c, h, w)
}}
"""

norm_max = tc.define(norm_lang, training=True, name="norm_max",
                     backward="norm_max_grad",
                     constants={'beta': 1})
norm_max.autotune(torch.randn(8, 135, 184, 184).cuda(), generations=1, pop_size=10)

[WARNING]: Autotuning results won't be cached. 'cache' option is not set
[WARNING]: Using 'naive' type mapping options for autotuning. See help(your_layer.autotune) for how to set mapping options.
Generation 0    Jobs(Compiled, GPU)/total  (10, 10)/10   (best/median/worst)us: 21383/66695/86822
[INFO]: Autotuning the backward layer now
[WARNING]: Using 'naive' type mapping options for autotuning. See help(your_layer.autotune) for how to set mapping options.
WARNING: Logging before InitGoogleLogging() is written to STDERR
W0313 14:40:08.700757 14911 rtc.cc:103] Compilation failure for nvrtc(NVRTC_ERROR_COMPILATION): 
default_program(32): error: expected a ")"

default_program(32): error: expected a ")"

default_program(32): error: expected a ")"

default_program(32): error: expected an expression

default_program(18): warning: variable "b0" was declared but never referenced

default_program(18): warning: variable "b2" was declared but never referenced

default_program(19): warning: variable "t2" was declared but never referenced

default_program(22): warning: variable "O_grad" was declared but never referenced

default_program(23): warning: variable "P_grad" was declared but never referenced

4 errors detected in the compilation of "default_program".
 source:
template<typename T> inline __device__ T floord(T n, T d) {
  return n < 0 ? - (-n + d - 1)/d : n / d;
}
#define if_then_else(cond,a,b) (cond) ? (a) : (b);

// Halide type handling
typedef int int32;
typedef long int64;
typedef float float32;
typedef double float64;

#define inff __int_as_float(0x7f800000)
#define inf __longlong_as_double(0x7ff0000000000000LL)

extern "C" {
__global__ void norm_max_grad_135_184_8_184(int32 C, int32 H, int32 N, int32 W, float32* pI_grad, float32* pI, float32* pO_grad, float32* pP_grad, float32* pmaxVal) {
  int b0 = blockIdx.x; int b1 = blockIdx.y; int b2 = blockIdx.z;
  int t0 = threadIdx.x; int t1 = threadIdx.y; int t2 = threadIdx.z;
  float32 (*I_grad)[135][184][184] = reinterpret_cast<float32 (*)[135][184][184]>(pI_grad);
  float32 (*I)[135][184][184] = reinterpret_cast<float32 (*)[135][184][184]>(pI);
  float32 (*O_grad)[135][184][184] = reinterpret_cast<float32 (*)[135][184][184]>(pO_grad);
  float32 (*P_grad)[135][184][184] = reinterpret_cast<float32 (*)[135][184][184]>(pP_grad);
  float32 (*maxVal)[184][184] = reinterpret_cast<float32 (*)[184][184]>(pmaxVal);
  for (int c2 = 0; c2 <= 183; c2 += 32) {
    for (int c4 = 0; c4 <= 7; c4 += 1) {
      for (int c5 = 0; c5 <= min(31, -32 * b1 + 134); c5 += 1) {
        for (int c6 = t1; c6 <= min(31, -c2 + 183); c6 += 8) {
          for (int c7 = t0; c7 <= 183; c7 += 32) {
            I_grad[c4][32*b1 + c5][c2 + c6][c7] = 0.000000f;
            I_grad[c4][32*b1 + c5][c2 + c6][c7] = (I_grad[c4][32*b1 + c5][c2 + c6][c7] + (exp(I[c4][32*b1 + c5][c2 + c6][c7])*maxVal[c4][c2 + c6][c7]));
            I_grad[c4][32*b1 + c5][c2 + c6][c7] = (I_grad[c4][32*b1 + c5][c2 + c6][c7] - (exp(I[c4][32*b1 + c5][c2 + c6][c7])*float32(if_then_else((I[c4][32*b1 + c5][c2 + c6][c7] == maxVal[c4][c2 + c6][c7]), 1, 0))));
            I_grad[c4][32*b1 + c5][c2 + c6][c7] = (I_grad[c4][32*b1 + c5][c2 + c6][c7]/pow(maxVal[c4][c2 + c6][c7], 2));
            I_grad[c4][32*b1 + c5][c2 + c6][c7] = (I_grad[c4][32*b1 + c5][c2 + c6][c7]*I_grad[c4][32*b1 + c5][c2 + c6][c7]);
          }
        }
      }
    }
  }
}
}

/*
Mapping Options:
tc::MappingOptions::makeNaiveMappingOptions()
    .outerScheduleFusionStrategy(tc::FusionStrategy::Preserve3Coincident)
    .outerScheduleAllowSkewing(false)
    .outerSchedulePositiveOrthant(true)
    .intraTileScheduleFusionStrategy(tc::FusionStrategy::Preserve3Coincident)
    .intraTileScheduleAllowSkewing(false)
    .intraTileSchedulePositiveOrthant(true)
    .tile(32, 32, 32)
    .mapToThreads(32, 8)
    .mapToBlocks(256, 256)
    .unroll(1)
    .tileImperfectlyNested(false)
    .useSharedMemory(false)
    .usePrivateMemory(false)
    .unrollCopyShared(false)
    .matchLibraryCalls(false);
TC version: 8e112e9dccda62c30ef29208a827e783b9a7f156
*/
E0313 14:40:08.700851 14911 rtc.cc:106] Compilation failure for nvrtc(NVRTC_ERROR_COMPILATION): 
default_program(32): error: expected a ")"

default_program(32): error: expected a ")"

default_program(32): error: expected a ")"

default_program(32): error: expected an expression

default_program(18): warning: variable "b0" was declared but never referenced

default_program(18): warning: variable "b2" was declared but never referenced

default_program(19): warning: variable "t2" was declared but never referenced

default_program(22): warning: variable "O_grad" was declared but never referenced

default_program(23): warning: variable "P_grad" was declared but never referenced

4 errors detected in the compilation of "default_program".
 source:
template<typename T> inline __device__ T floord(T n, T d) {
  return n < 0 ? - (-n + d - 1)/d : n / d;
}
#define if_then_else(cond,a,b) (cond) ? (a) : (b);

// Halide type handling
typedef int int32;
typedef long int64;
typedef float float32;
typedef double float64;

#define inff __int_as_float(0x7f800000)
#define inf __longlong_as_double(0x7ff0000000000000LL)

extern "C" {
__global__ void norm_max_grad_135_184_8_184(int32 C, int32 H, int32 N, int32 W, float32* pI_grad, float32* pI, float32* pO_grad, float32* pP_grad, float32* pmaxVal) {
  int b0 = blockIdx.x; int b1 = blockIdx.y; int b2 = blockIdx.z;
  int t0 = threadIdx.x; int t1 = threadIdx.y; int t2 = threadIdx.z;
  float32 (*I_grad)[135][184][184] = reinterpret_cast<float32 (*)[135][184][184]>(pI_grad);
  float32 (*I)[135][184][184] = reinterpret_cast<float32 (*)[135][184][184]>(pI);
  float32 (*O_grad)[135][184][184] = reinterpret_cast<float32 (*)[135][184][184]>(pO_grad);
  float32 (*P_grad)[135][184][184] = reinterpret_cast<float32 (*)[135][184][184]>(pP_grad);
  float32 (*maxVal)[184][184] = reinterpret_cast<float32 (*)[184][184]>(pmaxVal);
  for (int c2 = 0; c2 <= 183; c2 += 32) {
    for (int c4 = 0; c4 <= 7; c4 += 1) {
      for (int c5 = 0; c5 <= min(31, -32 * b1 + 134); c5 += 1) {
        for (int c6 = t1; c6 <= min(31, -c2 + 183); c6 += 8) {
          for (int c7 = t0; c7 <= 183; c7 += 32) {
            I_grad[c4][32*b1 + c5][c2 + c6][c7] = 0.000000f;
            I_grad[c4][32*b1 + c5][c2 + c6][c7] = (I_grad[c4][32*b1 + c5][c2 + c6][c7] + (exp(I[c4][32*b1 + c5][c2 + c6][c7])*maxVal[c4][c2 + c6][c7]));
            I_grad[c4][32*b1 + c5][c2 + c6][c7] = (I_grad[c4][32*b1 + c5][c2 + c6][c7] - (exp(I[c4][32*b1 + c5][c2 + c6][c7])*float32(if_then_else((I[c4][32*b1 + c5][c2 + c6][c7] == maxVal[c4][c2 + c6][c7]), 1, 0))));
            I_grad[c4][32*b1 + c5][c2 + c6][c7] = (I_grad[c4][32*b1 + c5][c2 + c6][c7]/pow(maxVal[c4][c2 + c6][c7], 2));
            I_grad[c4][32*b1 + c5][c2 + c6][c7] = (I_grad[c4][32*b1 + c5][c2 + c6][c7]*I_grad[c4][32*b1 + c5][c2 + c6][c7]);
          }
        }
      }
    }
  }
}
}

/*
Mapping Options:
tc::MappingOptions::makeNaiveMappingOptions()
    .outerScheduleFusionStrategy(tc::FusionStrategy::Preserve3Coincident)
    .outerScheduleAllowSkewing(false)
    .outerSchedulePositiveOrthant(true)
    .intraTileScheduleFusionStrategy(tc::FusionStrategy::Preserve3Coincident)
    .intraTileScheduleAllowSkewing(false)
    .intraTileSchedulePositiveOrthant(true)
    .tile(32, 32, 32)
    .mapToThreads(32, 8)
    .mapToBlocks(256, 256)
    .unroll(1)
    .tileImperfectlyNested(false)
    .useSharedMemory(false)
    .usePrivateMemory(false)
    .unrollCopyShared(false)
    .matchLibraryCalls(false);
TC version: 8e112e9dccda62c30ef29208a827e783b9a7f156
*/
[ERROR]: Raised exception: Could not compile function

Semantic analysis fails to catch invalid TC involving temporaries

we don't support temporaries very well right now, i.e. everything has to be either an input/output. However, I myself have fallen into trap of missing marking the temporary variables as outputs. The semantic analysis doesn't catch this TC and propagates to Halide further to get the IR. this should be corrected

Nested Function Calls in TC

There is some interest in being able to do nested function calls i.e. define one TC and call it from inside another one.

lang = """
def max() {
...
}
"""

lang1 = """
def elementwise() -> (output) {
    output = max()
}
"""

This makes a lot of sense as well for cases where you want to use a defined database of TC and just make calls.

[EDIT]: pasting here @leonbottou 's remarks:

I'd love to see the ability to invoke a TC from another. Not as a procedure call, but as type-aware inlining. Here is an example for a possible syntax:
def weirdbatchedmatmul( float(N,W,H) x, float(N,M,W) k )
    -> float(M,W,H) y 
{
  y(*,i,j) = matmul( x(*,i,j), k(*,*,i) )
}

Then we can reimplement select and narrow
def select0( float(A,B,C) x, int S) -> o {
   o(i,j) = x(S,i,j)
}
def narrow0( float(A,B,C) x, int L, int F) -> o {
   o(k,i,j) = x(k+F,i,j) where k in 0:L-1
}

**Benefits:**
- Making the TC language a little less write-only!
- Bridging the gap between the TC style and the Lush/Torch/TF/PyTorch style of tensor programming,
- Encapsulating layers into a library of ready-to-inline TCs.

Support for triangular dependencies in TC language

one of our users reported on slack channel that they were trying to translate the following Numpy code

def A_matvec_batch(A, X):
    n, m = X.shape
    Y = np.zeros((n, m))
    for j in range(m):
        for i in range(n):
            # Out of bound values lead to the following rules to define limit in s index    
            # s_min = -1 for all rows except the zeroth and 0 for the zero row
            # s_max = 2 for all rows except the last one and 1 for the last row 
            s_min = max(-1, -i)
            s_max = min(2, n - i)
            for s in range(s_min, s_max):  # this loop simplifies differentiation
                Y[i, j] += A[i, s + 1] * X[i + s, j]
    return Y

to the TC:

lang = """
def A_mv(float(n, 3) A, float(n, size) X) -> (O) {
      output(i, j) +=! A(i, s+1) * X(i+s, j) where s in (-1<=-i ? -i : -1):(2<=n-i ? 2 : n-i) 
}
"""

but this TC fails to compile. The reason being: there are triangular dependencies in the bound inference i.e. s depends on i bounds. This is currently not supported in TC.
But the good news is that it is entirely language and inference issue and not the polyhedral backend.

@abadams has a nice proposal to handle this and we will discuss this further and support such TCs.

Investigate install TC instructions in docker

there are permission denied issues with git clone in docker. using https might be an option but this needs investigation.

Variable tensor sizes support for TC

is there a way for us to support variable tensor sizes? right now, if the tensor size changes, we have to recompile and cache. But often in computer vision, NLP, people have models where tensor size for a layer changes at every step of training

for example:

def sum_reduce_dims023_4d(float(N ,C, H, W) X) -> Y {
   Y(c) += ! X(n, c, h, w)
}

should user be writing this for all input sizes? compiling and caching every time.

This is a big use case for computer vision and NLP models and having to compile at every step is going to slow down things quite a bit.

Add support for strided tensors

We currently incorrectly treat all tensors as dense. Need to add support for strided tensors. This can be done by adding the strides as extra params, and either constructing the tensor views in the emitted source using the strides instead of the sizes, or just doing the storage flattening ourselves before or during codegen.

checklist after talking to @ftynse [~~@prigoyal~~ @protonu is working on it]

propagate DLTensor stride information to Halide (it has similar struct and already takes bits, lanes info, so should be easy)
codegen, where we emit "tensor views" by casting a pointer into a multidimensional array - the casting should be aware of strides. Also, Halide->polyhedral, there is a place that creates constraints on tensor sizes for polyhedral; it may or may not be adapted to strides.
add the stride info to the tuner proto
make proper changes to the frontend (aten, pytorch, caffe2 etc.)

Reference group belongs to two error

Generation 0    Job[Compiled, GPU] (10, 8)/10   Time (us): best: 97 median: 347 worst: 427F0223 14:03:31.068509 111930 codegen_cuda.cc:730] Check failed: !promotionInfo.groupId reference __tc_ref_10 belongs to two groups: _centered_0 and _var_0
*** Check failure stack trace: ***
    @     0x7fce13cf2c3d  google::LogMessage::Fail()
    @     0x7fce13cf4b88  google::LogMessage::SendToLog()
    @     0x7fce13cf2723  google::LogMessage::Flush()
    @     0x7fce13cf54be  google::LogMessageFatal::~LogMessageFatal()
    @     0x7fce135fdfae  tc::polyhedral::detail::emitMappedTensorAccess()
    @     0x7fce135fa860  tc::polyhedral::(anonymous namespace)::emitUserStmt()
    @     0x7fce135fc01a  tc::polyhedral::(anonymous namespace)::AstPrinter::emitStmt()
    @     0x7fce135fc586  tc::polyhedral::(anonymous namespace)::AstPrinter::emitAst()
    @     0x7fce135fc474  tc::polyhedral::(anonymous namespace)::AstPrinter::emitAst()
    @     0x7fce135f805f  tc::polyhedral::(anonymous namespace)::AstPrinter::emitFor()
    @     0x7fce135fc389  tc::polyhedral::(anonymous namespace)::AstPrinter::emitAst()
    @     0x7fce135fc474  tc::polyhedral::(anonymous namespace)::AstPrinter::emitAst()
    @     0x7fce135fc474  tc::polyhedral::(anonymous namespace)::AstPrinter::emitAst()
    @     0x7fce135fc474  tc::polyhedral::(anonymous namespace)::AstPrinter::emitAst()
    @     0x7fce135f805f  tc::polyhedral::(anonymous namespace)::AstPrinter::emitFor()
    @     0x7fce135fc389  tc::polyhedral::(anonymous namespace)::AstPrinter::emitAst()
    @     0x7fce135f830e  tc::polyhedral::(anonymous namespace)::AstPrinter::emitIf()
    @     0x7fce135fc3cf  tc::polyhedral::(anonymous namespace)::AstPrinter::emitAst()
    @     0x7fce135fc474  tc::polyhedral::(anonymous namespace)::AstPrinter::emitAst()
    @     0x7fce135f6bcb  tc::polyhedral::(anonymous namespace)::AstPrinter::emit()
    @     0x7fce1360065a  tc::polyhedral::emitCudaKernel()
    @     0x7fce1364c6b2  tc::polyhedral::MappedScop::codegen()
    @     0x7fce1356c73b  tc::TcExecutor::compileWithTcMapper()
    @     0x7fce1356b7af  tc::TcExecutor::compile()
    @     0x7fce13554725  tc::ExecutionEngine::compile()
    @     0x7fce14777746  tc::autotune::detail::GeneticTunerHarness::doCompile()
    @     0x7fce147790bc  _ZZN2tc8autotune6detail19GeneticTunerHarness16runOneGenerationEmENKUlvE1_clEv
    @     0x7fce1477cb6e  _ZNSt12_Bind_simpleIFZN2tc8autotune6detail19GeneticTunerHarness16runOneGenerationEmEUlvE1_vEE9_M_invokeIIEEEvSt12_Index_tupleIIXspT_EEE
    @     0x7fce1477ca1d  _ZNSt12_Bind_simpleIFZN2tc8autotune6detail19GeneticTunerHarness16runOneGenerationEmEUlvE1_vEEclEv
    @     0x7fce1477c94e  _ZNSt6thread5_ImplISt12_Bind_simpleIFZN2tc8autotune6detail19GeneticTunerHarness16runOneGenerationEmEUlvE1_vEEE6_M_runEv
    @     0x7fce260a6c5c  execute_native_thread_routine_compat
    @     0x7fce54fcce25  start_thread

TC build support for PyTorch master

there is user interest in using TC compatible with pytorch master. This also means that users want to use/develop TC with current pytorch master.

cc @soumith

Bump aten submodule and make necessary changes #155
add instructions to build pytorch from source with/without conda deps

Protobuf version mismatch from fresh install

OS: Fedora 25
How you installed TC (docker, conda, source): conda
Python version: 3.6
CUDA/cuDNN version: 8.0
Conda version (if using conda): 4.4.11

repro:

conda clean --all
conda create -y --name tc-install python=3.6
conda activate tc-install
conda install -y -c pytorch -c tensorcomp tensor_comprehensions

# compile whatever TC
> Proto version doesn't match. TC git version is: 8e112e9dccda62c30ef29208a827e783b9a7f156 and Proto version is: ea0c39dcede6850bd3e9553eee0d4f083681c314 .This proto might be incompatible with your TC binary and can break. Please autotune against the correct TC version.

Autotuner doesn't return to original handlers before signals raise

can you look into this? this has been extensively discussed already

[Build] ScheduleTree struct error schedule_tree.h

OS: Fedora 28
Python version: 3.6
CUDA: 9.1
GCC: 8.0.1

Hi Folks,

I am trying to pack TC as rpm package on RedHat/Fedora platform but have issues within tc_executor.cc & schedule_tree.h . Not sure what would be right declartion.

In short:

polyhedral/schedule_tree.h:141:44: error: ‘std::ostream& tc::polyhedral::detail::operator<<(std::ostream&, tc::polyhedral::detail::ScheduleTree&)’ should have been declared inside ‘tc::polyhedral::detail’

In details

/usr/bin/c++  -DCAFFE2_USE_GOOGLE_GLOG -DDMLC_USE_GLOG -Dtc_core_EXPORTS -I/home/cbalint/rpmbuild/BUILD/TensorComprehensions/src/../include -I/home/cbalint/rpmbuild/BUILD/TensorComprehensions/src/. -I/usr/lib/python2.7/site-packages/torch/lib/include -I/home/cbalint/rpmbuild/BUILD/TensorComprehensions/third-party/caffe2/third_party/eigen -I/usr/include/cuda -I/usr/include/cuda/include -I/home/cbalint/rpmbuild/BUILD/TensorComprehensions/third-party/dlpack/include -I/home/cbalint/rpmbuild/BUILD/TensorComprehensions/third-party/islpp/include -I/home/cbalint/rpmbuild/BUILD/TensorComprehensions/include -I/home/cbalint/rpmbuild/BUILD/TensorComprehensions/src/version -I/home/cbalint/rpmbuild/BUILD/TensorComprehensions/build/src/proto  -O2 -g -pipe -Wall -Werror=format-security -Wp,-D_FORTIFY_SOURCE=2 -Wp,-D_GLIBCXX_ASSERTIONS -fexceptions -fstack-protector-strong -grecord-gcc-switches -specs=/usr/lib/rpm/redhat/redhat-hardened-cc1 -specs=/usr/lib/rpm/redhat/redhat-annobin-cc1 -m64 -mtune=generic -fasynchronous-unwind-tables -fstack-clash-protection -mcet -fcf-protection -DCUDA_HOME="\"/usr/include/cuda\"" -DCUB_HOME="\"/usr/include\""  -Wall -Wno-sign-compare -fPIC  -o CMakeFiles/tc_core.dir/tc_executor.cc.o -c /home/cbalint/rpmbuild/BUILD/TensorComprehensions/src/core/tc_executor.cc -std=gnu++11
In file included from /home/cbalint/rpmbuild/BUILD/TensorComprehensions/src/../include/tc/core/halide2isl.h:25,
                 from /home/cbalint/rpmbuild/BUILD/TensorComprehensions/src/../include/tc/core/polyhedral/scop.h:27,
                 from /home/cbalint/rpmbuild/BUILD/TensorComprehensions/src/../include/tc/core/tc_executor.h:20,
                 from /home/cbalint/rpmbuild/BUILD/TensorComprehensions/src/core/tc_executor.cc:16:
/home/cbalint/rpmbuild/BUILD/TensorComprehensions/src/../include/tc/core/polyhedral/schedule_tree.h:141:44: error: ‘std::ostream& tc::polyhedral::detail::operator<<(std::ostream&, tc::polyhedral::detail::ScheduleTree&)’ should have been declared inside ‘tc::polyhedral::detail’
       tc::polyhedral::detail::ScheduleTree&);

Ensure types other than float/int32 are supported

teach parser about bool, int8, int16, int32, int64, float16, float64
make sure Halide types are properly constructed
make sure polyhedral dependence analysis behaves correctly for complex types
teach memory promoter about byte-sizes of each type (@ftynse)
write tests involving all types

Writes to non-zero based tensors

Consider:

out(i) = in(10 - i)

It will infer a range for i that does not necessarily start at zero. What should this mean for the domain of out? I don't think we have any tests for this pattern yet, and I don't have a firm idea of what the semantics should be.

Exception thrown: "isl.h:3185: NULL input thrown in the test body"

one of our user reported an issue with TC

def graph2(float(N, C, H, W) I, float(N, C, R, T) J, float(KH, KW) W1) -> (O, Out) {
        O(n, c, h, w) +=! J(n, c, h + kh, w + kw) * W1(kh, kw)
        Out(i, j) +=! I(n, i, h, w) * O(n, j, h, w)
}

which consistently throws error:

unknown file: Failure
C++ exception with description "/home/prigoyal/local/TensorComprehensions/third-party-install/include/isl/interface/isl.h:3185: NULL input" thrown in the test body.

Deadlock in Autotuner

we have an interesting use case where autotuner deadlocks.
here is gdb backtrace: https://gist.github.com/prigoyal/3a50a60c79fc1e82fd8e999fb4b9e829

here is the C++ test gist: https://gist.github.com/prigoyal/0a0ad421d50f31ad8a37ef0f9d0dc150

this was posted by our user in slack channel cc @moskomule

Refactor build.sh when having both conda and source mode

#72 introduces a bunch of build.sh with content that is largely copy-pasted from the original build.sh, it would be nice to structure things in a way such that:

the main build.sh calls the auxiliary smaller build scripts with some set of options
the conda recipes call the smaller build scripts with possibly another set of options

Unable to use same variable for reduction twice or more in TC

do we not allow people to use same variable twice? also repro is available in #61

FAILS:
def softmax(float(N, D) I) -> (O, tmp) {
        tmp(n) max=! I(n, d)
        O(n, d) = exp(I(n, d) - tmp(n))
        tmp(n) +=! O(n, d)
        O(n, d) = O(n, d) / tmp(n)
}

PASSES:
def softmax(float(N, D) I) -> (O, expsum, maxVal) {
        maxVal(n) max= I(n, d)
        expsum(n) +=! exp(I(n, d) - maxVal(n))
        O(n, d) = exp(I(n, d) - maxVal(n)) / expsum(n)
 }

Bitwise Operation Support

Tensor Comprehensions seem like a great way to implement binary approximation of layers as in BinaryNet. However, tc does not currently support bitwise operations such as xnor (^), or (|), and and (&). Additionally, implementing these approximations would require the popc cuda intrinsic, which is no supported. Getting these functions integrated would enable really cool and fast applications that could highlight the strengths of tc.

Define TC lang conventions to standardize tc defs

we are seeing TC's get written in ways that can be confusing to look at. Nicolas has great idea on defining the TC lang convention. Feel free to add the .md file for it :)

Allow using scalars for bounds inference

We can't use scalar inputs in the bounds inference right now. So for example:

LANG="""
def avgpool(float(B, C, H, W) input, float kH, float kW, float sH, float sW) -> (output) {
    output(b, c, h, w) += input(b, c, h * sH + kh, w * sW + kw) where kh in 0:kH, kw in 0:kW
}
"""

LANG="""
def avgpool(float(B, C, H, W) input, float kH, float kW) -> (output) {
    output(b, c, h, w) += input(b, c, h + kh, w + kw) where kh in 0:kH, kw in 0:kW
}
"""

avgpool = tc.define(LANG, name="avgpool")
inp = torch.ones(1, 1, 4, 4).cuda()
kH = torch.randn(1).fill_(2.0).cuda()
kW = torch.randn(1).fill_(2.0).cuda()
sH = torch.randn(1).fill_(1.0).cuda()
sW = torch.randn(1).fill_(1.0).cuda()
out = avgpool(inp, kH, kW, sH, sW)

this will fail.

The workaround right now is to do proper substitution for those scalars in the TC before passing them to the backend.

Comparison operators

The front-end doesn't understand ==, <, >, !=, etc. Hilariously, we do have a ternary operator, but I'm not sure what you would use for the first argument :)

Tagging @zdevito to add them to the lexer/parser

Determine whether a loop should be parallelized in CPU backend from ISL

ISL internally dictates if loops should be run serially, or may be run in parallel. Presently we are conservative and always run loops serially (for CPU). We ought use the information from ISL to set the CPU parallelization flag appropriately for each loop.

test_mapper_llvm fails because wrong cilkrts is found

With ac1f3c3, I get

Running build/test/test_mapper_llvm
[==========] Running 4 tests from 1 test case.
[----------] Global test environment set-up.
[----------] 4 tests from LLVMCodegen
[ RUN      ] LLVMCodegen.Basic
unknown file: Failure
C++ exception with description "Failed to find cilkrts: /usr/lib32/libcilkrts.so.5: wrong ELF class: ELFCLASS32" thrown in the test body.
[  FAILED  ] LLVMCodegen.Basic (12 ms)

The output of ldconfig -p | grep libcilkrts is

        libcilkrts.so.5 (libc6,x32) => /usr/libx32/libcilkrts.so.5
        libcilkrts.so.5 (libc6,x86-64) => /usr/lib/x86_64-linux-gnu/libcilkrts.so.5
        libcilkrts.so.5 (libc6) => /usr/lib32/libcilkrts.so.5

porting the examples in python using the provided python binding

As in the title, do you have any plan to provide working example of the currently available python binding interface, or will you just wait to release them with PyTorch bindings?

Jenkins CI for running GPU tests for CI

creating this master task for people to track progress. I am actively working on this and will post updates here.

Vanilla LLVM support

Would be nice if TensorComprehensions ran with non-Tapir LLVM.
Assigning tentatively to @wsmoses for now.

Padding/Boundary conditions

We need some way to express padding of input tensors. The current thought is a primitive pad(expr, expr), which returns the first argument if it does no out-of-bounds reads, and otherwise returns the second argument.

Please support Tensor Comprehensions on macOS

Please consider the following information:

OS: macOS 10.13.3
How you installed TC (docker, conda, source): could not install (not supported)
Python version: 2.7.x
Conda version (is using conda): 4.4.11

Handle temporaries

right now, everything is either an input or output. Reason being TC does not do any allocation by itself. We should find a better way to handle this

cc Albert Cohen who is interested in this

Investigate and fix python and C++ flags logging

right now the LOG_IF doesn't play nice when we consider user experience trying to do the cuda dump in python interpreter. THey normally can do

env GLOG_logtostderr=1 python my_test.py -v

and but this is not friendly when people have to do the stuff in python interpreter. For this reason, I had added a separate flag dump_cuda and GlobalDebugInit from python was setting that flag. However, having LOG_IF necessitates that we set env variable and doesn't play nice with logging.

this should be fixed more properly.

[Build] build on Ubuntu16.04 and fail when building third-party

OS: Ubuntu 16.04
How you installed TC (docker, conda, source): source
Python version: 3.5.2
CUDA/cuDNN version: 8.0
GCC/GXX version (if compiling from source): 5.4.0
LLVM/Tapir git hash used (if compiling from source): clang version 5.0.0 (get from the command below)

To get the hash, run: $HOME/clang+llvm-tapir5.0/bin/clang --version
Commit hash of our repo and submodules (if compiling from source): ea0c39d

In addition, including the following information will also be very helpful for us to diagnose the problem:

I followed "Building from Source in Non-Conda Environment"
of https://facebookresearch.github.io/TensorComprehensions/installation_non_conda.html and problems happened when installing TC with caffe2 intergration. While the build.sh is trying to build some third-party such as Halide I got some error:

/home/jimmyoic/TensorComprehensions/third-party/halide/src/Error.cpp: In destructor 'Halide::Internal::ErrorReport::~ErrorReport()':
/home/jimmyoic/TensorComprehensions/third-party/halide/src/Error.cpp:121:15: error: exception handling disabled, use -fexceptions to enable
throw err;
^
At global scope:
cc1plus: error: unrecognized command line option '-Wno-unknown-warning-option' [-Werror]
cc1plus: all warnings being treated as errors

Note that since I already had protobuf 2.6 on my server, I built protobuf 3.4 and changed the path of build.sh where it finds protobuf.

BTW, is it possible to build those third-party by myself rather than through the build script? I want to make sure how to correctly configure this (if this works).

Thanks!