Giter Site home page Giter Site logo

facebookresearch / tensorcomprehensions Goto Github PK

View Code? Open in Web Editor NEW
1.8K 108.0 232.0 37.55 MB

A domain specific language to express machine learning workloads.

Home Page: https://facebookresearch.github.io/TensorComprehensions/

License: Apache License 2.0

CMake 2.40% Shell 1.47% C++ 91.97% C 0.11% Python 3.91% Dockerfile 0.14%
machine-learning domain-specific-language

tensorcomprehensions's Introduction

Tensor Comprehensions

Tensor Comprehensions (TC) is a fully-functional C++ library to automatically synthesize high-performance machine learning kernels using Halide, ISL and NVRTC or LLVM. TC additionally provides basic integration with Caffe2 and PyTorch. We provide more details in our paper on arXiv.

This library is designed to be highly portable, machine-learning-framework agnostic and only requires a simple tensor library with memory allocation, offloading and synchronization capabilities.

For now, we have integrated TC with Caffe2 and PyTorch.

A simple example

The following illustrates a short but powerful feature of the library: the capacity to JIT-compile high-performance machine learning kernels on demand, for specific sizes.

import tensor_comprehensions as tc
import torch
lang = """
def tensordot(float(N, C1, C2, H, W) I0, float(N, C2, C3, H, W) I1) -> (O) {
    O(n, c1, c3, h, w) +=! I0(n, c1, c2, h, w) * I1(n, c2, c3, h, w)
}
"""
N, C1, C2, C3, H, W = 32, 512, 8, 2, 28, 28
tensordot = tc.define(lang, name="tensordot")
I0, I1 = torch.randn(N, C1, C2, H, W).cuda(), torch.randn(N, C2, C3, H, W).cuda()
best_options = tensordot.autotune(I0, I1, cache=True)
out = tensordot(I0, I1, options=best_options)

After a few generations of autotuning on a 2-GPU P100 system, we see results resembling:

Autotuning Sample

In C++ a minimal autotuning example resembles the following:

TEST(TensorDot, SimpleAutotune) {
  // 1. Define and setup the TC compilation unit with CUDA memory
  // management backed by ATen tensors.
  std::string tc = R"TC(
def tensordot(float(N, C1, C2, H, W) I0,
              float(N, C2, C3, H, W) I1)  -> (O)
{
    O(n, c1, c3, h, w) +=! I0(n, c1, r_c2, h, w) * I1(n, r_c2, c3, h, w)
}
  )TC";

  // 2. Allocate tensors with random data.
  at::Tensor I0 = at::CUDA(at::kFloat).rand({32,  8, 16, 17, 25});
  at::Tensor I1 = at::CUDA(at::kFloat).rand({32, 16, 2, 17, 25});

  // 3. Run autotuning with evolutionary search starting from a naive option.
  auto naiveOptions = Backend::MappingOptionsType::makeNaiveMappingOptions();
  tc::aten::ATenAutotuner<tc::CudaBackend, tc::autotune::GeneticSearch>
      geneticAutotuneATen(tc);
  auto bestOption =
      geneticAutotuneATen.tune("tensordot", {I0, I1}, {naiveOptions});

  // 4. Compile and run the TC with the best option after allocating output
  //    tensors.
  auto pExecutor =
      tc::aten::compile<Backend>(tc, "tensordot", {I0, I1}, bestOption[0]);
  auto outputs = tc::aten::prepareOutputs(tc, "tensordot", {I0, I1});
  auto timings = tc::aten::profile(*pExecutor, {I0, I1}, outputs);
  std::cout << "tensordot size I0: " << I0.sizes() << ", "
            << "size I1: " << I1.sizes()
            << " ran in: " << timings.kernelRuntime.toMicroSeconds() << "us\n";
}

Note that we only need to autotune a TC once to obtain reasonable mapping options that can translate to other problem sizes for a given TC as the following snippet illustrates:

// 5. Reuse bestOptions from autotuning on another kernel
for (auto sizes : std::vector<std::pair<at::IntList, at::IntList>>{
         {{4, 9, 7, 16, 14}, {4, 7, 3, 16, 14}},
         {{8, 5, 11, 10, 10}, {8, 11, 16, 10, 10}},
     }) {
  at::Tensor I0 = makeATenTensor<Backend>(sizes.first);
  at::Tensor I1 = makeATenTensor<Backend>(sizes.second);
  auto pExecutor =
      tc::aten::compile<Backend>(tc, "tensordot", {I0, I1}, bestOption[0]);
  auto outputs = tc::aten::prepareOutputs(tc, "tensordot", {I0, I1});
  auto timings = tc::aten::profile(*pExecutor, {I0, I1}, outputs);
  std::cout << "tensordot size I0: " << I0.sizes() << ", "
            << "size I1: " << I1.sizes()
            << " ran in: " << timings.kernelRuntime.toMicroSeconds()
            << "us\n";
}

Putting it all together, one may see:

> build$ ./examples/example_simple
[==========] Running 1 test from 1 test case.
[----------] Global test environment set-up.
[----------] 1 test from TensorDot
[ RUN      ] TensorDot.SimpleAutotune
Generation 0    Jobs(Compiled, GPU)/total  (10, 10)/10   (best/median/worst)us: 226/4238/7345
Generation 1    Jobs(Compiled, GPU)/total  (10, 10)/10   (best/median/worst)us: 220/221/233
Generation 2    Jobs(Compiled, GPU)/total  (10, 10)/10   (best/median/worst)us: 220/221/234
tensordot size I0: [16, 8, 16, 17, 25], size I1: [16, 16, 2, 17, 25] ran in: 239us
tensordot size I0: [4, 9, 7, 16, 14], size I1: [4, 7, 3, 16, 14] ran in: 56us
tensordot size I0: [8, 5, 11, 10, 10], size I1: [8, 11, 16, 10, 10] ran in: 210us
[       OK ] TensorDot.SimpleAutotune (27812 ms)
[----------] 1 test from TensorDot (27812 ms total)

[----------] Global test environment tear-down
[==========] 1 test from 1 test case ran. (27812 ms total)
[  PASSED  ] 1 test.

We have not yet characterized the precise fraction of peak performance we obtain but it is not uncommon to obtain 80%+ of peak shared memory bandwidth after autotuning. Solid register-level optimizations are still in the work but TC in its current form already addresses the productivity gap between the needs of research and the needs of production. Which is why we are excited to share it with the entire community and bring this collaborative effort in the open.

Documentation

General: You can find detailed information about Tensor Comprehensions here.

C++ API: We also provide documentation for our C++ API which can can be found here

Installation

Binaries

We provide conda package for making it easy to install and use TC binary. Please refer to our documentation here for instructions.

From Source

You can find documentation here which contains instructions for building TC via docker, conda packages or in non-conda environment.

Communication

  • Email: [email protected]
  • GitHub issues: bug reports, feature requests, install issues, RFCs, thoughts, etc.

Code of Conduct

See the CODE_OF_CONDUCT.md file for more details.

License

Tensor Comprehensions is distributed under a permissive Apache v2.0 license, see the LICENSE file for more details.

Contributing

See the CONTRIBUTING.md file for more details.

tensorcomprehensions's People

Contributors

abadams avatar akhti avatar alok avatar apaszke avatar brettkoonce avatar cbalint13 avatar chr1sj0nes avatar dchichkov avatar dingobye avatar doodlesbykumbi avatar ftynse avatar jekbradbury avatar lvdmaaten avatar math-fehr avatar mingzhe09088 avatar mitmul avatar moore123 avatar moskomule avatar nicolasvasilache avatar prigoyal avatar protonu avatar salexspb avatar skimo-openhub avatar thetheodor avatar tosaka2 avatar wsmoses avatar zdevito avatar zpao avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

tensorcomprehensions's Issues

Jenkins CI for running GPU tests for CI

creating this master task for people to track progress. I am actively working on this and will post updates here.

  • Setup the dockerfiles repo where CI can pull docker build info from
  • Setup a tensorcomp bot and add to this repo
  • ssh setup for the tensorcomp bot on Jenkins CI
  • Remove the tiering of the docker images #79 done in #91
  • Groovy skeleton for TC CI https://github.com/pietern/ossci-job-dsl/pull/81
  • setup the anonymous readability of jobs
  • build docker and push to the nexus registry https://registry.pytorch.org/nexus/#browse/search/docker
  • whitelist users for which to automatically kick off builds tc core team+some other external
  • Docker version handling properly automatic handling is done
  • Refactor and Install jenkins in docker files#122
  • Refactor jenkins groovy scripts for Tensor Comprehensions with Ed
  • setup the test in the docker build before pushing it to registry https://github.com/pietern/ossci-job-dsl/pull/85
  • Add tensorcomp the the HUD for monitoring jobs pytorch/ci-hud#1
  • Set tensorcomp docker cleanup job https://github.com/pietern/ossci-job-dsl/pull/93
  • end to end testing of the docker build, push, tc build and test on Jenkins
  • setup the sccache for the docker image building
  • submit jenkins configuration and test configuration is correct
  • test the PRs are triggering

Semantic analysis fails to catch invalid TC involving temporaries

we don't support temporaries very well right now, i.e. everything has to be either an input/output. However, I myself have fallen into trap of missing marking the temporary variables as outputs. The semantic analysis doesn't catch this TC and propagates to Halide further to get the IR. this should be corrected

[Build] error: expected type-specifier for llvm::CilkABI()

I'm trying to build the TC project from the docker images and default settings, but I'm stuck with the following error:

/root/TensorComprehensions/src/core/polyhedral/codegen_llvm.cc: In member function 'void tc::polyhedral::{anonymous}::CodeGen_TC::optimize_module()':
/root/TensorComprehensions/src/core/polyhedral/codegen_llvm.cc:331:25: error: expected type-specifier
     b.tapirTarget = new llvm::CilkABI();

I'm using the following docker image to automate the build, using the command docker build -t tc-local . to build:

FROM tensorcomprehensions/linux-xenial-gcc5-tapir5.0-cuda9-cudnn7-py3:x86

ENV DEBIAN_FRONTEND noninteractive

RUN cd /root && \
  git clone https://github.com/facebookresearch/TensorComprehensions.git --recursive

ENV TC_DIR=/root/TensorComprehensions
RUN cd ${TC_DIR} && \
  git submodule update --init --recursive && \
  BUILD_TYPE=Release PYTHON=$(which python3) WITH_CAFFE2=OFF ./build.sh --all

I had to remove the manual CLANG_HOME environmental variable from the instructions because in the docker image the LLVM location is different and already specified in CLANG_HOME.

Configuration

  • OS: Ubuntu 14.04
  • How you installed TC: docker
  • Python version: Python 3.5.2
  • CUDA/cuDNN version: CUDA v9.1.85, cuDNN 7
  • Conda version (is using conda): None
  • Docker image (if using docker): tensorcomprehensions/linux-xenial-gcc5-tapir5.0-cuda9-cudnn7-py3:x86
  • GCC/GXX version (if compiling from source): 5.4.1
  • LLVM/Tapir git hash used (if compiling from source): 5.0.0

Commit hash of the repo:

32414e3ef77215a0bea834dccdc85b046a66f02f

Commit hash of all the submodules: gist

Writes to non-zero based tensors

Consider:

out(i) = in(10 - i)

It will infer a range for i that does not necessarily start at zero. What should this mean for the domain of out? I don't think we have any tests for this pattern yet, and I don't have a firm idea of what the semantics should be.

Investigate and fix python and C++ flags logging

right now the LOG_IF doesn't play nice when we consider user experience trying to do the cuda dump in python interpreter. THey normally can do

env GLOG_logtostderr=1 python my_test.py -v

and but this is not friendly when people have to do the stuff in python interpreter. For this reason, I had added a separate flag dump_cuda and GlobalDebugInit from python was setting that flag. However, having LOG_IF necessitates that we set env variable and doesn't play nice with logging.

this should be fixed more properly.

Protobuf version mismatch from fresh install

  • OS: Fedora 25
  • How you installed TC (docker, conda, source): conda
  • Python version: 3.6
  • CUDA/cuDNN version: 8.0
  • Conda version (if using conda): 4.4.11

repro:

conda clean --all
conda create -y --name tc-install python=3.6
conda activate tc-install
conda install -y -c pytorch -c tensorcomp tensor_comprehensions

# compile whatever TC
> Proto version doesn't match. TC git version is: 8e112e9dccda62c30ef29208a827e783b9a7f156 and Proto version is: ea0c39dcede6850bd3e9553eee0d4f083681c314 .This proto might be incompatible with your TC binary and can break. Please autotune against the correct TC version.

Comparison operators

The front-end doesn't understand ==, <, >, !=, etc. Hilariously, we do have a ternary operator, but I'm not sure what you would use for the first argument :)

Tagging @zdevito to add them to the lexer/parser

Ternary term generates cuda errors and warnings

Hi,

  • OS: Fedora 25
  • How you installed TC (docker, conda, source): conda
  • Python version: 3.6
  • CUDA/cuDNN version: 8.0
  • Conda version (if using conda): 4.4.11

I am trying to write the backward of this TC, the forward works as expected, but autotuning the backward generates weird cuda errors that are impossible to understand.
I noticed that removing the ternary term of the second line in backward will solve the issue, I guess this is the root cause.
Thanks for your help.

import tensor_comprehensions as tc
import torch


norm_lang = """
def norm_max(float(N, C, H, W) I) -> (O, P, maxVal) {{
    O(n, c, h, w) = exp(I(n, c, h, w) * {beta})
    maxVal(n, h, w) max= O(n, c, h, w)
    P(n, c, h, w) = O(n, c, h, w) / maxVal(n, h, w)
}}

def norm_max_grad(float(N, C, H, W) I, float(N, C, H, W) O_grad, float(N, C, H, W) P_grad, float(N, H, W) maxVal) -> (I_grad){{
    I_grad(n, c, h, w) +=! exp(I(n, c, h, w) * {beta}) * {beta} * maxVal(n, h, w)
    I_grad(n, c, h, w) = I_grad(n, c, h, w) - (exp(I(n, c, h, w) * {beta}) * (I(n, c, h, w) == maxVal(n, h, w) ? 1: 0))
    I_grad(n, c, h, w) = I_grad(n, c, h, w) / pow(maxVal(n, h, w), 2)
    I_grad(n, c, h, w) = I_grad(n, c, h, w) * I_grad(n, c, h, w)
}}
"""

norm_max = tc.define(norm_lang, training=True, name="norm_max",
                     backward="norm_max_grad",
                     constants={'beta': 1})
norm_max.autotune(torch.randn(8, 135, 184, 184).cuda(), generations=1, pop_size=10)

[WARNING]: Autotuning results won't be cached. 'cache' option is not set
[WARNING]: Using 'naive' type mapping options for autotuning. See help(your_layer.autotune) for how to set mapping options.
Generation 0    Jobs(Compiled, GPU)/total  (10, 10)/10   (best/median/worst)us: 21383/66695/86822
[INFO]: Autotuning the backward layer now
[WARNING]: Using 'naive' type mapping options for autotuning. See help(your_layer.autotune) for how to set mapping options.
WARNING: Logging before InitGoogleLogging() is written to STDERR
W0313 14:40:08.700757 14911 rtc.cc:103] Compilation failure for nvrtc(NVRTC_ERROR_COMPILATION): 
default_program(32): error: expected a ")"

default_program(32): error: expected a ")"

default_program(32): error: expected a ")"

default_program(32): error: expected an expression

default_program(18): warning: variable "b0" was declared but never referenced

default_program(18): warning: variable "b2" was declared but never referenced

default_program(19): warning: variable "t2" was declared but never referenced

default_program(22): warning: variable "O_grad" was declared but never referenced

default_program(23): warning: variable "P_grad" was declared but never referenced

4 errors detected in the compilation of "default_program".
 source:
template<typename T> inline __device__ T floord(T n, T d) {
  return n < 0 ? - (-n + d - 1)/d : n / d;
}
#define if_then_else(cond,a,b) (cond) ? (a) : (b);

// Halide type handling
typedef int int32;
typedef long int64;
typedef float float32;
typedef double float64;

#define inff __int_as_float(0x7f800000)
#define inf __longlong_as_double(0x7ff0000000000000LL)

extern "C" {
__global__ void norm_max_grad_135_184_8_184(int32 C, int32 H, int32 N, int32 W, float32* pI_grad, float32* pI, float32* pO_grad, float32* pP_grad, float32* pmaxVal) {
  int b0 = blockIdx.x; int b1 = blockIdx.y; int b2 = blockIdx.z;
  int t0 = threadIdx.x; int t1 = threadIdx.y; int t2 = threadIdx.z;
  float32 (*I_grad)[135][184][184] = reinterpret_cast<float32 (*)[135][184][184]>(pI_grad);
  float32 (*I)[135][184][184] = reinterpret_cast<float32 (*)[135][184][184]>(pI);
  float32 (*O_grad)[135][184][184] = reinterpret_cast<float32 (*)[135][184][184]>(pO_grad);
  float32 (*P_grad)[135][184][184] = reinterpret_cast<float32 (*)[135][184][184]>(pP_grad);
  float32 (*maxVal)[184][184] = reinterpret_cast<float32 (*)[184][184]>(pmaxVal);
  for (int c2 = 0; c2 <= 183; c2 += 32) {
    for (int c4 = 0; c4 <= 7; c4 += 1) {
      for (int c5 = 0; c5 <= min(31, -32 * b1 + 134); c5 += 1) {
        for (int c6 = t1; c6 <= min(31, -c2 + 183); c6 += 8) {
          for (int c7 = t0; c7 <= 183; c7 += 32) {
            I_grad[c4][32*b1 + c5][c2 + c6][c7] = 0.000000f;
            I_grad[c4][32*b1 + c5][c2 + c6][c7] = (I_grad[c4][32*b1 + c5][c2 + c6][c7] + (exp(I[c4][32*b1 + c5][c2 + c6][c7])*maxVal[c4][c2 + c6][c7]));
            I_grad[c4][32*b1 + c5][c2 + c6][c7] = (I_grad[c4][32*b1 + c5][c2 + c6][c7] - (exp(I[c4][32*b1 + c5][c2 + c6][c7])*float32(if_then_else((I[c4][32*b1 + c5][c2 + c6][c7] == maxVal[c4][c2 + c6][c7]), 1, 0))));
            I_grad[c4][32*b1 + c5][c2 + c6][c7] = (I_grad[c4][32*b1 + c5][c2 + c6][c7]/pow(maxVal[c4][c2 + c6][c7], 2));
            I_grad[c4][32*b1 + c5][c2 + c6][c7] = (I_grad[c4][32*b1 + c5][c2 + c6][c7]*I_grad[c4][32*b1 + c5][c2 + c6][c7]);
          }
        }
      }
    }
  }
}
}

/*
Mapping Options:
tc::MappingOptions::makeNaiveMappingOptions()
    .outerScheduleFusionStrategy(tc::FusionStrategy::Preserve3Coincident)
    .outerScheduleAllowSkewing(false)
    .outerSchedulePositiveOrthant(true)
    .intraTileScheduleFusionStrategy(tc::FusionStrategy::Preserve3Coincident)
    .intraTileScheduleAllowSkewing(false)
    .intraTileSchedulePositiveOrthant(true)
    .tile(32, 32, 32)
    .mapToThreads(32, 8)
    .mapToBlocks(256, 256)
    .unroll(1)
    .tileImperfectlyNested(false)
    .useSharedMemory(false)
    .usePrivateMemory(false)
    .unrollCopyShared(false)
    .matchLibraryCalls(false);
TC version: 8e112e9dccda62c30ef29208a827e783b9a7f156
*/
E0313 14:40:08.700851 14911 rtc.cc:106] Compilation failure for nvrtc(NVRTC_ERROR_COMPILATION): 
default_program(32): error: expected a ")"

default_program(32): error: expected a ")"

default_program(32): error: expected a ")"

default_program(32): error: expected an expression

default_program(18): warning: variable "b0" was declared but never referenced

default_program(18): warning: variable "b2" was declared but never referenced

default_program(19): warning: variable "t2" was declared but never referenced

default_program(22): warning: variable "O_grad" was declared but never referenced

default_program(23): warning: variable "P_grad" was declared but never referenced

4 errors detected in the compilation of "default_program".
 source:
template<typename T> inline __device__ T floord(T n, T d) {
  return n < 0 ? - (-n + d - 1)/d : n / d;
}
#define if_then_else(cond,a,b) (cond) ? (a) : (b);

// Halide type handling
typedef int int32;
typedef long int64;
typedef float float32;
typedef double float64;

#define inff __int_as_float(0x7f800000)
#define inf __longlong_as_double(0x7ff0000000000000LL)

extern "C" {
__global__ void norm_max_grad_135_184_8_184(int32 C, int32 H, int32 N, int32 W, float32* pI_grad, float32* pI, float32* pO_grad, float32* pP_grad, float32* pmaxVal) {
  int b0 = blockIdx.x; int b1 = blockIdx.y; int b2 = blockIdx.z;
  int t0 = threadIdx.x; int t1 = threadIdx.y; int t2 = threadIdx.z;
  float32 (*I_grad)[135][184][184] = reinterpret_cast<float32 (*)[135][184][184]>(pI_grad);
  float32 (*I)[135][184][184] = reinterpret_cast<float32 (*)[135][184][184]>(pI);
  float32 (*O_grad)[135][184][184] = reinterpret_cast<float32 (*)[135][184][184]>(pO_grad);
  float32 (*P_grad)[135][184][184] = reinterpret_cast<float32 (*)[135][184][184]>(pP_grad);
  float32 (*maxVal)[184][184] = reinterpret_cast<float32 (*)[184][184]>(pmaxVal);
  for (int c2 = 0; c2 <= 183; c2 += 32) {
    for (int c4 = 0; c4 <= 7; c4 += 1) {
      for (int c5 = 0; c5 <= min(31, -32 * b1 + 134); c5 += 1) {
        for (int c6 = t1; c6 <= min(31, -c2 + 183); c6 += 8) {
          for (int c7 = t0; c7 <= 183; c7 += 32) {
            I_grad[c4][32*b1 + c5][c2 + c6][c7] = 0.000000f;
            I_grad[c4][32*b1 + c5][c2 + c6][c7] = (I_grad[c4][32*b1 + c5][c2 + c6][c7] + (exp(I[c4][32*b1 + c5][c2 + c6][c7])*maxVal[c4][c2 + c6][c7]));
            I_grad[c4][32*b1 + c5][c2 + c6][c7] = (I_grad[c4][32*b1 + c5][c2 + c6][c7] - (exp(I[c4][32*b1 + c5][c2 + c6][c7])*float32(if_then_else((I[c4][32*b1 + c5][c2 + c6][c7] == maxVal[c4][c2 + c6][c7]), 1, 0))));
            I_grad[c4][32*b1 + c5][c2 + c6][c7] = (I_grad[c4][32*b1 + c5][c2 + c6][c7]/pow(maxVal[c4][c2 + c6][c7], 2));
            I_grad[c4][32*b1 + c5][c2 + c6][c7] = (I_grad[c4][32*b1 + c5][c2 + c6][c7]*I_grad[c4][32*b1 + c5][c2 + c6][c7]);
          }
        }
      }
    }
  }
}
}

/*
Mapping Options:
tc::MappingOptions::makeNaiveMappingOptions()
    .outerScheduleFusionStrategy(tc::FusionStrategy::Preserve3Coincident)
    .outerScheduleAllowSkewing(false)
    .outerSchedulePositiveOrthant(true)
    .intraTileScheduleFusionStrategy(tc::FusionStrategy::Preserve3Coincident)
    .intraTileScheduleAllowSkewing(false)
    .intraTileSchedulePositiveOrthant(true)
    .tile(32, 32, 32)
    .mapToThreads(32, 8)
    .mapToBlocks(256, 256)
    .unroll(1)
    .tileImperfectlyNested(false)
    .useSharedMemory(false)
    .usePrivateMemory(false)
    .unrollCopyShared(false)
    .matchLibraryCalls(false);
TC version: 8e112e9dccda62c30ef29208a827e783b9a7f156
*/
[ERROR]: Raised exception: Could not compile function

[Build] ScheduleTree struct error schedule_tree.h

  • OS: Fedora 28
  • Python version: 3.6
  • CUDA: 9.1
  • GCC: 8.0.1

Hi Folks,

I am trying to pack TC as rpm package on RedHat/Fedora platform but have issues within tc_executor.cc & schedule_tree.h . Not sure what would be right declartion.

  • In short:
polyhedral/schedule_tree.h:141:44: error: ‘std::ostream& tc::polyhedral::detail::operator<<(std::ostream&, tc::polyhedral::detail::ScheduleTree&)’ should have been declared inside ‘tc::polyhedral::detail’
  • In details
/usr/bin/c++  -DCAFFE2_USE_GOOGLE_GLOG -DDMLC_USE_GLOG -Dtc_core_EXPORTS -I/home/cbalint/rpmbuild/BUILD/TensorComprehensions/src/../include -I/home/cbalint/rpmbuild/BUILD/TensorComprehensions/src/. -I/usr/lib/python2.7/site-packages/torch/lib/include -I/home/cbalint/rpmbuild/BUILD/TensorComprehensions/third-party/caffe2/third_party/eigen -I/usr/include/cuda -I/usr/include/cuda/include -I/home/cbalint/rpmbuild/BUILD/TensorComprehensions/third-party/dlpack/include -I/home/cbalint/rpmbuild/BUILD/TensorComprehensions/third-party/islpp/include -I/home/cbalint/rpmbuild/BUILD/TensorComprehensions/include -I/home/cbalint/rpmbuild/BUILD/TensorComprehensions/src/version -I/home/cbalint/rpmbuild/BUILD/TensorComprehensions/build/src/proto  -O2 -g -pipe -Wall -Werror=format-security -Wp,-D_FORTIFY_SOURCE=2 -Wp,-D_GLIBCXX_ASSERTIONS -fexceptions -fstack-protector-strong -grecord-gcc-switches -specs=/usr/lib/rpm/redhat/redhat-hardened-cc1 -specs=/usr/lib/rpm/redhat/redhat-annobin-cc1 -m64 -mtune=generic -fasynchronous-unwind-tables -fstack-clash-protection -mcet -fcf-protection -DCUDA_HOME="\"/usr/include/cuda\"" -DCUB_HOME="\"/usr/include\""  -Wall -Wno-sign-compare -fPIC  -o CMakeFiles/tc_core.dir/tc_executor.cc.o -c /home/cbalint/rpmbuild/BUILD/TensorComprehensions/src/core/tc_executor.cc -std=gnu++11
In file included from /home/cbalint/rpmbuild/BUILD/TensorComprehensions/src/../include/tc/core/halide2isl.h:25,
                 from /home/cbalint/rpmbuild/BUILD/TensorComprehensions/src/../include/tc/core/polyhedral/scop.h:27,
                 from /home/cbalint/rpmbuild/BUILD/TensorComprehensions/src/../include/tc/core/tc_executor.h:20,
                 from /home/cbalint/rpmbuild/BUILD/TensorComprehensions/src/core/tc_executor.cc:16:
/home/cbalint/rpmbuild/BUILD/TensorComprehensions/src/../include/tc/core/polyhedral/schedule_tree.h:141:44: error: ‘std::ostream& tc::polyhedral::detail::operator<<(std::ostream&, tc::polyhedral::detail::ScheduleTree&)’ should have been declared inside ‘tc::polyhedral::detail’
       tc::polyhedral::detail::ScheduleTree&);

Add support for strided tensors

We currently incorrectly treat all tensors as dense. Need to add support for strided tensors. This can be done by adding the strides as extra params, and either constructing the tensor views in the emitted source using the strides instead of the sizes, or just doing the storage flattening ourselves before or during codegen.

checklist after talking to @ftynse [@prigoyal @protonu is working on it]

  • propagate DLTensor stride information to Halide (it has similar struct and already takes bits, lanes info, so should be easy)
  • codegen, where we emit "tensor views" by casting a pointer into a multidimensional array - the casting should be aware of strides. Also, Halide->polyhedral, there is a place that creates constraints on tensor sizes for polyhedral; it may or may not be adapted to strides.
  • add the stride info to the tuner proto
  • make proper changes to the frontend (aten, pytorch, caffe2 etc.)

[Build] build on Ubuntu16.04 and fail when building third-party

  • OS: Ubuntu 16.04

  • How you installed TC (docker, conda, source): source

  • Python version: 3.5.2

  • CUDA/cuDNN version: 8.0

  • GCC/GXX version (if compiling from source): 5.4.0

  • LLVM/Tapir git hash used (if compiling from source): clang version 5.0.0 (get from the command below)

    To get the hash, run: $HOME/clang+llvm-tapir5.0/bin/clang --version

  • Commit hash of our repo and submodules (if compiling from source): ea0c39d

In addition, including the following information will also be very helpful for us to diagnose the problem:

I followed "Building from Source in Non-Conda Environment"
of https://facebookresearch.github.io/TensorComprehensions/installation_non_conda.html and problems happened when installing TC with caffe2 intergration. While the build.sh is trying to build some third-party such as Halide I got some error:

/home/jimmyoic/TensorComprehensions/third-party/halide/src/Error.cpp: In destructor 'Halide::Internal::ErrorReport::~ErrorReport()':
/home/jimmyoic/TensorComprehensions/third-party/halide/src/Error.cpp:121:15: error: exception handling disabled, use -fexceptions to enable
throw err;
^
At global scope:
cc1plus: error: unrecognized command line option '-Wno-unknown-warning-option' [-Werror]
cc1plus: all warnings being treated as errors

Note that since I already had protobuf 2.6 on my server, I built protobuf 3.4 and changed the path of build.sh where it finds protobuf.

BTW, is it possible to build those third-party by myself rather than through the build script? I want to make sure how to correctly configure this (if this works).

Thanks!

Unable to use same variable for reduction twice or more in TC

do we not allow people to use same variable twice? also repro is available in #61

FAILS:
def softmax(float(N, D) I) -> (O, tmp) {
        tmp(n) max=! I(n, d)
        O(n, d) = exp(I(n, d) - tmp(n))
        tmp(n) +=! O(n, d)
        O(n, d) = O(n, d) / tmp(n)
}

PASSES:
def softmax(float(N, D) I) -> (O, expsum, maxVal) {
        maxVal(n) max= I(n, d)
        expsum(n) +=! exp(I(n, d) - maxVal(n))
        O(n, d) = exp(I(n, d) - maxVal(n)) / expsum(n)
 }

Scattering

We currently support computed expressions in the indices on the RHS. We should plumb through computed expressions on the LHS too, e.g:

hist(im(i)) +=! 1

Improve SIGINT/SIGTERM during autotuning

When the signal handling logic receives a SIGINT/SIGTERM it dumps the compilation caches, waits for the tuning sessions to finish and then throws an exception.

Maybe something got lost during refactoring but the current logic is not very useful:

  • The caches are stored after autotuning even if a signal was not raised.
  • If the user has to wait until autotuning exits normally why is an exception thrown.
  • Not being to immediately stop the autotuner with a SIGTERM is annoying.

@prigoyal what are the constraints imposed by the python bindings?

Reference group belongs to two error

Generation 0    Job[Compiled, GPU] (10, 8)/10   Time (us): best: 97 median: 347 worst: 427F0223 14:03:31.068509 111930 codegen_cuda.cc:730] Check failed: !promotionInfo.groupId reference __tc_ref_10 belongs to two groups: _centered_0 and _var_0
*** Check failure stack trace: ***
    @     0x7fce13cf2c3d  google::LogMessage::Fail()
    @     0x7fce13cf4b88  google::LogMessage::SendToLog()
    @     0x7fce13cf2723  google::LogMessage::Flush()
    @     0x7fce13cf54be  google::LogMessageFatal::~LogMessageFatal()
    @     0x7fce135fdfae  tc::polyhedral::detail::emitMappedTensorAccess()
    @     0x7fce135fa860  tc::polyhedral::(anonymous namespace)::emitUserStmt()
    @     0x7fce135fc01a  tc::polyhedral::(anonymous namespace)::AstPrinter::emitStmt()
    @     0x7fce135fc586  tc::polyhedral::(anonymous namespace)::AstPrinter::emitAst()
    @     0x7fce135fc474  tc::polyhedral::(anonymous namespace)::AstPrinter::emitAst()
    @     0x7fce135f805f  tc::polyhedral::(anonymous namespace)::AstPrinter::emitFor()
    @     0x7fce135fc389  tc::polyhedral::(anonymous namespace)::AstPrinter::emitAst()
    @     0x7fce135fc474  tc::polyhedral::(anonymous namespace)::AstPrinter::emitAst()
    @     0x7fce135fc474  tc::polyhedral::(anonymous namespace)::AstPrinter::emitAst()
    @     0x7fce135fc474  tc::polyhedral::(anonymous namespace)::AstPrinter::emitAst()
    @     0x7fce135f805f  tc::polyhedral::(anonymous namespace)::AstPrinter::emitFor()
    @     0x7fce135fc389  tc::polyhedral::(anonymous namespace)::AstPrinter::emitAst()
    @     0x7fce135f830e  tc::polyhedral::(anonymous namespace)::AstPrinter::emitIf()
    @     0x7fce135fc3cf  tc::polyhedral::(anonymous namespace)::AstPrinter::emitAst()
    @     0x7fce135fc474  tc::polyhedral::(anonymous namespace)::AstPrinter::emitAst()
    @     0x7fce135f6bcb  tc::polyhedral::(anonymous namespace)::AstPrinter::emit()
    @     0x7fce1360065a  tc::polyhedral::emitCudaKernel()
    @     0x7fce1364c6b2  tc::polyhedral::MappedScop::codegen()
    @     0x7fce1356c73b  tc::TcExecutor::compileWithTcMapper()
    @     0x7fce1356b7af  tc::TcExecutor::compile()
    @     0x7fce13554725  tc::ExecutionEngine::compile()
    @     0x7fce14777746  tc::autotune::detail::GeneticTunerHarness::doCompile()
    @     0x7fce147790bc  _ZZN2tc8autotune6detail19GeneticTunerHarness16runOneGenerationEmENKUlvE1_clEv
    @     0x7fce1477cb6e  _ZNSt12_Bind_simpleIFZN2tc8autotune6detail19GeneticTunerHarness16runOneGenerationEmEUlvE1_vEE9_M_invokeIIEEEvSt12_Index_tupleIIXspT_EEE
    @     0x7fce1477ca1d  _ZNSt12_Bind_simpleIFZN2tc8autotune6detail19GeneticTunerHarness16runOneGenerationEmEUlvE1_vEEclEv
    @     0x7fce1477c94e  _ZNSt6thread5_ImplISt12_Bind_simpleIFZN2tc8autotune6detail19GeneticTunerHarness16runOneGenerationEmEUlvE1_vEEE6_M_runEv
    @     0x7fce260a6c5c  execute_native_thread_routine_compat
    @     0x7fce54fcce25  start_thread

in Start example typo

in Docs: Let’s start with a simple example is a matrix vector product:
def mv(float(R,C) A, float(C) x) -> (o) {
o(i) += A(i,j) * b(j)
}

should be:
o(i) += A(i,j) *x(j)

Variable tensor sizes support for TC

is there a way for us to support variable tensor sizes? right now, if the tensor size changes, we have to recompile and cache. But often in computer vision, NLP, people have models where tensor size for a layer changes at every step of training

for example:

def sum_reduce_dims023_4d(float(N ,C, H, W) X) -> Y {
   Y(c) += ! X(n, c, h, w)
}

should user be writing this for all input sizes? compiling and caching every time.

This is a big use case for computer vision and NLP models and having to compile at every step is going to slow down things quite a bit.

TC conda package for CUDA9

There is interest in having TC conda package which is compatible with CUDA9. I'll look into packaging that.

Indirections on LHS [for indexing grad]

the reason for having the LHS indirection is because let's say someone has the lookup table:

def lut(float(B, R) M, int32(B, N) I) -> (O) {
    O(b, n) +=! M(I(b, n), r)
}

now they want to define the gradient for this, the most intuitive way to write the TC for backwards would be

M(I(b, n), r) = O_grad(b, r)

from @ftynse

  • without other uses of M, we cannot infer its size in the first dimension, even with a where clause; this is not a problem for input tensors whose size is specified
  • Halide->Polyhedral lowering pass should differentiate between must writes (all writes are must-writes now) and may writes
  • so should polyhedral dependence analysis
  • advanced transformations (reductions, shared memory) should be disabled
  • codegen should properly emit LHS indirections; hopefully this just works thanks to Halide

[Build] Using protobuf installation at non-standard location

I want to use a protobuf installation within $HOME to build with TC. CMake picks up the path to the protoc application but seems to fail at picking the right include directories,

In file included from /home/sanjay/Projects/TensorComprehensions/build/src/proto/compcache.pb.cc:5:0:
/home/sanjay/Projects/TensorComprehensions/build/src/proto/compcache.pb.h:9:42: fatal error: google/protobuf/stubs/common.h: No such file or directory
compilation terminated.
In file included from /home/sanjay/Projects/TensorComprehensions/build/src/proto/mapping_options.pb.cc:5:0:
/home/sanjay//Projects/TensorComprehensions/build/src/proto/mapping_options.pb.h:9:42: fatal error: google/protobuf/stubs/common.h: No such file or directory
compilation terminated.
src/proto/CMakeFiles/tc_proto.dir/build.make:110: recipe for target 'src/proto/CMakeFiles/tc_proto.dir/compcache.pb.cc.o' failed
make[2]: *** [src/proto/CMakeFiles/tc_proto.dir/compcache.pb.cc.o] Error 1

with tc_proto.dir/build.make having the line,

cd /home/sanjay/Projects/TensorComprehensions/build/src/proto && /home/sanjay/Projects/proto-install/protobuf-3.4.0/install/bin/protoc --cpp_out cd /home/sanjay/Projects/TensorComprehensions/build/src/proto && /home/sanjay/Projects/proto-install/protobuf-3.4.0/install/bin/protoc --cpp_out /home/sanjay/Projects/TensorComprehensions/build/src/proto -I /home/sanjay/Projects/TensorComprehensions/src/proto /home/sanjay/Projects/TensorComprehensions/src/proto/compcache.protoProjects/TensorComprehensions/build/src/proto -I /home/sanjay/Projects/TensorComprehensions/src/proto /home/sanjay/Projects/TensorComprehensions/src/proto/compcache.proto

What can I do to ensure that the requisite include paths are included ?

Setup:

  • OS: Ubuntu 16.04
  • How you installed TC (docker, conda, source): build from source within a conda environment
  • Python version: 3.6
  • CUDA/cuDNN version: CUDA 8
  • Conda version (if using conda): 4.4.11
  • Docker image (if using docker):
  • GCC/GXX version (if compiling from source): 5.4
  • LLVM/Tapir git hash used (if compiling from source): 5.0.0git-195a877
  • Commit hash of our repo and submodules (if compiling from source): recursively clone from 8c345a5 of TC.

Improved Layer Fusion Support

It seems like one of the huge benefits of a tool like TC is the ability to fuse layers and optimize the schedule jointly. There may be significant speed-ups from fusing all the layers of a full model and auto-scheduling for example. Unless I'm reading the docs wrong, the only way to currently do this fusion is to define an entire model in one TC language. However, since there are already many standard layers in tc.database, it would be much cleaner to define the model as a series of TC layers. This could be implemented by adding a fuse function that takes a list of individual languages and produces one large one that can then be optimized. If something like this feature already exists, adding a fusion tutorial to the docs would prevent other people like me from getting confused.

TC build support for PyTorch master

there is user interest in using TC compatible with pytorch master. This also means that users want to use/develop TC with current pytorch master.

cc @soumith

  • Bump aten submodule and make necessary changes #155
  • add instructions to build pytorch from source with/without conda deps

Bitwise Operation Support

Tensor Comprehensions seem like a great way to implement binary approximation of layers as in BinaryNet. However, tc does not currently support bitwise operations such as xnor (^), or (|), and and (&). Additionally, implementing these approximations would require the popc cuda intrinsic, which is no supported. Getting these functions integrated would enable really cool and fast applications that could highlight the strengths of tc.

Nested Function Calls in TC

There is some interest in being able to do nested function calls i.e. define one TC and call it from inside another one.

lang = """
def max() {
...
}
"""

lang1 = """
def elementwise() -> (output) {
    output = max()
}
"""

This makes a lot of sense as well for cases where you want to use a defined database of TC and just make calls.

[EDIT]: pasting here @leonbottou 's remarks:

I'd love to see the ability to invoke a TC from another. Not as a procedure call, but as type-aware inlining. Here is an example for a possible syntax:
def weirdbatchedmatmul( float(N,W,H) x, float(N,M,W) k )
    -> float(M,W,H) y 
{
  y(*,i,j) = matmul( x(*,i,j), k(*,*,i) )
}

Then we can reimplement select and narrow
def select0( float(A,B,C) x, int S) -> o {
   o(i,j) = x(S,i,j)
}
def narrow0( float(A,B,C) x, int L, int F) -> o {
   o(k,i,j) = x(k+F,i,j) where k in 0:L-1
}

**Benefits:**
- Making the TC language a little less write-only!
- Bridging the gap between the TC style and the Lush/Torch/TF/PyTorch style of tensor programming,
- Encapsulating layers into a library of ready-to-inline TCs.

max=! fails with inff not known to nvrtc

FAILS
def softmax(float(N, D) I) -> (O, expsum, maxVal) {
        maxVal(n) max=! I(n, d)
        expsum(n) +=! exp(I(n, d) - maxVal(n))
        O(n, d) = exp(I(n, d) - maxVal(n)) / expsum(n)
}
Error: https://gist.github.com/prigoyal/0318a964368f86b3a6a09228c7bfbf71

PASSES (bang removed from max=)
def softmax(float(N, D) I) -> (O, expsum, maxVal) {
        maxVal(n) max= I(n, d)
        expsum(n) +=! exp(I(n, d) - maxVal(n))
        O(n, d) = exp(I(n, d) - maxVal(n)) / expsum(n)
}

Please support Tensor Comprehensions on macOS

Please support Tensor Comprehensions on macOS

Please consider the following information:

  • OS: macOS 10.13.3
  • How you installed TC (docker, conda, source): could not install (not supported)
  • Python version: 2.7.x
  • Conda version (is using conda): 4.4.11

Allow using scalars for bounds inference

We can't use scalar inputs in the bounds inference right now. So for example:

LANG="""
def avgpool(float(B, C, H, W) input, float kH, float kW, float sH, float sW) -> (output) {
    output(b, c, h, w) += input(b, c, h * sH + kh, w * sW + kw) where kh in 0:kH, kw in 0:kW
}
"""

LANG="""
def avgpool(float(B, C, H, W) input, float kH, float kW) -> (output) {
    output(b, c, h, w) += input(b, c, h + kh, w + kw) where kh in 0:kH, kw in 0:kW
}
"""

avgpool = tc.define(LANG, name="avgpool")
inp = torch.ones(1, 1, 4, 4).cuda()
kH = torch.randn(1).fill_(2.0).cuda()
kW = torch.randn(1).fill_(2.0).cuda()
sH = torch.randn(1).fill_(1.0).cuda()
sW = torch.randn(1).fill_(1.0).cuda()
out = avgpool(inp, kH, kW, sH, sW)

this will fail.

The workaround right now is to do proper substitution for those scalars in the TC before passing them to the backend.

Handle temporaries

right now, everything is either an input or output. Reason being TC does not do any allocation by itself. We should find a better way to handle this

cc Albert Cohen who is interested in this

Uniformize docker images

We used to have all our trusty docker images used to inherit from this FROM nvidia/cuda:8.0-cudnn6-devel-ubuntu14.04 and all our xenial docker images used to inherit from this FROM nvidia/cuda:9.0-cudnn7-devel-ubuntu16.04

This has changed drastically and I see that we now start from a naked ubuntu image unvetted by NVIDIA and we are installing cuda/cudnn with a custom script.

Still, when providing the Docker for conda_recipes we now use the NVIDIA image again.

I would prefer that we only used the NVIDIA images like we had for months but let's discuss if something else makes more sense in a different CI context.

Can't compile max reduction

When submitting a bug report, please include the following information (where relevant):

  • OS: ubuntu 16.04
  • How you installed TC (docker, conda, source): conda
  • Python version: 3.6
  • CUDA/cuDNN version: 9 / 7
  • Conda version (if using conda): conda 4.3.30
  • Docker image (if using docker): nvidia/cuda:9.0-cudnn7-devel-ubuntu16.04
  • GCC/GXX version (if compiling from source): 5.4.0
  • LLVM/Tapir git hash used (if compiling from source): NA
  • Commit hash of our repo and submodules (if compiling from source): NA

Hi, I'm experimenting and in particular I try to implement convolutional part of lenet as a single function in TC. However max-pooling doesn't work for me

Here is the script:

lang = """
def lenet_convpart(float(B, 32, 32) I0, 
                   float(6, 5, 5) Weight1, float(6) Bias1) -> (O1, O2) {

    O1(b, c, h, w) +=! I0(b, h + kh, w + kw) * Weight1(c, kh, kw)
    O1(b, c, h, w) = fmax(0, O1(b, c, h, w) + Bias1(c))
    # trivial reduction not working
    O2(b, c, h, w) max=! O1(b, c, h, w)
    # this one is of actual interest
    # O2(b, c, h1, w1) max=! O1(b, c, h1 * 2 + kh1, w1 * 2 + kw1) where kh1 in 0:2, kw1 in 0:2
}
"""
N = 100
tensordot = tc.define(lang, name="lenet_convpart")
I0 = torch.randn(100, 32, 32).cuda()
W1 = torch.randn(6, 5, 5).cuda()
B1 = torch.randn(6).cuda()

best_options = tensordot.autotune(I0, W1, B1, cache=True)
out = tensordot(I0, W1, B1, options=best_options)

Here is the output:

[INFO]: Autotuning cache will be saved to: /tmp/lenet_convpart_100_32_32_6_5_5_6_53fe0682-3df7-42ae-9592-f279767ec145.cuda/options
[WARNING]: Using 'naive' type mapping options for autotuning. See help(your_layer.autotune) for how to set mapping options.
[ERROR]: Caught Exception: Could not compile function

Padding/Boundary conditions

We need some way to express padding of input tensors. The current thought is a primitive pad(expr, expr), which returns the first argument if it does no out-of-bounds reads, and otherwise returns the second argument.

Support for triangular dependencies in TC language

one of our users reported on slack channel that they were trying to translate the following Numpy code

def A_matvec_batch(A, X):
    n, m = X.shape
    Y = np.zeros((n, m))
    for j in range(m):
        for i in range(n):
            # Out of bound values lead to the following rules to define limit in s index    
            # s_min = -1 for all rows except the zeroth and 0 for the zero row
            # s_max = 2 for all rows except the last one and 1 for the last row 
            s_min = max(-1, -i)
            s_max = min(2, n - i)
            for s in range(s_min, s_max):  # this loop simplifies differentiation
                Y[i, j] += A[i, s + 1] * X[i + s, j]
    return Y

to the TC:

lang = """
def A_mv(float(n, 3) A, float(n, size) X) -> (O) {
      output(i, j) +=! A(i, s+1) * X(i+s, j) where s in (-1<=-i ? -i : -1):(2<=n-i ? 2 : n-i) 
}
"""

but this TC fails to compile. The reason being: there are triangular dependencies in the bound inference i.e. s depends on i bounds. This is currently not supported in TC.
But the good news is that it is entirely language and inference issue and not the polyhedral backend.

@abadams has a nice proposal to handle this and we will discuss this further and support such TCs.

Refactor build.sh when having both conda and source mode

#72 introduces a bunch of build.sh with content that is largely copy-pasted from the original build.sh, it would be nice to structure things in a way such that:

  1. the main build.sh calls the auxiliary smaller build scripts with some set of options
  2. the conda recipes call the smaller build scripts with possibly another set of options

Problem in Importing tensor_comprehensions `[Installation]`

Hi,

Once I installed the TensorComprehensions in conda enviroment and the installation was successful, but I couldn't import the tensor_comprehensions:

ImportError: No module named 'tensor_comprehensions'

Then I installed the package from source, but again I encounter the mentioned error. I tried to run it with both python 2 and python3.

Exception thrown: "isl.h:3185: NULL input thrown in the test body"

one of our user reported an issue with TC

def graph2(float(N, C, H, W) I, float(N, C, R, T) J, float(KH, KW) W1) -> (O, Out) {
        O(n, c, h, w) +=! J(n, c, h + kh, w + kw) * W1(kh, kw)
        Out(i, j) +=! I(n, i, h, w) * O(n, j, h, w)
}

which consistently throws error:

unknown file: Failure
C++ exception with description "/home/prigoyal/local/TensorComprehensions/third-party-install/include/isl/interface/isl.h:3185: NULL input" thrown in the test body.

Ensure types other than float/int32 are supported

  • teach parser about bool, int8, int16, int32, int64, float16, float64
  • make sure Halide types are properly constructed
  • make sure polyhedral dependence analysis behaves correctly for complex types
  • teach memory promoter about byte-sizes of each type (@ftynse)
  • write tests involving all types

Support for RNN loops

There is a big interest in having support for RNN loops in TC. Creating this master task to discuss further and track progress

test_mapper_llvm fails because wrong cilkrts is found

With ac1f3c3, I get

Running build/test/test_mapper_llvm
[==========] Running 4 tests from 1 test case.
[----------] Global test environment set-up.
[----------] 4 tests from LLVMCodegen
[ RUN      ] LLVMCodegen.Basic
unknown file: Failure
C++ exception with description "Failed to find cilkrts: /usr/lib32/libcilkrts.so.5: wrong ELF class: ELFCLASS32" thrown in the test body.
[  FAILED  ] LLVMCodegen.Basic (12 ms)

The output of ldconfig -p | grep libcilkrts is

        libcilkrts.so.5 (libc6,x32) => /usr/libx32/libcilkrts.so.5
        libcilkrts.so.5 (libc6,x86-64) => /usr/lib/x86_64-linux-gnu/libcilkrts.so.5
        libcilkrts.so.5 (libc6) => /usr/lib32/libcilkrts.so.5

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.