tiramisu-compiler / tiramisu Goto Github PK

A polyhedral compiler for expressing fast and portable data parallel algorithms

License: MIT License

C++ 48.08% C 4.15% Shell 0.69% Python 8.86% CMake 0.54% Makefile 0.17% MATLAB 0.05% Jupyter Notebook 37.04% Cuda 0.41% C# 0.01%

compiler library optimization linear-algebra tensors deep-neural-networks machine-learning code-generation tensor-algebra

tiramisu's People

Contributors

Stargazers

Watchers

Forkers

stillwater-sc malekbr jrayzero isramekki radjahachilif bhafsa thinhinaneihadadene gueroudjiamal dinataklit huzelin mikeseven hennimohammed yaozhujia khalefa-phd gccfancier velconia arcbbb asmabalamane vmiheer iderayoub nadjisouilamas maafihanene dawnpower brightlaboratory khadidjarabea mascinissa mekkimohamed mekrache medchik ychen306 mhleghettas djohra19 haiahems abdelbasset-aidouni abdelatifbous akmaru iasakura ramzibouz vipyoung ayoub-root dhtarek yacine2611 bouzidseyf houssam2293 edebzine dzordre todun lcy-seso ghosthamlet uchihasr montassarhammouda bilal0h20 jerryzj xiaming9880 wxbbuaa2011 ct-clmsn hjchen2 suleisl2000 wpan11nv alex-xia-xia souhaibmehemel nboulfrad zehor-l attwaben cycheng khitermedachraf dylangeng egnlife chisuhua gkanwar ihiaadj wenming2014 nadir199 eddieburning yzwqf nedjimabelgacem yinxx littlefish0123 zhuangzhuangwu nassimtchoulak googol-lab yangwang92 subjectnoi dangerggg jaynotleno sirius93123 akothen omarabdeldayem haozech xinchenhawaii designe compilerloser reality95 guigalaverna affifboudaoud qimingzheng alaeddineatt chuanqixu9 linanina rnshah9

tiramisu's Issues

tiramisu aligned with llvm 6? Bring Halide up to date.

Tiramisu has dependencies with LLVM/Clang 5 because Halide branch is outdated.
The same could be said with ISL to be updated to 0.20.

Any idea when this updated alignment would occur?

Some debug info printed even if ENABLE_DEBUG is false

If ENABLE_DEBUG is turned OFF and make tests is run the following is printed

100010001000 + + + Lowering after final simplification

Where as this should not be printed.

ISL Debugging

When there is an ISL syntax error we get something like syntax error (1, 53): unknown identifier and there is no clue which ISL statement is causing the error. We should print the erroneous ISL statement or the line number.

Will you plan to add autodiff like Halide?

I see Halide had autodiff func by this paper, https://people.csail.mit.edu/tzumao/gradient_halide/gradient_halide.pdf

Will you plan to add autodiff for tiramisu ?

Halide Buffer Index Problem

Halide buffer indices are reversed compared to C convention and thus Tiramisu. For example if we define a buffer buf(10, 20) it is passed as an array of shape buf[20][10] to Tiramisu. Consequently, Halide statementbuf(x, y) = z sets buf[y][x] instead of buf[x][y]. This creates unexpected problems.

The issue does not come up in tutorials because either outputs are constant so the index flip does not matter or we don't check the outputs for validity at all.

I will be fixing the tutorials and use one tutorial to explain this behavior.

Debug Macro

__FILE__ and __LINE__ macros point to this file (tiramisu_debug.cpp:41) instead of where function is called. We should implement it as a macro instead of a function.

tiramisu/src/tiramisu_debug.cpp

Lines 38 to 47 in fed7a55

    
           void error(const std::string &str, bool exit_program) 
        
           { 
        
               std::cerr << "Error in " << __FILE__ << ":" 
        
                         << __LINE__ << " - " << str << std::endl; 
        
               if (exit_program) 
        
               { 
        
                   exit(1); 
        
               } 
        
           }

RNN example

I would be very interested in a RNN implementation example using Tiramisu.

Unfortunately due to #217 I cannot explore that myself at the moment.

Also in the tiramisu.github.io website, you claim that Halide cannot represent RNNs but time is just another loop/tensor dimension. Case in point, this seems to be an actual implementation of LSTM in Halide: https://github.com/misaka-10032/Halstm/blob/master/src/layer/lstm.cpp.

One thing I would be very interested in is the wavefront optimisation on stacked RNNs as mentionned in Nvidia's blog post in optimization 3.

They even provide the Cuda source code that can serve as a reference benchmark.

Possible error in squential_mm

tiramisu/benchmarks/linear_algebra/sequential_mm/generator.cpp

Line 44 in ba82202

c_T1.set_expression(c_T1(i, j, k) + c_A(i, j) * c_B(j, k));

Shouldn't A(i,j) be A(i,k) instead ?

Any suggestions for understanding the ISL code in Tiramisu?

Hello Tiramisu developers:

I am a software developer interested in the Tiramisu project.

Currently, I'm reading the Tiramisu codes and find complex usage of ISL for the polyhedral analysis.

I wonder how you developers can master the ISL so fluently, as far as I googled through the Internet and just found several documents about the ISL, with very few demos and examples about the API.

After reading these documentations, I found it still hard to understand the ISL usage code in the Tiramisu project.

Can you provide some suggestions about the materials for mastering the ISL APIs?

Some Travis Tests Fail

When travis tests are enabled in:

tiramisu/.travis.yml

Line 22 in 22c6c09

# - make test

some tests fail every now and then (e.g test_10). The issue is probably about avx2 support on Travis machines, so we night need to specify the architecture in .travis.yml file.

tiramisu_core

src/tiramisu_core.cpp has over 8800 lines which is very difficult to navigate through. It would be convenient if we split it into class definition files like tiramisu_computation.cpp, tiramisu_buffer.cpp, etc. This might be difficult to do in one shot since the system has many interconnected parts, so we can start moving things around function by function.

Does tiramisu support FPGA as backend hardware now?

As your paper said, it has supported fpga until now, right?

Is there any methods in Tiramisu for parallelizing or loop tiling that automatically resolves data dependency?

I am quite new to the polyhedral model, and may still be unfamiliar with related concepts, so please point out if I made any mistakes.

I would like to know is there any methods for parallelizing or loop tiling that automatically resolves data dependencies. To be more specific, consider the following one-dimensional stencil computation:

for (t = 1; t < T; t += 1)
  for (i = 1; i < N - 1; i += 1)
    A[t][i] = 0.25 * (A[t - 1][i + 1] - 2.0 * A[t - 1][i] +　A[t - 1][i - 1]);

Since computing A[t][i] needs to read A[t - 1][i + 1], the statement instance (t, i) have to be excuted after the statement instance (t - 1, i + 1). So the two-level loop cannot be simply tiled, otherwise data dependency will be violated.

However, the Computation::tile function in Tiramisu seems won't make any efforts to solve the data dependency:

#include <tiramisu/tiramisu.h>

using namespace tiramisu;

int main() {
  tiramisu::init("stencil");

  const int SIZE_T = 200, SIZE_N = 100;
  constant T("T", expr(SIZE_T)), N("N", expr(SIZE_N));

  var t("t", 1, T), i("i", 1, N - 1);

  computation A("A", {t, i}, p_float32);
  A.set_expression((A(t - 1, i + 1) - A(t - 1, i) * 2.0f + A(t - 1, i - 1)) * 0.25f);
//  var t0("t0"), i0("i0"), t1("t1"), i1("i1");
//  A.tile(t, i, 32, 32, t0, i0, t1, i1);
  
  buffer b_A("b_A", {expr(SIZE_T), expr(SIZE_N)}, p_float32, a_input);
  A.store_in(&b_A);

  tiramisu::codegen({&b_A}, "stencil.o");
}

Uncomment the two lines related to loop tiling, and the output Halide IR changes from:

Generated Halide IR:
assert((reinterpret(uint64, b_A.buffer) != (uint64)0), halide_error_buffer_argument_is_null("b_A"))
let b_A = _halide_buffer_get_host(b_A.buffer)
produce  {
  allocate _A_b0[float32 * 98 * 199]
  for (c1, 1, 199) {
    for (c3, 1, 98) {
      b_A[(c3 + int32((int64(c1)*(int64)100)))] = (((b_A[(int32((int64(c3) + (int64)1)) + int32(((int64(c1)*(int64)100) + (int64)-100)))] - (b_A[(c3 + int32(((int64(c1)*(int64)100) + (int64)-100)))]*2.000000f)) + b_A[(int32((int64(c3) + (int64)-1)) + int32(((int64(c1)*(int64)100) + (int64)-100)))])*0.250000f)
    }
  }
}

to:

Generated Halide IR:
assert((reinterpret(uint64, b_A.buffer) != (uint64)0), halide_error_buffer_argument_is_null("b_A"))
let b_A = _halide_buffer_get_host(b_A.buffer)
produce  {
  allocate _A_b0[float32 * 98 * 199]
  for (c1, 0, 7) {
    for (c3, 0, 4) {
      for (c5, (1 - min((c1*32), 1)), ((min((c1*32), 1) - max((c1*32), 168)) + 199)) {
        for (c7, (1 - min((c3*32), 1)), ((min((c3*32), 1) - max((c3*32), 67)) + 98)) {
          b_A[(((c3*32) + c7) + int32((int64(((c1*32) + c5))*(int64)100)))] = (((b_A[(int32((int64(((c3*32) + c7)) + (int64)1)) + int32(((int64(((c1*32) + c5))*(int64)100) + (int64)-100)))] - (b_A[(((c3*32) + c7) + int32(((int64(((c1*32) + c5))*(int64)100) + (int64)-100)))]*2.000000f)) + b_A[(int32((int64(((c3*32) + c7)) + (int64)-1)) + int32(((int64(((c1*32) + c5))*(int64)100) + (int64)-100)))])*0.250000f)
        }
      }
    }
  }
}

where data dependency is violated, and test also fails.

I have heard that the Pluto algorithm can be adopted in such a scenario, which will automatically skew the iteration domain to solve the data dependency:

for (t = 1; t < T; t += 1)
  for (i = 1 + t; i < N - 1 + t; i += 1)
    A[t][i - t] = 0.25 * (A[t - 1][i - t + 1] - 2.0 * A[t - 1][i + t] + A[t - 1][i - t - 1]);

and the loop can be safely tiled. It is also possible to skew the iteration domain and perform loop tiling in Tiramisu:

#include <tiramisu/tiramisu.h>

using namespace tiramisu;

int main() {
  tiramisu::init("stencil");

  const int SIZE_T = 200, SIZE_N = 100;
  constant T("T", expr(SIZE_T)), N("N", expr(SIZE_N));

  var t("t", 1, T), i("i", 1, N - 1);

  computation A("A", {t, i}, p_float32);
  A.set_expression((A(t - 1, i + 1) - A(t - 1, i) * 2.0f + A(t - 1, i - 1)) * 0.25f);
  var nt("nt"), ni("ni");
  A.skew(t, i, 1, nt, ni);
  var t0("t0"), i0("i0"), t1("t1"), i1("i1");
  A.tile(nt, ni, 32, 32, t0, i0, t1, i1);

  buffer b_A("b_A", {expr(SIZE_T), expr(SIZE_N)}, p_float32, a_output);
  A.store_in(&b_A);

  tiramisu::codegen({&b_A}, "stencil.o");
}

It will pass the test. But it requires my observation of the loop patterns to make such a transformation. Moreover, if I want to paralize the loop, it requires not only skewing, but also synchronization and communication between parallel computation units. This seems complicated to me, but it can theoratically be automated through the Pluto algorithm. This is why I would like to know: is there any methods in Tiramisu for parallelizing or loop tiling that automatically resolves data dependencies?

CUDA Wrapper Bug

tiramisu/src/tiramisu_codegen_cuda.cpp

Lines 772 to 774 in 0051562

    
           cuda_ast::statement_ptr{new cuda_ast::host_function{cuda_ast::kernel::wrapper_return_type, 
        
                                                               kernel->get_wrapper_name(), kernel->get_arguments(), 
        
                                                               std::static_pointer_cast<cuda_ast::statement>(wrapper_block)}}

Host code generation for CUDA has a subtle bug. Wrapper copies kernel's arguments as its own arguments (kernel->get_arguments() above). However, if kernel drops an unused argument while wrapper still uses it for block dimension, generated code becomes faulty:

int32_t _kernel_0_wrapper(int32_t K, int32_t N) {
  dim3 blocks(((M / 96) + 1), ((N / 256) + 1), 1);
  dim3 threads((15 + 1), (15 + 1), 1);
  _kernel_0<<<blocks, threads>>>(K, N);
  return 0;
}

M is undefined above and should've been passed to wrapper as an argument. I'll work on this asap since it blocks GEMM benchmark.

CUDA AST is not documented

The CUDA code generator classes are not documented.

Examples of undocumented classes:
https://tiramisu-compiler.github.io/doc/namespacetiramisu_1_1cuda__ast.html
https://tiramisu-compiler.github.io/doc/structtiramisu_1_1_halide_codegen_output.html
https://tiramisu-compiler.github.io/doc/structtiramisu_1_1isl__ast__expr__deleter.html

Regression issue? Many halide tests slower with tiramisu on Mac, or crashing

The following tests are significantly slower with Tiramisu vs Halide:

blurxy
convolution, convolution_layer
gaussian
vgg
warp-affine

Others fails:

recfilter — segfault
heat2d, heat3d —crash: out of bounds access
laplacian —doesn’t build
optical_flow —crash: out of bounds access
resize — crash: name not in scope

This is on macbookpro 2018, MacOS Mojave, latest xcode and Homebrew up-to-date.

Docker image is broken?

I downloaded the docker image from http://groups.csail.mit.edu/commit/software/TiramisuVM.zip
to try it out and when I unzip it I get the following:

bash-3.2$ unzip TiramisuVM.zip
Archive:  TiramisuVM.zip
warning [TiramisuVM.zip]:  4294967296 extra bytes at beginning or within zipfile
  (attempting to process anyway)
file #1:  bad zipfile offset (local header sig):  4294967296
  (attempting to re-compensate)
   creating: TiramisuVM/
  inflating: TiramisuVM/.DS_Store
   creating: __MACOSX/
   creating: __MACOSX/TiramisuVM/
  inflating: __MACOSX/TiramisuVM/._.DS_Store
   creating: TiramisuVM/Logs/
  inflating: TiramisuVM/Logs/VBox.log
  inflating: TiramisuVM/TiramisuVM.vbox
  inflating: TiramisuVM/TiramisuVM.vbox-prev
  inflating: TiramisuVM/TiramisuVM.vdi
  error:  invalid compressed data to inflate

Could you please provide a working docker image?

Add GPU tests to Travis CI

We need to add GPU tests to Travis CI.

type `computation` missing virtual dtor

the type 'computation' in the file tiramisu/include/tiramisu/core.h is missing a virtual dtor.

Extra data copies

tiramisu/tutorials/developers/tutorial_04gpu/tutorial_04gpu.cpp

Line 69 in f50c764

    
           computation copy_A_to_host("{copy_A_to_host[0]}", memcpy(b_A_gpu, b_A), true, p_none, &matmul);

Copying C from host to GPU and copying A and B from GPU to host are not really needed.

CI build is broken

The CI build is failing. It seems to have something to do with halide's ABI not matching LLVM's, not sure how to fix that.

Errors compiling benchmarks/halide/heat2d_dist_tiramisu.cpp

This seems to be an error due to MPI, maybe because this benchmark is not disable if WITH_MPI is not used.

Assuming that

WITH_MPI=false

If the user runs

make benchmarks

We get the following error messages

/Users/b/Documents/src/MIT/tiramisu/benchmarks/halide/heat2d_dist_tiramisu.cpp:165:32: error: too few arguments to function call, single argument 'level' was not specified
init_even.drop_rank_iter();
~~~~~~~~~~~~~~~~~~~~~~~~ ^
/Users/b/Documents/src/MIT/tiramisu/include/tiramisu/core.h:2981:5: note: 'drop_rank_iter' declared here
void drop_rank_iter(var level);
^
/Users/b/Documents/src/MIT/tiramisu/benchmarks/halide/heat2d_dist_tiramisu.cpp:166:31: error: too few arguments to function call, single argument 'level' was not specified
init_odd.drop_rank_iter();
~~~~~~~~~~~~~~~~~~~~~~~ ^
/Users/b/Documents/src/MIT/tiramisu/include/tiramisu/core.h:2981:5: note: 'drop_rank_iter' declared here
void drop_rank_iter(var level);
^
/Users/b/Documents/src/MIT/tiramisu/benchmarks/halide/heat2d_dist_tiramisu.cpp:167:36: error: too few arguments to function call, single argument 'level' was not specified
out_comp_even.drop_rank_iter();
~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ^
/Users/b/Documents/src/MIT/tiramisu/include/tiramisu/core.h:2981:5: note: 'drop_rank_iter' declared here
void drop_rank_iter(var level);
^
/Users/b/Documents/src/MIT/tiramisu/benchmarks/halide/heat2d_dist_tiramisu.cpp:168:35: error: too few arguments to function call, single argument 'level' was not specified
out_comp_odd.drop_rank_iter();
~~~~~~~~~~~~~~~~~~~~~~~~~~~ ^
/Users/b/Documents/src/MIT/tiramisu/include/tiramisu/core.h:2981:5: note: 'drop_rank_iter' declared here
void drop_rank_iter(var level);
^
/Users/b/Documents/src/MIT/tiramisu/benchmarks/halide/heat2d_dist_tiramisu.cpp:191:12: error: use of undeclared identifier 'TOTAL_ITERATIONS'
assert(TOTAL_ITERATIONS % 2 == 0); // so that the output buffer is the last odd computation
^
/Users/b/Documents/src/MIT/tiramisu/benchmarks/halide/heat2d_dist_tiramisu.cpp:261:21: error: 'lift_dist_comps' is a protected member of 'tiramisu::function'
heat2d_tiramisu.lift_dist_comps(); // MUST go before gen_isl_ast
^
/Users/b/Documents/src/MIT/tiramisu/include/tiramisu/core.h:680:10: note: declared protected here
void lift_dist_comps();
^

Build instructions/scripts need to be updated. Unable to build.

I tried to follow instructions on an up-to-date archlinux distro with GCC 8.3.0 and Clang/LLVM 8.0.0 but had several issues with both LLVM and Halide and had to give up trying Tiramisu in the end.

Note that I can successfully build Halide from source. My last build was as of Feb 9 (https://github.com/halide/Halide/commits/d02247b3021549fde4bec8e600dced90f5d9a87c).

Build

I tried to use the install submodule script at https://github.com/Tiramisu-Compiler/tiramisu/blob/2ee529439fbfccf82f7351ee2e3c01f10387af26/utils/scripts/install_submodules.sh

LLVM

It fails for LLVM after reaching 100%.

[100%] Built target libclang
[100%] Built target c-arcmt-test
[100%] Built target c-index-test
In file included from /home/beta/Programming/DSL_Compilers_for_Tensors/tiramisu/3rdParty/llvm/include/llvm/CodeGen/LinkAllCodegenComponents.h:20,
                 from /home/beta/Programming/DSL_Compilers_for_Tensors/tiramisu/3rdParty/llvm/tools/lli/lli.cpp:22:
/home/beta/Programming/DSL_Compilers_for_Tensors/tiramisu/3rdParty/llvm/include/llvm/CodeGen/SchedulerRegistry.h: In constructor ‘llvm::RegisterScheduler::RegisterScheduler(const char*, const char*, llvm::RegisterScheduler::FunctionPassCtor)’:
/home/beta/Programming/DSL_Compilers_for_Tensors/tiramisu/3rdParty/llvm/include/llvm/CodeGen/SchedulerRegistry.h:40:52: warning: cast between incompatible function types from ‘llvm::RegisterScheduler::FunctionPassCtor’ {aka ‘llvm::ScheduleDAGSDNodes* (*)(llvm::SelectionDAGISel*, llvm::CodeGenOpt::Level)’} to ‘llvm::MachinePassCtor’ {aka ‘void* (*)()’} [-Wcast-function-type]
   : MachinePassRegistryNode(N, D, (MachinePassCtor)C)
                                                    ^
In file included from /home/beta/Programming/DSL_Compilers_for_Tensors/tiramisu/3rdParty/llvm/tools/lli/lli.cpp:30:
/home/beta/Programming/DSL_Compilers_for_Tensors/tiramisu/3rdParty/llvm/include/llvm/ExecutionEngine/Orc/OrcRemoteTargetClient.h: In member function ‘llvm::Expected<std::vector<char> > llvm::orc::remote::OrcRemoteTargetClient<ChannelT>::readMem(char*, llvm::JITTargetAddress, uint64_t)’:
/home/beta/Programming/DSL_Compilers_for_Tensors/tiramisu/3rdParty/llvm/include/llvm/ExecutionEngine/Orc/OrcRemoteTargetClient.h:722:26: error: could not convert ‘((llvm::orc::remote::OrcRemoteTargetClient<ChannelT>*)this)->callB<llvm::orc::remote::OrcRemoteTargetRPCAPI::ReadMem>(Src, Size)’ from ‘Expected<vector<unsigned char,allocator<unsigned char>>>’ to ‘Expected<vector<char,allocator<char>>>’
     return callB<ReadMem>(Src, Size);

Halide

The script launches the default compiler, GCC 8 on my machine (note that it works for current master branch Halide). But GCC 8 brought a lots of new warnings that needs to be ignored with:
-Wno-stringop-truncation (/home/beta/Programming/DSL_Compilers_for_Tensors/tiramisu/3rdParty/Halide/src/Introspection.cpp:36:12: error: ‘char* strncpy(char*, const char*, size_t)’ specified bound 2048 equals destination size [-Werror=stringop-truncation])
-Wno-catch-value (some polymorphic exception caught by value)
-Wno-format (/home/beta/Programming/DSL_Compilers_for_Tensors/tiramisu/3rdParty/Halide/test/correctness/extern_producer.cpp:47:12: error: format ‘%d’ expects argument of type ‘int’, but argument 2 has type ‘int64_t’ {aka ‘long int’} [-Werror=format=])
-fpermissive

It would be good to force usage of the Clang 5.0 if that is the only supported one.

Object files

This part of Halide Makefile is deleting the objects that are supposed to be put in the static lib and the make command fails with file not found: https://github.com/jrayzero/Halide/blob/0e2cac7a2e3982d1a51d82932b185f75af05f4c2/Makefile#L670-L697

$(BUILD_DIR)/llvm_objects/list: $(OBJECTS) $(INITIAL_MODULES)
	# Determine the relevant object files from llvm with a dummy
	# compilation. Passing -t to the linker gets it to list which
	# object files in which archives it uses to resolve
	# symbols. We only care about the libLLVM ones.
	@mkdir -p $(@D)
	$(CXX) -o /dev/null -shared $(OBJECTS) $(INITIAL_MODULES) -Wl,-t $(LLVM_STATIC_LIBS) $(COMMON_LD_FLAGS) | egrep "libLLVM" > $(BUILD_DIR)/llvm_objects/list.new
	# if the list has changed since the previous build, or there
	# is no list from a previous build, then delete any old object
	# files and re-extract the required object files
	cd $(BUILD_DIR)/llvm_objects; \
	if cmp -s list.new list; \
	then \
	echo "No changes in LLVM deps"; \
	touch list; \
	else \
	rm -f llvm_*.o*; \
	cat list.new | sed = | sed "N;s/[()]/ /g;s/\n /\n/;s/\([0-9]*\)\n\([^ ]*\) \([^ ]*\)/ar x \2 \3; mv \3 llvm_\1_\3/" | bash -; \
	mv list.new list; \
	fi

$(LIB_DIR)/libHalide.a: $(OBJECTS) $(INITIAL_MODULES) $(BUILD_DIR)/llvm_objects/list
	# Archive together all the halide and llvm object files
	@mkdir -p $(@D)
	@rm -f $(LIB_DIR)/libHalide.a
	# ar breaks on MinGW with all objects at the same time.
	echo $(OBJECTS) $(INITIAL_MODULES) $(BUILD_DIR)/llvm_objects/llvm_*.o* | xargs -n200 ar q $(LIB_DIR)/libHalide.a
	ranlib $(LIB_DIR)/libHalide.a

LLVMHeaders

When building Halide it doesn't search the LLVM Headers in their proper place and I had to explicitly include 3rdParty/llvm/include

CuDNN variable

There is a CUDNN_LOCATION variable here

tiramisu/configure.cmake

Line 19 in 2ee5294

set(CUDNN_LOCATION /data/scratch/akkas/cudnn7)

but during build I had a warning that CUDNN_LIBRARIES was not set.

Looking into this code:

tiramisu/CMakeLists.txt

Lines 41 to 42 in 11395ca

    
           find_library(CUDNN_LIBRARIES cudnn PATHS ${CUDNN_LOCATION}/lib64 NO_DEFAULT_PATH) 
        
           set(CUDNN_INCLUDE_DIRECTORY ${CUDNN_LOCATION}/include)

either CUDNN_LOCATION should be changed to LIB/INCLUDE similar to ISL and Halide:

tiramisu/configure.cmake

Lines 40 to 46 in 2ee5294

    
           # ISL paths 
        
           set(ISL_INCLUDE_DIRECTORY "3rdParty/isl/build/include/") 
        
           set(ISL_LIB_DIRECTORY "3rdParty/isl/build/lib/") 
        
           # Halide Paths 
        
           set(HALIDE_SOURCE_DIRECTORY "3rdParty/Halide") 
        
           set(HALIDE_LIB_DIRECTORY "3rdParty/Halide/lib")

or CUDNN_LOCATION description Change with the cudnn library location should be made more clear that it's the directory parent to lib64/libcudnn.so and include/cudnn.h as naively people will set it to /usr/lib or /usr/lib64 instead of /usr

CUDA Block Size

Add error when thread block size is bigger than 1024. Currently, the compiler fails silently which makes it hard to figure out what's wrong.

Bug in constant declaration

Declaring a constant using the following constructor

tiramisu::constant N("N", tiramisu::expr((int32_t) size), p_int32);

would make the constant non function-wide while the following constructor

tiramisu::constant N_CONST("N", tiramisu::expr((int32_t) size));

declares the constant to be function-wide. The two behaviors are not coherent. Either they should be coherent or a different class should be created for non function-wide constants.

Distributed code generator not documented

The current distributed code generator is not fully documented.

Examples of undocumented classes
https://tiramisu-compiler.github.io/doc/classtiramisu_1_1communicator.html
https://tiramisu-compiler.github.io/doc/classtiramisu_1_1send__recv.html
https://tiramisu-compiler.github.io/doc/classtiramisu_1_1sync.html
https://tiramisu-compiler.github.io/doc/structtiramisu_1_1xfer.html
https://tiramisu-compiler.github.io/doc/classtiramisu_1_1wait.html

Name in codegen

The name of the function is defined in init:

tiramisu::init("sgemm");
..
tiramisu::codegen({&arg1, &arg2}, "build/generated_fct_sgemm.o");

but it would be more convenient if we provided the name in codegen along with file name:

tiramisu::init();
..
tiramisu::codegen("sgemm", {&arg1, &arg2}, "build/generated_fct_sgemm.o");

This also seems to be the Halide convention (with opposite order):

Res.compile_to_object("build/generated_fct_sgemm.o", {Img}, "sgemm");

The link in the readme to a VirtualBox VM is broken

The words "virtual machine disk image" link to http://groups.csail.mit.edu/commit/software/TiramisuVM.zip, which 404s.

What's the difference between Tiramisu and TACO compiler

I searched through the internet but can not find the comparison.
AFAIK, they are both tensor compiler.

unrolling not working with tiling + vectorization

Unrolling not working when combined with tiling and vectorization. In this example, it works when combined with one of them but not both.

Error message : If Then Else is unsupported in bound extraction.

function0_schedule_24.txt

rectfilter benchmark segfaults on Linux

For some reasons, rectfilter benchmark segfaults on Linux. It was fine on Mac. The segfault seems to happen after the benchmark is run. Commenting out the save_image seems to make the issue disappear, although I am not sure if it is the direct cause of the segfault.

Here is the error message:

Kernel : Tiramisu ; Halide ;
recfilter : 13.795732 ; 12.503011 ;
/bin/sh: line 1: 70723 Segmentation fault (core dumped) LD_LIBRARY_PATH=:/usr/local/Halide/lib:/usr/local/isl/installed/lib:Halide/lib:/usr/local/lib:/usr/local/tiramisu/build/ DYLD_LIBRARY_PATH=:Halide/lib:/usr/local/tiramisu/build/ ${tt}

Constants in store_in

There is no store_in equivalent of:
A.set_access("[offset]->{A[i]->buf_A[i + offset]}");
where offset is a tiramisu::constant. The following statement:
A.store_in(buf_A, {i + offset})
generates ISL syntax {A[i]->buf_A[i + offset]} which gives error since offset is not defined. We need a way to infer or pass the initial part ([offset]->) to the map.

New API Scheduling Issue

There is a scheduling issue in new API that causes all computations to be placed in the innermost loop nest. Check generated code of tutorial_04A.

Can Multi Convs or MatMuls be fused?

Conv or Matmul can be fused with eltwise operations easily. I want to know if two Convs or Conv and Matmul can be fused? Because fusion can reduce memory footprint，so it benefits a lot to some Tensor Accelerators (e.g. TVA) if cascaded Convs can be fused. If this kind of fusion is available, can anyone provide an example?

Add gpu_tile overloads

gpu_tile is missing overloads that lets you specify inner and outer iterator variables like:
gpu_tile(i, j, bsize, bsize, i0, j0, i1, j1); as in cpu tile.

`auto_scheduler::create_initial_computation_graph` missing return value

the method implemented in tiramisu/src/tiramisu_auto_scheduler.cpp:26 is missing a return value of type 'computation_graph'.

CUDA Unrolling

Unrolling is not implemented in CUDA ast. The loop is splitted but "#pragma unroll" directive is not added.

pytorch integration mentioned in paper

Thanks for the great work!

I read paper TIRAMISU: A Polyhedral Compiler for Dense and Sparse Deep Learning from learningsys workshop, and found the following statement:

TIRAMISU is integrated in high level deep learning frameworks such as Pytorch and therefore can be used transparently by end-users

Are we going to open-source this part? I checked pytorch official repo but found nothing. Also, there's repo under Tiramisu-Compiler named tiramisu-pytorch, while the repo is empty.

Thanks a lot!

Add an argument type for input-output

We should add a new argument type for buffers that are used as both input and output. Also documentation for buffer constructor needs to be updated. The code generator should also be modified.

In the CPU code generator that uses Halide there is not difference between input, output and input/output buffers. They are all treated in the same way, but for generating GPU code, we want to be able to differentiate between them so that we can decide which buffer to copy in to GPU and which buffers to copy out from GPU.

tiramisu/include/tiramisu/core.h

Lines 1034 to 1036 in f355e7f

    
                *  - a_input: for inputs of the function, 
        
                *  - a_output: for outputs of the function, 
        
                *  - a_temporary: for buffers used as temporary buffers within

Some tests of skewing command failing

I've run the tests and the following failed : 133-134-135-139-140, I checked them and they are skewing tests, giving an output different than the expected.
I tried running them individually a few times, and they don't always fail, which means that perhaps a few other skewing tests could fail as well.

Use CMake to compile all benchmarks (linear algebra, tensors, ...)

Currently we use custom scripts, we better use CMake to unify building of all benchmarks.

W add autodiff like Halide?

How to add autodiff like Halide?
https://people.csail.mit.edu/tzumao/gradient_halide/gradient_halide.pdf

Does Tiramisu have legality check of scheduling?

I tried the following example in Tiramisu master (c40b004).

#include <tiramisu/tiramisu.h>

using namespace tiramisu;

int main(int argc, char** argv) {
    tiramisu::init("test_after");

    static const int N = 48;
    static const int M = 48;
    
    var x("x", 0, N);
    var y("y", 0, M);
    
    computation f("f", {y, x}, x + y);
    computation g("g", {y, x}, f(y, 48 - 1 - x));

    if (std::getenv("AFTER")) g.after(f, x);
    else g.after(f, computation::root);

    buffer buf_f("buf_f", {M, N}, p_int32, a_temporary);
    buffer buf_g("buf_g", {M, N}, p_int32, a_output);

    f.store_in(&buf_f, {y, x});
    g.store_in(&buf_g, {y, x});

    tiramisu::codegen({&buf_g}, "test_after.o");

    return 0;
}

I expected that this will cause an error when "AFTER=" is enabled because the scheduling g.after(f, x) breaks dependency between f and g. However, it can be compiled and returns a broken result.
The Tiramisu paper "Tiramisu: A Code Optimization Framework for High Performance Systems" says that "TIRAMISU does not have
this restriction since it checks transformation legality using
dependence analysis [18].", but I cannot find the way to check "transformation legality".

How can we check the legality of transformation?

Julia interface?

Your paper mentions Julia interface. Is it going to be released?

Bug with tag_unroll_level ?

In the following code : https://github.com/abdouskamel/tiramisu/blob/master/benchmarks/DNN/blocks/Resize-Conv/resize_conv_generator_tiramisu.cpp, I tag the loop level fin of the computation resize to be unrolled (line 79). But when I generate the code, it's the loop level fin of the computation init_resized_input that gets unrolled. Here is the generated code :

parallel (c1, 0, 32) {
    for (c3, 0, 226) {
        for (c5, 0, 226) {
            unrolled (c7, 0, 3) { // Loop level that gets unrolled
                input_resized_buf[(((c7 + int32((int64(c5)*(int64)3))) + int32((int64(c3)*(int64)678))) + int32((int64(c1)*(int64)153228)))] = 0.000000f
            }
        }
    }
    for (c3, 0, 224) {
        for (c5, 0, 28) {
            vectorized (c7, 0, 8) {
                for (c9, 0, 3) { // Loop level that should be unrolled
                    let t27.s = int32(floor_f32(((float32(((c5*8) + c7))*2.678571f) + 0.839286f)))
                    let t26.s = int32(floor_f32(((float32(c3)*1.785714f) + 0.392857f)))
                    let t25.s = t27.s
                    let t24 = t26.s
                    let t23 = t25.s
                    let t22.s = t24
                    let t21 = t25.s
                    let t20 = t24

                    input_resized_buf[(((c9 + int32(((int64(((c5*8) + c7))*(int64)3) + (int64)3))) + int32(((int64(c3)*(int64)678) + (int64)678))) + int32((int64(c1)*(int64)153228)))] = ((((_c_input_b0[(((c9 + (t25.s*3)) + (t24*1800)) + (c1*720000))]*((floor_f32(((float32(c3)*1.785714f) + 0.392857f)) - (float32(c3)*1.785714f)) + 0.607143f)) + (_c_input_b0[((((c9 + (t25.s*3)) + (t24*1800)) + (c1*720000)) + 1800)]*(((float32(c3)*1.785714f) - floor_f32(((float32(c3)*1.785714f) + 0.392857f))) + 0.392857f)))*((floor_f32(((float32(((c5*8) + c7))*2.678571f) + 0.839286f)) - (float32(((c5*8) + c7))*2.678571f)) + 0.160714f)) + (((_c_input_b0[((((c9 + (t25.s*3)) + (t24*1800)) + (c1*720000)) + 3)]*((floor_f32(((float32(c3)*1.785714f) + 0.392857f)) - (float32(c3)*1.785714f)) + 0.607143f)) + (_c_input_b0[((((c9 + (t25.s*3)) + (t24*1800)) + (c1*720000)) + 1803)]*(((float32(c3)*1.785714f) - floor_f32(((float32(c3)*1.785714f) + 0.392857f))) + 0.392857f)))*(((float32(((c5*8) + c7))*2.678571f) - floor_f32(((float32(((c5*8) + c7))*2.678571f) + 0.839286f))) + 0.839286f)))
                }
            }
        }
    }
    for (c3, 0, 4) {
        for (c5, 0, 224) {
            for (c7, 0, 224) {
                for (c9, 0, 8) {
                    output_buf[((((c9 + int32((int64(c7)*(int64)8))) + int32((int64(c5)*(int64)1792))) + int32((int64(c3)*(int64)401408))) + int32((int64(c1)*(int64)1605632)))] = _conv_bias_b2[(c9 + (c3*8))]
                }
            }
            for (c7, 0, 3) {
                for (c9, 0, 3) {
                    for (c11, 0, 224) {
                        for (c13, 0, 3) {
                            vectorized (c15, 0, 8) {
                                output_buf[((((c15 + int32((int64(c11)*(int64)8))) + int32((int64(c5)*(int64)1792))) + int32((int64(c3)*(int64)401408))) + int32((int64(c1)*(int64)1605632)))] = (output_buf[((((c15 + int32((int64(c11)*(int64)8))) + int32((int64(c5)*(int64)1792))) + int32((int64(c3)*(int64)401408))) + int32((int64(c1)*(int64)1605632)))] + (input_resized_buf[(((c13 + int32((int64((c9 + c11))*(int64)3))) + int32((int64((c5 + c7))*(int64)678))) + int32((int64(c1)*(int64)153228)))]*_conv_filter_b1[((((c15 + (c13*8)) + (c9*24)) + (c7*72)) + (c3*216))]))
                            }
                        }
                    }
                }
            }
        }
    }
}

I don't know if it's a bug or if there's an error in my code.

Vectorize an inner loop that depends on the outer loop

Hello,

Actually that's not an issue but a question concerning Tiramisu. I don't know if it's the right place for asking questions, thus I'm sorry if it's inconvenient.

I want to implement a computation that looks like this with tiramisu :

for (i = 0; i < N; ++i)
    for (j = 0; j < i; ++j)
        S0
    S1

I have done it following the tutorials, but now I want to apply vectorization to the loop j with vector length v. The problem is that, as stated in the documentation :

the vectorization of a loop that has less than v iterations is not correct.

Thus, for the first iterations where i < v, the vectorization will be incorrect because the loop j iterates for less than v iterations. I think that this can be fixed by splitting the outer loop like that :

for (i = 0; i < v; ++i)
    for (j = 0; j < i; ++j)
        S0
    S1

for (i = v; i < N; ++i)
    for (j = 0; j < i; ++j)
        S0
    S1

and apply vectorization only to the second loop. This can be done easily in the declaration of the algorithm (layer 1), but I think that I must resolve this issue in layer 2 because it's an issue related to the optimization of the algorithm.

I searched in the documentation but didn't find a way to do that, so I am asking for your help.

Thank you.

CUDA backend Warnings

When compiling tiramisu, the CUDA backend emits warnings

make tiramisu

....
....
....

/data/scratch/baghdadi/tiramisu/src/tiramisu_codegen_cuda.cpp: In member function ‘bool tiramisu::cuda_ast::compiler::compile(const string&) const’:
/data/scratch/baghdadi/tiramisu/src/tiramisu_codegen_cuda.cpp:1757:25: warning: ignoring return value of ‘char* getcwd(char*, size_t)’, declared with attribute warn_unused_result [-Wunused-result]
getcwd(cwd, 500);
^

reserved identifier violation

I would like to point out that an identifier like “_H_TIRAMISU_CORE_” does eventually not fit to the expected naming convention of the C++ language standard.
Would you like to adjust your selection for unique names?

Skewing problem

Skewing operation doesn't seem to have any effect. Say you have a nested loop:

for i in [0, 4):
  for j in [0, 4):
    A[i, j] = x

This traverses the loop range as in:

 0  1  2  3
 4  5  6  7
 8  9 10 11
12 13 14 15

After the skewing operation A.skew(i, j, 1), one would expect diagonal traversal:

But the resulting Tiramisu program still has the previous traversal. One can see this in the debug output as well:

Generated Halide stmt:
produce  {
  for (c1, 0, 4) {
    for (c3, c1, 4) {
      A[((c3 - c1) + (c1*4))] = x
    }
  }
}

Skew only shifts the inner loop by c1, and subtracts c1 back in the access. So the resulting program is equivalent to the one without skewing.

Vectorize + distribute

Vectorizing inner loops of distributed code removes the distributed tag.

	void error(const std::string &str, bool exit_program)
	{
	std::cerr << "Error in " << __FILE__ << ":"
	<< __LINE__ << " - " << str << std::endl;

	if (exit_program)
	{
	exit(1);
	}
	}

	cuda_ast::statement_ptr{new cuda_ast::host_function{cuda_ast::kernel::wrapper_return_type,
	kernel->get_wrapper_name(), kernel->get_arguments(),
	std::static_pointer_cast<cuda_ast::statement>(wrapper_block)}}

	find_library(CUDNN_LIBRARIES cudnn PATHS ${CUDNN_LOCATION}/lib64 NO_DEFAULT_PATH)
	set(CUDNN_INCLUDE_DIRECTORY ${CUDNN_LOCATION}/include)

	# ISL paths
	set(ISL_INCLUDE_DIRECTORY "3rdParty/isl/build/include/")
	set(ISL_LIB_DIRECTORY "3rdParty/isl/build/lib/")

	# Halide Paths
	set(HALIDE_SOURCE_DIRECTORY "3rdParty/Halide")
	set(HALIDE_LIB_DIRECTORY "3rdParty/Halide/lib")

	* - a_input: for inputs of the function,
	* - a_output: for outputs of the function,
	* - a_temporary: for buffers used as temporary buffers within