tiramisu-compiler / tiramisu Goto Github PK
View Code? Open in Web Editor NEWA polyhedral compiler for expressing fast and portable data parallel algorithms
Home Page: http://tiramisu-compiler.org
License: MIT License
A polyhedral compiler for expressing fast and portable data parallel algorithms
Home Page: http://tiramisu-compiler.org
License: MIT License
Tiramisu has dependencies with LLVM/Clang 5 because Halide branch is outdated.
The same could be said with ISL to be updated to 0.20.
Any idea when this updated alignment would occur?
If ENABLE_DEBUG is turned OFF and make tests is run the following is printed
100010001000 + + + Lowering after final simplification
Where as this should not be printed.
When there is an ISL syntax error we get something like syntax error (1, 53): unknown identifier
and there is no clue which ISL statement is causing the error. We should print the erroneous ISL statement or the line number.
I see Halide had autodiff func by this paper, https://people.csail.mit.edu/tzumao/gradient_halide/gradient_halide.pdf
Will you plan to add autodiff for tiramisu ?
Halide buffer indices are reversed compared to C convention and thus Tiramisu. For example if we define a buffer buf(10, 20)
it is passed as an array of shape buf[20][10]
to Tiramisu. Consequently, Halide statementbuf(x, y) = z
sets buf[y][x]
instead of buf[x][y]
. This creates unexpected problems.
The issue does not come up in tutorials because either outputs are constant so the index flip does not matter or we don't check the outputs for validity at all.
I will be fixing the tutorials and use one tutorial to explain this behavior.
__FILE__ and __LINE__ macros point to this file (tiramisu_debug.cpp:41) instead of where function is called. We should implement it as a macro instead of a function.
tiramisu/src/tiramisu_debug.cpp
Lines 38 to 47 in fed7a55
I would be very interested in a RNN implementation example using Tiramisu.
Unfortunately due to #217 I cannot explore that myself at the moment.
Also in the tiramisu.github.io website, you claim that Halide cannot represent RNNs but time is just another loop/tensor dimension. Case in point, this seems to be an actual implementation of LSTM in Halide: https://github.com/misaka-10032/Halstm/blob/master/src/layer/lstm.cpp.
One thing I would be very interested in is the wavefront optimisation on stacked RNNs as mentionned in Nvidia's blog post in optimization 3.
They even provide the Cuda source code that can serve as a reference benchmark.
In
Shouldn't A(i,j) be A(i,k) instead ?
Hello Tiramisu developers:
I am a software developer interested in the Tiramisu project.
Currently, I'm reading the Tiramisu codes and find complex usage of ISL for the polyhedral analysis.
I wonder how you developers can master the ISL so fluently, as far as I googled through the Internet and just found several documents about the ISL, with very few demos and examples about the API.
After reading these documentations, I found it still hard to understand the ISL usage code in the Tiramisu project.
Can you provide some suggestions about the materials for mastering the ISL APIs?
When travis tests are enabled in:
Line 22 in 22c6c09
src/tiramisu_core.cpp has over 8800 lines which is very difficult to navigate through. It would be convenient if we split it into class definition files like tiramisu_computation.cpp, tiramisu_buffer.cpp, etc. This might be difficult to do in one shot since the system has many interconnected parts, so we can start moving things around function by function.
As your paper said, it has supported fpga until now, right?
I am quite new to the polyhedral model, and may still be unfamiliar with related concepts, so please point out if I made any mistakes.
I would like to know is there any methods for parallelizing or loop tiling that automatically resolves data dependencies. To be more specific, consider the following one-dimensional stencil computation:
for (t = 1; t < T; t += 1)
for (i = 1; i < N - 1; i += 1)
A[t][i] = 0.25 * (A[t - 1][i + 1] - 2.0 * A[t - 1][i] + A[t - 1][i - 1]);
Since computing A[t][i]
needs to read A[t - 1][i + 1]
, the statement instance (t, i)
have to be excuted after the statement instance (t - 1, i + 1)
. So the two-level loop cannot be simply tiled, otherwise data dependency will be violated.
However, the Computation::tile
function in Tiramisu seems won't make any efforts to solve the data dependency:
#include <tiramisu/tiramisu.h>
using namespace tiramisu;
int main() {
tiramisu::init("stencil");
const int SIZE_T = 200, SIZE_N = 100;
constant T("T", expr(SIZE_T)), N("N", expr(SIZE_N));
var t("t", 1, T), i("i", 1, N - 1);
computation A("A", {t, i}, p_float32);
A.set_expression((A(t - 1, i + 1) - A(t - 1, i) * 2.0f + A(t - 1, i - 1)) * 0.25f);
// var t0("t0"), i0("i0"), t1("t1"), i1("i1");
// A.tile(t, i, 32, 32, t0, i0, t1, i1);
buffer b_A("b_A", {expr(SIZE_T), expr(SIZE_N)}, p_float32, a_input);
A.store_in(&b_A);
tiramisu::codegen({&b_A}, "stencil.o");
}
Uncomment the two lines related to loop tiling, and the output Halide IR changes from:
Generated Halide IR:
assert((reinterpret(uint64, b_A.buffer) != (uint64)0), halide_error_buffer_argument_is_null("b_A"))
let b_A = _halide_buffer_get_host(b_A.buffer)
produce {
allocate _A_b0[float32 * 98 * 199]
for (c1, 1, 199) {
for (c3, 1, 98) {
b_A[(c3 + int32((int64(c1)*(int64)100)))] = (((b_A[(int32((int64(c3) + (int64)1)) + int32(((int64(c1)*(int64)100) + (int64)-100)))] - (b_A[(c3 + int32(((int64(c1)*(int64)100) + (int64)-100)))]*2.000000f)) + b_A[(int32((int64(c3) + (int64)-1)) + int32(((int64(c1)*(int64)100) + (int64)-100)))])*0.250000f)
}
}
}
to:
Generated Halide IR:
assert((reinterpret(uint64, b_A.buffer) != (uint64)0), halide_error_buffer_argument_is_null("b_A"))
let b_A = _halide_buffer_get_host(b_A.buffer)
produce {
allocate _A_b0[float32 * 98 * 199]
for (c1, 0, 7) {
for (c3, 0, 4) {
for (c5, (1 - min((c1*32), 1)), ((min((c1*32), 1) - max((c1*32), 168)) + 199)) {
for (c7, (1 - min((c3*32), 1)), ((min((c3*32), 1) - max((c3*32), 67)) + 98)) {
b_A[(((c3*32) + c7) + int32((int64(((c1*32) + c5))*(int64)100)))] = (((b_A[(int32((int64(((c3*32) + c7)) + (int64)1)) + int32(((int64(((c1*32) + c5))*(int64)100) + (int64)-100)))] - (b_A[(((c3*32) + c7) + int32(((int64(((c1*32) + c5))*(int64)100) + (int64)-100)))]*2.000000f)) + b_A[(int32((int64(((c3*32) + c7)) + (int64)-1)) + int32(((int64(((c1*32) + c5))*(int64)100) + (int64)-100)))])*0.250000f)
}
}
}
}
}
where data dependency is violated, and test also fails.
I have heard that the Pluto algorithm can be adopted in such a scenario, which will automatically skew the iteration domain to solve the data dependency:
for (t = 1; t < T; t += 1)
for (i = 1 + t; i < N - 1 + t; i += 1)
A[t][i - t] = 0.25 * (A[t - 1][i - t + 1] - 2.0 * A[t - 1][i + t] + A[t - 1][i - t - 1]);
and the loop can be safely tiled. It is also possible to skew the iteration domain and perform loop tiling in Tiramisu:
#include <tiramisu/tiramisu.h>
using namespace tiramisu;
int main() {
tiramisu::init("stencil");
const int SIZE_T = 200, SIZE_N = 100;
constant T("T", expr(SIZE_T)), N("N", expr(SIZE_N));
var t("t", 1, T), i("i", 1, N - 1);
computation A("A", {t, i}, p_float32);
A.set_expression((A(t - 1, i + 1) - A(t - 1, i) * 2.0f + A(t - 1, i - 1)) * 0.25f);
var nt("nt"), ni("ni");
A.skew(t, i, 1, nt, ni);
var t0("t0"), i0("i0"), t1("t1"), i1("i1");
A.tile(nt, ni, 32, 32, t0, i0, t1, i1);
buffer b_A("b_A", {expr(SIZE_T), expr(SIZE_N)}, p_float32, a_output);
A.store_in(&b_A);
tiramisu::codegen({&b_A}, "stencil.o");
}
It will pass the test. But it requires my observation of the loop patterns to make such a transformation. Moreover, if I want to paralize the loop, it requires not only skewing, but also synchronization and communication between parallel computation units. This seems complicated to me, but it can theoratically be automated through the Pluto algorithm. This is why I would like to know: is there any methods in Tiramisu for parallelizing or loop tiling that automatically resolves data dependencies?
tiramisu/src/tiramisu_codegen_cuda.cpp
Lines 772 to 774 in 0051562
Host code generation for CUDA has a subtle bug. Wrapper copies kernel's arguments as its own arguments (kernel->get_arguments()
above). However, if kernel drops an unused argument while wrapper still uses it for block dimension, generated code becomes faulty:
int32_t _kernel_0_wrapper(int32_t K, int32_t N) {
dim3 blocks(((M / 96) + 1), ((N / 256) + 1), 1);
dim3 threads((15 + 1), (15 + 1), 1);
_kernel_0<<<blocks, threads>>>(K, N);
return 0;
}
M is undefined above and should've been passed to wrapper as an argument. I'll work on this asap since it blocks GEMM benchmark.
The CUDA code generator classes are not documented.
Examples of undocumented classes:
https://tiramisu-compiler.github.io/doc/namespacetiramisu_1_1cuda__ast.html
https://tiramisu-compiler.github.io/doc/structtiramisu_1_1_halide_codegen_output.html
https://tiramisu-compiler.github.io/doc/structtiramisu_1_1isl__ast__expr__deleter.html
The following tests are significantly slower with Tiramisu vs Halide:
Others fails:
This is on macbookpro 2018, MacOS Mojave, latest xcode and Homebrew up-to-date.
I downloaded the docker image from http://groups.csail.mit.edu/commit/software/TiramisuVM.zip
to try it out and when I unzip it I get the following:
bash-3.2$ unzip TiramisuVM.zip
Archive: TiramisuVM.zip
warning [TiramisuVM.zip]: 4294967296 extra bytes at beginning or within zipfile
(attempting to process anyway)
file #1: bad zipfile offset (local header sig): 4294967296
(attempting to re-compensate)
creating: TiramisuVM/
inflating: TiramisuVM/.DS_Store
creating: __MACOSX/
creating: __MACOSX/TiramisuVM/
inflating: __MACOSX/TiramisuVM/._.DS_Store
creating: TiramisuVM/Logs/
inflating: TiramisuVM/Logs/VBox.log
inflating: TiramisuVM/TiramisuVM.vbox
inflating: TiramisuVM/TiramisuVM.vbox-prev
inflating: TiramisuVM/TiramisuVM.vdi
error: invalid compressed data to inflate
Could you please provide a working docker image?
We need to add GPU tests to Travis CI.
the type 'computation' in the file tiramisu/include/tiramisu/core.h
is missing a virtual dtor.
Copying C from host to GPU and copying A and B from GPU to host are not really needed.
The CI build is failing. It seems to have something to do with halide's ABI not matching LLVM's, not sure how to fix that.
This seems to be an error due to MPI, maybe because this benchmark is not disable if WITH_MPI is not used.
Assuming that
WITH_MPI=false
If the user runs
make benchmarks
We get the following error messages
/Users/b/Documents/src/MIT/tiramisu/benchmarks/halide/heat2d_dist_tiramisu.cpp:165:32: error: too few arguments to function call, single argument 'level' was not specified
init_even.drop_rank_iter();
~~~~~~~~~~~~~~~~~~~~~~~~ ^
/Users/b/Documents/src/MIT/tiramisu/include/tiramisu/core.h:2981:5: note: 'drop_rank_iter' declared here
void drop_rank_iter(var level);
^
/Users/b/Documents/src/MIT/tiramisu/benchmarks/halide/heat2d_dist_tiramisu.cpp:166:31: error: too few arguments to function call, single argument 'level' was not specified
init_odd.drop_rank_iter();
~~~~~~~~~~~~~~~~~~~~~~~ ^
/Users/b/Documents/src/MIT/tiramisu/include/tiramisu/core.h:2981:5: note: 'drop_rank_iter' declared here
void drop_rank_iter(var level);
^
/Users/b/Documents/src/MIT/tiramisu/benchmarks/halide/heat2d_dist_tiramisu.cpp:167:36: error: too few arguments to function call, single argument 'level' was not specified
out_comp_even.drop_rank_iter();
~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ^
/Users/b/Documents/src/MIT/tiramisu/include/tiramisu/core.h:2981:5: note: 'drop_rank_iter' declared here
void drop_rank_iter(var level);
^
/Users/b/Documents/src/MIT/tiramisu/benchmarks/halide/heat2d_dist_tiramisu.cpp:168:35: error: too few arguments to function call, single argument 'level' was not specified
out_comp_odd.drop_rank_iter();
~~~~~~~~~~~~~~~~~~~~~~~~~~~ ^
/Users/b/Documents/src/MIT/tiramisu/include/tiramisu/core.h:2981:5: note: 'drop_rank_iter' declared here
void drop_rank_iter(var level);
^
/Users/b/Documents/src/MIT/tiramisu/benchmarks/halide/heat2d_dist_tiramisu.cpp:191:12: error: use of undeclared identifier 'TOTAL_ITERATIONS'
assert(TOTAL_ITERATIONS % 2 == 0); // so that the output buffer is the last odd computation
^
/Users/b/Documents/src/MIT/tiramisu/benchmarks/halide/heat2d_dist_tiramisu.cpp:261:21: error: 'lift_dist_comps' is a protected member of 'tiramisu::function'
heat2d_tiramisu.lift_dist_comps(); // MUST go before gen_isl_ast
^
/Users/b/Documents/src/MIT/tiramisu/include/tiramisu/core.h:680:10: note: declared protected here
void lift_dist_comps();
^
I tried to follow instructions on an up-to-date archlinux distro with GCC 8.3.0 and Clang/LLVM 8.0.0 but had several issues with both LLVM and Halide and had to give up trying Tiramisu in the end.
Note that I can successfully build Halide from source. My last build was as of Feb 9 (https://github.com/halide/Halide/commits/d02247b3021549fde4bec8e600dced90f5d9a87c).
I tried to use the install submodule script at https://github.com/Tiramisu-Compiler/tiramisu/blob/2ee529439fbfccf82f7351ee2e3c01f10387af26/utils/scripts/install_submodules.sh
It fails for LLVM after reaching 100%.
[100%] Built target libclang
[100%] Built target c-arcmt-test
[100%] Built target c-index-test
In file included from /home/beta/Programming/DSL_Compilers_for_Tensors/tiramisu/3rdParty/llvm/include/llvm/CodeGen/LinkAllCodegenComponents.h:20,
from /home/beta/Programming/DSL_Compilers_for_Tensors/tiramisu/3rdParty/llvm/tools/lli/lli.cpp:22:
/home/beta/Programming/DSL_Compilers_for_Tensors/tiramisu/3rdParty/llvm/include/llvm/CodeGen/SchedulerRegistry.h: In constructor ‘llvm::RegisterScheduler::RegisterScheduler(const char*, const char*, llvm::RegisterScheduler::FunctionPassCtor)’:
/home/beta/Programming/DSL_Compilers_for_Tensors/tiramisu/3rdParty/llvm/include/llvm/CodeGen/SchedulerRegistry.h:40:52: warning: cast between incompatible function types from ‘llvm::RegisterScheduler::FunctionPassCtor’ {aka ‘llvm::ScheduleDAGSDNodes* (*)(llvm::SelectionDAGISel*, llvm::CodeGenOpt::Level)’} to ‘llvm::MachinePassCtor’ {aka ‘void* (*)()’} [-Wcast-function-type]
: MachinePassRegistryNode(N, D, (MachinePassCtor)C)
^
In file included from /home/beta/Programming/DSL_Compilers_for_Tensors/tiramisu/3rdParty/llvm/tools/lli/lli.cpp:30:
/home/beta/Programming/DSL_Compilers_for_Tensors/tiramisu/3rdParty/llvm/include/llvm/ExecutionEngine/Orc/OrcRemoteTargetClient.h: In member function ‘llvm::Expected<std::vector<char> > llvm::orc::remote::OrcRemoteTargetClient<ChannelT>::readMem(char*, llvm::JITTargetAddress, uint64_t)’:
/home/beta/Programming/DSL_Compilers_for_Tensors/tiramisu/3rdParty/llvm/include/llvm/ExecutionEngine/Orc/OrcRemoteTargetClient.h:722:26: error: could not convert ‘((llvm::orc::remote::OrcRemoteTargetClient<ChannelT>*)this)->callB<llvm::orc::remote::OrcRemoteTargetRPCAPI::ReadMem>(Src, Size)’ from ‘Expected<vector<unsigned char,allocator<unsigned char>>>’ to ‘Expected<vector<char,allocator<char>>>’
return callB<ReadMem>(Src, Size);
The script launches the default compiler, GCC 8 on my machine (note that it works for current master branch Halide). But GCC 8 brought a lots of new warnings that needs to be ignored with:
-Wno-stringop-truncation (/home/beta/Programming/DSL_Compilers_for_Tensors/tiramisu/3rdParty/Halide/src/Introspection.cpp:36:12: error: ‘char* strncpy(char*, const char*, size_t)’ specified bound 2048 equals destination size [-Werror=stringop-truncation]
)
-Wno-catch-value (some polymorphic exception caught by value)
-Wno-format (/home/beta/Programming/DSL_Compilers_for_Tensors/tiramisu/3rdParty/Halide/test/correctness/extern_producer.cpp:47:12: error: format ‘%d’ expects argument of type ‘int’, but argument 2 has type ‘int64_t’ {aka ‘long int’} [-Werror=format=]
)
-fpermissive
It would be good to force usage of the Clang 5.0 if that is the only supported one.
This part of Halide Makefile is deleting the objects that are supposed to be put in the static lib and the make
command fails with file not found: https://github.com/jrayzero/Halide/blob/0e2cac7a2e3982d1a51d82932b185f75af05f4c2/Makefile#L670-L697
$(BUILD_DIR)/llvm_objects/list: $(OBJECTS) $(INITIAL_MODULES)
# Determine the relevant object files from llvm with a dummy
# compilation. Passing -t to the linker gets it to list which
# object files in which archives it uses to resolve
# symbols. We only care about the libLLVM ones.
@mkdir -p $(@D)
$(CXX) -o /dev/null -shared $(OBJECTS) $(INITIAL_MODULES) -Wl,-t $(LLVM_STATIC_LIBS) $(COMMON_LD_FLAGS) | egrep "libLLVM" > $(BUILD_DIR)/llvm_objects/list.new
# if the list has changed since the previous build, or there
# is no list from a previous build, then delete any old object
# files and re-extract the required object files
cd $(BUILD_DIR)/llvm_objects; \
if cmp -s list.new list; \
then \
echo "No changes in LLVM deps"; \
touch list; \
else \
rm -f llvm_*.o*; \
cat list.new | sed = | sed "N;s/[()]/ /g;s/\n /\n/;s/\([0-9]*\)\n\([^ ]*\) \([^ ]*\)/ar x \2 \3; mv \3 llvm_\1_\3/" | bash -; \
mv list.new list; \
fi
$(LIB_DIR)/libHalide.a: $(OBJECTS) $(INITIAL_MODULES) $(BUILD_DIR)/llvm_objects/list
# Archive together all the halide and llvm object files
@mkdir -p $(@D)
@rm -f $(LIB_DIR)/libHalide.a
# ar breaks on MinGW with all objects at the same time.
echo $(OBJECTS) $(INITIAL_MODULES) $(BUILD_DIR)/llvm_objects/llvm_*.o* | xargs -n200 ar q $(LIB_DIR)/libHalide.a
ranlib $(LIB_DIR)/libHalide.a
When building Halide it doesn't search the LLVM Headers in their proper place and I had to explicitly include 3rdParty/llvm/include
There is a CUDNN_LOCATION variable here
Line 19 in 2ee5294
Looking into this code:
Lines 41 to 42 in 11395ca
either CUDNN_LOCATION should be changed to LIB/INCLUDE similar to ISL and Halide:
Lines 40 to 46 in 2ee5294
or CUDNN_LOCATION description Change with the cudnn library location
should be made more clear that it's the directory parent to lib64/libcudnn.so and include/cudnn.h as naively people will set it to /usr/lib or /usr/lib64 instead of /usr
Add error when thread block size is bigger than 1024. Currently, the compiler fails silently which makes it hard to figure out what's wrong.
Declaring a constant using the following constructor
tiramisu::constant N("N", tiramisu::expr((int32_t) size), p_int32);
would make the constant non function-wide while the following constructor
tiramisu::constant N_CONST("N", tiramisu::expr((int32_t) size));
declares the constant to be function-wide. The two behaviors are not coherent. Either they should be coherent or a different class should be created for non function-wide constants.
The current distributed code generator is not fully documented.
Examples of undocumented classes
https://tiramisu-compiler.github.io/doc/classtiramisu_1_1communicator.html
https://tiramisu-compiler.github.io/doc/classtiramisu_1_1send__recv.html
https://tiramisu-compiler.github.io/doc/classtiramisu_1_1sync.html
https://tiramisu-compiler.github.io/doc/structtiramisu_1_1xfer.html
https://tiramisu-compiler.github.io/doc/classtiramisu_1_1wait.html
The name of the function is defined in init:
tiramisu::init("sgemm");
..
tiramisu::codegen({&arg1, &arg2}, "build/generated_fct_sgemm.o");
but it would be more convenient if we provided the name in codegen along with file name:
tiramisu::init();
..
tiramisu::codegen("sgemm", {&arg1, &arg2}, "build/generated_fct_sgemm.o");
This also seems to be the Halide convention (with opposite order):
Res.compile_to_object("build/generated_fct_sgemm.o", {Img}, "sgemm");
The words "virtual machine disk image" link to http://groups.csail.mit.edu/commit/software/TiramisuVM.zip, which 404s.
I searched through the internet but can not find the comparison.
AFAIK, they are both tensor compiler.
Unrolling not working when combined with tiling and vectorization. In this example, it works when combined with one of them but not both.
Error message : If Then Else is unsupported in bound extraction.
For some reasons, rectfilter benchmark segfaults on Linux. It was fine on Mac. The segfault seems to happen after the benchmark is run. Commenting out the save_image seems to make the issue disappear, although I am not sure if it is the direct cause of the segfault.
Here is the error message:
Kernel : Tiramisu ; Halide ;
recfilter : 13.795732 ; 12.503011 ;
/bin/sh: line 1: 70723 Segmentation fault (core dumped) LD_LIBRARY_PATH=:/usr/local/Halide/lib:/usr/local/isl/installed/lib:Halide/lib:/usr/local/lib:/usr/local/tiramisu/build/ DYLD_LIBRARY_PATH=:Halide/lib:/usr/local/tiramisu/build/ ${tt}
There is no store_in
equivalent of:
A.set_access("[offset]->{A[i]->buf_A[i + offset]}");
where offset
is a tiramisu::constant. The following statement:
A.store_in(buf_A, {i + offset})
generates ISL syntax {A[i]->buf_A[i + offset]}
which gives error since offset is not defined. We need a way to infer or pass the initial part ([offset]->
) to the map.
There is a scheduling issue in new API that causes all computations to be placed in the innermost loop nest. Check generated code of tutorial_04A.
Conv or Matmul can be fused with eltwise operations easily. I want to know if two Convs or Conv and Matmul can be fused? Because fusion can reduce memory footprint,so it benefits a lot to some Tensor Accelerators (e.g. TVA) if cascaded Convs can be fused. If this kind of fusion is available, can anyone provide an example?
gpu_tile is missing overloads that lets you specify inner and outer iterator variables like:
gpu_tile(i, j, bsize, bsize, i0, j0, i1, j1);
as in cpu tile.
the method implemented in tiramisu/src/tiramisu_auto_scheduler.cpp:26
is missing a return value of type 'computation_graph'.
Unrolling is not implemented in CUDA ast. The loop is splitted but "#pragma unroll" directive is not added.
Thanks for the great work!
I read paper TIRAMISU: A Polyhedral Compiler for Dense and Sparse Deep Learning
from learningsys workshop, and found the following statement:
TIRAMISU is integrated in high level deep learning frameworks such as Pytorch and therefore can be used transparently by end-users
Are we going to open-source this part? I checked pytorch official repo but found nothing. Also, there's repo under Tiramisu-Compiler named tiramisu-pytorch, while the repo is empty.
Thanks a lot!
We should add a new argument type for buffers that are used as both input and output. Also documentation for buffer constructor needs to be updated. The code generator should also be modified.
In the CPU code generator that uses Halide there is not difference between input, output and input/output buffers. They are all treated in the same way, but for generating GPU code, we want to be able to differentiate between them so that we can decide which buffer to copy in to GPU and which buffers to copy out from GPU.
tiramisu/include/tiramisu/core.h
Lines 1034 to 1036 in f355e7f
I've run the tests and the following failed : 133-134-135-139-140, I checked them and they are skewing tests, giving an output different than the expected.
I tried running them individually a few times, and they don't always fail, which means that perhaps a few other skewing tests could fail as well.
Currently we use custom scripts, we better use CMake to unify building of all benchmarks.
How to add autodiff like Halide?
https://people.csail.mit.edu/tzumao/gradient_halide/gradient_halide.pdf
I tried the following example in Tiramisu master (c40b004).
#include <tiramisu/tiramisu.h>
using namespace tiramisu;
int main(int argc, char** argv) {
tiramisu::init("test_after");
static const int N = 48;
static const int M = 48;
var x("x", 0, N);
var y("y", 0, M);
computation f("f", {y, x}, x + y);
computation g("g", {y, x}, f(y, 48 - 1 - x));
if (std::getenv("AFTER")) g.after(f, x);
else g.after(f, computation::root);
buffer buf_f("buf_f", {M, N}, p_int32, a_temporary);
buffer buf_g("buf_g", {M, N}, p_int32, a_output);
f.store_in(&buf_f, {y, x});
g.store_in(&buf_g, {y, x});
tiramisu::codegen({&buf_g}, "test_after.o");
return 0;
}
I expected that this will cause an error when "AFTER=" is enabled because the scheduling g.after(f, x)
breaks dependency between f and g. However, it can be compiled and returns a broken result.
The Tiramisu paper "Tiramisu: A Code Optimization Framework for High Performance Systems" says that "TIRAMISU does not have
this restriction since it checks transformation legality using
dependence analysis [18].", but I cannot find the way to check "transformation legality".
How can we check the legality of transformation?
Your paper mentions Julia interface. Is it going to be released?
In the following code : https://github.com/abdouskamel/tiramisu/blob/master/benchmarks/DNN/blocks/Resize-Conv/resize_conv_generator_tiramisu.cpp, I tag the loop level fin
of the computation resize
to be unrolled (line 79). But when I generate the code, it's the loop level fin
of the computation init_resized_input
that gets unrolled. Here is the generated code :
parallel (c1, 0, 32) {
for (c3, 0, 226) {
for (c5, 0, 226) {
unrolled (c7, 0, 3) { // Loop level that gets unrolled
input_resized_buf[(((c7 + int32((int64(c5)*(int64)3))) + int32((int64(c3)*(int64)678))) + int32((int64(c1)*(int64)153228)))] = 0.000000f
}
}
}
for (c3, 0, 224) {
for (c5, 0, 28) {
vectorized (c7, 0, 8) {
for (c9, 0, 3) { // Loop level that should be unrolled
let t27.s = int32(floor_f32(((float32(((c5*8) + c7))*2.678571f) + 0.839286f)))
let t26.s = int32(floor_f32(((float32(c3)*1.785714f) + 0.392857f)))
let t25.s = t27.s
let t24 = t26.s
let t23 = t25.s
let t22.s = t24
let t21 = t25.s
let t20 = t24
input_resized_buf[(((c9 + int32(((int64(((c5*8) + c7))*(int64)3) + (int64)3))) + int32(((int64(c3)*(int64)678) + (int64)678))) + int32((int64(c1)*(int64)153228)))] = ((((_c_input_b0[(((c9 + (t25.s*3)) + (t24*1800)) + (c1*720000))]*((floor_f32(((float32(c3)*1.785714f) + 0.392857f)) - (float32(c3)*1.785714f)) + 0.607143f)) + (_c_input_b0[((((c9 + (t25.s*3)) + (t24*1800)) + (c1*720000)) + 1800)]*(((float32(c3)*1.785714f) - floor_f32(((float32(c3)*1.785714f) + 0.392857f))) + 0.392857f)))*((floor_f32(((float32(((c5*8) + c7))*2.678571f) + 0.839286f)) - (float32(((c5*8) + c7))*2.678571f)) + 0.160714f)) + (((_c_input_b0[((((c9 + (t25.s*3)) + (t24*1800)) + (c1*720000)) + 3)]*((floor_f32(((float32(c3)*1.785714f) + 0.392857f)) - (float32(c3)*1.785714f)) + 0.607143f)) + (_c_input_b0[((((c9 + (t25.s*3)) + (t24*1800)) + (c1*720000)) + 1803)]*(((float32(c3)*1.785714f) - floor_f32(((float32(c3)*1.785714f) + 0.392857f))) + 0.392857f)))*(((float32(((c5*8) + c7))*2.678571f) - floor_f32(((float32(((c5*8) + c7))*2.678571f) + 0.839286f))) + 0.839286f)))
}
}
}
}
for (c3, 0, 4) {
for (c5, 0, 224) {
for (c7, 0, 224) {
for (c9, 0, 8) {
output_buf[((((c9 + int32((int64(c7)*(int64)8))) + int32((int64(c5)*(int64)1792))) + int32((int64(c3)*(int64)401408))) + int32((int64(c1)*(int64)1605632)))] = _conv_bias_b2[(c9 + (c3*8))]
}
}
for (c7, 0, 3) {
for (c9, 0, 3) {
for (c11, 0, 224) {
for (c13, 0, 3) {
vectorized (c15, 0, 8) {
output_buf[((((c15 + int32((int64(c11)*(int64)8))) + int32((int64(c5)*(int64)1792))) + int32((int64(c3)*(int64)401408))) + int32((int64(c1)*(int64)1605632)))] = (output_buf[((((c15 + int32((int64(c11)*(int64)8))) + int32((int64(c5)*(int64)1792))) + int32((int64(c3)*(int64)401408))) + int32((int64(c1)*(int64)1605632)))] + (input_resized_buf[(((c13 + int32((int64((c9 + c11))*(int64)3))) + int32((int64((c5 + c7))*(int64)678))) + int32((int64(c1)*(int64)153228)))]*_conv_filter_b1[((((c15 + (c13*8)) + (c9*24)) + (c7*72)) + (c3*216))]))
}
}
}
}
}
}
}
}
I don't know if it's a bug or if there's an error in my code.
Hello,
Actually that's not an issue but a question concerning Tiramisu. I don't know if it's the right place for asking questions, thus I'm sorry if it's inconvenient.
I want to implement a computation that looks like this with tiramisu :
for (i = 0; i < N; ++i)
for (j = 0; j < i; ++j)
S0
S1
I have done it following the tutorials, but now I want to apply vectorization to the loop j
with vector length v.
The problem is that, as stated in the documentation :
the vectorization of a loop that has less than v iterations is not correct.
Thus, for the first iterations where i < v
, the vectorization will be incorrect because the loop j
iterates for less than v
iterations. I think that this can be fixed by splitting the outer loop like that :
for (i = 0; i < v; ++i)
for (j = 0; j < i; ++j)
S0
S1
for (i = v; i < N; ++i)
for (j = 0; j < i; ++j)
S0
S1
and apply vectorization only to the second loop. This can be done easily in the declaration of the algorithm (layer 1), but I think that I must resolve this issue in layer 2 because it's an issue related to the optimization of the algorithm.
I searched in the documentation but didn't find a way to do that, so I am asking for your help.
Thank you.
When compiling tiramisu, the CUDA backend emits warnings
make tiramisu
....
....
....
/data/scratch/baghdadi/tiramisu/src/tiramisu_codegen_cuda.cpp: In member function ‘bool tiramisu::cuda_ast::compiler::compile(const string&) const’:
/data/scratch/baghdadi/tiramisu/src/tiramisu_codegen_cuda.cpp:1757:25: warning: ignoring return value of ‘char* getcwd(char*, size_t)’, declared with attribute warn_unused_result [-Wunused-result]
getcwd(cwd, 500);
^
I would like to point out that an identifier like “_H_TIRAMISU_CORE_
” does eventually not fit to the expected naming convention of the C++ language standard.
Would you like to adjust your selection for unique names?
Skewing operation doesn't seem to have any effect. Say you have a nested loop:
for i in [0, 4):
for j in [0, 4):
A[i, j] = x
This traverses the loop range as in:
0 1 2 3
4 5 6 7
8 9 10 11
12 13 14 15
After the skewing operation A.skew(i, j, 1)
, one would expect diagonal traversal:
0 1 3 6
2 4 7 10
5 8 11 13
9 12 14 15
But the resulting Tiramisu program still has the previous traversal. One can see this in the debug output as well:
Generated Halide stmt:
produce {
for (c1, 0, 4) {
for (c3, c1, 4) {
A[((c3 - c1) + (c1*4))] = x
}
}
}
Skew only shifts the inner loop by c1
, and subtracts c1
back in the access. So the resulting program is equivalent to the one without skewing.
Vectorizing inner loops of distributed code removes the distributed tag.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.