triton-lang / triton Goto Github PK
View Code? Open in Web Editor NEWDevelopment repository for the Triton language and compiler
Home Page: https://triton-lang.org/
License: MIT License
Development repository for the Triton language and compiler
Home Page: https://triton-lang.org/
License: MIT License
Hi, All
Is there any support for using GPU tensor core in Sparse-Dense Matrix Multiplication (SpMM) or Sampled Dense-Dense Matrix Multiplication (SDDMM)?
Thanks!
The instructions for installing from source say to test "by running the einsum.py example". However, this example no longer exists
I got a probem when I run python einsum.py, how to fix this error?
python einsum.py
Traceback (most recent call last):
File "einsum.py", line 195, in
triton.ops.einsum(expr, ta, tb, tc, arrays = arrays, bench = True)
TypeError: einsum() got an unexpected keyword argument 'arrays'
I'd like to do some performance evaluation of isaac blas library.
For example, clBLAS provides clBLAS-client that produces the following output.
./clBLAS-client
StatisticalTimer:: Pruning 0 samples from clfunc
StatisticalTimer:: Pruning 0 samples from clGemm
BLAS kernel execution time < ns >: 116170
BLAS kernel execution Gflops < 2.0_M_N*K/time >: 36.1049
On the otherhand, it is not very obvious how to do quick performance evaluation using ISAAC.
Even if there is no tool like clBLAS-client, can you provide info on how to do some evaluations on SGEMM, DGEMM for some sizes ? If you have some CPP code that does, would you provide the code or send it ? [email protected]
Thank you, I appreciate it
The repertoire of available Triton ops has grown, but there are still important ones missing. In particular, a generic permute for 2/3/4D tensors would be nice, not only for practitioners (e.g., layout conversion in CNNs) but also as a benchmark/test for the compiler.
I installed the triton by pip install triton==0.2.3; but when I run the code
it return Import Error :
triton/_C/libtriton.so: undefined symbol: compress2
I have libz.so.1 in /usr/lib64 and /usr/lib64 in my $LD_LIBRARY_PATH.
Hi,
I try to run the trans.py in python/examples/tutorials, but meet below error:
The docker environment I use is nvcr.io/nvidia/pytorch:20.12-py3
Traceback (most recent call last):
File "trans.py", line 1, in
import torch
File "/opt/conda/lib/python3.8/site-packages/torch/init.py", line 484, in
from .serialization import save, load
File "/opt/conda/lib/python3.8/site-packages/torch/serialization.py", line 8, in
import tarfile
File "/opt/conda/lib/python3.8/tarfile.py", line 47, in
import copy
File "/data/tmp/triton/python/examples/tutorials/copy.py", line 4, in
class _copy(torch.autograd.Function):
AttributeError: partially initialized module 'torch' has no attribute 'autograd' (most likely due to a circular import)
Hi, I'm trying to do some inference on M60/K80 while always get issue like this:
"triton.code_gen.OutOfResources: out of resource: shared memoryRequired: 65536Hardware limit: 49152"
While training on A100 is ok because of bigger shared memory.
Just want to know is there any hard limit on GPU hardware? Is there any suggestions that we can skip this issue and enable inference on M60/K80/T4 ?
The symbolic engine is outdated. Parts are still written in C++03, and the dirty code structure makes it hard to add new kernel templates.
In particular, the new symbolic engine should allow for the transparent handling of index modifiers (row, col, trans, reshape, diag, etc.), the support of which is too sloppy for now -- thereby leading to many bugs in the C++ API.
I was trying to use the GEMM for a program that has to multiply a large amount of matrices and I found a memory leak. Then I profiled test-gemm and found that the leak is also there as you can see on this Valgrind Memcheck output:
==11320== LEAK SUMMARY:
==11320== definitely lost: 323,624 bytes in 166 blocks
==11320== indirectly lost: 300 bytes in 16 blocks
==11320== possibly lost: 121,368 bytes in 1,088 blocks
==11320== still reachable: 14,228,175 bytes in 17,865 blocks
==11320== suppressed: 0 bytes in 0 blocks
==11320== Reachable blocks (those to which a pointer was found) are not shown.
==11320== To see them, rerun with: --leak-check=full --show-leak-kinds=all
==11320==
The problem seems to be here:
==11320== 26,144 bytes in 6 blocks are definitely lost in loss record 1,429 of 1,488
==11320== at 0x4C3017F: operator new(unsigned long) (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==11320== by 0x4E93112: std::vector<float, std::allocator<float> >::_M_default_append(unsigned long) (in /home/rogerpt32/isaac/build/libisaac.so)
==11320== by 0x4E8EFAA: isaac::runtime::Dense::Dense(unsigned char*&) (in /home/rogerpt32/isaac/build/libisaac.so)
==11320== by 0x4E8F094: isaac::runtime::Layer::read(unsigned char*&) (in /home/rogerpt32/isaac/build/libisaac.so)
==11320== by 0x4E8F16F: isaac::runtime::Network::Network(unsigned char*) (in /home/rogerpt32/isaac/build/libisaac.so)
==11320== by 0x4E8F661: isaac::runtime::Profile::Profile(unsigned char*, unsigned long) (in /home/rogerpt32/isaac/build/libisaac.so)
==11320== by 0x4E8D20E: std::__shared_ptr<isaac::runtime::GEMMProfile, (__gnu_cxx::_Lock_policy)2>::__shared_ptr<std::allocator<isaac::runtime::GEMMProfile>, unsigned char*>(std::_Sp_make_shared_tag, std::allocator<isaac::runtime::GEMMProfile> const&, unsigned char*&&) (in /home/rogerpt32/isaac/build/libisaac.so)
==11320== by 0x4E6A0EB: __static_initialization_and_destruction_0(int, int) [clone .constprop.78] (in /home/rogerpt32/isaac/build/libisaac.so)
==11320== by 0x4010732: call_init (dl-init.c:72)
==11320== by 0x4010732: _dl_init (dl-init.c:119)
==11320== by 0x40010C9: ??? (in /lib/x86_64-linux-gnu/ld-2.27.so)
With Valgrind massif we can also see the memory leakage but this time it points to:
isaac::driver::Kernel::Kernel(isaac::driver::Module const&, char const*)
The memory leaks for every call to GEMM. It's not a problem for small programs but if anybody wants to run a program using ISAAC with a very large amount of calls to GEMM, it could be problematic.
Hi I am trying to reproduce your SC results.
I got the following errors among others:
In file included from /home/flavio/isaac/lib/templates/gemm.cpp:26:0: │···································
/home/flavio/isaac/include/isaac/templates/common.hpp:82:16: error: ‘vector’ in namespace ‘std’ does not name a template type │···································
virtual std::vector<param_t> tuning_params() const = 0; │···································
^~~~~~ │···································
In file included from /home/flavio/isaac/lib/templates/gemm.cpp:27:0: │···································
/home/flavio/isaac/include/isaac/templates/gemm.h:60:8: error: ‘vector’ in namespace ‘std’ does not name a template type │···································
std::vector<param_t> tuning_params() const; │···································
^~~~~~ │···································
In file included from /home/flavio/isaac/lib/templates/pool.cpp:26:0: │···································
/home/flavio/isaac/include/isaac/templates/common.hpp:82:16: error: ‘vector’ in namespace ‘std’ does not name a template type │···································
virtual std::vector<param_t> tuning_params() const = 0; │···································
^~~~~~ │···································
In file included from /home/flavio/isaac/lib/templates/pool.cpp:27:0: │···································
/home/flavio/isaac/include/isaac/templates/pool.h:42:34: error: ‘std::vector’ has not been declared │···································
void init_constant_memory(std::vector<int32_t>& delta, std::vector<uint32_t> &masks, size_t nlut, int32_t strideIc, int32_t strideIw, int32_t strideIh, int32_t strid│···································
eId); │···································
^~~~~~ ````
I am using CUDA-9 and gcc/g++ -7.
If I run,
python imagenet.py --arch resnet152 /data/imagenet_data
Then,
raise ImportError("torch.utils.ffi is deprecated. Please use cpp extensions instead.")
ImportError: torch.utils.ffi is deprecated. Please use cpp extensions instead.
it only supports https://github.com/pytorch/pytorch/tree/v0.4.1
Hi - would it be possible to reinstantiate triton==1.0.0.dev20210329 on pip, or make it clear how to update to the latest dev branch?
The api seems to have changed significantly in the latest nightlies, and some function in https://github.com/microsoft/DeepSpeed rely on that particular interface.
If not, why can this repo support the tensorflow? Thanks !
Hi,
I am running caffe opencl branch (https://github.com/BVLC/caffe/tree/opencl) with isaac master branch on INTEL BROADWELL platform with below command:
./build/test/test.testbin --gtest_filter=NetTest/2.TestLossWeight
, that will bring fail.
While comment out the line 94 to 141 on file https://github.com/ptillet/isaac/blob/master/lib/runtime/profiles.cpp, that will pass the test case.
Can you reproduce the fail case, seems there is some problem with the copy operation on predict_ logic?
Hi,
I tried on Intel OCL only platform, and found that from commit f1a636f the master branch will build failed with CMAKE_BUILD_TYPE=Debug
but can pass with Release
or RelWithDebInfo
, looks cause by introduced with cublas libraries, the build errors show belw:
../lib/libisaac.so: undefined reference to `cublasGetStream_v2'
../lib/libisaac.so: undefined reference to `cublasSetStream_v2'
collect2: error: ld returned 1 exit status
tests/CMakeFiles/test-fusion.dir/build.make:95: recipe for target 'tests/test-fusion' failed
make[2]: *** [tests/test-fusion] Error 1
CMakeFiles/Makefile2:1608: recipe for target 'tests/CMakeFiles/test-fusion.dir/all' failed
make[1]: *** [tests/CMakeFiles/test-fusion.dir/all] Error 2
make[1]: *** Waiting for unfinished jobs....
../lib/libisaac.so: undefined reference to `cublasGetStream_v2'
../lib/libisaac.so: undefined reference to `cublasSetStream_v2'
collect2: error: ld returned 1 exit status
I was looking at this test function: https://github.com/ptillet/triton/blob/master/python/test/test_blocksparse.py#L13,
when the blocksize is 16 or greater, it works fine. However, when the block size is set as 4 or 8, the error message looks like this:
Traceback (most recent call last):
File "test_blocksparse.py", line 41, in <module>
test_matmul("dds", TRANS_A=False, TRANS_B=False, BLOCK=8, DTYPE="float16", Z=3, H=2, M=512, N=384, K=256)
File "test_blocksparse.py", line 30, in test_matmul
rc = op(ra, rb)
File "/home/ubuntu/triton/python/triton/ops/blocksparse/matmul.py", line 651, in __call__
db_lut, db_num_locks, db_width, db_packs = self.make_lut(a.dtype, a.device)
File "/home/ubuntu/triton/python/triton/ops/blocksparse/matmul.py", line 595, in make_lut
c_lut, c_num_locks, c_width, c_packs = _matmul.make_dxx_lut(layout, block, step, self.trans_b, device)
File "/home/ubuntu/triton/python/triton/ops/blocksparse/matmul.py", line 391, in make_dxx_lut
xincs[:, 0] -= (div - 1) * step
IndexError: index 0 is out of bounds for dimension 1 with size 0
Any ideas about what's the reason? To reproduce the error, can run
test_matmul("dds", TRANS_A=False, TRANS_B=False, BLOCK=4, DTYPE="float16", Z=3, H=2, M=512, N=384, K=256)
Do you plan to add support for a HIP backend in addition to the CUDA backend? HIP is a C++ Runtime API and Kernel Language that allows developers to create portable applications for AMD and NVIDIA GPUs from single source code.
Adding support for the HIP backend would enable the triton library to also support AMD GPUs. The HIP API very closely resembles the CUDA API, and we have tools such as hipify to allow us to easily "translate" most of the CUDA sources into HIP sources.
Many files, for example https://github.com/ptillet/triton/blob/master/lib/ir/constant.cc raises compilation error: 25:16: error: no member named 'runtime_error' in namespace 'std'
on arch linux. When I manually add #include it starts working, but fails with same error on another files
I installed the project via pip
, but failed to import it.
Python 3.7.3 (default, Dec 20 2019, 18:57:59)
[GCC 8.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import triton
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/rd/.local/lib/python3.7/site-packages/triton/__init__.py", line 1, in <module>
from .kernel import *
File "/home/rd/.local/lib/python3.7/site-packages/triton/kernel.py", line 1, in <module>
import triton._C.libtriton as libtriton
ImportError: /home/rd/.local/lib/python3.7/site-packages/triton/_C/libtriton.so: undefined symbol: setupterm
However, ldd
shows every dependent libraries of this .so
are resolved.
ldd /home/rd/.local/lib/python3.7/site-packages/triton/_C/libtriton.so
linux-vdso.so.1 (0x00007ffc7c375000)
libc10.so => /home/rd/.local/lib/python3.7/site-packages/torch/lib/libc10.so (0x00007fbae6190000)
libc10_cuda.so => /home/rd/.local/lib/python3.7/site-packages/torch/lib/libc10_cuda.so (0x00007fbae5f5f000)
libtorch.so => /home/rd/.local/lib/python3.7/site-packages/torch/lib/libtorch.so (0x00007fbae5d49000)
libtorch_cuda.so => /home/rd/.local/lib/python3.7/site-packages/torch/lib/libtorch_cuda.so (0x00007fbaab149000)
libtorch_cpu.so => /home/rd/.local/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so (0x00007fba9af88000)
libtorch_python.so => /home/rd/.local/lib/python3.7/site-packages/torch/lib/libtorch_python.so (0x00007fba99dd6000)
libstdc++.so.6 => /lib/x86_64-linux-gnu/libstdc++.so.6 (0x00007fba99c42000)
libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x00007fba99abf000)
libgcc_s.so.1 => /lib/x86_64-linux-gnu/libgcc_s.so.1 (0x00007fba99aa3000)
libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007fba998e2000)
/lib64/ld-linux-x86-64.so.2 (0x00007fbae929b000)
libgomp-7c85b1e2.so.1 => /home/rd/.local/lib/python3.7/site-packages/torch/lib/libgomp-7c85b1e2.so.1 (0x00007fba996b8000)
libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 (0x00007fba99697000)
libcudart-80664282.so.10.2 => /home/rd/.local/lib/python3.7/site-packages/torch/lib/libcudart-80664282.so.10.2 (0x00007fba99416
000)
librt.so.1 => /lib/x86_64-linux-gnu/librt.so.1 (0x00007fba9940a000)
libnvToolsExt-3965bdd0.so.1 => /home/rd/.local/lib/python3.7/site-packages/torch/lib/libnvToolsExt-3965bdd0.so.1 (0x00007fba992
00000)
libdl.so.2 => /lib/x86_64-linux-gnu/libdl.so.2 (0x00007fba991fb000)
libtensorpipe.so => /home/rd/.local/lib/python3.7/site-packages/torch/lib/libtensorpipe.so (0x00007fba98d32000)
libshm.so => /home/rd/.local/lib/python3.7/site-packages/torch/lib/libshm.so (0x00007fba98b2a000)
I have just checked out your blog and I must congratulate on the release!
May be I'm asking a dumb question here but how feasible is to replace cuda in pytorch with triton kernels. Your tutorials seem to indicate auto-optimized code seems to be better than hand-optimized torch cuda code. So, I wonder if we can actually rewrite torch cuda kernels in triton.
Many symbols of the BLAS ABI are currently not supported, although the current ABI is enough to ensure compatibility with e.g. caffe.
Currently supported functions are:
xAXPY
xCOPY
xSCAL
xDOT
xASUM
xGEMV
xGER
xGEMM
x = {S, D}
Hi!
Is there a way to get the underlying PTX for a given kernel? I recall there being an API for this at one point, but it looks like it no longer exists in the latest Triton.
Thanks!
Trevor
hi -
Can you please clarify compiler requirement? I am mainly interested in python module, but it needs build extension anyway. is LLVM a must, version requirement? I ran into bunch of errors, I just want to confirm the basics before I post the logs.
Thanks
Oliver
It seems like some kernels tend to corrupt memory. This problem does not show on the benchmarks but breaks the BLAS3-test (despite computing the correct matrix).
In setup.py the llvm version is checked here.
But my llvm-config is just called: llvm-config
(no version number).
Plain llvm-config
is listed as an option here.
Changing the lines in setup.py to:
versions = ['','-9.0', '-9', '-90', '-8.0', '-8', '-80']
supported = ['llvm-config{v}'.format(v=v) for v in versions]
worked for me. This maybe defeats the point of having the check, but you could also throw a warning asking the user to check that their llvm version is >=8?
If that fix is ok, I'm happy to open a PR ?
hi! :) Thank you for such an amazing project!
I have been trying to use your sparse matmul kernels in tensorflow by making a custom op and calling the function instance in the overrided Compute function.
However, after successfully launching the kernel, I encounter illegal memory access errors when I try to print the result (matrix c) in python level.
Considering this may be due to conflicts with tensorflow itself, I referenced the tutorials/01-matmul.cc
file and tried using the aforementioned kernel only by using c++ and triton api's, but I still have the same 'illegal memory access' error when trying to read the result into a cpu vector by using stream->read
.
Is there some reason why I may be encountering such an error only with the aforementioned kernel and not the one in 01-matmul.cc (src::dot)
?
It would be great if you could give some advice about what may have gone wrong, or maybe a potential reason why using a tf custom op is inappropriate.
Thank you :)
//--------------------- c++ only file -------------------------
// tried to read from stream at the end, but gets illegal memory access error
void compute(drv::context* context,
drv::stream* stream,
bool trans_a, // false
bool trans_b, // true
bool trans_c, // false
int block,
int num_blocks)
{
size_t dt_nbytes = sizeof(float);
drv::device* device = context->device();
std::vector<half_float::half> ha(1 * 4 * 128 * 64);
std::vector<half_float::half> hb(1 * 4 * 128 * 64);
for(size_t i = 0; i < ha.size(); i++) {
ha[i] = (half_float::half)rand()/RAND_MAX;
}
for(size_t i = 0; i < hb.size(); i++) {
hb[i] = (half_float::half)rand()/RAND_MAX;
}
std::vector<float> widths = {24, 40};
std::vector<float> spdims = {4, 8, 8};
std::vector<float> packs = {2,1};
int AS0 = 1; int AS1 = 4 ; int AS2 = 128 ; int AS3 = 64;
int BS0 = 1; int BS1 = 4 ; int BS2 = 64 ; int BS3 = 128;
bool is_16_multiple = (AS3 % 16 == 0);
bool is_32_multiple = (AS3 % 32 == 0);
bool is_64_multiple = (AS3 % 64 == 0);
int total_width = 0;
int num_elements = widths.size();
for(int i = 0 ; i < num_elements ; i++) {
total_width += widths[i] * packs[i] * packs[i];
}
std::vector<half_float::half> hc(AS0 * total_width * block * block);
auto da = std::shared_ptr<drv::buffer>(drv::buffer::create(context, AS0*AS1*AS2*AS3*dt_nbytes));
auto db = std::shared_ptr<drv::buffer>(drv::buffer::create(context, BS0*BS1*BS2*BS3*dt_nbytes));
auto dc = std::shared_ptr<drv::buffer>(drv::buffer::create(context, AS0*total_width*block*block*dt_nbytes));
stream->write(&*da, true, 0, ha);
stream->write(&*db, true, 0, hb);
int pack = packs[0];
rt::options_t opt;
opt.defines["TM"] = "32"; //std::to_string(block * pack);
opt.defines["TN"] = "32"; //std::to_string(block * pack);
opt.defines["TMN"] = "1024"; //std::to_string(block * block * pack * pack);
opt.defines["BLOCK"] = "16"; //std::to_string(block);
opt.defines["TK"] = "32";
opt.defines["TYPE"] = "float";
opt.defines["STRIDE_AM"] = "lda"; //trans_a ? "1" : "lda"; //TODO:
opt.defines["STRIDE_AK"] = "1"; //trans_a ? "lda" : "1";
opt.defines["STRIDE_BN"] = "ldb"; //trans_b ? "ldb" : "1";
opt.defines["STRIDE_BK"] = "1"; //trans_b ? "1" : "ldb";
opt.defines["STRIDE_CM"] = "ldc";
opt.defines["STRIDE_CN"] = "1";
opt.defines["SDD"] = "True";
opt.defines["TZ"] = "1";
opt.defines["NAME"] = "sdd_kernel";
opt.num_warps = 4;
int num_lock = 1;
int width = widths[0];
std::vector<int> locks(2 * width * AS0 * num_lock, 0);
auto locks_buf = std::shared_ptr<drv::buffer>(drv::buffer::create(context, 2*width*AS0*num_lock*sizeof(int)));
stream->write(&*locks_buf, true, 0, locks);
std::vector<int> lut = {
0, 0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 8, 0, 1,
1, 9, 0, 0, 2, 2, 0, 0, 3, 3, 0, 2, 2, 13,
0, 2, 3, 14, 0, 3, 0, 15, 0, 3, 3, 17, 0, 4,
0, 19, 0, 4, 3, 20, 0, 0, 4, 4, 0, 0, 5, 5,
0, 4, 4, 21, 0, 4, 5, 22, 0, 5, 0, 23, 0, 5,
5, 25, 0, 6, 0, 27, 0, 6, 5, 28, 0, 0, 6, 6,
0, 0, 7, 7, 0, 6, 6, 29, 0, 6, 7, 30, 1, 0,
0, 34, 1, 0, 1, 35, 1, 1, 0, 42, 1, 1, 1, 43,
1, 0, 2, 36, 1, 0, 3, 37, 1, 2, 2, 47, 1, 2,
3, 48, 1, 3, 0, 49, 1, 3, 3, 51, 1, 4, 0, 53,
1, 4, 3, 54, 1, 0, 4, 38, 1, 0, 5, 39, 1, 4,
4, 55, 1, 4, 5, 56, 1, 5, 0, 57, 1, 5, 5, 59,
1, 6, 0, 61, 1, 6, 5, 62, 1, 0, 6, 40, 1, 0,
7, 41, 1, 6, 6, 63, 1, 6, 7, 64, 2, 0, 0, 68,
2, 0, 1, 69, 2, 1, 0, 76, 2, 1, 1, 77, 2, 0,
2, 70, 2, 0, 3, 71, 2, 2, 2, 81, 2, 2, 3, 82,
2, 3, 0, 83, 2, 3, 3, 85, 2, 4, 0, 87, 2, 4,
3, 88, 2, 0, 4, 72, 2, 0, 5, 73, 2, 4, 4, 89,
2, 4, 5, 90, 2, 5, 0, 91, 2, 5, 5, 93, 2, 6,
0, 95, 2, 6, 5, 96, 2, 0, 6, 74, 2, 0, 7, 75,
2, 6, 6, 97, 2, 6, 7, 98, 3, 0, 0, 102, 3, 0,
1, 103, 3, 1, 0, 110, 3, 1, 1, 111, 3, 0, 2, 104,
3, 0, 3, 105, 3, 2, 2, 115, 3, 2, 3, 116, 3, 3,
0, 117, 3, 3, 3, 119, 3, 4, 0, 121, 3, 4, 3, 122,
3, 0, 4, 106, 3, 0, 5, 107, 3, 4, 4, 123, 3, 4,
5, 124, 3, 5, 0, 125, 3, 5, 5, 127, 3, 6, 0, 129,
3, 6, 5, 130, 3, 0, 6, 108, 3, 0, 7, 109, 3, 6,
6, 131, 3, 6, 7, 132
};
auto lut_buf = std::shared_ptr<drv::buffer>(drv::buffer::create(context, lut.size()*sizeof(int)));
stream->write(&*lut_buf, true, 0, luts_i);
std::stringstream oss;
rt::add_arg(oss, *da->cu());
rt::add_arg(oss, *db->cu());
rt::add_arg(oss, *dc->cu());
rt::add_arg(oss, 64);
rt::add_arg(oss, 64);
rt::add_arg(oss, block);
rt::add_arg(oss, 32768);
rt::add_arg(oss, 32768);
rt::add_arg(oss, 34816);
rt::add_arg(oss, 8192);
rt::add_arg(oss, 8192);
rt::add_arg(oss, 34816);
rt::add_arg(oss, AS2);
rt::add_arg(oss, AS2);
rt::add_arg(oss, AS3);
rt::add_arg(oss, 0);
rt::add_arg(oss, *lut_buf->cu());
rt::add_arg(oss, *locks_buf->cu());
rt::add_arg(oss, num_lock);
rt::function function(src::matmul, opt, device); // src::matmul is source code in matmul.c, passed over like the tutorial
int max_width = 49152;
auto grid = [max_width, width, AS0](const rt::options_t& x) {
return rt::grid_t{(size_t)x.D<int>("TZ"),
(size_t)std::min(max_width, width),
(size_t)AS0};
};
function((void**)oss.str().data(), oss.str().size(), grid, stream);
stream->read(&*dc, true, 0, hc); // ERROR HAPPENS HERE!!!!
}
int main() {
auto context = triton::driver::backend::contexts::get_default();
triton::driver::stream* stream = triton::driver::stream::create(context->backend());
compute(context, stream, false, true, false, 16, 0);
}
The below shows the tf-custom op version of using the sparse matmul kernel,
and the second code block shows how this op is called in python level,
the error happens when printing tensor c
//--------------------------------------- using tf custom op ----------------------------------------
using namespace tensorflow;
// op without caching ver
//************************ OP INTERFACE *********************************
REGISTER_OP("SddMatmul")
.Attr("T: {int32, float, float16}")
.Input("a: T")
.Input("b: T")
.Input("c: T")
.Attr("trans_a: bool")
.Attr("trans_b: bool")
.Attr("trans_c: bool")
.Attr("block: int")
.Input("luts_i: T")
.Attr("widths_i: int")
.Attr("packs_i: int")
.Input("locks: T")
.Attr("a_stride0: int")
.Attr("a_stride1: int")
.Attr("a_stride2: int")
.Attr("b_stride0: int")
.Attr("b_stride1: int")
.Attr("b_stride2: int")
.Attr("c_stride0: int");
//****************************** KERNEL **********************************
template <typename T>
class SddMatmulOp : public OpKernel {
private:
bool trans_a, trans_b, trans_c;
int block, width, pack;
int a_stride0, a_stride1, a_stride2, b_stride0, b_stride1, b_stride2, c_stride0;
public:
explicit SddMatmulOp(OpKernelConstruction* context) : OpKernel(context) {
OP_REQUIRES_OK(context, context->GetAttr("trans_a", &trans_a));
OP_REQUIRES_OK(context, context->GetAttr("trans_b", &trans_b));
OP_REQUIRES_OK(context, context->GetAttr("trans_c", &trans_c));
OP_REQUIRES_OK(context, context->GetAttr("block", &block));
OP_REQUIRES_OK(context, context->GetAttr("widths_i", &width));
OP_REQUIRES_OK(context, context->GetAttr("packs_i", &pack));
OP_REQUIRES_OK(context, context->GetAttr("a_stride0", &a_stride0));
OP_REQUIRES_OK(context, context->GetAttr("a_stride1", &a_stride1));
OP_REQUIRES_OK(context, context->GetAttr("a_stride2", &a_stride2));
OP_REQUIRES_OK(context, context->GetAttr("b_stride0", &b_stride0));
OP_REQUIRES_OK(context, context->GetAttr("b_stride1", &b_stride1));
OP_REQUIRES_OK(context, context->GetAttr("b_stride2", &b_stride2));
OP_REQUIRES_OK(context, context->GetAttr("c_stride0", &c_stride0));
}
void Compute(OpKernelContext* context) override {
auto triton_ctx = drv::backend::contexts::get_default(); // get context
drv::stream* stream = triton::driver::stream::create(triton_ctx->backend()); // get stream
drv::device* device = triton_ctx->device(); // get device
// get inputs
OpInputList luts;
const Tensor& a = context->input(0);
const Tensor& b = context->input(1);
const Tensor& c = context->input(2);
const Tensor& lut = context->input(3);
const Tensor& locks = context->input(4);
// change to pointers
auto a_ = reinterpret_cast<const T*>(a.tensor_data().data());
auto b_ = reinterpret_cast<const T*>(b.tensor_data().data());
auto c_ = reinterpret_cast<const T*>(c.tensor_data().data());
auto lut_ = reinterpret_cast<const T*>(lut.tensor_data().data());
auto locks_ = reinterpret_cast<const T*>(locks.tensor_data().data());
uintptr_t a_uintptr = (uintptr_t)a_;
uintptr_t b_uintptr = (uintptr_t)b_;
uintptr_t c_uintptr = (uintptr_t)c_;
uintptr_t lut_uintptr = (uintptr_t)lut_;
uintptr_t locks_uintptr = (uintptr_t)locks_;
uint64 a_addr = a_uintptr;
uint64 b_addr = b_uintptr;
uint64 c_addr = c_uintptr;
uint64 lut_addr = lut_uintptr;
uint64 locks_addr = locks_uintptr;
std::cout << "a addr: " << a_addr << std::endl;
std::cout << "b addr: " << b_addr << std::endl;
std::cout << "c addr: " << c_addr << std::endl;
std::cout << "lut addr: " << lut_addr << std::endl;
std::cout << "locks addr: " << locks_addr << std::endl;
// define macros
rt::options_t opt;
opt.defines["TM"] = std::to_string(block * pack);
opt.defines["TN"] = std::to_string(block * pack);
opt.defines["TMN"] = std::to_string(block * block * pack * pack);
opt.defines["BLOCK"] = std::to_string(block);
opt.defines["TK"] = "32";
opt.defines["TYPE"] = "half"; //std::to_string(dtype); //std::to_string<T>::value; TODO:
opt.defines["STRIDE_AM"] = trans_a ? "1" : "lda";
opt.defines["STRIDE_AK"] = trans_a ? "lda" : "1";
opt.defines["STRIDE_BN"] = trans_b ? "ldb" : "1";
opt.defines["STRIDE_BK"] = trans_b ? "1" : "ldb";
opt.defines["STRIDE_CM"] = "ldc";
opt.defines["STRIDE_CN"] = "1";
opt.defines["SDD"] = "True"; // TODO: bool type
opt.defines["TZ"] = "1";
opt.defines["NAME"] = "sdd_kernel";
opt.num_warps = 4;
// adding arguments
std::stringstream oss;
rt::add_arg(oss, a_addr);
rt::add_arg(oss, b_addr);
rt::add_arg(oss, c_addr);
rt::add_arg(oss, a_stride2);
rt::add_arg(oss, b_stride2);
rt::add_arg(oss, block);
rt::add_arg(oss, a_stride0);
rt::add_arg(oss, b_stride0);
rt::add_arg(oss, c_stride0);
rt::add_arg(oss, a_stride1);
rt::add_arg(oss, b_stride1);
rt::add_arg(oss, c_stride0);
rt::add_arg(oss, 128); // AS2
rt::add_arg(oss, 128); // AS2
rt::add_arg(oss, 64); // AS3
rt::add_arg(oss, 0);
rt::add_arg(oss, lut_addr);
rt::add_arg(oss, locks_addr);
rt::add_arg(oss, 1); // num_lock
rt::function function(src::matmul, opt, device);
auto grid = [this](const rt::options_t& x) {
return rt::grid_t{(size_t)x.D<int>("TZ"),
(size_t)std::min(49152, width),
(size_t)1};
};
function((void**)oss.str().data(), oss.str().size(), grid, stream);
std::cout << "a addr after: " << a_addr << std::endl;
std::cout << "b addr after: " << b_addr << std::endl;
std::cout << "c addr after: " << c_addr << std::endl;
std::cout << "lut addr after: " << lut_addr << std::endl;
std::cout << "locks addr after: " << locks_addr << std::endl;
} // Compute
}; // SddMatmulOp class
REGISTER_KERNEL_BUILDER(Name("SddMatmul").Device(DEVICE_GPU).TypeConstraint<Eigen::half>("T"), SddMatmulOp<Eigen::half>);
sdd_matmul = _load_library('c_lib' + get_ext_suffix()).sdd_matmul
# go around for loop once
with tf.device("/GPU:0"):
a = tf.random.uniform([1, 4, 128, 64], dtype=tf.float16)
b = tf.random.uniform([1, 4, 128, 64], dtype=tf.float16)
c = tf.ones([1, 136, 16, 16], dtype=tf.float16)
luts_i = tf.random.uniform([384], minval=0, maxval=130, dtype=tf.float16)
locks = tf.zeros([2,24,1,1], dtype=tf.float16)
sdd_matmul(a=a,
b=b,
c=c,
trans_a=False,
trans_b=True,
trans_c=False,
block=16,
luts_i=luts_i,
widths_i=24,
packs_i=2,
locks=locks,
a_stride0=32768,
a_stride1=8192,
a_stride2=64,
b_stride0=32768,
b_stride1=8192,
b_stride2=64,
c_stride0=34816
)
print("C: ", c) # ERROR HAPPENS HERE
The following is the specification of my environment:
n1-standard-8
instance)I'd like to build it in windows, with vs2015, but I got hundreds errors, so can this projects be built in widows?
I suffer from installing old version of this which other package need.
If you need extra env to compile this, pip is not a good idea for users.
Please can you provide detailed instructions on how to install the program and how to test it?
Building process looks correct except some warnings. But after that all tests fail.
$ ./bench/bench-blas
Devices available:
------------------
[x] - gfx803 on AMD Accelerated Parallel Processing
------------------
BENCH M N K AT BT ISAAC
Segmentation fault (core dumped)
$ gdb $_
GNU gdb (Ubuntu 7.7.1-0ubuntu5~14.04.2) 7.7.1
......
Reading symbols from ./bench/bench-blas...(no debugging symbols found)...done.
(gdb) run
Starting program: /tmp/isaac/build/bench/bench-blas
Devices available:
------------------
Warning: couldn't activate thread debugging using libthread_db: Cannot find new threads: generic error
warning: File "/lib/x86_64-linux-gnu/libthread_db-1.0.so" auto-loading has been declined by your `auto-load safe-path' set to "$debugdir:$datadir/auto-load".
To enable execution of this file add
add-auto-load-safe-path /lib/x86_64-linux-gnu/libthread_db-1.0.so
line to your configuration file "/var/data0/sandbox/user1/.gdbinit".
To completely disable this security protection add
set auto-load safe-path /
line to your configuration file "/var/data0/sandbox/user1/.gdbinit".
For more information about this security protection see the
"Auto-loading safe path" section in the GDB manual. E.g., run from the shell:
info "(gdb)Auto-loading safe path"
warning: Unable to find libthread_db matching inferior's thread library, thread debugging will not be available.
[x] - gfx803 on AMD Accelerated Parallel Processing
------------------
BENCH M N K AT BT ISAAC
Program received signal SIGSEGV, Segmentation fault.
0x00007ffff7750a1d in isaac::runtime::profiles::value_type::init(isaac::runtime::execution_handler const&) () from /tmp/isaac/build/lib/libisaac.so
(gdb) where
#0 0x00007ffff7750a1d in isaac::runtime::profiles::value_type::init(isaac::runtime::execution_handler const&) () from /tmp/isaac/build/lib/libisaac.so
#1 0x00007ffff7751440 in isaac::runtime::profiles::value_type::execute(isaac::runtime::execution_handler const&) () from /tmp/isaac/build/lib/libisaac.so
#2 0x00007ffff775b71e in isaac::runtime::execute(isaac::runtime::execution_handler const&, std::map<std::pair<isaac::expression_type, isaac::numeric_type>, std::shared_ptr<isaac::runtime::profiles::value_type>, std::less<std::pair<isaac::expression_type, isaac::numeric_type> >, std::allocator<std::pair<std::pair<isaac::expression_type, isaac::numeric_type> const, std::shared_ptr<isaac::runtime::profiles::value_type> > > >&) () from /tmp/isaac/build/lib/libisaac.so
#3 0x00007ffff7741200 in isaac::array_base::operator=(isaac::runtime::execution_handler const&) () from /tmp/isaac/build/lib/libisaac.so
#4 0x00007ffff77419a7 in isaac::array_base::operator=(isaac::expression_tree const&) () from /tmp/isaac/build/lib/libisaac.so
#5 0x0000000000409c7e in double bench<void bench<float>(isaac::numeric_type, std::string)::{lambda()#8}, void bench<float>(isaac::numeric_type, std::string)::{lambda()#1}>(void bench<float>(isaac::numeric_type, std::string)::{lambda()#8} const&, void bench<float>(isaac::numeric_type, std::string)::{lambda()#1} const&) ()
#6 0x000000000040d4f8 in void bench<float>(isaac::numeric_type, std::string) ()
#7 0x0000000000405caf in main ()
(gdb)
After I upgrade from triton 0.3 to 1.1, triton.kernel is throwing the error that it requires "device" argument. Is it the same as torch.device? Thanks!.
I follow the build instructions on the Readme:
git clone https://github.com/ptillet/isaac.git
mkdir -p isaac/build && cd isaac/build
cmake ../ && make -j4
..and I get this build error:
cathal@thinkum:~/Projects/isaac/build$ make -j4
[ 2%] Built target bin2cpp
[ 5%] Built target headers
[ 5%] Built target database_amd
[ 8%] Built target database_unknown
[ 8%] Built target database_intel
[ 14%] Built target database_nvidia
[ 71%] Built target isaac
[ 77%] Built target test-element-1d
[ 77%] Built target test-blas-3
[ 83%] Built target test-blas-2
[ 83%] Built target test-blas-1
[ 85%] Built target test-element-2d
[ 88%] Built target test-reduce-2d
[ 91%] Built target test-reduce-1d
[ 94%] Built target test-fusion
[ 98%] Built target example-indexing
[ 98%] Linking CXX executable bench-blas
/home/cathal/torch-cl/install/lib/libclBLAS.so: undefined reference to `pthread_rwlock_destroy'
/home/cathal/torch-cl/install/lib/libclBLAS.so: undefined reference to `pthread_rwlock_rdlock'
/home/cathal/torch-cl/install/lib/libclBLAS.so: undefined reference to `pthread_rwlock_init'
/home/cathal/torch-cl/install/lib/libclBLAS.so: undefined reference to `pthread_rwlock_unlock'
/home/cathal/torch-cl/install/lib/libclBLAS.so: undefined reference to `pthread_rwlock_wrlock'
collect2: error: ld returned 1 exit status
bench/CMakeFiles/bench-blas.dir/build.make:96: recipe for target 'bench/bench-blas' failed
make[2]: *** [bench/bench-blas] Error 1
CMakeFiles/Makefile2:1663: recipe for target 'bench/CMakeFiles/bench-blas.dir/all' failed
make[1]: *** [bench/CMakeFiles/bench-blas.dir/all] Error 2
Makefile:138: recipe for target 'all' failed
make: *** [all] Error 2
I don't see a reference in a non-binary file to 'pthread' when I sift
for it, but it's matched in lots of object files. So, perhaps it's just missing somewhere from the Makefiles...but I don't grok the C build system. :/
Triton's caching mechanism fails to detect when Torch has been upgraded. This means that ~/.triton/torch/*.so will be compiled for an old version of torch and trigger a link error on import. This can be worked aroung by deleting ~/.triton/torch, but needs to be fixed properly.
We found that if we change the num_warps
on this line to any value besides 16, then the numerics become incorrect:
https://github.com/ptillet/triton/pull/53/files#diff-e5b8f91da04da23cb0d29b0d334fb41a185033e4974531de7494cfa38dd97880R58
My understanding is that num_warps
should not affect the numerics at all, but it could be that we have a bug in our code, which violates a constraint needed for that guarantee!
(we are running on an A100 GPU)
Hi all,
I tried to install the versions 0.3.0 and 0.2.0 and got some build errors. Any ideas what might be going on? The error points me to a CMake log, but that log doesn't exist on my computer.
For what its worth, I have Cuda 11.1 and Python 3.9.1.
Thanks!
-- Configuring incomplete, errors occurred!
See also "/tmp/pip-req-build-ll5zoboi/build/temp.linux-x86_64-3.9/CMakeFiles/CMakeOutput.log".
Traceback (most recent call last):
File "<string>", line 1, in <module>
File "/tmp/pip-req-build-ll5zoboi/setup.py", line 106, in <module>
setup(
File "/home/fishy/.cache/pypoetry/virtualenvs/limp-MNsrVWB5-py3.9/lib/python3.9/site-packages/setuptools/__init__.py", line 163, in setup
return distutils.core.setup(**attrs)
File "/home/fishy/.pyenv/versions/3.9.1/lib/python3.9/distutils/core.py", line 148, in setup
dist.run_commands()
File "/home/fishy/.pyenv/versions/3.9.1/lib/python3.9/distutils/dist.py", line 966, in run_commands
self.run_command(cmd)
File "/home/fishy/.pyenv/versions/3.9.1/lib/python3.9/distutils/dist.py", line 985, in run_command
cmd_obj.run()
File "/home/fishy/.cache/pypoetry/virtualenvs/limp-MNsrVWB5-py3.9/lib/python3.9/site-packages/setuptools/command/install.py", line 61, in run
return orig.install.run(self)
File "/home/fishy/.pyenv/versions/3.9.1/lib/python3.9/distutils/command/install.py", line 546, in run
self.run_command('build')
File "/home/fishy/.pyenv/versions/3.9.1/lib/python3.9/distutils/cmd.py", line 313, in run_command
self.distribution.run_command(command)
File "/home/fishy/.pyenv/versions/3.9.1/lib/python3.9/distutils/dist.py", line 985, in run_command
cmd_obj.run()
File "/home/fishy/.pyenv/versions/3.9.1/lib/python3.9/distutils/command/build.py", line 135, in run
self.run_command(cmd_name)
File "/home/fishy/.pyenv/versions/3.9.1/lib/python3.9/distutils/cmd.py", line 313, in run_command
self.distribution.run_command(command)
File "/home/fishy/.pyenv/versions/3.9.1/lib/python3.9/distutils/dist.py", line 985, in run_command
cmd_obj.run()
File "/tmp/pip-req-build-ll5zoboi/setup.py", line 55, in run
self.build_extension(ext)
File "/tmp/pip-req-build-ll5zoboi/setup.py", line 93, in build_extension
subprocess.check_call(['cmake', sourcedir] + cmake_args, cwd=self.build_temp, env=env)
File "/home/fishy/.pyenv/versions/3.9.1/lib/python3.9/subprocess.py", line 373, in check_call
raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command '['cmake', '/tmp/pip-req-build-ll5zoboi/src', '-DCMAKE_LIBRARY_OUTPUT_DIRECTORY=/tmp/pip-req-build-ll5zoboi/build/lib.linux-x86_64-3.9/triton/_C', '-DBUILD_TESTS=OFF', '-DBUILD_PYTHON_MODULE=ON', '-DPYT
HON_INCLUDE_DIRS=/home/fishy/.pyenv/versions/3.9.1/include/python3.9;/home/fishy/.cache/pypoetry/virtualenvs/limp-MNsrVWB5-py3.9/lib/python3.9/site-packages/torch/include;/home/fishy/.cache/pypoetry/virtualenvs/limp-MNsrVWB5-py3.9/lib/python3.9/si
te-packages/torch/include/torch/csrc/api/include;/home/fishy/.cache/pypoetry/virtualenvs/limp-MNsrVWB5-py3.9/lib/python3.9/site-packages/torch/include/TH;/home/fishy/.cache/pypoetry/virtualenvs/limp-MNsrVWB5-py3.9/lib/python3.9/site-packages/torch
/include/THC;/usr/local/cuda-11.1/include', '-DPYTHON_LINK_DIRS=/home/fishy/.cache/pypoetry/virtualenvs/limp-MNsrVWB5-py3.9/lib/python3.9/site-packages/torch/lib;/usr/local/cuda-11.1/lib64', '-DTORCH_LIBRARIES=c10;c10_cuda;torch;torch_cuda;torch_c
pu;torch_python;triton', '-DLLVM_CONFIG=/usr/bin/llvm-config', '-DCMAKE_BUILD_TYPE=Release']' returned non-zero exit status 1.
----------------------------------------
ERROR: Command errored out with exit status 1: /home/fishy/.cache/pypoetry/virtualenvs/limp-MNsrVWB5-py3.9/bin/python -u -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'/tmp/pip-req-build-ll5zoboi/setup.py'"'"'; __file__='"'"'/tmp/pip-r
eq-build-ll5zoboi/setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(__file__);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' install --record /tmp/pip-record-8fxaw_pr/install-reco
rd.txt --single-version-externally-managed --compile --install-headers /home/fishy/.cache/pypoetry/virtualenvs/limp-MNsrVWB5-py3.9/include/site/python3.9/triton Check the logs for full command output.
The latest isaac code triggers many test failures with caffe's opencl branch. The good commit is:
Templates/Reduce1D: now properly loading 2D scalars commit 6ac5e1f
Since that commit, both "General: Internal code generator overhaul" and "JIT: No longer using fallbacks for stride[0] > 1" introduce some regressions.
It's easy to build the Caffe's opencl branch as below:
Then you will see many new failures with the above two commit.
BTW It's better to use latest beignet driver as the OCL compiler. The good commit works great with beignet.
@ptillet Could you look at this issue? Thanks.
Hi,
I build triton over 3090, and find cannot execute the generated binary:
root@2841f7a92187:/data/triton/build/tests/bench# ./bench_conv
terminate called after throwing an instance of 'triton::driver::exception::cuda::illegal_address'
what(): CUDA: Error- illegal address
Aborted (core dumped)
root@2841f7a92187:/data/triton/build/tests/bench# ./bench_copy
// {16777216}, {0}, {0}terminate called after throwing an instance of 'triton::driver::exception::cuda::launch_out_of_resources'
what(): CUDA: Error- launch out of resources
Aborted (core dumped)
So current 3090/a100 is not supported?
Thx,
Lei
Hello,
I am running Isaac on an AMD Fiji Nano GPU, using ROCm 1.6
https://github.com/RadeonOpenCompute/ROCm
A memory error is being detected for sgemm problem
tA = no
tB = no
colMaj = yes
m = 35
n = 8457
k = 4096
lda = 35
ldb = 4096
ldc = 35
this corresponds to one of the problems in the DeepBench suite.
The error reported by the runtime is
Memory access fault by GPU node-1 on address 0x901e22000. Reason: Page not present or supervisor privilege.
I'm getting the same error on one of the DeepBench problems in Isaac's bench-blas.
AMD's previous Catalyst drivers just let these memory errors through.
I don't think this is a compiler issue, as there is still a problem when
export AMD_OCL_BUILD_OPTIONS_APPEND="-cl-opt-disable"
but I'm not certain.
Thanks.
When I develop a new operator, the time of backward always cost many time. If triton can support autodiff, It will be very nice!
Hi,
Whether there is test script to generate test data for comparing cublas/triton/tvm as your paper state?
And what I care is whether triton could provide some benefit that tensorRT couldn't provide, like more efficient conv kernel.
Or auto generate kernel for memory movement like nn.shuffle in pytorch?
I think more examples would benefit for this project.
Thx,
Lei
Hi,
I have installed ISSAC and clcaffe (https://github.com/01org/caffe/wiki/clCaffe) with beignet 1.1.1.2
bit throws errors with ISSAC when forwarding alexnet.
./tools/caffe time -model ../models/bvlc_alexnet/deploy.prototxt -gpu 0
ISAAC: unknow Intel CPU ID: Intel(R) Celeron(R) CPU J1900 @ 1.99GHz
ISAAC: use SKYLAKE by default.
Device : Intel(R) HD Graphics Bay Trail-TBuild Status = -2
Build Log = stringInput.cl:5827:1: warning: implicit declaration of function 'intel_sub_group_block_read8' is invalid in C99
stringInput.cl:5812:39: note: expanded from macro 'GEMM_NN'
stringInput.cl:5674:50: note: expanded from macro 'SUBGROUP_BLOCK_READ8'
stringInput.cl:5827:1: error: call to 'as_float8' is ambiguous
stringInput.cl:5812:28: note: expanded from macro 'GEMM_NN'
/usr/lib/x86_64-linux-gnu/beignet//include/ocl_as.h:492:21: note: candidate function
/usr/lib/x86_64-linux-gnu/beignet//include/ocl_as.h:502:21: note: candidate function
/usr/lib/x86_64-linux-gnu/beignet//include/ocl_as.h:512:21: note: candidate function
/usr/lib/x86_64-linux-gnu/beignet//include/ocl_as.h:522:21: note: candidate function
/usr/lib/x86_64-linux-gnu/beignet//include/ocl_as.h:533:21: note: candidate function
/usr/lib/x86_64-linux-gnu/beignet//include/ocl_as.h:544:21: note: candidate function
/usr/lib/x86_64-linux-gnu/beignet//include/ocl_as.h:555:21: note: candidate function
/usr/lib/x86_64-linux-gnu/beignet//
terminate called after throwing an instance of 'isaac::exception::ocl::invalid_program_executable'
what(): OpenCL: Error- invalid program executable
*** Aborted at 1516592801 (unix time) try "date -d @1516592801" if you are using GNU date ***
PC: @ 0x7f5f7765a428 gsignal
*** SIGABRT (@0x3e800004d42) received by PID 19778 (TID 0x7f5f79919ac0) from PID 19778; stack trace: ***
@ 0x7f5f7765a4b0 (unknown)
@ 0x7f5f7765a428 gsignal
@ 0x7f5f7765c02a abort
@ 0x7f5f77c9484d __gnu_cxx::__verbose_terminate_handler()
@ 0x7f5f77c926b6 (unknown)
@ 0x7f5f77c92701 std::terminate()
@ 0x7f5f77c92919 __cxa_throw
@ 0x7f5f755f05de isaac::driver::check()
@ 0x7f5f755f317d isaac::driver::Kernel::Kernel()
@ 0x7f5f756285cb isaac::templates::intelblas_gemm_image::enqueue()
@ 0x7f5f756b3098 isaac::runtime::profiles::value_type::execute()
@ 0x7f5f756be949 isaac::runtime::execute()
@ 0x7f5f75699e6e execute
@ 0x7f5f756aaa86 clblasSgemm
@ 0x7f5f7910f2e5 caffe::greentea_gpu_gemm<>()
@ 0x7f5f78ee4efa caffe::BaseConvolutionLayer<>::forward_gpu_gemm()
@ 0x7f5f78feacaa caffe::ConvolutionLayerSpatial<>::Forward_gpu()
@ 0x7f5f79195cd1 caffe::Net<>::ForwardFromTo()
@ 0x7f5f79195dc7 caffe::Net<>::Forward()
@ 0x5596234ec338 time()
@ 0x5596234e6652 main
@ 0x7f5f77645830 __libc_start_main
@ 0x5596234e70f9 _start
@ 0x0 (unknown)
Aborted (core dumped)
Here is the steps to install beignet
sudo apt-get install beignet
The attachment is my clinfo.
beignet_clinfo.txt
Any suggestion?
thank you
https://github.com/ptillet/triton/blob/triton-ops/python/triton/ops/matmul.py
https://github.com/ptillet/triton/blob/triton-ops/python/examples/tutorials/matmul.py
https://github.com/ptillet/triton/blob/triton-ops/tests/common/src/dot.h
I think these are same operator defines, right? If they are, Why did use same operator defines?
Hi,
Apologies for this kind of question but I can't get the first matmul example to work: from here.
Here is a full compilable example:
import torch
import triton
class _dot(torch.autograd.Function):
src = """
__global__ void dot(float * A, float *B, float *C, int M, int N, int K,
int lda __multipleof(8), int ldb __multipleof(8), int ldc __multipleof(8)) {
int pm = get_program_id(0);
int pn = get_program_id(1);
// ranges
int rm[TM] = pm * TM + 0 ... TM;
int rn[TN] = pn * TN + 0 ... TN;
int rk[TK] = 0 ... TK;
// accumulator
float c[TM, TN] = 0;
//pointers
float* pa[TM, TK] = A + rk[newaxis, :] * 1 + rm[:, newaxis] * lda;
float* pb[TK, TN] = B + rk[:, newaxis] * ldb + rn[newaxis, :] * 1;
for(int k=K; k>0; k-=TK) {
float a[TM, TK] = *pa;
float b[TK, TN] = *pb;
c += dot(a,b);
pa = pa + TK * 1;
pb = pb + TK * ldb;
}
float* pc[TM,TN] = C + rn[newaxis, :] + rm[:,newaxis] * ldc;
*pc = c;
}
"""
@staticmethod
def forward(ctx, a, b):
c = _dot._call(a,b)
return c
@staticmethod
def _call(a, b):
M, K = a.shape
K, N = b.shape
lda = M
ldb = K
ldc = M
dtype = a.dtype
c = triton.empty([M,N], dtype=dtype)
grid = lambda opt: [triton.cdiv(M, opt.d('TM')), triton.cdiv(N, opt.d('TN'))]
defines= {
'TYPE' : dtype,
'TM' : [32,64,128],
'TN' : [32,64,128],
'TK' : [8],
}
_dot.kernel = triton.kernel(_dot.src, defines=defines)
_dot.kernel(a, b, c, M, N, K, lda, ldb, ldc,
grid=grid, num_warps=4, defines=defines)
return c
dot = _dot.apply
torch.manual_seed(0)
M, N, K = 128, 512, 256
a = torch.rand((M, K)).cuda()
b = torch.rand((K, N)).cuda()
zc = torch.matmul(a,b)
zc_ = dot(a,b)
print(zc)
print(zc_)
print(zc == zc_)
The use of dot
throws an error:
(null):50:22: error: tile with more than one element cannot be casted to scalar
c += dot(a,b);
^
But a @ b
also seems wrong.
I think I'm possibly misunderstanding the use of leading dimensions and/or missing something simple ! Any help much appreciated, but since this isn't really an issue feel free to close if you don't have time at the moment.
Thanks,
Jack
Hi thanks for open-sourced contribution. I see there is block sparse implementation in the test.
Is this the block sparse attention mechanism specific for transformer? Or it is the general spMM?
Thanks
If I just copy and paste the program in the tutorial: https://docs.triton-lang.org/tutorials/custom-operation.html
$ python custom-op.py
Traceback (most recent call last):
File "custom-op.py", line 4, in <module>
class _add(triton.Function):
AttributeError: module 'triton' has no attribute 'Function'
Changing:
class _add(triton.Function):
to
class _add(torch.autograd.Function):
Seems to work (print(diff)
outputs tensor(0., device='cuda:0')
). Is this correct? (P.S. sorry for opening multiple issues in one day ;) )
Hello,
I would like to use Isaac in OpenCL. After installing isaac and linking to libisaac.so, there is a link failure
undefined reference to `clblasSgemm'
From README
Link against libisaac.so instead of libcublas.so or libclblas.so, and you're good to go!
but grepping about the source file I can't find implementations of clBLAS.h functions. Isaac GEMM executes on my machine via the example bench/blas.cpp, but it is not clear in that example how an end user is supposed to use Isaac's GEMM (there are calls to function dot (?)).
So my question is : can I just link to libisaac.so and get sgemm working, or do I need to use a different API?
Thanks.
sort of a continuation from this issue https://github.com/ptillet/triton/issues/99
For a while the fix above worked fine, and I was able to use triton version 0.4.0, but recently I haven't been able to install it.
I still see it as the correct version on PyPi, but when I tried to install using pip I get the error
>>> pip3 install triton==0.4.0
ERROR: Could not find a version that satisfies the requirement triton==0.4.0 (from versions: 0.1, 0.1.1, 0.1.2, 0.1.3, 0.2.0, 0.2.1, 0.2.2, 0.2.3, 0.3.0, 0.4.1, 1.0.0.dev20210502, 1.0.0.dev20210506, 1.0.0.dev20210508, 1.0.0.dev20210509, 1.0.0.dev20210510, 1.0.0.dev20210515, 1.0.0.dev20210516, 1.0.0.dev20210520, 1.0.0.dev20210521, 1.0.0.dev20210525, 1.0.0.dev20210601)
ERROR: No matching distribution found for triton==0.4.0
I also can't install the version that's listed, 0.4.1, getting the following error
>>> pip3 install triton==0.4.1
Collecting triton==0.4.1
Downloading triton-0.4.1-cp37-cp37m-manylinux2010_x86_64.whl (14.9 MB)
|████████████████████████████████| 14.9 MB 13.1 MB/s
WARNING: Discarding https://files.pythonhosted.org/packages/74/00/990e44a74bba0735f55780ede9c717271887e7b48e4cedb1ab72086af169/triton-0.4.1-cp37-cp37m-manylinux2010_x86_64.whl#sha256=ced9481babfc33f4abfcdc983be20e12085dd396030fa147074b6c7202df78c0 (from https://pypi.org/simple/triton/). Requested triton==0.4.1 from https://files.pythonhosted.org/packages/74/00/990e44a74bba0735f55780ede9c717271887e7b48e4cedb1ab72086af169/triton-0.4.1-cp37-cp37m-manylinux2010_x86_64.whl#sha256=ced9481babfc33f4abfcdc983be20e12085dd396030fa147074b6c7202df78c0 has inconsistent version: filename has '0.4.1', but metadata has '0.4.0'
ERROR: Could not find a version that satisfies the requirement triton==0.4.1 (from versions: 0.1, 0.1.1, 0.1.2, 0.1.3, 0.2.0, 0.2.1, 0.2.2, 0.2.3, 0.3.0, 0.4.1, 1.0.0.dev20210502, 1.0.0.dev20210506, 1.0.0.dev20210508, 1.0.0.dev20210509, 1.0.0.dev20210510, 1.0.0.dev20210515, 1.0.0.dev20210516, 1.0.0.dev20210520, 1.0.0.dev20210521, 1.0.0.dev20210525, 1.0.0.dev20210601)
ERROR: No matching distribution found for triton==0.4.1
Hi @ptillet ,
I want to add kernels on isaac library, but have some doubt about the implementation of the tuning machanism:
1) there are some json file in the database folder, how do you generate them, and how can i modify them to fit the new kernels?
2) what is the mechanism about the predictor and its relationship among expression_tree, random forest, feature, threashold and values, how can I use this mechanism?
Looking forward for your replay, thx!
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.