Hi, thank you for your pytorch extention！ When I tried to run samples/mlp_learning

Hi, I had the same issue, although I had to use <code class="notrans

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Got cutlass error: Error Internal at: 363, when trying to run samples/mlp_learning_an_image_pytorch.py,about nvlabs/tiny-cuda-nn

Comments (10)

Tom94 commented on May 14, 2024 1

Unfortunately, I've been unable to reproduce this issue so far. I've just push a few shot-in-the-dark changes that'll hopefully help narrow things down. If you have the time, it'd be very helpful if you could try the following things:

In src/cpp_api.cu, try removing the three occurrences of SyncedMultiStream synced_stream{stream, 2}; and replacing all occurrences of synced_stream.get(1) with stream.
Just to clarify: do I understand correctly that your previous answer means TCNN_MIN_GPU_ARCH=70 worked for mlp_learning_an_image, or was that with default architecture settings?
Does config_oneblob.json as opposed to the default config_hash.json work?

py .\samples\mlp_learning_an_image_pytorch.py .\data\images\albert.jpg .\data\config_oneblob.json

Thanks again!

from tiny-cuda-nn.

OctoberKat commented on May 14, 2024 1

I have upgraded my CUDA version into 11.3 and then met the same issue as in #48 when running python setup.py install. I referred to the solution in #49, and modified setup.py into the latest version (fb8f845), and finally tinycudann was installed successfully. mlp_learning_an_image_pytorch seems to be running fine now. Many thanks!

from tiny-cuda-nn.

bchretien commented on May 14, 2024

Hi,

I had the same issue, although I had to use CUDA_LAUNCH_BLOCKING=1 to see the error. It points to this check.

My environment:

Ubuntu 18.04 in Docker container, Ubuntu 20.04.3 on host
CUDA 11.5
Drivers version: 495.29.05
GPU: Tesla T4
Python 3.9
PyTorch 1.9.1+cu111
I installed tinycudann through python3.9 setup.py bdist_wheel with the latest master commit (e1ccb40)
Command used: CUDA_LAUNCH_BLOCKING=1 gdb --args python3.9 samples/mlp_learning_an_image_pytorch.py data/images/albert.jpg data/config_hash.json 1000

Here's a full backtrace (code compiled in debug here):

================================================================
This script replicates the behavior of the native CUDA example
mlp_learning_an_image.cu using tiny-cuda-nn's PyTorch extension.
This extension >> runs ~2x slower than native << as of now.
================================================================
Warning: FullyFusedMLP is not supported for the selected architecture 70. Falling back to CutlassMLP. For maximum performance, raise the target GPU architecture to 75+.
Warning: FullyFusedMLP is not supported for the selected architecture 70. Falling back to CutlassMLP. For maximum performance, raise the target GPU architecture to 75+.
NetworkWithInputEncoding(n_input_dims=2, n_output_dims=1, seed=1337, dtype=torch.float16, hyperparams={'encoding': {'base_resolution': 16, 'interpolation': 'Linear', 'log2_hashmap_size': 15, 'n_features_per_level': 2, 'n_levels': 16, 'otype': 'Grid', 'per_level_scale': 1.5, 'type': 'Hash'}, 'network': {'activation': 'ReLU', 'n_hidden_layers': 2, 'n_neurons': 64, 'otype': 'CutlassMLP', 'output_activation': 'None'}, 'otype': 'NetworkWithInputEncoding'})
Writing 'reference.jpg'... done.
Beginning optimization with 1000 training steps.
/mnt/tiny_cuda_nn/samples/mlp_learning_an_image_pytorch.py:68: TracerWarning: torch.tensor results are registered as constants in the trace. You can safely ignore this warning if you use this function to create tensors out of constant variables that would be the same every time you call this function. In any other case, this might cause the trace to be incorrect.
  xs = xs * torch.tensor([shape[1], shape[0]], device=xs.device).float()

Got cutlass error: Error Internal at: 363

Thread 1 "python3.9" hit Breakpoint 2, __GI_exit (status=1) at exit.c:139
139     exit.c: No such file or directory.
(gdb) bt
#0  __GI_exit (status=1) at exit.c:139
#1  0x00007f19f9d67db4 in tcnn::fc_multiply_impl<cutlass::gemm::device::Gemm<cutlass::half_t, cutlass::layout::RowMajor, cutlass::half_t, cutlass::layout::RowMajor, cutlass::half_t, cutlass::layout::ColumnMajor, cutlass::half_t, cutlass::arch::OpClassTensorOp, cutlass::arch::Sm70, cutlass::gemm::GemmShape<128, 128, 32, false>, cutlass::gemm::GemmShape<64, 64, 32, false>, cutlass::gemm::GemmShape<8, 8, 4, false>, tcnn::ActivationEpilogue<cutlass::half_t, 8, cutlass::half_t, cutlass::half_t, (cutlass::FloatRoundStyle)2>, cutlass::gemm::threadblock::GemmIdentityThreadblockSwizzle<1>, 2, 8, 8, false, cutlass::arch::OpMultiplyAdd, false> > (stream=0xb2131300, args=...) at /mnt/tiny_cuda_nn/include/tiny-cuda-nn/cutlass_matmul.h:363
#2  0x00007f19f9d5c24d in tcnn::fc_multiply<tcnn::LayerConfig<cutlass::gemm::GemmShape<128, 128, 32, false>, cutlass::gemm::GemmShape<64, 64, 32, false> >, __half, (tcnn::MatrixLayout)0, __half, (tcnn::MatrixLayout)0, __half, (tcnn::MatrixLayout)1, __half, (tcnn::MatrixLayout)1> (stream=0xb2131300, A=..., B=..., C=..., D=..., act=tcnn::Activation::ReLU, transfer=false, sum_source=false) at /mnt/tiny_cuda_nn/include/tiny-cuda-nn/cutlass_matmul.h:444
#3  0x00007f19f9d566f6 in tcnn::fc_multiply<tcnn::LayerConfig<cutlass::gemm::GemmShape<128, 128, 32, false>, cutlass::gemm::GemmShape<64, 64, 32, false> >, __half, (tcnn::MatrixLayout)0, __half, __half, __half> (stream=0xb2131300, A=..., B=..., C=..., D=..., act=tcnn::Activation::ReLU, transfer=false, sum_source=false) at /mnt/tiny_cuda_nn/include/tiny-cuda-nn/cutlass_matmul.h:476
#4  0x00007f19f9d51f17 in tcnn::fc_multiply<tcnn::LayerConfig<cutlass::gemm::GemmShape<128, 128, 32, false>, cutlass::gemm::GemmShape<64, 64, 32, false> >, __half, (tcnn::MatrixLayout)0, __half, __half> (stream=0xb2131300, A=..., B=..., D=..., act=tcnn::Activation::ReLU) at /mnt/tiny_cuda_nn/include/tiny-cuda-nn/cutlass_matmul.h:482
#5  0x00007f19f9d4d5d2 in tcnn::compute_layer<tcnn::LayerConfig<cutlass::gemm::GemmShape<128, 128, 32, false>, cutlass::gemm::GemmShape<64, 64, 32, false> >, __half> (stream=0xb2131300, is_inference=false, activation=tcnn::Activation::ReLU, weights=..., input=..., output=..., activation_output=...) at /mnt/tiny_cuda_nn/src/cutlass_mlp.cu:146
#6  0x00007f19f9d49bc9 in tcnn::CutlassMLP<__half>::forward (this=0xb0216020, stream=0xb2131300, input=..., output=0x7ffe92a07700, use_inference_matrices=false, prepare_input_gradients=false) at /mnt/tiny_cuda_nn/src/cutlass_mlp.cu:241
#7  0x00007f19f9d250f0 in tcnn::NetworkWithInputEncoding<__half>::forward (this=0x5d38290, stream=0xb2131300, input=..., output=0x7ffe92a07700, use_inference_matrices=false, prepare_input_gradients=false) at /mnt/tiny_cuda_nn/include/tiny-cuda-nn/network_with_input_encoding.h:99
#8  0x00007f19f9d22333 in tcnn::cpp::NetworkWithInputEncoding::forward (this=0x5d38250, stream=0x0, n_elements=65536, input=0x7f19ef407800, output=0x7f190ec00000, params=0x7f19ee2ba600, prepare_input_gradients=false) at /mnt/tiny_cuda_nn/src/cpp_api.cu:83
#9  0x00007f19f9e7185f in Module::fwd (this=0x5bb26b0, input=..., params=...) at /mnt/tiny_cuda_nn/bindings/torch/tinycudann/bindings.cpp:99
#10 0x00007f19f9e8ac9d in pybind11::cpp_function::cpp_function<std::tuple<tcnn::cpp::Context, at::Tensor>, Module, at::Tensor, at::Tensor, pybind11::name, pybind11::is_method, pybind11::sibling>(std::tuple<tcnn::cpp::Context, at::Tensor> (Module::*)(at::Tensor, at::Tensor), pybind11::name const&, pybind11::is_method const&, pybind11::sibling const&)::{lambda(Module*, at::Tensor, at::Tensor)#1}::operator()(Module*, at::Tensor, at::Tensor) const (this=0x53c0fd8, c=0x5bb26b0, args#0=..., args#1=...)
    at /usr/local/lib/python3.9/dist-packages/torch/include/pybind11/pybind11.h:84
#11 0x00007f19f9ea74ba in pybind11::detail::argument_loader<Module*, at::Tensor, at::Tensor>::call_impl<std::tuple<tcnn::cpp::Context, at::Tensor>, pybind11::cpp_function::cpp_function<std::tuple<tcnn::cpp::Context, at::Tensor>, Module, at::Tensor, at::Tensor, pybind11::name, pybind11::is_method, pybind11::sibling>(std::tuple<tcnn::cpp::Context, at::Tensor> (Module::*)(at::Tensor, at::Tensor), pybind11::name const&, pybind11::is_method const&, pybind11::sibling const&)::{lambda(Module*, at::Tensor, at::Tensor)#1}&, 0ul, 1ul, 2ul, pybind11::detail::void_type>(pybind11::cpp_function::cpp_function<std::tuple<tcnn::cpp::Context, at::Tensor>, Module, at::Tensor, at::Tensor, pybind11::name, pybind11::is_method, pybind11::sibling>(std::tuple<tcnn::cpp::Context, at::Tensor> (Module::*)(at::Tensor, at::Tensor), pybind11::name const&, pybind11::is_method const&, pybind11::sibling const&)::{lambda(Module*, at::Tensor, at::Tensor)#1}&, std::integer_sequence<unsigned long, 0ul, 1ul, 2ul>, pybind11::detail::void_type&&) && (this=0x7ffe92a079d0, f=...) at /usr/local/lib/python3.9/dist-packages/torch/include/pybind11/cast.h:2042
#12 0x00007f19f9ea084f in pybind11::detail::argument_loader<Module*, at::Tensor, at::Tensor>::call<std::tuple<tcnn::cpp::Context, at::Tensor>, pybind11::detail::void_type, pybind11::cpp_function::cpp_function<std::tuple<tcnn::cpp::Context, at::Tensor>, Module, at::Tensor, at::Tensor, pybind11::name, pybind11::is_method, pybind11::sibling>(std::tuple<tcnn::cpp::Context, at::Tensor> (Module::*)(at::Tensor, at::Tensor), pybind11::name const&, pybind11::is_method const&, pybind11::sibling const&)::{lambda(Module*, at::Tensor, at::Tensor)#1}&>(pybind11::cpp_function::cpp_function<std::tuple<tcnn::cpp::Context, at::Tensor>, Module, at::Tensor, at::Tensor, pybind11::name, pybind11::is_method, pybind11::sibling>(std::tuple<tcnn::cpp::Context, at::Tensor> (Module::*)(at::Tensor, at::Tensor), pybind11::name const&, pybind11::is_method const&, pybind11::sibling const&)::{lambda(Module*, at::Tensor, at::Tensor)#1}&) && (this=0x7ffe92a079d0, f=...) at /usr/local/lib/python3.9/dist-packages/torch/include/pybind11/cast.h:2014
#13 0x00007f19f9e945b2 in pybind11::cpp_function::initialize<pybind11::cpp_function::initialize<std::tuple<tcnn::cpp::Context, at::Tensor>, Module, at::Tensor, at::Tensor, pybind11::name, pybind11::is_method, pybind11::sibling>(std::tuple<tcnn::cpp::Context, at::Tensor> (Module::*)(at::Tensor, at::Tensor), pybind11::name const&, pybind11::is_method const&, pybind11::sibling const&)::{lambda(Module*, at::Tensor, at::Tensor)#1}, std::tuple<tcnn::cpp::Context, at::Tensor>, Module*, at::Tensor, at::Tensor, pybind11::name, pybind11::is_method, pybind11::sibling>(pybind11::cpp_function::initialize<std::tuple<tcnn::cpp::Context, at::Tensor>, Module, at::Tensor, at::Tensor, pybind11::name, pybind11::is_method, pybind11::sibling>(std::tuple<tcnn::cpp::Context, at::Tensor> (Module::*)(at::Tensor, at::Tensor), pybind11::name const&, pybind11::is_method const&, pybind11::sibling const&)::{lambda(Module*, at::Tensor, at::Tensor)#1}&&, std::tuple<tcnn::cpp::Context, at::Tensor> (*)(Module*, at::Tensor, at::Tensor), pybind11::name const&, pybind11::is_method const&, pybind11::sibling const&)::{lambda(pybind11::detail::function_call&)#3}::operator()(pybind11::detail::function_call&) const (this=0x0, call=...) at /usr/local/lib/python3.9/dist-packages/torch/include/pybind11/pybind11.h:192
#14 0x00007f19f9e94864 in pybind11::cpp_function::initialize<pybind11::cpp_function::initialize<std::tuple<tcnn::cpp::Context, at::Tensor>, Module, at::Tensor, at::Tensor, pybind11::name, pybind11::is_method, pybind11::sibling>(std::tuple<tcnn::cpp::Context, at::Tensor> (Module::*)(at::Tensor, at::Tensor), pybind11::name const&, pybind11::is_method const&, pybind11::sibling const&)::{lambda(Module*, at::Tensor, at::Tensor)#1}, std::tuple<tcnn::cpp::Context, at::Tensor>, Module*, at::Tensor, at::Tensor, pybind11::name, pybind11::is_method, pybind11::sibling>(pybind11::cpp_function::initialize<std::tuple<tcnn::cpp::Context, at::Tensor>, Module, at::Tensor, at::Tensor, pybind11::name, pybind11::is_method, pybind11::sibling>(std::tuple<tcnn::cpp::Context, at::Tensor> (Module::*)(at::Tensor, at::Tensor), pybind11::name const&, pybind11::is_method const&, pybind11::sibling const&)::{lambda(Module*, at::Tensor, at::Tensor)#1}&&, std::tuple<tcnn::cpp::Context, at::Tensor> (*)(Module*, at::Tensor, at::Tensor), pybind11::name const&, pybind11::is_method const&, pybind11::sibling const&)::{lambda(pybind11::detail::function_call&)#3}::_FUN(pybind11::detail::function_call&) () at /usr/local/lib/python3.9/dist-packages/torch/include/pybind11/pybind11.h:170
#15 0x00007f19f9e699ad in pybind11::cpp_function::dispatcher (self=0x7f19ffa24a80, args_in=0x7f19f068aa40, kwargs_in=0x0) at /usr/local/lib/python3.9/dist-packages/torch/include/pybind11/pybind11.h:767
#16 0x00000000005ef5c6 in ?? ()
#17 0x00000000005110d9 in _PyObject_MakeTpCall ()
#18 0x00000000005e7f5b in ?? ()
#19 0x0000000000573b7c in _PyEval_EvalFrameDefault ()
#20 0x00000000005117d4 in _PyFunction_Vectorcall ()
#21 0x00007f1af1b7416b in THPFunction_apply(_object*, _object*) () from /usr/local/lib/python3.9/dist-packages/torch/lib/libtorch_python.so
#22 0x00000000005ef5e8 in ?? ()
#23 0x00000000005110d9 in _PyObject_MakeTpCall ()
#24 0x0000000000574af2 in _PyEval_EvalFrameDefault ()
#25 0x00000000005117d4 in _PyFunction_Vectorcall ()
#26 0x00000000005e7ff8 in ?? ()
#27 0x0000000000423f31 in ?? ()
#28 0x00000000005717ef in _PyEval_EvalFrameDefault ()
#29 0x000000000056ee02 in ?? ()
#30 0x0000000000512269 in _PyObject_Call_Prepend ()
#31 0x000000000054f353 in ?? ()
#32 0x00000000005110d9 in _PyObject_MakeTpCall ()
#33 0x0000000000573aa6 in _PyEval_EvalFrameDefault ()
#34 0x000000000056ee02 in ?? ()
#35 0x000000000056eb2b in _PyEval_EvalCodeWithName ()
#36 0x000000000056ead3 in PyEval_EvalCode ()
#37 0x0000000000644f79 in ?? ()
#38 0x0000000000644821 in ?? ()
#39 0x0000000000470edb in ?? ()
#40 0x000000000064458c in PyRun_SimpleFileExFlags ()
#41 0x0000000000613a0a in Py_RunMain ()
#42 0x00000000006135a9 in Py_BytesMain ()
#43 0x00007f1b0a030bf7 in __libc_start_main (main=0x50d0b0, argc=5, argv=0x7ffe92a08cf8, init=<optimized out>, fini=<optimized out>, rtld_fini=<optimized out>, stack_end=0x7ffe92a08ce8) at ../csu/libc-start.c:310
#44 0x00000000006134ba in _start ()

from tiny-cuda-nn.

Tom94 commented on May 14, 2024

Thanks for reporting this!

To help narrow this down: does the native mlp_learning_an_image sample run correctly?

Also: which specific GPU(s) are you using?

from tiny-cuda-nn.

bchretien commented on May 14, 2024

@Tom94: thanks for the quick feedback, and for the entire project! mlp_learning_an_image seems to be running fine, and I'm on a Tesla T4. I'll try to see if I can reproduce it on other GPUs.

EDIT: In my case I had to play a bit with the TCNN_MIN_GPU_ARCH variable to have a single wheel that I can run on different GPUs (TCNN_MIN_GPU_ARCH=70 here), but since @OctoberKat seems to be using the default setup.py, that may not be the cause of this.

from tiny-cuda-nn.

bchretien commented on May 14, 2024

Some more tests seem to indicate that this happens with TCNN_MIN_GPU_ARCH=70 but not TCNN_MIN_GPU_ARCH=75. @OctoberKat's GeForce GTX 1080 Ti has a CC of 6.1 if I'm not mistaken, so maybe the problem comes from here? Does cutlass has a minimum supported CC? Or maybe the older architectures are not as well tested with recent cutlass releases? 🤷‍♂️

@Tom94 you might be able to reproduce the error if you force TCNN_MIN_GPU_ARCH in setup.py.

from tiny-cuda-nn.

bchretien commented on May 14, 2024

I also tried to reproduce this with the CMake build and TCNN_CUDA_ARCHITECTURES="70;75" (so that TCNN_MIN_GPU_ARCH=70), but I could not reproduce the error.

from tiny-cuda-nn.

bchretien commented on May 14, 2024

@Tom94:

This does not seem to make a difference, even with the latest commits.
Yes, TCNN_MIN_GPU_ARCH=70 worked with mlp_learning_an_image on a CC 7.5 GPU.
Same error with config_oneblob.json.

But since everything worked with the CMake build, I went back to my modified setup.py that allows me to generate debug builds. For debug builds, I use -G for nvcc, but apparently it can cause miscompilations with cutlass, and it seems to be the cause of my problems. If I add it to the nvcc flags in the CMake build, I can reproduce the problem. With the current master branch (f32361e) and no -G, I'm no longer seeing the error with the PyTorch build. I was really not expecting that to be a source of problems!

Still, in the original bug report, the original setup.py without all my shenanigans is used, so there is probably another problem. @OctoberKat can run the various tests you suggested to see if that leads to different results.

I'll continue to run some tests today, and get back to you if anything strange happens again. Thanks again for the help and the fast feedback, greatly appreciated!

from tiny-cuda-nn.

Tom94 commented on May 14, 2024

Likewise, thank you very much for helping out! Unfortunately, I can't think of anything beyond what you've found based on your latest results...

Knowing about -G being problematic is good to say the least (and, frankly, shocking).

@OctoberKat one more idea that could help with your setup would be upgrading CUDA (I see you have 11.0) to a more recent version (such as 11.6).

from tiny-cuda-nn.

OctoberKat commented on May 14, 2024

Ok, I will have a try. Thank you everyone.

from tiny-cuda-nn.

Got cutlass error: Error Internal at: 363, when trying to run samples/mlp_learning_an_image_pytorch.py about tiny-cuda-nn HOT 10 CLOSED

Comments (10)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent