Hi: The reported "CUDA kernel failed : invalid device function void svo

CUDA kernel failed : invalid device function about nsvf HOT 20 CLOSED

facebookresearch commented on June 28, 2024

CUDA kernel failed : invalid device function

from nsvf.

Comments (20)

MultiPath commented on June 28, 2024

It seems that you did not compile the code successfully. Could you paste your system, cuda version, pytorch version, etc?

from nsvf.

NNNNAI commented on June 28, 2024

Thanks for your reply! The configuration are shown below:
Driver Version: 440.44 CUDA Version: 10.2 PyTorch 1.4.0 GPU : RTX2080TI 12G

NSVF) gdp@gdp:~/harddisk/Data4/lny/NSVF$ CUDA_VISIBLE_DEVICES=2 python -u train.py /home/gdp/harddisk/Data4/lny/NSVFDATASET/Synthetic_NSVF/Robot
--user-dir fairnr
--task single_object_rendering
--train-views "0..100" --view-resolution "800x800"
--max-sentences 1 --view-per-batch 4 --pixel-per-view 2048
--no-preload
--sampling-on-mask 1.0 --no-sampling-at-reader
--valid-views "100..200" --valid-view-resolution "400x400"
--valid-view-per-batch 1
--transparent-background "1.0,1.0,1.0" --background-stop-gradient
--arch nsvf_base
--initial-boundingbox /home/gdp/harddisk/Data4/lny/NSVFDATASET/Synthetic_NSVF/Robot/bbox.txt
--use-octree
--raymarching-stepsize-ratio 0.125
--discrete-regularization
--color-weight 128.0 --alpha-weight 1.0
--optimizer "adam" --adam-betas "(0.9, 0.999)"
--lr 0.001 --lr-scheduler "polynomial_decay" --total-num-update 150000
--criterion "srn_loss" --clip-norm 0.0
--num-workers 0
--seed 2
--save-interval-updates 500 --max-update 150000
--virtual-epoch-steps 5000 --save-interval 1
--half-voxel-size-at "5000,25000,75000"
--reduce-step-size-at "5000,25000,75000"
--pruning-every-steps 2500
--keep-interval-updates 5 --keep-last-epochs 5
--log-format simple --log-interval 1
--save-dir checkpoints/robot
--tensorboard-logdir checkpoints/robot/tensorboard
| tee -a checkpoints/robot/train.log

from nsvf.

MultiPath commented on June 28, 2024

I don't think PyTorch 1.4.0 is ever compiled with CUDA 10.2... The cuda version to compile this code should match with PyTorch cuda version. You can check it by python -c "import torch; print(torch.version.cuda)"

from nsvf.

NNNNAI commented on June 28, 2024

Sorry for the mistake .The cuda version is excatly 10.1.

from nsvf.

MultiPath commented on June 28, 2024

Ok. just in case, did you run python setup.py build_ext --inplace and everything showed ok?

from nsvf.

NNNNAI commented on June 28, 2024

(NSVF) gdp@gdp:~/harddisk/Data4/lny/NSVF$ python setup.py build_ext --inplace
running build_ext
copying build/lib.linux-x86_64-3.7/fairnr/clib/_ext.cpython-37m-x86_64-linux-gnu.so -> fairnr/clib

from nsvf.

MultiPath commented on June 28, 2024

To try recompling, i think you need to delete the build folder under NSVF

from nsvf.

NNNNAI commented on June 28, 2024

I got such a report after deleting the build folder under NSVF and running python setup.py build_ext --inplace.

running build_ext
building 'fairnr.clib.ext' extension
creating build
creating build/temp.linux-x86_64-3.7
creating build/temp.linux-x86_64-3.7/fairnr
creating build/temp.linux-x86_64-3.7/fairnr/clib
creating build/temp.linux-x86_64-3.7/fairnr/clib/src
gcc -pthread -B /home/gdp/.conda/envs/NSVF/compiler_compat -Wl,--sysroot=/ -Wsign-compare -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -fPIC -I/home/gdp/.conda/envs/NSVF/lib/python3.7/site-packages/torch/include -I/home/gdp/.conda/envs/NSVF/lib/python3.7/site-packages/torch/include/torch/csrc/api/include -I/home/gdp/.conda/envs/NSVF/lib/python3.7/site-packages/torch/include/TH -I/home/gdp/.conda/envs/NSVF/lib/python3.7/site-packages/torch/include/THC -I/usr/local/cuda-10.0/include -I/home/gdp/.conda/envs/NSVF/include/python3.7m -c fairnr/clib/src/binding.cpp -o build/temp.linux-x86_64-3.7/fairnr/clib/src/binding.o -O2 -Ifairnr/clib/include -DTORCH_API_INCLUDE_EXTENSION_H -DTORCH_EXTENSION_NAME=ext -D_GLIBCXX_USE_CXX11_ABI=0 -std=c++11
cc1plus: warning: command line option ‘-Wstrict-prototypes’ is valid for C/ObjC but not for C++
gcc -pthread -B /home/gdp/.conda/envs/NSVF/compiler_compat -Wl,--sysroot=/ -Wsign-compare -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -fPIC -I/home/gdp/.conda/envs/NSVF/lib/python3.7/site-packages/torch/include -I/home/gdp/.conda/envs/NSVF/lib/python3.7/site-packages/torch/include/torch/csrc/api/include -I/home/gdp/.conda/envs/NSVF/lib/python3.7/site-packages/torch/include/TH -I/home/gdp/.conda/envs/NSVF/lib/python3.7/site-packages/torch/include/THC -I/usr/local/cuda-10.0/include -I/home/gdp/.conda/envs/NSVF/include/python3.7m -c fairnr/clib/src/intersect.cpp -o build/temp.linux-x86_64-3.7/fairnr/clib/src/intersect.o -O2 -Ifairnr/clib/include -DTORCH_API_INCLUDE_EXTENSION_H -DTORCH_EXTENSION_NAME=ext -D_GLIBCXX_USE_CXX11_ABI=0 -std=c++11
cc1plus: warning: command line option ‘-Wstrict-prototypes’ is valid for C/ObjC but not for C++
gcc -pthread -B /home/gdp/.conda/envs/NSVF/compiler_compat -Wl,--sysroot=/ -Wsign-compare -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -fPIC -I/home/gdp/.conda/envs/NSVF/lib/python3.7/site-packages/torch/include -I/home/gdp/.conda/envs/NSVF/lib/python3.7/site-packages/torch/include/torch/csrc/api/include -I/home/gdp/.conda/envs/NSVF/lib/python3.7/site-packages/torch/include/TH -I/home/gdp/.conda/envs/NSVF/lib/python3.7/site-packages/torch/include/THC -I/usr/local/cuda-10.0/include -I/home/gdp/.conda/envs/NSVF/include/python3.7m -c fairnr/clib/src/octree.cpp -o build/temp.linux-x86_64-3.7/fairnr/clib/src/octree.o -O2 -Ifairnr/clib/include -DTORCH_API_INCLUDE_EXTENSION_H -DTORCH_EXTENSION_NAME=ext -D_GLIBCXX_USE_CXX11_ABI=0 -std=c++11
cc1plus: warning: command line option ‘-Wstrict-prototypes’ is valid for C/ObjC but not for C++
/usr/local/cuda-10.0/bin/nvcc -I/home/gdp/.conda/envs/NSVF/lib/python3.7/site-packages/torch/include -I/home/gdp/.conda/envs/NSVF/lib/python3.7/site-packages/torch/include/torch/csrc/api/include -I/home/gdp/.conda/envs/NSVF/lib/python3.7/site-packages/torch/include/TH -I/home/gdp/.conda/envs/NSVF/lib/python3.7/site-packages/torch/include/THC -I/usr/local/cuda-10.0/include -I/home/gdp/.conda/envs/NSVF/include/python3.7m -c fairnr/clib/src/intersect_gpu.cu -o build/temp.linux-x86_64-3.7/fairnr/clib/src/intersect_gpu.o -D__CUDA_NO_HALF_OPERATORS -D__CUDA_NO_HALF_CONVERSIONS -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr --compiler-options '-fPIC' -O2 -Ifairnr/clib/include -DTORCH_API_INCLUDE_EXTENSION_H -DTORCH_EXTENSION_NAME=_ext -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_75,code=sm_75 -std=c++11
/home/gdp/.conda/envs/NSVF/lib/python3.7/site-packages/torch/include/c10/core/TensorTypeSet.h(44): warning: integer conversion resulted in a change of sign

/home/gdp/.conda/envs/NSVF/lib/python3.7/site-packages/torch/include/c10/core/TensorTypeSet.h(44): warning: integer conversion resulted in a change of sign

creating build/lib.linux-x86_64-3.7
creating build/lib.linux-x86_64-3.7/fairnr
creating build/lib.linux-x86_64-3.7/fairnr/clib
g++ -pthread -shared -B /home/gdp/.conda/envs/NSVF/compiler_compat -L/home/gdp/.conda/envs/NSVF/lib -Wl,-rpath=/home/gdp/.conda/envs/NSVF/lib -Wl,--no-as-needed -Wl,--sysroot=/ build/temp.linux-x86_64-3.7/fairnr/clib/src/binding.o build/temp.linux-x86_64-3.7/fairnr/clib/src/intersect.o build/temp.linux-x86_64-3.7/fairnr/clib/src/octree.o build/temp.linux-x86_64-3.7/fairnr/clib/src/intersect_gpu.o -L/usr/local/cuda-10.0/lib64 -lcudart -o build/lib.linux-x86_64-3.7/fairnr/clib/_ext.cpython-37m-x86_64-linux-gnu.so
copying build/lib.linux-x86_64-3.7/fairnr/clib/_ext.cpython-37m-x86_64-linux-gnu.so -> fairnr/clib

from nsvf.

MultiPath commented on June 28, 2024

It looks ok. I have tried this data on my side just now and it did not show errors. Could you try reduce --view-per-batch to 1 to see if it is because of out of memory?

from nsvf.

NNNNAI commented on June 28, 2024

I still got 'CUDA kernel failed : invalid device function' , even though I reduced --view-per-batch to 1 . It seems like not a OOM issue.

from nsvf.

MultiPath commented on June 28, 2024

Do you have the full log file?

from nsvf.

MultiPath commented on June 28, 2024

Also, could try remove "--use-octree" and see how the other method works?

from nsvf.

NNNNAI commented on June 28, 2024

train.log
Here is the full log file. And I also remove "--use-octree" and run the code again. But I still got the same error report.
The error report (after removing "--use-octree") is shown below:
2020-10-21 13:52:27 | INFO | fairseq.utils | CUDA enviroments for all 1 workers
2020-10-21 13:52:27 | INFO | fairnr_cli.train | training on 1 GPUs
2020-10-21 13:52:27 | INFO | fairnr_cli.train | max tokens per GPU = None and max sentences per GPU = 1
2020-10-21 13:52:27 | INFO | fairseq.trainer | no existing checkpoint found checkpoints/robot/checkpoint_last.pt
2020-10-21 13:52:27 | INFO | fairseq.trainer | loading train data for epoch 1
2020-10-21 13:52:27 | INFO | fairseq.trainer | NOTE: your device may support faster training with --fp16
CUDA kernel failed : invalid device function
void aabb_intersect_point_kernel_wrapper(int, int, int, float, int, const float*, const float*, const float*, int*, float*, float*) at L:371 in fairnr/clib/src/intersect_gpu.cu

from nsvf.

MultiPath commented on June 28, 2024

I searched your error a bit, it seems to be some cuda setting or GPU setting issues. Although I don't exactly know what caused this...
If you try nvcc --version, what it shows?

from nsvf.

NNNNAI commented on June 28, 2024

It shows：
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2018 NVIDIA Corporation
Built on Sat_Aug_25_21:08:01_CDT_2018
Cuda compilation tools, release 10.0, V10.0.130

from nsvf.

MultiPath commented on June 28, 2024

So your nvcc CUDA is 10.0 instead of 10.1?

from nsvf.

NNNNAI commented on June 28, 2024

Yep, the nvcc --version indicate that the CUDA is 10.0.But when I ran [python -c "import torch; print(torch.version.cuda)"], it output 10.1.

from nsvf.

MultiPath commented on June 28, 2024

python -c "import torch; print(torch.version.cuda) this is your pytorch cuda version. It is compiled with cuda10.1. But your machine also needs to install cuda 10.1 to match the version of pytorch.

from nsvf.

NNNNAI commented on June 28, 2024

Sorry for the late reply. The problem was solved after I installed cuda 10.1 in my machine. Thanks for your help~

from nsvf.

MultiPath commented on June 28, 2024

Glad to see it was solved!

from nsvf.

CUDA kernel failed : invalid device function about nsvf HOT 20 CLOSED

Comments (20)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent