Comments (20)
It seems that you did not compile the code successfully. Could you paste your system, cuda version, pytorch version, etc?
from nsvf.
Thanks for your reply! The configuration are shown below:
Driver Version: 440.44 CUDA Version: 10.2 PyTorch 1.4.0 GPU : RTX2080TI 12G
NSVF) gdp@gdp:~/harddisk/Data4/lny/NSVF$ CUDA_VISIBLE_DEVICES=2 python -u train.py /home/gdp/harddisk/Data4/lny/NSVFDATASET/Synthetic_NSVF/Robot
--user-dir fairnr
--task single_object_rendering
--train-views "0..100" --view-resolution "800x800"
--max-sentences 1 --view-per-batch 4 --pixel-per-view 2048
--no-preload
--sampling-on-mask 1.0 --no-sampling-at-reader
--valid-views "100..200" --valid-view-resolution "400x400"
--valid-view-per-batch 1
--transparent-background "1.0,1.0,1.0" --background-stop-gradient
--arch nsvf_base
--initial-boundingbox /home/gdp/harddisk/Data4/lny/NSVFDATASET/Synthetic_NSVF/Robot/bbox.txt
--use-octree
--raymarching-stepsize-ratio 0.125
--discrete-regularization
--color-weight 128.0 --alpha-weight 1.0
--optimizer "adam" --adam-betas "(0.9, 0.999)"
--lr 0.001 --lr-scheduler "polynomial_decay" --total-num-update 150000
--criterion "srn_loss" --clip-norm 0.0
--num-workers 0
--seed 2
--save-interval-updates 500 --max-update 150000
--virtual-epoch-steps 5000 --save-interval 1
--half-voxel-size-at "5000,25000,75000"
--reduce-step-size-at "5000,25000,75000"
--pruning-every-steps 2500
--keep-interval-updates 5 --keep-last-epochs 5
--log-format simple --log-interval 1
--save-dir checkpoints/robot
--tensorboard-logdir checkpoints/robot/tensorboard
| tee -a checkpoints/robot/train.log
from nsvf.
I don't think PyTorch 1.4.0 is ever compiled with CUDA 10.2... The cuda version to compile this code should match with PyTorch cuda version. You can check it by python -c "import torch; print(torch.version.cuda)"
from nsvf.
Sorry for the mistake .The cuda version is excatly 10.1.
from nsvf.
Ok. just in case, did you run python setup.py build_ext --inplace
and everything showed ok?
from nsvf.
(NSVF) gdp@gdp:~/harddisk/Data4/lny/NSVF$ python setup.py build_ext --inplace
running build_ext
copying build/lib.linux-x86_64-3.7/fairnr/clib/_ext.cpython-37m-x86_64-linux-gnu.so -> fairnr/clib
from nsvf.
To try recompling, i think you need to delete the build folder under NSVF
from nsvf.
I got such a report after deleting the build folder under NSVF and running python setup.py build_ext --inplace.
running build_ext
building 'fairnr.clib.ext' extension
creating build
creating build/temp.linux-x86_64-3.7
creating build/temp.linux-x86_64-3.7/fairnr
creating build/temp.linux-x86_64-3.7/fairnr/clib
creating build/temp.linux-x86_64-3.7/fairnr/clib/src
gcc -pthread -B /home/gdp/.conda/envs/NSVF/compiler_compat -Wl,--sysroot=/ -Wsign-compare -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -fPIC -I/home/gdp/.conda/envs/NSVF/lib/python3.7/site-packages/torch/include -I/home/gdp/.conda/envs/NSVF/lib/python3.7/site-packages/torch/include/torch/csrc/api/include -I/home/gdp/.conda/envs/NSVF/lib/python3.7/site-packages/torch/include/TH -I/home/gdp/.conda/envs/NSVF/lib/python3.7/site-packages/torch/include/THC -I/usr/local/cuda-10.0/include -I/home/gdp/.conda/envs/NSVF/include/python3.7m -c fairnr/clib/src/binding.cpp -o build/temp.linux-x86_64-3.7/fairnr/clib/src/binding.o -O2 -Ifairnr/clib/include -DTORCH_API_INCLUDE_EXTENSION_H -DTORCH_EXTENSION_NAME=ext -D_GLIBCXX_USE_CXX11_ABI=0 -std=c++11
cc1plus: warning: command line option ‘-Wstrict-prototypes’ is valid for C/ObjC but not for C++
gcc -pthread -B /home/gdp/.conda/envs/NSVF/compiler_compat -Wl,--sysroot=/ -Wsign-compare -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -fPIC -I/home/gdp/.conda/envs/NSVF/lib/python3.7/site-packages/torch/include -I/home/gdp/.conda/envs/NSVF/lib/python3.7/site-packages/torch/include/torch/csrc/api/include -I/home/gdp/.conda/envs/NSVF/lib/python3.7/site-packages/torch/include/TH -I/home/gdp/.conda/envs/NSVF/lib/python3.7/site-packages/torch/include/THC -I/usr/local/cuda-10.0/include -I/home/gdp/.conda/envs/NSVF/include/python3.7m -c fairnr/clib/src/intersect.cpp -o build/temp.linux-x86_64-3.7/fairnr/clib/src/intersect.o -O2 -Ifairnr/clib/include -DTORCH_API_INCLUDE_EXTENSION_H -DTORCH_EXTENSION_NAME=ext -D_GLIBCXX_USE_CXX11_ABI=0 -std=c++11
cc1plus: warning: command line option ‘-Wstrict-prototypes’ is valid for C/ObjC but not for C++
gcc -pthread -B /home/gdp/.conda/envs/NSVF/compiler_compat -Wl,--sysroot=/ -Wsign-compare -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -fPIC -I/home/gdp/.conda/envs/NSVF/lib/python3.7/site-packages/torch/include -I/home/gdp/.conda/envs/NSVF/lib/python3.7/site-packages/torch/include/torch/csrc/api/include -I/home/gdp/.conda/envs/NSVF/lib/python3.7/site-packages/torch/include/TH -I/home/gdp/.conda/envs/NSVF/lib/python3.7/site-packages/torch/include/THC -I/usr/local/cuda-10.0/include -I/home/gdp/.conda/envs/NSVF/include/python3.7m -c fairnr/clib/src/octree.cpp -o build/temp.linux-x86_64-3.7/fairnr/clib/src/octree.o -O2 -Ifairnr/clib/include -DTORCH_API_INCLUDE_EXTENSION_H -DTORCH_EXTENSION_NAME=ext -D_GLIBCXX_USE_CXX11_ABI=0 -std=c++11
cc1plus: warning: command line option ‘-Wstrict-prototypes’ is valid for C/ObjC but not for C++
/usr/local/cuda-10.0/bin/nvcc -I/home/gdp/.conda/envs/NSVF/lib/python3.7/site-packages/torch/include -I/home/gdp/.conda/envs/NSVF/lib/python3.7/site-packages/torch/include/torch/csrc/api/include -I/home/gdp/.conda/envs/NSVF/lib/python3.7/site-packages/torch/include/TH -I/home/gdp/.conda/envs/NSVF/lib/python3.7/site-packages/torch/include/THC -I/usr/local/cuda-10.0/include -I/home/gdp/.conda/envs/NSVF/include/python3.7m -c fairnr/clib/src/intersect_gpu.cu -o build/temp.linux-x86_64-3.7/fairnr/clib/src/intersect_gpu.o -D__CUDA_NO_HALF_OPERATORS -D__CUDA_NO_HALF_CONVERSIONS -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr --compiler-options '-fPIC' -O2 -Ifairnr/clib/include -DTORCH_API_INCLUDE_EXTENSION_H -DTORCH_EXTENSION_NAME=_ext -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_75,code=sm_75 -std=c++11
/home/gdp/.conda/envs/NSVF/lib/python3.7/site-packages/torch/include/c10/core/TensorTypeSet.h(44): warning: integer conversion resulted in a change of sign
/home/gdp/.conda/envs/NSVF/lib/python3.7/site-packages/torch/include/c10/core/TensorTypeSet.h(44): warning: integer conversion resulted in a change of sign
creating build/lib.linux-x86_64-3.7
creating build/lib.linux-x86_64-3.7/fairnr
creating build/lib.linux-x86_64-3.7/fairnr/clib
g++ -pthread -shared -B /home/gdp/.conda/envs/NSVF/compiler_compat -L/home/gdp/.conda/envs/NSVF/lib -Wl,-rpath=/home/gdp/.conda/envs/NSVF/lib -Wl,--no-as-needed -Wl,--sysroot=/ build/temp.linux-x86_64-3.7/fairnr/clib/src/binding.o build/temp.linux-x86_64-3.7/fairnr/clib/src/intersect.o build/temp.linux-x86_64-3.7/fairnr/clib/src/octree.o build/temp.linux-x86_64-3.7/fairnr/clib/src/intersect_gpu.o -L/usr/local/cuda-10.0/lib64 -lcudart -o build/lib.linux-x86_64-3.7/fairnr/clib/_ext.cpython-37m-x86_64-linux-gnu.so
copying build/lib.linux-x86_64-3.7/fairnr/clib/_ext.cpython-37m-x86_64-linux-gnu.so -> fairnr/clib
from nsvf.
It looks ok. I have tried this data on my side just now and it did not show errors. Could you try reduce --view-per-batch
to 1 to see if it is because of out of memory?
from nsvf.
I still got 'CUDA kernel failed : invalid device function' , even though I reduced --view-per-batch to 1 . It seems like not a OOM issue.
from nsvf.
Do you have the full log file?
from nsvf.
Also, could try remove "--use-octree" and see how the other method works?
from nsvf.
train.log
Here is the full log file. And I also remove "--use-octree" and run the code again. But I still got the same error report.
The error report (after removing "--use-octree") is shown below:
2020-10-21 13:52:27 | INFO | fairseq.utils | CUDA enviroments for all 1 workers
2020-10-21 13:52:27 | INFO | fairnr_cli.train | training on 1 GPUs
2020-10-21 13:52:27 | INFO | fairnr_cli.train | max tokens per GPU = None and max sentences per GPU = 1
2020-10-21 13:52:27 | INFO | fairseq.trainer | no existing checkpoint found checkpoints/robot/checkpoint_last.pt
2020-10-21 13:52:27 | INFO | fairseq.trainer | loading train data for epoch 1
2020-10-21 13:52:27 | INFO | fairseq.trainer | NOTE: your device may support faster training with --fp16
CUDA kernel failed : invalid device function
void aabb_intersect_point_kernel_wrapper(int, int, int, float, int, const float*, const float*, const float*, int*, float*, float*) at L:371 in fairnr/clib/src/intersect_gpu.cu
from nsvf.
I searched your error a bit, it seems to be some cuda setting or GPU setting issues. Although I don't exactly know what caused this...
If you try nvcc --version, what it shows?
from nsvf.
It shows:
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2018 NVIDIA Corporation
Built on Sat_Aug_25_21:08:01_CDT_2018
Cuda compilation tools, release 10.0, V10.0.130
from nsvf.
So your nvcc CUDA is 10.0 instead of 10.1?
from nsvf.
Yep, the nvcc --version indicate that the CUDA is 10.0.But when I ran [python -c "import torch; print(torch.version.cuda)"], it output 10.1.
from nsvf.
python -c "import torch; print(torch.version.cuda)
this is your pytorch cuda version. It is compiled with cuda10.1. But your machine also needs to install cuda 10.1 to match the version of pytorch.
from nsvf.
Sorry for the late reply. The problem was solved after I installed cuda 10.1 in my machine. Thanks for your help~
from nsvf.
Glad to see it was solved!
from nsvf.
Related Issues (20)
- stuck with arg distributed-no-spawn, strange OUT OF MEMORY message without that
- Why the validation loss is smaller than the training loss?
- Build is not working HOT 4
- question about the dynamic scene HOT 1
- Conda environment files HOT 1
- CUDA kernel failed : no kernel image is available for execution on the device in clusters. HOT 2
- Too slow training HOT 3
- How to get the correct bbox.txt intrinsic.txt ? HOT 5
- Please share training script for large objects HOT 2
- AttributeError: module 'fairnr' has no attribute 'clib' HOT 3
- Request for official results on testing set.
- Hello, it says in your code that the hypernetwork does not work, but why?And your paper shows that hypernetwork works HOT 2
- why to add a 10% bias on x? HOT 4
- How to produce datatset for training in NSVF?
- NSVF dataset convention HOT 2
- How to obtain test_traj.txt
- Why so slow for training HOT 2
- a bug in inverse_cdf_sampling_kernel HOT 1
- Wired Depth Image of Free Viewport Rendering
- Error undefined symbol when import build_octree, is there a python implementation for build_octree?
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from nsvf.