nvidia-ai-iot / cupcl Goto Github PK
View Code? Open in Web Editor NEWA project demonstrating how to use the libs of cuPCL.
License: MIT License
A project demonstrating how to use the libs of cuPCL.
License: MIT License
.so library in cuda-pcl is wrapper of pcl cuda implementation?
https://github.com/PointCloudLibrary/pcl/blob/master/cuda/filters/include/pcl/cuda/filters/passthrough.h
When I use CUDA VoxelGrid for filtering, and use the filtered data to prepare for Cluster. when doing the cudaExtractCluster object construction it reports "------------checking CUDA VoxelGrid----------------
Cuda failure: an illegal memory access was encountered at line 138 in file cudaFilter.cpp error status: 700
Aborted
" error. The CUDA VoxelGrid function works fine without cudaExtractCluster object creation, this is my first time programming with this library, please tell me the reason for this error. And how can I program with a mix of library functions. Thanks, I appreciate your help. Here is one of the codes that I have tried several times.
void testCUDA(pcl::PointCloud<pcl::PointXYZ>::Ptr cloudSrc,
pcl::PointCloud<pcl::PointXYZ>::Ptr cloudDst) {
std::chrono::steady_clock::time_point t1 = std::chrono::steady_clock::now();
std::chrono::steady_clock::time_point t2 = std::chrono::steady_clock::now();
std::chrono::duration<double, std::ratio<1, 1000>> time_span =
std::chrono::duration_cast<
std::chrono::duration<double, std::ratio<1, 1000>>>(t2 - t1);
cudaStream_t stream = NULL;
cudaStreamCreate(&stream);
unsigned int nCount = cloudSrc->width * cloudSrc->height;
float *inputData = (float *)cloudSrc->points.data();
cloudDst->width = nCount;
cloudDst->height = 1;
cloudDst->resize(cloudDst->width * cloudDst->height);
float *outputData = (float *)cloudDst->points.data();
memset(outputData, 0, sizeof(float) * 4 * nCount);
std::cout << "\n------------checking CUDA ---------------- " << std::endl;
std::cout << "CUDA Loaded " << cloudSrc->width * cloudSrc->height
<< " data points from PCD file with the following fields: "
<< pcl::getFieldsList(*cloudSrc) << std::endl;
float *input = NULL;
cudaMallocManaged(&input, sizeof(float) * 4 * nCount, cudaMemAttachHost);
cudaStreamAttachMemAsync(stream, input);
cudaMemcpyAsync(input, inputData, sizeof(float) * 4 * nCount,
cudaMemcpyHostToDevice, stream);
cudaStreamSynchronize(stream);
float *output = NULL;
cudaMallocManaged(&output, sizeof(float) * 4 * nCount, cudaMemAttachHost);
cudaStreamAttachMemAsync(stream, output);
cudaStreamSynchronize(stream);
cudaFilter filterTest;
FilterParam_t setP;
FilterType_t type;
unsigned int countLeft = 0;
std::cout << "\n------------checking CUDA VoxelGrid---------------- "
<< std::endl;
type = VOXELGRID;
setP.type = type;
setP.voxelX = 0.02;
setP.voxelY = 0.02;
setP.voxelZ = 0.02;
filterTest.set(setP);
int status = 0;
cudaDeviceSynchronize();
t1 = std::chrono::steady_clock::now();
status = filterTest.filter(output, &countLeft, input, nCount);
cudaDeviceSynchronize();
t2 = std::chrono::steady_clock::now();
if (status != 0)
return;
time_span = std::chrono::duration_cast<
std::chrono::duration<double, std::ratio<1, 1000>>>(t2 - t1);
std::cout << "CUDA VoxelGrid by Time: " << time_span.count() << " ms."
<< std::endl;
std::cout << "CUDA VoxelGrid before filtering: " << nCount << std::endl;
std::cout << "CUDA VoxelGrid after filtering: " << countLeft << std::endl;
pcl::PointCloud<pcl::PointXYZ>::Ptr cloudNew(
new pcl::PointCloud<pcl::PointXYZ>);
cloudNew->width = countLeft;
cloudNew->height = 1;
cloudNew->points.resize(cloudNew->width * cloudNew->height);
int check = 0;
for (std::size_t i = 0; i < cloudNew->size(); ++i) {
cloudNew->points[i].x = output[i * 4 + 0];
cloudNew->points[i].y = output[i * 4 + 1];
cloudNew->points[i].z = output[i * 4 + 2];
}
pcl::io::savePCDFileASCII("after-cuda-VoxelGrid.pcd", *cloudNew);
{
cudaStream_t stream2;
cudaStreamCreate(&stream2);
float *input2Data = (float *)cloudNew->points.data();
float *input2 = NULL;
cudaMallocManaged(&input2, sizeof(float) * 4 * nCount, cudaMemAttachHost);
cudaStreamAttachMemAsync(stream2, input2);
cudaMemcpyAsync(input2, input2Data, sizeof(float) * 4 * nCount,
cudaMemcpyHostToDevice, stream2);
cudaStreamSynchronize(stream2);
float *output2 = NULL;
cudaMallocManaged(&output2, sizeof(float) * 4 * nCount, cudaMemAttachHost);
cudaStreamAttachMemAsync(stream2, output2);
cudaStreamSynchronize(stream2);
cudaExtractCluster cudaec;
extractClusterParam_t ecp;
ecp.minClusterSize = 100;
ecp.maxClusterSize = 2500000;
ecp.voxelX = 0.05;
ecp.voxelY = 0.05;
ecp.voxelZ = 0.05;
ecp.countThreshold = 20;
cudaec.set(ecp);
unsigned int *indexEC = NULL;
cudaMallocManaged(&indexEC, sizeof(float) * 4 * nCount, cudaMemAttachHost);
cudaStreamAttachMemAsync(stream2, indexEC);
cudaMemsetAsync(indexEC, 0, sizeof(float) * 4 * nCount, stream2);
cudaStreamSynchronize(stream2);
}
cudaFree(input);
cudaFree(output);
cudaStreamDestroy(stream);
}
According to your demo codes, each point data takes 4 floats. Does it means the only supported point types are PointXYZ, PointXYZRGB, PointXYZI ? Could you add support for PointXYZINormal in the future?
Thanks
when I run the "./demo [*.pcd]" in the directory of cuda-icp, just got the result:
GPU has cuda devices: 1
----device id: 0 info----
GPU : Xavier
Capbility: 7.2
Global memory: 31918MB
Const memory: 64KB
SM in a block: 48KB
warp size: 32
threads in a block: 1024
block dim: (1024,1024,64)
grid dim: (2147483647,65535,65535)
Loaded 7000 data points for P with the following fields: x y z
Loaded 7000 data points for Q with the following fields: x y z
iter.Maxiterate 0
iter.threshold 1e-12
iter.acceptrate 1
Target rigid transformation : cloud_in -> cloud_icp
Rotation matrix :
| 0.923880 -0.382683 0.000000 |
R = | 0.382683 0.923880 0.000000 |
| 0.000000 0.000000 1.000000 |
Translation vector :
t = < 0.000000, 0.000000, 0.200000 >
------------checking CUDA ICP(GPU)----------------
CUDA ICP by Time: 0.797888 ms.
CUDA ICP fitness_score: 0.777453
matrix_icp calculated Matrix by Class ICP
Rotation matrix :
| 1.000000 0.000000 -0.000000 |
R = | -0.000000 1.000000 0.000000 |
| -0.000000 0.000000 1.000000 |
Translation vector :
t = < -0.000000, 0.000000, -0.000000 >
------------checking PCL ICP(CPU)----------------
PCL icp.align Time: 38.2758 ms.
has converged: 1 score: 0.651369
CUDA ICP fitness_score: 0.651369
transformation_matrix:
0.999905 0.00279406 0.0134922 0.0161865
-0.00265722 0.999945 -0.010151 0.00527596
-0.0135198 0.0101141 0.999858 0.0133578
0 0 0 1
------------checking PCL GICP(CPU)----------------
PCL Gicp.align Time: 144.663 ms.
has converged: 1 score: 0.541552
transformation_matrix:
0.99874 0.00468762 0.0499603 -0.0427716
-0.00344507 0.999683 -0.0249281 0.0265501
-0.0500613 0.0247246 0.99844 0.148036
0 0 0 1
so,I really want to know how to display the result,just like https://developer.nvidia.com/zh-cn/blog/cuda-pcl-1-0-jetson/
~/cuPCL/cuNDT$ ./demo
GPU has cuda devices: 1
----device id: 0 info----
GPU : NVIDIA A800 80GB PCIe
Capbility: 8.0
Global memory: 81085MB
Const memory: 64KB
SM in a block: 48KB
warp size: 32
threads in a block: 1024
block dim: (1024,1024,64)
grid dim: (2147483647,65535,65535)
Loaded 7000 data points for P with the following fields: x y z
Loaded 7000 data points for Q with the following fields: x y z
Target rigid transformation : cloud_P -> cloud_Q
Rotation matrix :
| 0.923880 -0.382683 0.000000 |
R = | 0.382683 0.923880 0.000000 |
| 0.000000 0.000000 1.000000 |
Translation vector :
t = < 0.000000, 0.000000, 0.200000 >
------------checking PCL NDT(CPU)----------------
PCL align Time: 27.1937 ms.
Normal Distributions Transform has converged: 1 score: 0.648334
Rotation matrix :
| 0.999894 0.004857 0.013688 |
R = | -0.004680 0.999905 -0.012931 |
| -0.013750 0.012865 0.999823 |
Translation vector :
t = < 0.015418, 0.056840, 0.078443 >
------------checking CUDA NDT(GPU)----------------
CUDA NDT by Time: 0.777725 ms.
CUDA NDT fitness_score: 0.349491
Rotation matrix :
| 0.000000 0.000000 0.000000 |
R = | 0.000000 0.000000 0.000000 |
| 0.000000 0.000000 0.000000 |
Translation vector :
t = < 0.000000, 0.000000, 0.000000 >
------------checking CUDA PassThrough ----------------
Cuda failure: no kernel image is available for execution on the device at line 102 in file cudaFilter.cpp error status: 209
Aborted (core dumped)
GPU has cuda devices: 1
----device id: 0 info----
GPU : NVIDIA Tegra X2
Capbility: 6.2
Global memory: 7851MB
Const memory: 64KB
SM in a block: 48KB
warp size: 32
threads in a block: 1024
block dim: (1024,1024,64)
grid dim: (2147483647,65535,65535)
------------checking CUDA ----------------
CUDA Loaded 119978 data points from PCD file with the following fields: x y z
------------checking CUDA PassThrough ----------------
Cuda failure: no kernel image is available for execution on the device at line 102 in file cudaFilter.cpp error status: 209
Aborted (core dumped)
this looks interesting but there seems to be no source included in the repo, only so files.
Any plans on releasing them?
It will be clearer to have more comment (as in cuda-pcl).
while in cuda-octree it is harder.
I found the distance result for approxNearestSearch
is 1e9 times the real sqr_distance.
but for radiusSearch
it is directly the result.
As in title error during compilation of cuda-segmentation and cuda-ndt
makefile prepared by cmake
configuration:
Hello,
I receive segmentation fault only for some point clouds and with small voxel size parameters to the cuCluster
. It does not seem related to the point cloud size. If I make the voxel sizes larger, however, the segmentation fault disappears. I thought may be the number of voxels necessary for the volume is larger than INT_MAX
, and it causes integer overflow but my calculation shows the required number of voxels is not even near the limit.
I could provide you with some point cloud data that produce the error, if you wish to reproduce.
Because the library is not open source, I cannot really debug. It looks like the cudaExtractClusterImpl::extract
function tries to write into the output array, however, the position it tries to write is out of bounds.
// call stack inside the library
libcudacluster.so!cudaExtractClusterImpl::extract(float*, int, float*, unsigned int*)
libcudacluster.so!cudaExtractCluster::extract(float*, int, float*, unsigned int*)
// The limits of the point cloud
// is [0m,0m,0m] to ~[3m, 2m, 6m] meters.
// I see segmentation faults for voxel size 0.05.
// for 0.1, I do not.
ecp.voxelX = 0.05;
ecp.voxelY = 0.05;
ecp.voxelZ = 0.05;
Hello there ! I am trying to use the cuPCL repository: https://github.com/NVIDIA-AI-IOT/cuPCL such to preprocess the PointCloud by a Voxel Downsampling Filter prior to using a the defined Clusterer. The program runs smoothly without the Voxel Downsampling , but the problem comes when only making an Instance of the filter as shown below:
cudaExtractCluster cudaec(stream);
cudaFilter filterTest(stream);
So just using one works, but both produces a: Cuda failure: an illegal memory access was encountered at line 138 in file cudaFilter.cpp error status: 700
After some trials with Debugging with CUDA-GDB and CUDA MEMCHECK I came to the following results but do not quite sure if they can be solved as the classes are implemented in a precompiled .so files:
Thread 1 "collision_avoid" received signal CUDA_EXCEPTION_1, Lane Illegal Address.
[Switching focus to CUDA kernel 0, grid 6, block (3,0,0), thread (160,0,0), device 0, sm 6, warp 4, lane 0]
0x0000555555d50eb0 in cudaFillVoxelGirdKernel(float4*, int4*, int4*, float4*, unsigned int, float, float, float) ()
Invalid __global__ write of size 4
Illegal access to address (@global)0x8007b0800c60 detected
(cuda-gdb) print *0x8007b0800c60
Error: Failed to read local memory at address 0x8007b0800c60 on device 0 sm 0 warp 9 lane 0, error=CUDBG_ERROR_INVALID_MEMORY_ACCESS(0x8).
warning: Cuda API error detected: cuGetProcAddress returned (0x1f4)
This indicates that a named symbol was not found. Examples of symbols are global/constant variable names, driver function names, texture names, and surface names.
What I do not understand is that from the Thread's scope the address is treated as a local address , but actually it seems to be a global one. And whether if the CUDA API Error can be a lead of some sort.
Note that for memory transfer cudaMemMallocManaged has been used (UVM), and even using explicit memory transfers did not solve the issue.
Other efforts to solve the issue was to limit all CUDA computations to match the Device limits as follows:
size_t limit = 0;
cudaDeviceGetLimit(&limit, cudaLimitStackSize);
std::cout << "Stack limit is: " << limit << std::endl;
cudaDeviceSetLimit(cudaLimitStackSize, limit);
cudaDeviceGetLimit(&limit, cudaLimitPrintfFifoSize);
std::cout << "cudaLimitPrintfFifoSize limit is: " << limit << std::endl;
cudaDeviceSetLimit(cudaLimitPrintfFifoSize, limit);
cudaDeviceGetLimit(&limit, cudaLimitMallocHeapSize);
std::cout << "cudaLimitMallocHeapSize limit is: " << limit << std::endl;
cudaDeviceSetLimit(cudaLimitMallocHeapSize, limit);
cudaDeviceGetLimit(&limit, cudaLimitDevRuntimeSyncDepth);
std::cout << "cudaLimitDevRuntimeSyncDepth limit is: " << limit << std::endl;
cudaDeviceSetLimit(cudaLimitDevRuntimeSyncDepth, limit);
cudaDeviceGetLimit(&limit, cudaLimitDevRuntimePendingLaunchCount);
std::cout << "cudaLimitDevRuntimePendingLaunchCount limit is: " << limit << std::endl;
cudaDeviceSetLimit(cudaLimitDevRuntimePendingLaunchCount, limit);
cudaDeviceGetLimit(&limit, cudaLimitMaxL2FetchGranularity);
std::cout << "cudaLimitMaxL2FetchGranularity limit is: " << limit << std::endl;
cudaDeviceSetLimit(cudaLimitMaxL2FetchGranularity, limit);
But not changes have been yielded.
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.161.03 Driver Version: 470.161.03 CUDA Version: 11.4 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA GeForce ... Off | 00000000:01:00.0 On | N/A |
| N/A 56C P8 18W / N/A | 123MiB / 7982MiB | 32% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
Dev PCI Bus/Dev ID Name Description SM Type
* 0 01:00.0 NVIDIA GeForce RTX 2080 Super with Max-Q Design TU104-A
SMs Warps/SM Lanes/Warp Max Regs/Lane Active SMs Mask
sm_75 48 32 32 256 0x00000000000000000000ffffffffffff
Using Ros Noetic and Ubuntu 20.04
There are some errors when I use it.
I want to change some source code to suit my program.
thanks.
:~/cuPCL/cuNDT$ ./demo
GPU has cuda devices: 1
----device id: 0 info----
GPU : NVIDIA A800 80GB PCIe
Capbility: 8.0
Global memory: 81085MB
Const memory: 64KB
SM in a block: 48KB
warp size: 32
threads in a block: 1024
block dim: (1024,1024,64)
grid dim: (2147483647,65535,65535)
Loaded 7000 data points for P with the following fields: x y z
Loaded 7000 data points for Q with the following fields: x y z
Target rigid transformation : cloud_P -> cloud_Q
Rotation matrix :
| 0.923880 -0.382683 0.000000 |
R = | 0.382683 0.923880 0.000000 |
| 0.000000 0.000000 1.000000 |
Translation vector :
t = < 0.000000, 0.000000, 0.200000 >
------------checking PCL NDT(CPU)----------------
PCL align Time: 27.1937 ms.
Normal Distributions Transform has converged: 1 score: 0.648334
Rotation matrix :
| 0.999894 0.004857 0.013688 |
R = | -0.004680 0.999905 -0.012931 |
| -0.013750 0.012865 0.999823 |
Translation vector :
t = < 0.015418, 0.056840, 0.078443 >
------------checking CUDA NDT(GPU)----------------
CUDA NDT by Time: 0.777725 ms.
CUDA NDT fitness_score: 0.349491
Rotation matrix :
| 0.000000 0.000000 0.000000 |
R = | 0.000000 0.000000 0.000000 |
| 0.000000 0.000000 0.000000 |
Translation vector :
t = < 0.000000, 0.000000, 0.000000 >
When i run cuda-icp.cpp example with PCD data:
and output here:
Cuda failure: the launch timed out and was terminated at line 59 in file cudaICP.cpp error status: 702
Aborted (core dumped)
How fix that ?
Thanks you.
Is it possible to use the cuda PCL with ros? And if so, how?
Thanks in Advance.
Hi,
Thrust gives us c++ STL style cuda programming, which made cuda programming easier and safer.
Will thrust support be added to this project?
Thanks!
Cuda failure: invalid argument at line 126 in file cudaFilter.cpp error status: 1。Have anyone meet this question?When I use my camera to generate the pointcloud,and this fault appear.If you have some free time and glance my question by coincidence,I wish you could tell me how to solve this problem.
Best wish!
I need to use this library with the above point type. Is that possible?
GPU has cuda devices: 1
----device id: 0 info----
GPU : Xavier
Capbility: 7.2
Global memory: 31918MB
Const memory: 64KB
SM in a block: 48KB
warp size: 32
threads in a block: 1024
block dim: (1024,1024,64)
grid dim: (2147483647,65535,65535)
[pcl::PCDReader::readHeader] Could not find file '[.pcd]'.
Error:can not open the file: [.pcd]
I cannot find where is the problem,can u help me?
localization_node: /usr/include/eigen3/Eigen/src/SVD/SVDBase.h:85: const MatrixUType& Eigen::SVDBase::matrixU() const [with Derived = Eigen::JacobiSVD<Eigen::Matrix<float, 3, 3>, 2>; Eigen::SVDBase::MatrixUType = Eigen::Matrix<float, 3, 3>; typename Eigen::internal::traits::MatrixType::Scalar = float]: Assertion `m_isInitialized && "SVD is not initialized."' failed.
cuda version : 11.4
pcl version : 1.10
system : ubuntu 20.04
eigen version : 3.3.7
platform: Orin
Thank you for providing CUDA implementation with PCL.
Can you please share the .so library in AMD64 platform?
cuda version : 10.02
pcl version : 1.8
system : ubuntu 18.04
@Haoyu-NV
FilterParam_t setPx, setPy, setPz; // filter parameters for each axis
FilterType_t type = PASSTHROUGH; // only passthrough filter implemented in cuCL library for now
// this filter contraints is being applied for only one axis....
setPx.type = type;
setPx.dim = 0; // 0 // it will be 0,1,2 for x,y,z axes
setPx.upFilterLimits = 1.5;
setPx.downFilterLimits = -1.5;
setPx.limitsNegative = false;
filterTest.set(setPx);
setPy.type = type;
setPy.dim = 1; // 0 // it will be 0,1,2 for x,y,z axes
setPy.upFilterLimits = 1.5;
setPy.downFilterLimits = -1.5;
setPy.limitsNegative = false;
filterTest.set(setPy);
setPz.type = type;
setPz.dim = 2; // 0 // it will be 0,1,2 for x,y,z axes
setPz.upFilterLimits = 2.0;
setPz.downFilterLimits = 0.0;
setPz.limitsNegative = false;
filterTest.set(setPz);
filterTest.filter(output, &countLeft, input, nCount);
Hi everyone,
I added the a timespan calculation to measure the time comsuming for input and output memory allocation.
This kind of memory allocation is needed before every cuda functions calling.
The code is below.
t1 = std::chrono::steady_clock::now();
cudaMallocManaged(&input, sizeof(float) * 4 * nCount, cudaMemAttachHost);
cudaStreamAttachMemAsync (stream, input );
cudaMemcpyAsync(input, inputData, sizeof(float) * 4 * nCount, cudaMemcpyHostToDevice, stream);
cudaStreamSynchronize(stream);
float *output = NULL;
cudaMallocManaged(&output, sizeof(float) * 4 * nCount, cudaMemAttachHost);
cudaStreamAttachMemAsync (stream, output );
cudaStreamSynchronize(stream);
t2 = std::chrono::steady_clock::now();
auto time_span1 = std::chrono::duration_cast<std::chrono::duration<double, std::ratio<1, 1000>>>(t2 - t1);
And here is my test result. MemCpy by Time 4.6285ms
So according to the real FPS of passthrough filter
cuda-pcl(4.6285+0.456927=5.085427ms) is not better than pcl(4.25133ms).
So what's the best practice of programming with cuda-pcl?
Thanks.
On all application the same/similar error appears.
Configuration:
root@linux:/cuda-pcl/cuda-icp#_` ./cuda_segmentation test_Q.pcd
GPU has cuda devices: 1
----device id: 0 info----
GPU : NVIDIA Tegra X2
Capbility: 6.2
Global memory: 3833MB
Const memory: 64KB
SM in a block: 48KB
warp size: 32
threads in a block: 1024
block dim: (1024,1024,64)
grid dim: (2147483647,65535,65535)
Loaded 7000 data points for P with the following fields: x y z
Loaded 7000 data points for Q with the following fields: x y z
iter.Maxiterate 0
iter.threshold 1e-12
iter.acceptrate 1
Target rigid transformation : cloud_in -> cloud_icp
Rotation matrix :
| 0.923880 -0.382683 0.000000 |
R = | 0.382683 0.923880 0.000000 |
| 0.000000 0.000000 1.000000 |
Translation vector :
t = < 0.000000, 0.000000, 0.200000 >
------------checking CUDA ICP(GPU)----------------
Cuda failure: initialization error at line 189 in file cudaICP.cpp error status: 3
Aborted (core `_dumped)
Just curious if there is something fundamentally different that will prevent JP 4.4 running or that just only it has been tested on JP 4.4.1 per the README.md.
Is there support for filtering X,Y, and Z simultaneously to extract a bounding box/cube from a point cloud?
For instance, in PCL lib this is achieved either using CropBox
or ConditionOr
settings as shown here:
https://stackoverflow.com/questions/45790828/remove-points-outside-defined-3d-box-inside-pcl-visualizer
Does similar functionality exist for this? The example only shows filtering 1 dimension ("X") at a time.
Hi! I am trying to use CUDA-PCL on Jetson TX2 (with Jetpack 4.5, CUDA-10.2 and PCL 1.8.1), but I have encountered a CUDA failure problem. It would be great if you can help me out. Here is the output when I run ./demo in cuda-pcl/cuda-segmentation:
nvidia@nvidia-tx2:~/Downloads/cuda-pcl/cuda-segmentation$ ./demo sample.pcd
GPU has cuda devices: 1
----device id: 0 info----
GPU : NVIDIA Tegra X2
Capbility: 6.2
Global memory: 7850MB
Const memory: 64KB
SM in a block: 48KB
warp size: 32
threads in a block: 1024
block dim: (1024,1024,64)
grid dim: (2147483647,65535,65535)Cuda failure: no kernel image is available for execution on the device at line 310 in file cudaSegmentation.cpp error status: 209
Aborted (core dumped)
I guess the lib*.so binaries are not compiled with sm=62 so it cannot be executed on Jetson TX2. I will be appreciated if you could fix it for us.
DriveAGX Orin is the tegra platform specifically for autonomous driving.
Can this lib be used in Drive AGX Orin?
thanks.
The file cudaFilter.cpp is not provided. It seems difficult to locate this issue.
Hi all,
I am trying to make cuda-filter but g++ crashes. So, I figured out that statistical_outlier_removal.h was causing the error so I commented out. Do you know why this is happening? I had a similar issue within another program where pcl/common/transforms.h caused the same problem. I had to fix it using an alternative library from ROS tf2.
I have Jetson Nano Developer Kit 4GB memory with 2 GB of swap size (I have already tried increasing the swap size).
Hi, I try to run your demo ,but something wrong with it.Print info as below:
GPU has cuda devices: 1
----device id: 0 info----
GPU : NVIDIA Tegra X2
Capbility: 6.2
Global memory: 7850MB
Const memory: 64KB
SM in a block: 48KB
warp size: 32
threads in a block: 1024
block dim: (1024,1024,64)
grid dim: (2147483647,65535,65535)
------------checking CUDA ----------------
CUDA Loaded 119978 data points from PCD file with the following fields: x y z
------------checking CUDA PassThrough ----------------
Cuda failure: no kernel image is available for execution on the device at line 102 in file cudaFilter.cpp error status: 209
Aborted (core dumped)
what should I do? please help me.
I'm getting the following error when I try to make
the cude filter code:
(base) ➜ cuda-filter make
USE Default CUDA DIR: /usr/local/cuda
TARGET_ARCH: x86_64
CUDA_VERSION: 10020
SMS: 30 35 50 53 60 61 70 72
g++ -I/usr/local/cuda/include -I/include -I/usr/local/include -I/usr/include/eigen3/ -I/usr/include/pcl-1.8/ -I/usr/include/vtk-6.3/ -D_REENTRANT -std=c++11 -O2 -fPIC -o obj/main.o -c main.cpp
g++ -D_REENTRANT -std=c++11 -O2 -o demo obj/main.o -L/usr/lib -L/usr/local/lib -L/usr/local/cuda/lib64 -lcudart_static -lrt -ldl -lpthread -lcudart -L/lib64 -lcudnn -lpthread -L/usr/lib/aarch64-linux-gnu/ -lboost_system -lpcl_common -lpcl_io -lpcl_recognition -lpcl_features -lpcl_sample_consensus -lpcl_octree -lpcl_search -lpcl_filters -lpcl_kdtree -lpcl_segmentation -lpcl_visualization ./lib/libcudafilter.so
./lib/libcudafilter.so: error adding symbols: File in wrong format
collect2: error: ld returned 1 exit status
Makefile:173: recipe for target 'demo' failed
make: *** [demo] Error 1
I wonder if it has anything to do with my computer architecture or am I missing something I need to do?
Thanks!
Hello.
Thank you for efforts in providing CUDA-optimized PCL algorithms.
With the Cuda-segmentation demo, I get an CUDA error on the cudaExtractCluster part, while the cudaSegmentation part runs fine.
Other demos such as cuda-filter runs fine.
NvMapReserveOp 0x80000003 failed [22]
NvMapReserveOp 0x80000001 failed [22]
NvMapReserveOp 0x80000000 failed [22]
Do you know why this happens?
And can you help me use the Cuda-segmentation?
Thank you!
Jetson AGX Xavier Developer Kit
PCL 1.8
CUDA 10.2
Hi, the repo perception_cuda_pcl is the ros interface for cuda_pcl, it is inspired by perception_open3d and depends on perception_pcl
So anyone who is using cuda_pcl throuth ros cpp can have a look.
Thanks.
Using cuda-cluster for some down-sampled robosense pointcloud, the cudaExtractCluster always return zero cluster, while normal pcl::EuclideanClusterExtraction works fine.
env:
jetson xavier nx
jetpack 4.6
ubuntu18.04
cuda10.2
sample pcd file: sample_pcd.zip
run the cuda-cluster demo with modified 'extractClusterParam' in testCUDA
ecp.minClusterSize = 5;
ecp.maxClusterSize = 2500000;
ecp.voxelX = 0.2;
ecp.voxelY = 0.2;
ecp.voxelZ = 0.2;
ecp.countThreshold = 20;
tried different ClusterParams several times, none succeed(:
I can make the project about cuICP and generate the demo , but when I run the demo,the issue appered:
---------checking cUDA ICP(GPU)----------------
cuda failure: initialization error at line 196 in file cudaICP.cpp error status:3
My operating environment:
Xavier AGX 8G
cuda 10.2
Jetpack 4.5.1
pcl 1.8
vtk 6.3
eigen3
I keep getting linker errors for the libcudasegmentation file.
With the error "/usr/bin/ld: ./lib/libcudasegmentation.so: error adding symbols: file in wrong format"
What could be causing this?
请jetpack4.3可以吗?
Hello, thanks for the repository.
Whenever I execute the extract method of cuCluster, it prints the following:
LINE:178 18696
The number after 178 changes across calls. This happens with the demo code and also other places. What is the meaning of this output and how can I get rid of it? The output of the demo code in my Jetson AGX Xavier is below. I checked out jp5.x branch, which is compatible with the jetpack version I have.
GPU has cuda devices: 1
----device id: 0 info----
GPU : Xavier
Capbility: 7.2
Global memory: 14907MB
Const memory: 64KB
SM in a block: 48KB
warp size: 32
threads in a block: 1024
block dim: (1024,1024,64)
grid dim: (2147483647,65535,65535)
-------------- test CUDA lib -----------
-------------- cudaExtractCluster -----------
LINE:178 18696
CUDA extract by Time: 14.91 ms.
PointCloud representing the Cluster: 162152 data points.
PointCloud representing the Cluster: 7098 data points.
PointCloud representing the Cluster: 1263 data points.
PointCloud representing the Cluster: 257 data points.
-------------- test PCL lib -----------
PCL(CPU) cluster kd-tree by Time: 92.0657 ms.
PCL(CPU) cluster extracted by Time: 5042.35 ms.
PointCloud cluster_indices: 4.
PointCloud representing the Cluster: 166789 data points.
PointCloud representing the Cluster: 7410 data points.
PointCloud representing the Cluster: 1318 data points.
PointCloud representing the Cluster: 427 data points.
if anybody has tested pls let me know
if possible, I hope to get the cuda-pcl source code for study,with best wish.
hi guys !
I want use it on windows, is there any solutions?
It well be very aprreciate for any reply!
I am trying to use clustering, but due to no documentation or any description it is nearly impossible to tune the parameters. There should be at least some note in documentation so user can know what each parameter does.
我修改了cuPCL的segment代码,加入了ROS接口,订阅动态播放点云包,进行地面分割,有时可以很好的去除地面点,有时地面点无法去除。我的雷达平台相对地面静止,但是地面点云会以一定频率闪烁。想知道哪里出问题了
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.