celebrandil / cudasift Goto Github PK

View Code? Open in Web Editor NEW

835.0 40.0 279.0 6 MB

A CUDA implementation of SIFT for NVidia GPUs (1.2 ms on a GTX 1060)

License: MIT License

Cuda 88.79% C++ 7.80% C 2.41% CMake 1.00%

gpu nvidia cuda sift vision

cudasift's People

Contributors

Stargazers

Watchers

Forkers

nbergst daniel-perry mdqyy robocyte biotrump arloz alperv flyerae wangshirufeng lyonszhao sh0 huluhaziqi qazwsx123456 cloudvision sxq2004123 mrgloom loliod errysobye thejonwong yonghoonkwon xhy20070406 lxmwust yxliang muzichao zjudzl jeneywang rhoana catree junjun2016 ahojnnes luofalei cequencer jiapei100 denlittlstar geoffreywang1990 liygcheng algorithm2d djiayong5 gbyfbi trigrass2 crisil xiaocenxiaocen nagyistge jlaura amos-zq johndpope mefathy helios-vmg saihv zhipengcai 3dreconstruction yxftju vbillys gcjyzdd templeblock oxygenshaka syzred hollisjoe h2016102 donmakmak danielkolsi ossdc sheun1 someshwaran kzampog cuijianzhu handsome3163 yzhua yskim041 cvutils rahulramanathan buaa1309lk maheshkha github2016-yuan moepforfreedom scholltan zuoshaobo airob xenuts harry159821 yuanjungod atammaroseeone jieeeeeeeeeee wwfnwg griffin4197 nightrain11 alicewy goodguy-dev rajathjavali alalemp ikingtec wait1988 czero69 gisbenweil runningj hs5530hs leeyangg yyf7329081 brothertook 10183308

cudasift's Issues

problem in extractSift method

hello and thanks for your great code.
when I built the code in visual studio 2015 many errors apeared as scaleDown and other functions of cudaSiftD.cu has already defined. I made declarations for them in cudaSiftD.h and then comment the #include "cudaSiftD.cu" from cudaSiftH.cu.
but after running the project with inputs of left.pgm and righ.pgm exactly after finishing the extractSift method 0 points fiound as features in both images. what should I do?

Invalid configuration argument

Hi,

I wanted to try out your code for a project of mine, but I get the following error:

checkMsg() CUDA error: LowPass() execution failed
 in file <cudaSiftH.cu>, line 416 : invalid configuration argument.

I have a Titan Xp, and use Cuda 9.2 (Ubuntu 16.04).
Any ideas?

Why I can't get any match correspondence even with the same images?

Thank you for sharing this implementaion!

Environment

Platform:
Windows 10 64-bit, VS2015
OpenCV3.4.1
CUDA8
1060

Current Behavior:

I have successfully build this repository with no errors.
I have tested mainSift.cpp with the same image (limg and rimg are the same picture) many times, none of them get any match correspondence!
I never chenge the code of mainSift.cpp.

Here is My output of mainSift.cpp:

Image size = (812,725)
Initializing data...
Device Number: 0
  Device name: GeForce GTX 1060
  Memory Clock Rate (MHz): 4004
  Memory Bus Width (bits): 192
  Peak Memory Bandwidth (GB/s): 192.2

SIFT extraction time =        1.99 ms
Incl prefiltering & memcpy =  3.09 ms 2623

SIFT extraction time =        1.82 ms
Incl prefiltering & memcpy =  3.06 ms 2624

MatchSiftData time =          6.16 ms
...
MatchSiftData time =          4.85 ms
Number of original features: 2623 2624
Number of matching features: 0 0 0% 1 3

Expected Behavior:

I expect CUDAsift find the correct correspondence , and show the correspondence between model image and test image.

Besides, the finding homography process is based on both model image's feature and test image's feature. Why there is only siftData1 input to functionFindHomography and ImproveHomography in mainSift.cpp?

for (int i = 0; i<500; i++)
  MatchSiftData(siftData1, siftData2);
float homography[9];
int numMatches;
FindHomography(siftData1, homography, &numMatches, 10000, 0.00f, 0.80f, 5.0);
int numFit = ImproveHomography(siftData1, homography, 5, 0.00f, 0.80f, 3.0);

I have been troubling in above two problems several days, can you help me for any suggestion？
Thank you!

ERROR : safeCall() Runtime API error in file </home/kk/Documents/CudaSift-Maxwell/cudaSiftH.cu>, line 42 : invalid device symbol.

Following is the output when the object file is executed.
`[kk@localhost CudaSift-Maxwell]$ ./cudasift
Image size = (1280,960)
Initializing data...
Device Number: 0
Device name: GeForce 210
Memory Clock Rate (MHz): 600
Memory Bus Width (bits): 64
Peak Memory Bandwidth (GB/s): 9.6

Download time = 0.92 ms
Download time = 0.89 ms
safeCall() Runtime API error in file </home/pict/Documents/CudaSift-Maxwell/cudaSiftH.cu>, line 42 : invalid device symbol.
[kk@localhost CudaSift-Maxwell]$ C
`
What is the problem?

Suggestions for future versions?

After the latests commits I'm running out of ideas of what to improve and would like to hear if anyone has any suggestions for future versions. For further speed improvements, I can see the possibility of adding functionalities for uploading images that are not necessarily in floats, using half precision floats for storage and matching of SIFT vectors, as well as projecting vectors to a lower dimension, similar to PCA-SIFT. In most practical scenarios though, gaining a fraction of a millisecond doesn't help much, since there is much more around it that is more important. Thus the nature of the end application becomes more important than the actual feature extraction code.

checkMsg() CUDA error: LowPass() execution failed

When I try and run cudasift I get the following output:

Image size = (0,0)
Initializing data...
Device Number: 0
Device name: GeForce GTX 860M
Memory Clock Rate (MHz): 2505
Memory Bus Width (bits): 128
Peak Memory Bandwidth (GB/s): 80.2

Failed to allocate device data
Failed to allocate device data
Failed to allocate device data
checkMsg() CUDA error: LowPass() execution failed in file </home/francois/Desktop/P3_CV3/CudaSift/cudaSiftH.cu>, line 416 : invalid configuration argument.

Does anyone know a solution to the problem?

What does corrData stands for in function MatchSiftPoints2

Hi,

Function MatchSiftPoints2 is the core code lines relating to the sift matching process. Typically , Given two set of sift descriptors t1, t2. For any sift descriptor, desc_1 for example, in t1, we find the two closest sift descriptor sfit1, sift2 in t2 under L2 norm, if the ratio of L2(desc_1, sift1)/L2(desc_1, sift2) is small than delta. sift1 is accepted as a correspondence of desc_1.

I do not under stand the following meaning of the code line bellow in function MatchSiftPoints2.

const float pt1 = &siftPoints1[ty128];
const float pt2 = &siftPoints2[tx128];
float sum = 0.0f;
for (int i=0;i<128;i++) {
int itx = (i + tx)&127; // avoid bank conflicts
sum += pt1[itx]pt2[itx];
}
if (p1<numPts1)
corrData[p1gridDim.y*16 + p2] = (p2<numPts2 ? sum : -1.0f);

Could you give any explanation or the principle that you used in the sift matching process. Much thanks!

cc: error: unrecognized command line option ‘-msse2’

nvidia@tegra-ubuntu:/usr/local/CudaSift-Maxwell/build$ make
[ 14%] Building NVCC (Device) object CMakeFiles/cudasift.dir/cudasift_generated_matching.cu.o
cc: error: unrecognized command line option ‘-msse2’
CMake Error at cudasift_generated_matching.cu.o.cmake:207 (message):
Error generating
/usr/local/CudaSift-Maxwell/build/CMakeFiles/cudasift.dir//./cudasift_generated_matching.cu.o

CMakeFiles/cudasift.dir/build.make:77: recipe for target 'CMakeFiles/cudasift.dir/cudasift_generated_matching.cu.o' failed
make[2]: *** [CMakeFiles/cudasift.dir/cudasift_generated_matching.cu.o] Error 1
CMakeFiles/Makefile2:67: recipe for target 'CMakeFiles/cudasift.dir/all' failed
make[1]: *** [CMakeFiles/cudasift.dir/all] Error 2
Makefile:149: recipe for target 'all' failed
make: *** [all] Error 2

Licensing and attribution

Sorry to bother again so soon, but someone reached out to us asking about commercial licensing our fork of CudaSift (link here). They are aware of David Lowe's original patent, but I wanted to a) properly attribute the work you did and b) protect both of us from liability. Do you have thoughts on how you would license CudaSift?

I was thinking of using of using the MIT License or even the Unlicense, but wanted to check with you first. Let me know!

what is the difference between cudasift and lowe's sift

hello. I have a question about what is the difference of 128 dimension descriptor between cudasift and lowe's sift. as to the sift feature with the same position, the sequence of 128 dimension descriptor is same? or what is the correspondence? thank you very much.

set parameters similar to opencv

hello sir.
I want to use opencv library for cpu & cv::sift with parameters : nFeatures = 40000 octaves = 3
contrast_threshhold = 0.01 edge_threshhold = 10 sigma = 1.6

i can not find edge_threshhold in you project.

how to should I set CudaSift parameters to get results similar to above.

thanks.

descriptor of keypoint seems not correct

Hi,
I use two images to test feature matching, the matched result seems wrong, the number of matched result is far away from the result of opencv version.
the two images are below, get 0 matches from CudaSift/mainSift.cpp,but got more than 100 matches from opencv version

Potential Memory Leak in cudaSiftH.cu

Hello,

Thanks for keeping this project active. There is, however, a potential memory leak in cudaSiftH.cu. In particular, the host memory is allocated twice using malloc() then cudaMallocHost() at lines 226 and 227, respectively. Either one of them is enough and the other one may need to be removed.

Identifier "totPts" and "fstPts" are undefined

Hi,

I am trying to compile this project on Windows 7 with Cmake.
But I get the following error message:
identifier "totPts" is undefined File: cudaSiftH.cu Line 226
identifier "fstPts" is undefined File: cudaSiftH.cu Line 229

Any help is appreciated :)

Systematic Point Density

First, many thanks for keeping this project live.

Below, I have an example image that I am attempting to extract key points for. I am getting very strange systematic point distribution where the top of the higher has a significantly higher number of points. In the example image, keypoints are still identified across the rest of the image. In some cases, no points are found outside of the top of the image.

I am not sure if the total number of points (SiftData.maxPts) is being exceeded here and therefore breaking early?

Any insight appreciated.

hello man, could you add some annotations to the code ?

recently, i want to optimize my sift feature codes with cuda, i am very pleasure that you share those codes, but i have some troubles , no any explainations, when i read ,could you add some annotations to important code, thanks, very much. a stranger from across the ocean.

Out of Memory error

Hi @Celebrandil ,
I have got out of memory error when I try to MatchSiftData() of nearly 20000 features from each image. Does it related to memory allocation error.
Thanks

Orientation and descriptors are always extracted in the same Gaussian image inside one octave

Hi, I tried to read your code and found that you did not store the Gaussian smoothed images. In the original SIFT paper, orientation and descriptors are extracted from the Gaussian smoothed images with scale closest the key points. In your implementation, I see it's extracted from the first image of the octave (i.e. the only saved one). I understand this gives speed benefits, but did you thoroughly test scale/rotation invariance with this simplification?

Regards,
Yao

cudaMallocPitch error!!!

cv::imread("image.png",0).convertTo(img1,CV_32FC1);
this works well.... no issue

however, in my application the image is ROI of another image so I can't simply read it from file. So,
img2 = image2(ROI)
then, cv::cvtColor(img2,img2,CV_BGR2GRAY);
then, img2.convertTo(img2,CV_32FC1);

This causes segmentation fault. I also checked if they belong to the same data type.
img1.type() = img2.type() = 5. I also printed the cv::Mat, and are similar in size and content

The error seems to be caused by

safeCall(cudaMallocPitch((void *)&d_data, (size_t)&pitch, (size_t)(sizeof(float)*width), (size_t)height));
(note there are two *'s inside (void) in the above statement)
under cudaImage.cu

I can't get my head around this. Any such error in the past....?

i'm sorry ,How to run this code.

Build error “cmd.exe” exit with code 1

Hi, Celebrandil
I want to build this project on my PC under the following conditions:
Branch: Maxwell
IDE: vs2015
cuda version: v 8.0
GPU: NVidia Titan X
but I got the following error in vs2015:

Building NVCC (Device) object CMakeFiles/cudasift.dir/Release/cudasift_generated_cudaImage.cu.obj
1> nvcc fatal : redefinition of argument 'gpu-architecture'
1> CMake Error at cudasift_generated_cudaImage.cu.obj.Release.cmake:222 (message):
1> Error generating
F:/MillerWorkPath/VSProject/CudaSift/Bin/CMakeFiles/cudasift.dir//Release/cudasift_generated_cudaImage.cu.obj
and the error message MSB6006: “cmd.exe” exit with code 1.
I am not familiar with the GPU programming, can you give me some tips to fix this problem, Thanks.
Miller

Are you using CUDA8.0 or CUDA9.0?

How do you copy SiftData structure ?

Dear the author.

First of all, thanks for developing such a nice SW.

I am having a problem that you might be able to help out.
I am trying to do sift extraction on many images and match them in a different function/scope. For example, I want to exact Sift features on 100 images. Later on, I would like to do sift match between image 53 and image 96.
In order to do that, I have to keep SiftData structures in some other places in my memory. Then the question is how do you copy SiftData structure to some other variable ?
The problem is that I can copy "h_data" by memcpy. But, I don't know what to do with "d_data". It seems like this pointer points to somewhere in my GPU device. Ignoring "d_data" giving me error when I call MatchSiftData().
Correct me if I am wrong since I am not a GPU guy.

Could you give me some advice ?

Descriptors convertible to OpenCV?

Am I right to assume that data in SiftPoint contains a descriptor? Can I just take these arrays from all SiftPoints and convert them to an OpenCV descriptor Mat? Are they normalized to 0 to 255? Or do I need the values like scale, sharpness and so on in some way too?

I won't have time to build and test it myself the next few days, so it would be cool to know.

Problem in matching.cu

Cannot find -lopencv_dep_cudart error

The following error shows up while doing sudo make install after cmake inside the build folder

[ 14%] Linking CXX executable cudasift
/usr/bin/ld: cannot find -lopencv_dep_cudart
collect2: error: ld returned 1 exit status
CMakeFiles/cudasift.dir/build.make:757: recipe for target 'cudasift' failed
make[2]: *** [cudasift] Error 1
CMakeFiles/Makefile2:67: recipe for target 'CMakeFiles/cudasift.dir/all' failed
make[1]: *** [CMakeFiles/cudasift.dir/all] Error 2
Makefile:149: recipe for target 'all' failed
make: *** [all] Error 2

Possible unnecessary __syncthreads using for kernel function FindMaxCorr in matching.cu

Hello there!

Currently I am learning CudaSift source code and may find some unnecessary use of __syncthreads() for some kernel funcitons in matching.cu.

For kernel function FindMaxCorr,

__global__ void FindMaxCorr(float *corrData, SiftPoint *sift1, SiftPoint *sift2, int numPts1, int corrWidth, int siftSize)
{
  .........
  if (tx==6)
    sift1[p1].score = maxScore[ty*16];
  if (tx==7)
    sift1[p1].ambiguity = maxScor2[ty*16] / (maxScore[ty*16] + 1e-6);
  if (tx==8)
    sift1[p1].match = maxIndex[ty*16];
  if (tx==9)
    sift1[p1].match_xpos = sift2[maxIndex[ty*16]].xpos;
  if (tx==10)
    sift1[p1].match_ypos = sift2[maxIndex[ty*16]].ypos;
  __syncthreads();
}

In line 160, before kernel function finished, FindMaxCorr calls __syncthreads(), but what confuses me is that line 160 is the last code kernel function executing, there should be unnecessary to synchronize threads here?

Same issues comes for FindMaxCorr1, FindMaxCorr2, FindMaxCorr3.

Thanks very much! :)

May I ask what is Lowpass (host) function doing? Didn't find that in original paper

Hi, first of all thank you for providing the cuda version of SIFT as its very helpful for me on my self-learning computer vision process. I tried to follow the code all the way down but got stuck somewhere in the middle. I couldn't figure out what is Lowpass function (the host one) doing before extracting sift descriptors, and what are those global variables beginning with "LOWPASS_". I tried to find an explanation in paper but failed ( or maybe I didn't read it too carefully). So may I ask for some instruction on that? Thank you. Since this is not a bug I will close it as soon as answered.

the sequence of descriptor

hello, what the difference between 128 dimension descriptor of cudasift and that of lowe's sift. as for the same postion of the same image, are the descriptor same? if not, what the difference? thank you very much.

ubuntu run cudasift failure with LowPass() execution failed

I clone the newest code to run on ubuntu 16.04 (Sep 11, 2018), failed with the following error

Image size = (0,0)
Initializing data...
Device Number: 0
  Device name: GeForce GTX 1050 Ti
  Memory Clock Rate (MHz): 3504
  Memory Bus Width (bits): 128
  Peak Memory Bandwidth (GB/s): 112.1

Failed to allocate device data
Failed to allocate device data
Failed to allocate device data
checkMsg() CUDA error: LowPass() execution failed
 in file </home/aqrose/siqin_all/siqin_project/CudaSift/cudaSiftH.cu>, line 416 : invalid configuration argument.

Could anyone tell me how to handle this problem ?

dense sift?

Is there an implementation for dense SIFT?

Matching.cu

Hi,

In the matching.cu, to my understanding the following kernel code in function FindMaxCorr is used to do the 'reduce job' of finding the best 2 matches. However, I am sort of confused about the outcome. Since in one block all threads with tx < 8 are concurrent threads how can we guarantee that the best match result is eventually swapped into the first entry? I would appreciate a lot if you can explain this part a little bit.
__syncthreads();
for (int len=8;len>0;len/=2) {
if (tx<8) {
float val = maxScore[idx+len];
int i = maxIndex[idx+len];
if (val>maxScore[idx]) {
maxScor2[idx] = maxScore[idx];
maxScore[idx] = val;
maxIndex[idx] = i;
} else if (val>maxScor2[idx])
maxScor2[idx] = val;
float va2 = maxScor2[idx+len];
if (va2>maxScor2[idx])
maxScor2[idx] = va2;
}
__syncthreads();
I tried two images of my own and found all matches have a very high ambiguity.
Should the following be correct?
sift1[p1].ambiguity = (1 - maxScore[ty * 16]) / (1 - maxScore2[ty * 16] + 1e-6);

Thanks!

LaplaceMulti() execution failed

Hello,

I'm trying to run the mainSift.cpp program on a Tesla M2090. First of all I had to change all of sim_35 occurences into sim_20 in CMakeLists.txt. After that, this message is returned during the execution of LaplaceMulti():

checkMsg() CUDA error: LaplaceMulti() execution failed
 in file </ghome/rzhengac/Downloads/CudaSift-Maxwell/cudaSiftH.cu>, line 324 : unknown error.

Why this happens?

Rotation invariance

Hi,
After replacing OpenCV's implementation of SIFT with CudaSift in my code I noticed that it could no longer handle much rotation between the images being matched. Is this a known limitation to CudaSift, or am I missing something?

When testing on mainSift.cpp and the sample data I get the following on the original images img1.png and img2.png:
Number of original features: 1818 1978
Number of matching features: 792 801 43.5644% 1 3

When rotating one of the images 180 degrees I get the following:
Number of original features: 1818 2040
Number of matching features: 261 400 14.3564% 1 3

CudaSift 3.0?

Hi @Celebrandil,

Thanks for the sweet library, with an agreeable license! It's quick to compile and easy to use, and I've started in on some changes that you may or may not want. If I'm heading in a direction you would rather not, simply ignore. My fork is here.

It's not PR ready at this point, but the main points are in there / ready to be changed if you have specific preferences.

CMake Changes

Separate out into library and executable.
- Goal is to support parent CMake projects
- Library does not need OpenCV, only application.
- Default action: no parent CMake -> build demo, if parent CMake do not.
Make VERBOSE an option
More explicit options for CUDA architectures, as well as overrides available if users so choose.
- Maybe the branch Maxwell would not be appropriate anymore.
TODO: revisit the install stuff given layout changes.
- Note to self: rpath of executable built needs to be explicit
There were some sections that gave different flags for Windows vs Apple vs Linux that I did not really understand. No SSE2 for Windows, and linux got a "-lineinfo" added (as opposed to all platforms). Did the reinterpretation given that Verbose is an option now make sense / should it remain this way?

Layout Changes

Since I desire to build this as a submodule, the split was to take *.h and put them in include/cudaSift, the two .cpp demo files ina demo/ folder, and the implementations went in src/. I have a very strong preference for the include/ and src split, because I like to have #include <cudaSift/sift.h> over #include <cudaSift.h>.

To be clear, this is very much a personal preference. I did it not as a suggestion, but because I was already going to be changing a lot.

Note: things like cudaSiftH.h -> cudaSift/sift_host.h are completely open to change. I couldn't decide whether your preference would be camel case or underscore, etc. E.g. cudaSift/SiftH.h or cudaSift/siftH.h etc. Happy to change the names back to whatever!

Code Changes

CUDA 9 deprecated shfl_xor. This fix seems to work as expected, but it may not be completely correct. The matches seem to be consistent with and without that code in there, but image diffs (my comparison method) in this regard are generally useless.
Everything is in namespace cudaSift. Promotion to a library build makes introducing the namespace appropriate in my opinion.
cudaSift::CudaImage felt a little awkward to me, so I changed it to just be cudaSift::Image. Everything else stayed the same.

Future Work

Edit:

These global symbols:

///////////////////////////////////////////////////////////////////////////////
// Kernel configuration
///////////////////////////////////////////////////////////////////////////////

__constant__ float d_Threshold[2];
__constant__ float d_Scales[8], d_Factor;
__constant__ float d_EdgeLimit;
__constant__ int d_MaxNumPoints;

__device__ unsigned int d_PointCounter[1];
__constant__ float d_Kernel1[5];
__constant__ float d_Kernel2[12*16];

were changed to be encapsulated in a struct SiftKernelParams that is managed by SiftData. There is a host and device copy, items are copied back and forth as needed. Basically, cudaMemcpy{To,From}Symbol calls were changed to just cudaMemcpyAsync. After numerous runs, there doesn't appear to be any noticeable performance change. Each of these are small enough that they probably don't have too many cache misses?

This had to be done since multiple streams cannot necessarily share any of these parameters.

Non-Deterministic result of CudaSift

Thank you for this implementation!

When I test the CudaSift, I found the result is non-deterministic, that is, for the same image, the program outputs different feature. Is it abnormal?

For example, I compile and run mainSift.cpp, and set the max sift num to 3, the result are:
Run1
xpos, ypos, score, descriptor[0], descriptor[1], descriptor[2]
1663.008179 215.823730 149.000000 0.001161 0.047922 0.060676
1235.375366 263.208405 111.000000 0.000000 0.059338 0.047047
1590.166992 1039.981323 201.000000 0.007714 0.001538 0.000000

Run2
xpos, ypos, score, descriptor[0], descriptor[1], descriptor[2]
1663.008179 215.823730 149.000000 0.001161 0.047922 0.060676
1660.701904 254.537323 111.000000 0.001048 0.023971 0.046532
1590.166992 1039.981323 201.000000 0.007714 0.001538 0.000000

matching.cu does not compute the L2 norm?

Is the matching.cu computing the L2 notm? The line:

sum += siftPoint[16_j+tx] * ptr2[16_j+tx];

seems to be just element by element multiplication of the feature vector. I am not sure how this computes a distance between the two SIFT descriptors?

Additionally, the sums are being dome many times for the same element of the feature vector. As in the setting, tx would range from 0 to 15...

Some proposal for less gpu memory cost

First of all, thank you very much for this excellent work.
As I work on my cuda-affine-sift project base on this, it went out of memory when sift points count grows up to more than 25600.
I resolve this problem by these steps:
in file matching.cu, function MatchSiftPoints2:

I removed the memory copy codes (from sift1, sift2 to siftPoints1, siftPoints2), because I think pt1 and pt2 can directly point to sift1[n1].data and sift2[n2].data.
We only need to know max sum score, max sum index and second max sum score for each point in sift1, if we define grid and block like those for funciton FindMaxCorr:
dim3 blocksMax(iDivUp(numPts1, 16));
dim3 threadsMax(16, 16);
then these array is enough:
shared float maxScore[16 * 16];
shared float maxScor2[16 * 16];
shared int maxIndex[16 * 16];

Yes, as you guess, I combine function MatchSiftPoints2 and FindMaxCorr together, and compare the sum values as soon as they are calculated. My codes:

global void MatchSiftPoints(float *corrData, SiftPoint *sift1, SiftPoint *sift2, int numPts1, int numPts2)
{
int block_dim = blockDim.x; // blockDim.x == 16
const int tx = threadIdx.x;
const int ty = threadIdx.y;
const int p1 = blockIdx.x * block_dim + ty;
const int idx = ty * 16 + tx;

__shared__ int maxIndex[16 * 16];
maxIndex[idx] = 0;
__syncthreads();

float *corrs = NULL;

if (p1 < numPts1) {
	corrs = &corrData[p1 * block_dim * 2];
	corrs[tx] = 0.0f;
	corrs[tx + 16] = 0.0f;
	const float *pt1 = sift1[p1].data;
	for (int p2 = tx; p2 < numPts2; p2 += 16) {
		float *pt2 = sift2[p2].data;
		float sum = 0.0f;
		for (int i = 0; i < 128; i++) {
			//int itx = (i + tx) & 127; // avoid bank conflicts
			sum += pt1[i] * pt2[i];
		}
		if (sum > corrs[tx]) {
			corrs[tx + 16] = corrs[tx];
			corrs[tx] = sum;
			maxIndex[idx] = p2;
		}
		else if (sum > corrs[tx + 16])
			corrs[tx + 16] = sum;
	}
}
__syncthreads();

//if (p1==1)
//  printf("tx = %d, score = %.2f, scor2 = %.2f, index = %d\n", 
//	   tx, maxScore[idx], maxScor2[idx], maxIndex[idx]);

if (p1 < numPts1) {
	for (int len = 8; len > 0; len /= 2) {
		if (tx < len) {
			float val = corrs[tx + len];
			int i = maxIndex[idx + len];
			if (val > corrs[tx]) {
				corrs[tx + 16] = corrs[tx];
				corrs[tx] = val;
				maxIndex[idx] = i;
			}
			else if (val > corrs[tx + 16])
				corrs[tx + 16] = val;
			float va2 = corrs[tx + 16 + len];
			if (va2 > corrs[tx + 16])
				corrs[tx + 16] = va2;
		}
		__syncthreads();
		//if (p1 == 1 && tx<len)
		//	printf("tx = %d, score = %.2f, scor2 = %.2f, index = %d\n",
		//		tx, corrs[tx], corrs[tx + 16], maxIndex[idx]);
	}
	if (tx == 6)
		sift1[p1].score = corrs[0];
	else if (tx == 7)
		sift1[p1].ambiguity = corrs[16] / (corrs[0] + 1e-6);
	else if (tx == 8)
		sift1[p1].match = maxIndex[ty << 4];
	else if (tx == 9)
		sift1[p1].match_xpos = sift2[maxIndex[ty << 4]].xpos;
	else if (tx == 10)
		sift1[p1].match_ypos = sift2[maxIndex[ty << 4]].ypos;
}
__syncthreads();
//if (p1 == 1 && tx == 0)
//	printf("index = %d/%d, score = %.2f, ambiguity = %.2f, match = %d\n",
//		p1, numPts1, sift1[p1].score, sift1[p1].ambiguity, sift1[p1].match);

}

codes affected in function MatchSiftData:
`

int block_dim = 16;
float *d_corrData;
int corrSize = numPts1 * block_dim * 2;
safeCall(cudaMalloc((void **)&d_corrData, sizeof(float) * corrSize));
//double allocTime = timer.read();

dim3 blocks(iDivUp(numPts1, block_dim));
dim3 threads(block_dim, block_dim); // each block: 1 points x 16 points
//std::cout << "numPts1=" << numPts1 << ",numPts2=" << numPts2 << ",corrSize=" << corrSize << std::endl;
MatchSiftPoints << <blocks, threads >> > (d_corrData, sift1, sift2, numPts1, numPts2);
safeCall(cudaThreadSynchronize());
checkMsg("MatchSiftPoints() execution failed\n");

//double matchTime = timer.read();
//printf("MatchSiftPoints time =          %.2f ms\n", (matchTime - allocTime));

safeCall(cudaFree(d_corrData));

`
It works for me, as I can match more than 32000 points for each image on GTX1060 6G.

Error while executing ./cudasift

Message got after executing ./cudasift

Image size = (0,0)
Initializing data...
Device Number: 0
Device name: GeForce GTX 1060 6GB
Memory Clock Rate (MHz): 4004
Memory Bus Width (bits): 192
Peak Memory Bandwidth (GB/s): 192.2

Failed to allocate device data
Failed to allocate device data
Download time = 0.01 ms
Download time = 0.00 ms
Failed to allocate device data
checkMsg() CUDA error: LowPass() execution failed
in file </opt/CudaSift/cudaSiftH.cu>, line 351 : invalid configuration argument.

why descriptor doesn't work

Hello, first of all thank you very much for your contribution to the cuda version of the sift feature. now I am working on a project of image search. but when i use cudasift to extract the sift feature of both query image and database image, the sift feature of query image doesn't work. when i use CPU with vlfeat library to extract sift feature, I can achieve accuracy of 62.6%, but if I use cudasift, I can only achieve 6.2%。what is more strange is that if I just generate the 128-dimension descriptor using random methods, I can alse achieve 6%. So can I get some suggestion about this? thank you very much.

matching siftPoints of images with different sizes

hello and thanks for your great code.
when I run the project to detect points and compare with 2 images with different sizes the project crashs.
what should I do?

thanks.

So what is the "right" value for initBlur ?

Hello,
I am used to the matlab vlfeat to manage sift features. The command line is :

edge_thresh = 2.5;
[xy,descritor id] = vl_sift(single(rgb2gray(image_i)), 'edgethresh', edge_thresh) ;

With Cuda Sift it is :

upscale = false.
ExtractSift(siftData1, img1, 7, initBlur, thresh, 0.0f, upscale);

A priori, i know nothing about the image quality. So what is the "right" value for initBlur ?

Thank you

Copyright.txt, Version, and sm_52 architecture ...

I'm playing CudaSift. Cool job...
But, in order to have it installed easily, you may have to:

Add "Copyright.txt" file under root folder
Better to have a version file, now 4.0.0?
sm_52 is for my computer now. So, maybe, you can let the users to choose their own Cuda architecture???

Cheers
Pei

Hello, my graphics card, the architecture is Pascal, can you use your program?

Window10 GTX1060

Pascal cards

Has anyone tried the code other Pascal cards? I really would like to see the performance on NVidia Titan X. Unfortunately, I haven't get been able to order one.

Wrong Sift points coordinates

Hi, thanks a lot for your great work

I am having an issue with the code
Whenever i try to read the sift points after extracting them from the image, their x and y coordinates aren't where they should be

So i am simply drawing these points over the image with opencv and displaying them in real-time, and the sift points seem to construct shapes of objects in the image but with a larger scale and a big offset

I tried multiplying and dividing the x and y coordinates with the scale attribute of the corresponding sift point but it didn't work

Any idea how this might be solved??

Note: i am running the code on a GT 750m (Kepler w/ cc of 3.0, 4GB DDR3, 12ms to run)

error MSB6006: "cmd.exe" exited with code 1.

I build this project by CMake 3.9.1 and choose visual studio 2013.After gengerating solution file.I open it and rebuild but it always show error MSB6006: "cmd.exe" exited with code 1. I have double check there is no space in the file path and I also run as administrator.It stil has the same problem.

Non-deterministically getting SiftPoint x, y positions outside image boundary?

Hello!

Thanks for the great library. We were using version three and are now trying version four. However, we notice a problem:

Occasionally, when we run the following code, we get SIFT points that live outside our image boundary (e.g., (341, 491) in an image that is (640, 480). We aren't sure why this is happening because we would expect that given identical images, we should get identical SIFT results (barring floating point errors, but 10+ pixels seems large).

For now, we simply throw out such SIFT points, but are curious to know if you've seen this issue before or if you have a hunch as to why it might be happening. We don't believe we're doing anything special, but please let us know if you think we might be doing something wrong!

Many thanks again for the work and any help!
Daniel

# gray is a 640x480 grayscale image read from 16bit PNG
gray.convertTo(gray, CV_32FC1);
unsigned int w = gray.cols;
unsigned int h = gray.rows;

InitCuda();
CudaImage cudaImage;
cudaImage.Allocate(w, h, iAlignUp(w, 128), false, NULL, (float*) gray.data);
cudaImage.Download();

ExtractSift(siftData, cudaImage, 5, 0.0f, 3.0f, 0.0f, 1.0f);

drawMatches method

Hi, I just want to ask if there is any method to draw matches between a pair of images just like OpenCV's drawMatches ?

saving descriptors

hello.
In a project I want to save keyPoints and descriptors in database.
with d_data and h_data in SiftData: how can I separate keypoints and descriptors and save them?