nepluno / lbfgsb-gpu Goto Github PK
View Code? Open in Web Editor NEWAn open source library for the GPU-implementation of L-BFGS-B algorithm
Home Page: http://dx.doi.org/10.1016/j.cag.2014.01.002
License: Mozilla Public License 2.0
An open source library for the GPU-implementation of L-BFGS-B algorithm
Home Page: http://dx.doi.org/10.1016/j.cag.2014.01.002
License: Mozilla Public License 2.0
diff --git a/CMakeLists.txt b/CMakeLists.txt
index 21809c0..06ef45d 100644
--- a/CMakeLists.txt
+++ b/CMakeLists.txt
@@ -15,6 +15,10 @@ add_definitions (-DCMAKE_BUILD_TYPE=${CMAKE_BUILD_TYPE})
find_package(CUDA REQUIRED)
+#foreach(ComputeCapability 60 61 70 72 75 86)
+# set(CUDA_NVCC_FLAGS ${CUDA_NVCC_FLAGS};-gencode;arch=compute_${ComputeCapability},code=sm_${ComputeCapability})
+#endforeach()
+
if (CMAKE_BUILD_TYPE MATCHES Debug)
add_definitions (-DDEBUG)
endif (CMAKE_BUILD_TYPE MATCHES Debug)
refer to Your GPU Compute Capability
Hi, I have some issues. I followed your install instruction and changed the CUDA Toolkit version to 10.1.
The program didn't report any errors. However, sometimes it only displayed the voronoi diagram of init sites(not CVT), which means the lbfgsb part didn't work and returned immediately. But other times, it just blocked in some "Copy" or "Multiple" function( such as cudaMemcpy and lbfgsbcuda::minimize::vdot_vv).
I found that the kernel functions upon these "Copy" or "Multiple" function have some issues, because the program couldn't print some variables on "device" after it called the kernel function "lbfgsbcuda::cmprlb::prog1" in "lbfgsbcmprlb"
It was ok if col=1, but seems the kernel function can not stop if col>1, then it couldn't print x,g,d
I had tried to change these "2" in line 143 to "1", then it was ok. But I'm not sure that will influence the result.
Seems these kernel function[prog2 in subsm.cu and prog1 in cmprlb] all have the same issue.
I'm not very familiar with CUDA program and kernel function.
Would you give me some ideas or solutions plz?
If n or N_elements less than 9, CPU version doesn't work, when minimizing a rosenbrock function.
#include <iostream>
#include <culbfgsb.h>
#include <chrono>
using namespace std;
const int n = 9;
//if n < 9, lbfgsbcuda::lbfgsbminimize will not work and return -1
//const int n = 4;
void rosenbrock(const float *x, float& fx, float* grad)
{
fx = 0.0;
for(int i = 0; i < n; i += 2)
{
float t1 = 1.0 - *(x+i);
float t2 = 10 * (*(x+i+1) - *(x+i) * *(x+i));
*(grad+i+1) = 20 * t2;
*(grad+i) = -2.0 * (*(x+i) * *(grad+i+1) + t1);
fx += t1 * t1 + t2 * t2;
}
//cout << fx << endl;
}
int main()
{
LBFGSB_CUDA_OPTION<float> lbfgsb_options;
lbfgsbcuda::lbfgsbdefaultoption<float>(lbfgsb_options);
lbfgsb_options.mode = LCM_NO_ACCELERATION;
lbfgsb_options.eps_f = static_cast<float>(1e-8);
lbfgsb_options.eps_g = static_cast<float>(1e-8);
lbfgsb_options.eps_x = static_cast<float>(1e-8);
lbfgsb_options.max_iteration = 1000;
// initialize LBFGSB state
LBFGSB_CUDA_STATE<float> state;
memset(&state, 0, sizeof(state));
float* assist_buffer_cpu = nullptr;
float minimal_f = std::numeric_limits<float>::max();
// setup callback function that evaluate function value and its gradient
state.m_funcgrad_callback = [&assist_buffer_cpu, &minimal_f](
float* x, float& f, float* g,
const cudaStream_t& stream,
const LBFGSB_CUDA_SUMMARY<float>& summary) {
//dsscfg_cpu<float>(g_nx, g_ny, x, f, g, &assist_buffer_cpu, 'FG', g_lambda);
rosenbrock(x, f, g);
if (summary.num_iteration % 100 == 0) {
std::cout << "CPU iteration " << summary.num_iteration << " F: " << f
<< std::endl;
}
minimal_f = fmin(minimal_f, f);
return 0;
};
// initialize CPU buffers
int N_elements = n;
float* x = new float[N_elements];
float* g = new float[N_elements];
float* xl = new float[N_elements];
float* xu = new float[N_elements];
// in this example, we don't have boundaries
memset(xl, 0, N_elements * sizeof(xl[0]));
memset(xu, 0, N_elements * sizeof(xu[0]));
// initialize number of bounds (0 for this example)
int* nbd = new int[N_elements];
memset(nbd, 0, N_elements * sizeof(nbd[0]));
LBFGSB_CUDA_SUMMARY<float> summary;
memset(&summary, 0, sizeof(summary));
// call optimization
auto start_time = std::chrono::steady_clock::now();
lbfgsbcuda::lbfgsbminimize<float>(N_elements, state, lbfgsb_options, x, nbd, xl, xu, summary);
auto end_time = std::chrono::steady_clock::now();
std::cout << "Timing: "
<< (std::chrono::duration<float, std::milli>(end_time - start_time).count() /
static_cast<float>(summary.num_iteration))
<< " ms / iteration" << std::endl;
// release allocated memory
delete[] g;
delete[] xl;
delete[] xu;
delete[] nbd;
delete[] assist_buffer_cpu;
return 0;
}
Hi,
first of all, thank you for making this code available!
In the readme you mention one should allocate a buffer for g
. But there is no way to actually make the library use that buffer, a new buffer will always be allocated inside lbfgsbminimize
.
The example allocates a buffer outside, but this is never used.
Do you maybe have a version of the code that allows a user to hand over the gradient buffer to use or would you accept a PR changing the API of lbfgsbminimize
to accept a buffer?
-Felix
Hello,
I am looking to use this library in a project so I just compiled it and ran the test dsscfg
program and it failed with what looks like an out-of-bounds write to a global memory array somewhere in the matupd::kernel20
kernel function.
This is on a GeForce RTX 2060 Super (CC 7.5) with CUDA 11.2.
$ ./bin/dsscfg
Begin testing DSSCFG on the CPU (double precision)
CPU iteration 0 F: -0.349732
CPU iteration 0 F: 0.0192875
CPU iteration 0 F: -0.496264
CPU iteration 100 F: -0.888434
CPU iteration 200 F: -1.00093
CPU iteration 300 F: -1.01115
CPU iteration 400 F: -1.01126
CPU iteration 500 F: -1.01129
Timing: 9.94557 ms / iteration
Begin testing DSSCFG with CUDA (double precision)
CUDA iteration 0 F: -0.349732
CUDA iteration 0 F: -0.199307
CUDA iteration 0 F: -0.496264
lbfgsb failure: 700, /home/alex/lbfgsb-gpu/culbfgsb/./cauchy.cu, 594
Here's the cuda-memcheck output:
========= Invalid __global__ write of size 8
========= at 0x000003d0 in void lbfgsbcuda::cuda::matupd::kernel20<int=256, double>(int, int, int, int, int, int, double const *, lbfgsbcuda::cuda::matupd::kernel20<int=256, double>*, double const )
========= by thread (0,0,0) in block (135,0,0)
========= Address 0x7f7999e22038 is out of bounds
========= Device Frame:void lbfgsbcuda::cuda::matupd::kernel20<int=256, double>(int, int, int, int, int, int, double const *, lbfgsbcuda::cuda::matupd::kernel20<int=256, double>*, double const ) (void lbfgsbcuda::cuda::matupd::kernel20<int=256, double>(int, int, int, int, int, int, double const *, lbfgsbcuda::cuda::matupd::kernel20<int=256, double>*, double const ) : 0x3d0)
========= Saved host backtrace up to driver entry point at kernel launch time
========= Host Frame:/usr/lib/x86_64-linux-gnu/libcuda.so.1 (cuLaunchKernel + 0x2b8) [0x222dd8]
========= Host Frame:./bin/dsscfg [0x2136b]
========= Host Frame:./bin/dsscfg [0x6de20]
========= Host Frame:./bin/dsscfg [0xd0c25]
========= Host Frame:./bin/dsscfg [0x9b098]
========= Host Frame:./bin/dsscfg [0x9e633]
========= Host Frame:./bin/dsscfg [0x102a4]
========= Host Frame:./bin/dsscfg [0x9db6]
========= Host Frame:/lib/x86_64-linux-gnu/libc.so.6 (__libc_start_main + 0xf3) [0x270b3]
========= Host Frame:./bin/dsscfg [0xdf4e]
=========
bin/dsscfg
Begin testing DSSCFG on the CPU (double precision)
CPU iteration 0 F: -0.349732
CPU iteration 0 F: 0.0192875
CPU iteration 0 F: -0.496264
CPU iteration 100 F: -0.888434
CPU iteration 200 F: -1.00093
CPU iteration 300 F: -1.01115
CPU iteration 400 F: -1.01126
CPU iteration 500 F: -1.0113
Timing: 34.2264 ms / iteration
Begin testing DSSCFG with CUDA (double precision)
CUDA iteration 0 F: -0.349732
CUDA iteration 0 F: -0.199307
CUDA iteration 0 F: -0.496264
lbfgsb failure: 700, /home/dongjun/lbfgsb-gpu-master/culbfgsb/./cauchy.cu, 594
In
Lines 34 to 40 in f3522f6
Whereas in Nocedal et als fortran code (http://users.iems.northwestern.edu/~nocedal/lbfgsb.html) it is
if (nbd(i) .gt. 0) then
if (nbd(i) .le. 2 .and. x(i) .le. l(i)) then
if (x(i) .lt. l(i)) then
prjctd = .true.
x(i) = l(i)
endif
nbdd = nbdd + 1
else if (nbd(i) .ge. 2 .and. x(i) .ge. u(i)) then
if (x(i) .gt. u(i)) then
prjctd = .true.
x(i) = u(i)
endif
nbdd = nbdd + 1
endif
endif
In the current active.cu code, for nbdi==2 the upper bounds is not enforced.
In the fortran code, the else if (nbd(i) .ge. 2 .and. x(i) .ge. u(i))
will be hit, but in the cuda code, the first if condition already matches, the max is a noop if x>ub>lb and the else will not be hit, so the min is never called and the upper bound not enforced.
Not sure what this means for the rest of the algorithm though..
Hi @nepluno,
first, thanks a lot for sharing your code with the world!
I would like to use your GPU implementation of L-BFGS-B for my master thesis project. One of the applications will be a large-scale (L2-)norm approximation with non-negativity constraints on the optimization variables, i.e.
where
To get familiar with the code I created an example for such a problem with
Similar results (very fast termination without getting close to the minimum for the GPU version) could be observed for other problems of this type.
My question is now: Do you think this is a limitation of your GPU-algorithm? (I thought this could be possible, because I observed the problem only if box-bounds are enabled. So maybe your adaptions for finding the GCP do not work for this problem?) Is it a known issue?
If not: Do you have an idea what to change in my example to make things work? (see examples folder in my fork ).
My CPU is a Intel(R) Core(TM) i7-7820X CPU, the GPU is a GeForce GTX 1080 Ti with CUDA 11.7. I get the following output
problem size: 1000, constrained, CPU, single precision
Summary:
Total time: 981.328 ms
Iteration time: 7.85062 ms / iteration
residual f: 4.29088e-09
residual g: 0.000298678
residual x: 1.18977e-05
iterations: 125
info: 1
CPU result 8.92826e-07
problem size: 1000, constrained, GPU, single precision
Summary:
Total time: 17.6567 ms
Iteration time: 1.35821 ms / iteration
residual f: nan
residual g: 0
residual x: nan
iterations: 13
info: 0
GPU result 0.639691
problem size: 1000, unconstrained, CPU, single precision
Summary:
Total time: 1507.06 ms
Iteration time: 7.49779 ms / iteration
residual f: 5.06273e-09
residual g: 9.37757e-05
residual x: 0.00013773
iterations: 201
info: 1
CPU result 2.74595e-07
problem size: 1000, unconstrained, GPU, single precision
Summary:
Total time: 138.902 ms
Iteration time: 0.775991 ms / iteration
residual f: 7.42523e-09
residual g: 0.000114142
residual x: 0.00048579
iterations: 179
info: 0
GPU result 1.28257e-06
Process finished with exit code 0
It would be really nice if you could take a look at this. Thanks!
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.