nepluno / lbfgsb-gpu Goto Github PK

View Code? Open in Web Editor NEW

108.0 5.0 16.0 77 KB

An open source library for the GPU-implementation of L-BFGS-B algorithm

Home Page: http://dx.doi.org/10.1016/j.cag.2014.01.002

License: Mozilla Public License 2.0

Cuda 35.82% C 0.35% C++ 63.04% CMake 0.80%

gpu optimization-methods lbfgs lbfgsb-solver lbfgsb nonlinear-optimization machine-learning-algorithms machine-learning

lbfgsb-gpu's People

Contributors

Stargazers

Watchers

Forkers

xiaoxiaohe504 tlawrencechen cha007 miaoneng ibragim cedricartigue nsridhar1 realasking guowu-mcgill fzimmermann89 yuenxq phoenixdigitalfx osburg forrestjgq ehsanhaghighat bearboysj

lbfgsb-gpu's Issues

Add GPU Compute Capability to fix Segmentation fault on RTX3090.

diff --git a/CMakeLists.txt b/CMakeLists.txt
index 21809c0..06ef45d 100644
--- a/CMakeLists.txt
+++ b/CMakeLists.txt
@@ -15,6 +15,10 @@ add_definitions (-DCMAKE_BUILD_TYPE=${CMAKE_BUILD_TYPE})

 find_package(CUDA REQUIRED)

+#foreach(ComputeCapability 60 61 70 72 75 86)
+#  set(CUDA_NVCC_FLAGS ${CUDA_NVCC_FLAGS};-gencode;arch=compute_${ComputeCapability},code=sm_${ComputeCapability})
+#endforeach()
+
 if (CMAKE_BUILD_TYPE MATCHES Debug)
   add_definitions (-DDEBUG)
 endif (CMAKE_BUILD_TYPE MATCHES Debug)

refer to Your GPU Compute Capability

blocked in some kernel function

Hi, I have some issues. I followed your install instruction and changed the CUDA Toolkit version to 10.1.
The program didn't report any errors. However, sometimes it only displayed the voronoi diagram of init sites(not CVT), which means the lbfgsb part didn't work and returned immediately. But other times, it just blocked in some "Copy" or "Multiple" function( such as cudaMemcpy and lbfgsbcuda::minimize::vdot_vv).

I found that the kernel functions upon these "Copy" or "Multiple" function have some issues, because the program couldn't print some variables on "device" after it called the kernel function "lbfgsbcuda::cmprlb::prog1" in "lbfgsbcmprlb"

It was ok if col=1, but seems the kernel function can not stop if col>1, then it couldn't print x,g,d
I had tried to change these "2" in line 143 to "1", then it was ok. But I'm not sure that will influence the result.

Seems these kernel function[prog2 in subsm.cu and prog1 in cmprlb] all have the same issue.

I'm not very familiar with CUDA program and kernel function.
Would you give me some ideas or solutions plz?

Stuck in some

N_elements must > 9 ?

If n or N_elements less than 9, CPU version doesn't work, when minimizing a rosenbrock function.

#include <iostream>
#include <culbfgsb.h>
#include <chrono>

using namespace std;

const int n = 9;
//if n < 9, lbfgsbcuda::lbfgsbminimize will not work and return -1 
//const int n = 4;

void rosenbrock(const float *x, float& fx, float* grad)
{
    fx = 0.0;
    for(int i = 0; i < n; i += 2)
    {
        float t1 = 1.0 - *(x+i);
        float t2 = 10 * (*(x+i+1) - *(x+i) * *(x+i));
        *(grad+i+1) = 20 * t2;
        *(grad+i)     = -2.0 * (*(x+i) * *(grad+i+1) + t1);
        fx += t1 * t1 + t2 * t2;
    }
    //cout << fx << endl;
}

int main()
{
    LBFGSB_CUDA_OPTION<float> lbfgsb_options;

    lbfgsbcuda::lbfgsbdefaultoption<float>(lbfgsb_options);
    lbfgsb_options.mode = LCM_NO_ACCELERATION;
    lbfgsb_options.eps_f = static_cast<float>(1e-8);
    lbfgsb_options.eps_g = static_cast<float>(1e-8);
    lbfgsb_options.eps_x = static_cast<float>(1e-8);
    lbfgsb_options.max_iteration = 1000;

    // initialize LBFGSB state
    LBFGSB_CUDA_STATE<float> state;
    memset(&state, 0, sizeof(state));
    float* assist_buffer_cpu = nullptr;

    float minimal_f = std::numeric_limits<float>::max();


    // setup callback function that evaluate function value and its gradient
    state.m_funcgrad_callback = [&assist_buffer_cpu, &minimal_f](
                                    float* x, float& f, float* g,
                                    const cudaStream_t& stream,
                                    const LBFGSB_CUDA_SUMMARY<float>& summary) {
    
        //dsscfg_cpu<float>(g_nx, g_ny, x, f, g, &assist_buffer_cpu, 'FG', g_lambda);
        rosenbrock(x, f, g);
        if (summary.num_iteration % 100 == 0) {
            std::cout << "CPU iteration " << summary.num_iteration << " F: " << f
                    << std::endl;
        }

        minimal_f = fmin(minimal_f, f);
        return 0;
    };

    // initialize CPU buffers
    int N_elements = n;

    float* x = new float[N_elements];
    float* g = new float[N_elements];

    float* xl = new float[N_elements];
    float* xu = new float[N_elements];

    // in this example, we don't have boundaries
    memset(xl, 0, N_elements * sizeof(xl[0]));
    memset(xu, 0, N_elements * sizeof(xu[0]));

    // initialize number of bounds (0 for this example)
    int* nbd = new int[N_elements];
    memset(nbd, 0, N_elements * sizeof(nbd[0]));

    LBFGSB_CUDA_SUMMARY<float> summary;
    memset(&summary, 0, sizeof(summary));

    // call optimization
    auto start_time = std::chrono::steady_clock::now();
    lbfgsbcuda::lbfgsbminimize<float>(N_elements, state, lbfgsb_options, x, nbd, xl, xu, summary);
    auto end_time = std::chrono::steady_clock::now();
    std::cout << "Timing: "
                << (std::chrono::duration<float, std::milli>(end_time - start_time).count() /
                    static_cast<float>(summary.num_iteration))
                << " ms / iteration" << std::endl;

    // release allocated memory
     
    delete[] g;
    delete[] xl;
    delete[] xu;
    delete[] nbd;
    delete[] assist_buffer_cpu;

    return 0;
}

Allocation of gradient array

Hi,

first of all, thank you for making this code available!

In the readme you mention one should allocate a buffer for g. But there is no way to actually make the library use that buffer, a new buffer will always be allocated inside lbfgsbminimize.
The example allocates a buffer outside, but this is never used.
Do you maybe have a version of the code that allows a user to hand over the gradient buffer to use or would you accept a PR changing the API of lbfgsbminimize to accept a buffer?

-Felix

Out of bounds write in kernel20

Hello,

I am looking to use this library in a project so I just compiled it and ran the test dsscfg program and it failed with what looks like an out-of-bounds write to a global memory array somewhere in the matupd::kernel20 kernel function.

This is on a GeForce RTX 2060 Super (CC 7.5) with CUDA 11.2.

$ ./bin/dsscfg 
Begin testing DSSCFG on the CPU (double precision)
CPU iteration 0 F: -0.349732
CPU iteration 0 F: 0.0192875
CPU iteration 0 F: -0.496264
CPU iteration 100 F: -0.888434
CPU iteration 200 F: -1.00093
CPU iteration 300 F: -1.01115
CPU iteration 400 F: -1.01126
CPU iteration 500 F: -1.01129
Timing: 9.94557 ms / iteration
Begin testing DSSCFG with CUDA (double precision)
CUDA iteration 0 F: -0.349732
CUDA iteration 0 F: -0.199307
CUDA iteration 0 F: -0.496264
lbfgsb failure: 700, /home/alex/lbfgsb-gpu/culbfgsb/./cauchy.cu, 594

Here's the cuda-memcheck output:

========= Invalid __global__ write of size 8
=========     at 0x000003d0 in void lbfgsbcuda::cuda::matupd::kernel20<int=256, double>(int, int, int, int, int, int, double const *, lbfgsbcuda::cuda::matupd::kernel20<int=256, double>*, double const )
=========     by thread (0,0,0) in block (135,0,0)
=========     Address 0x7f7999e22038 is out of bounds
=========     Device Frame:void lbfgsbcuda::cuda::matupd::kernel20<int=256, double>(int, int, int, int, int, int, double const *, lbfgsbcuda::cuda::matupd::kernel20<int=256, double>*, double const ) (void lbfgsbcuda::cuda::matupd::kernel20<int=256, double>(int, int, int, int, int, int, double const *, lbfgsbcuda::cuda::matupd::kernel20<int=256, double>*, double const ) : 0x3d0)
=========     Saved host backtrace up to driver entry point at kernel launch time
=========     Host Frame:/usr/lib/x86_64-linux-gnu/libcuda.so.1 (cuLaunchKernel + 0x2b8) [0x222dd8]
=========     Host Frame:./bin/dsscfg [0x2136b]
=========     Host Frame:./bin/dsscfg [0x6de20]
=========     Host Frame:./bin/dsscfg [0xd0c25]
=========     Host Frame:./bin/dsscfg [0x9b098]
=========     Host Frame:./bin/dsscfg [0x9e633]
=========     Host Frame:./bin/dsscfg [0x102a4]
=========     Host Frame:./bin/dsscfg [0x9db6]
=========     Host Frame:/lib/x86_64-linux-gnu/libc.so.6 (__libc_start_main + 0xf3) [0x270b3]
=========     Host Frame:./bin/dsscfg [0xdf4e]
=========

Example error：

bin/dsscfg
Begin testing DSSCFG on the CPU (double precision)
CPU iteration 0 F: -0.349732
CPU iteration 0 F: 0.0192875
CPU iteration 0 F: -0.496264
CPU iteration 100 F: -0.888434
CPU iteration 200 F: -1.00093
CPU iteration 300 F: -1.01115
CPU iteration 400 F: -1.01126
CPU iteration 500 F: -1.0113
Timing: 34.2264 ms / iteration
Begin testing DSSCFG with CUDA (double precision)
CUDA iteration 0 F: -0.349732
CUDA iteration 0 F: -0.199307
CUDA iteration 0 F: -0.496264
lbfgsb failure: 700, /home/dongjun/lbfgsb-gpu-master/culbfgsb/./cauchy.cu, 594

Logical difference between fortran version and active.cu: upper bound not enforced in active

lbfgsb-gpu/culbfgsb/active.cu

Lines 34 to 40 in f3522f6

    
           if (nbdi > 0) { 
        
             if (nbdi <= 2) { 
        
               xi = maxr(xi, li); 
        
             } else { 
        
               xi = minr(xi, ui); 
        
             } 
        
           }

Whereas in Nocedal et als fortran code (http://users.iems.northwestern.edu/~nocedal/lbfgsb.html) it is

         if (nbd(i) .gt. 0) then
            if (nbd(i) .le. 2 .and. x(i) .le. l(i)) then
               if (x(i) .lt. l(i)) then
                  prjctd = .true.
                  x(i) = l(i)
               endif
               nbdd = nbdd + 1
            else if (nbd(i) .ge. 2 .and. x(i) .ge. u(i)) then
               if (x(i) .gt. u(i)) then
                  prjctd = .true.
                  x(i) = u(i)
               endif
               nbdd = nbdd + 1
            endif
         endif

In the current active.cu code, for nbdi==2 the upper bounds is not enforced.

In the fortran code, the else if (nbd(i) .ge. 2 .and. x(i) .ge. u(i)) will be hit, but in the cuda code, the first if condition already matches, the max is a noop if x>ub>lb and the else will not be hit, so the min is never called and the upper bound not enforced.

Not sure what this means for the rest of the algorithm though..

GPU implementation failing for simple constrained norm approximation problem

Hi @nepluno,

first, thanks a lot for sharing your code with the world!
I would like to use your GPU implementation of L-BFGS-B for my master thesis project. One of the applications will be a large-scale (L2-)norm approximation with non-negativity constraints on the optimization variables, i.e.

$$ \text{min } \frac{1}{m} ||A x - b||_2^2 $$

$$ \text{subject to } x \geq 0 $$

where $A \in \mathbb{R}^{m \times n}$, $b \in \mathbb{R}^m$ and $x \in \mathbb{R}^n$.
To get familiar with the code I created an example for such a problem with $m=n=1000$. While everything works very good in case of the CPU version, the GPU version often jumps to really large or NaN values and terminates after only a few iterations without having reached a value close to a minimum. Removing the box constraints leads to both versions doing the job very well.
Similar results (very fast termination without getting close to the minimum for the GPU version) could be observed for other problems of this type.

My question is now: Do you think this is a limitation of your GPU-algorithm? (I thought this could be possible, because I observed the problem only if box-bounds are enabled. So maybe your adaptions for finding the GCP do not work for this problem?) Is it a known issue?
If not: Do you have an idea what to change in my example to make things work? (see examples folder in my fork ).

My CPU is a Intel(R) Core(TM) i7-7820X CPU, the GPU is a GeForce GTX 1080 Ti with CUDA 11.7. I get the following output

problem size: 1000, constrained, CPU, single precision

Summary: 
Total time: 981.328 ms
Iteration time: 7.85062 ms / iteration
residual f: 4.29088e-09
residual g: 0.000298678
residual x: 1.18977e-05
iterations: 125
info: 1
CPU result 8.92826e-07

problem size: 1000, constrained, GPU, single precision

Summary: 
Total time: 17.6567 ms
Iteration time: 1.35821 ms / iteration
residual f: nan
residual g: 0
residual x: nan
iterations: 13
info: 0
GPU result 0.639691

problem size: 1000, unconstrained, CPU, single precision

Summary: 
Total time: 1507.06 ms
Iteration time: 7.49779 ms / iteration
residual f: 5.06273e-09
residual g: 9.37757e-05
residual x: 0.00013773
iterations: 201
info: 1
CPU result 2.74595e-07

problem size: 1000, unconstrained, GPU, single precision

Summary: 
Total time: 138.902 ms
Iteration time: 0.775991 ms / iteration
residual f: 7.42523e-09
residual g: 0.000114142
residual x: 0.00048579
iterations: 179
info: 0
GPU result 1.28257e-06

Process finished with exit code 0

It would be really nice if you could take a look at this. Thanks!

nepluno / lbfgsb-gpu Goto Github PK

lbfgsb-gpu's People

Contributors

Stargazers

Watchers

Forkers

lbfgsb-gpu's Issues

Add GPU Compute Capability to fix Segmentation fault on RTX3090.

blocked in some kernel function

Stuck in some

N_elements must > 9 ?

Allocation of gradient array

Out of bounds write in kernel20

Example error：

Logical difference between fortran version and active.cu: upper bound not enforced in active

GPU implementation failing for simple constrained norm approximation problem

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

	if (nbdi > 0) {
	if (nbdi <= 2) {
	xi = maxr(xi, li);
	} else {
	xi = minr(xi, ui);
	}
	}