Giter Site home page Giter Site logo

nepluno / lbfgsb-gpu Goto Github PK

View Code? Open in Web Editor NEW
108.0 5.0 16.0 77 KB

An open source library for the GPU-implementation of L-BFGS-B algorithm

Home Page: http://dx.doi.org/10.1016/j.cag.2014.01.002

License: Mozilla Public License 2.0

Cuda 35.82% C 0.35% C++ 63.04% CMake 0.80%
gpu optimization-methods lbfgs lbfgsb-solver lbfgsb nonlinear-optimization machine-learning-algorithms machine-learning

lbfgsb-gpu's People

Contributors

nepluno avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

lbfgsb-gpu's Issues

Add GPU Compute Capability to fix Segmentation fault on RTX3090.

diff --git a/CMakeLists.txt b/CMakeLists.txt
index 21809c0..06ef45d 100644
--- a/CMakeLists.txt
+++ b/CMakeLists.txt
@@ -15,6 +15,10 @@ add_definitions (-DCMAKE_BUILD_TYPE=${CMAKE_BUILD_TYPE})

 find_package(CUDA REQUIRED)

+#foreach(ComputeCapability 60 61 70 72 75 86)
+#  set(CUDA_NVCC_FLAGS ${CUDA_NVCC_FLAGS};-gencode;arch=compute_${ComputeCapability},code=sm_${ComputeCapability})
+#endforeach()
+
 if (CMAKE_BUILD_TYPE MATCHES Debug)
   add_definitions (-DDEBUG)
 endif (CMAKE_BUILD_TYPE MATCHES Debug)

refer to Your GPU Compute Capability

blocked in some kernel function

Hi, I have some issues. I followed your install instruction and changed the CUDA Toolkit version to 10.1.
The program didn't report any errors. However, sometimes it only displayed the voronoi diagram of init sites(not CVT), which means the lbfgsb part didn't work and returned immediately. But other times, it just blocked in some "Copy" or "Multiple" function( such as cudaMemcpy and lbfgsbcuda::minimize::vdot_vv).
QQ截图20191212111816

I found that the kernel functions upon these "Copy" or "Multiple" function have some issues, because the program couldn't print some variables on "device" after it called the kernel function "lbfgsbcuda::cmprlb::prog1" in "lbfgsbcmprlb"

QQ截图20191212112442

QQ截图20191212112357

It was ok if col=1, but seems the kernel function can not stop if col>1, then it couldn't print x,g,d
I had tried to change these "2" in line 143 to "1", then it was ok. But I'm not sure that will influence the result.

Seems these kernel function[prog2 in subsm.cu and prog1 in cmprlb] all have the same issue.
QQ截图20191212113321

I'm not very familiar with CUDA program and kernel function.
Would you give me some ideas or solutions plz?

N_elements must > 9 ?

If n or N_elements less than 9, CPU version doesn't work, when minimizing a rosenbrock function.

#include <iostream>
#include <culbfgsb.h>
#include <chrono>

using namespace std;

const int n = 9;
//if n < 9, lbfgsbcuda::lbfgsbminimize will not work and return -1 
//const int n = 4;

void rosenbrock(const float *x, float& fx, float* grad)
{
    fx = 0.0;
    for(int i = 0; i < n; i += 2)
    {
        float t1 = 1.0 - *(x+i);
        float t2 = 10 * (*(x+i+1) - *(x+i) * *(x+i));
        *(grad+i+1) = 20 * t2;
        *(grad+i)     = -2.0 * (*(x+i) * *(grad+i+1) + t1);
        fx += t1 * t1 + t2 * t2;
    }
    //cout << fx << endl;
}

int main()
{
    LBFGSB_CUDA_OPTION<float> lbfgsb_options;

    lbfgsbcuda::lbfgsbdefaultoption<float>(lbfgsb_options);
    lbfgsb_options.mode = LCM_NO_ACCELERATION;
    lbfgsb_options.eps_f = static_cast<float>(1e-8);
    lbfgsb_options.eps_g = static_cast<float>(1e-8);
    lbfgsb_options.eps_x = static_cast<float>(1e-8);
    lbfgsb_options.max_iteration = 1000;

    // initialize LBFGSB state
    LBFGSB_CUDA_STATE<float> state;
    memset(&state, 0, sizeof(state));
    float* assist_buffer_cpu = nullptr;

    float minimal_f = std::numeric_limits<float>::max();


    // setup callback function that evaluate function value and its gradient
    state.m_funcgrad_callback = [&assist_buffer_cpu, &minimal_f](
                                    float* x, float& f, float* g,
                                    const cudaStream_t& stream,
                                    const LBFGSB_CUDA_SUMMARY<float>& summary) {
    
        //dsscfg_cpu<float>(g_nx, g_ny, x, f, g, &assist_buffer_cpu, 'FG', g_lambda);
        rosenbrock(x, f, g);
        if (summary.num_iteration % 100 == 0) {
            std::cout << "CPU iteration " << summary.num_iteration << " F: " << f
                    << std::endl;
        }

        minimal_f = fmin(minimal_f, f);
        return 0;
    };

    // initialize CPU buffers
    int N_elements = n;

    float* x = new float[N_elements];
    float* g = new float[N_elements];

    float* xl = new float[N_elements];
    float* xu = new float[N_elements];

    // in this example, we don't have boundaries
    memset(xl, 0, N_elements * sizeof(xl[0]));
    memset(xu, 0, N_elements * sizeof(xu[0]));

    // initialize number of bounds (0 for this example)
    int* nbd = new int[N_elements];
    memset(nbd, 0, N_elements * sizeof(nbd[0]));

    LBFGSB_CUDA_SUMMARY<float> summary;
    memset(&summary, 0, sizeof(summary));

    // call optimization
    auto start_time = std::chrono::steady_clock::now();
    lbfgsbcuda::lbfgsbminimize<float>(N_elements, state, lbfgsb_options, x, nbd, xl, xu, summary);
    auto end_time = std::chrono::steady_clock::now();
    std::cout << "Timing: "
                << (std::chrono::duration<float, std::milli>(end_time - start_time).count() /
                    static_cast<float>(summary.num_iteration))
                << " ms / iteration" << std::endl;

    // release allocated memory
     
    delete[] g;
    delete[] xl;
    delete[] xu;
    delete[] nbd;
    delete[] assist_buffer_cpu;

    return 0;
}

Allocation of gradient array

Hi,

first of all, thank you for making this code available!

In the readme you mention one should allocate a buffer for g. But there is no way to actually make the library use that buffer, a new buffer will always be allocated inside lbfgsbminimize.
The example allocates a buffer outside, but this is never used.
Do you maybe have a version of the code that allows a user to hand over the gradient buffer to use or would you accept a PR changing the API of lbfgsbminimize to accept a buffer?

-Felix

Out of bounds write in kernel20

Hello,

I am looking to use this library in a project so I just compiled it and ran the test dsscfg program and it failed with what looks like an out-of-bounds write to a global memory array somewhere in the matupd::kernel20 kernel function.

This is on a GeForce RTX 2060 Super (CC 7.5) with CUDA 11.2.

$ ./bin/dsscfg 
Begin testing DSSCFG on the CPU (double precision)
CPU iteration 0 F: -0.349732
CPU iteration 0 F: 0.0192875
CPU iteration 0 F: -0.496264
CPU iteration 100 F: -0.888434
CPU iteration 200 F: -1.00093
CPU iteration 300 F: -1.01115
CPU iteration 400 F: -1.01126
CPU iteration 500 F: -1.01129
Timing: 9.94557 ms / iteration
Begin testing DSSCFG with CUDA (double precision)
CUDA iteration 0 F: -0.349732
CUDA iteration 0 F: -0.199307
CUDA iteration 0 F: -0.496264
lbfgsb failure: 700, /home/alex/lbfgsb-gpu/culbfgsb/./cauchy.cu, 594

Here's the cuda-memcheck output:

========= Invalid __global__ write of size 8
=========     at 0x000003d0 in void lbfgsbcuda::cuda::matupd::kernel20<int=256, double>(int, int, int, int, int, int, double const *, lbfgsbcuda::cuda::matupd::kernel20<int=256, double>*, double const )
=========     by thread (0,0,0) in block (135,0,0)
=========     Address 0x7f7999e22038 is out of bounds
=========     Device Frame:void lbfgsbcuda::cuda::matupd::kernel20<int=256, double>(int, int, int, int, int, int, double const *, lbfgsbcuda::cuda::matupd::kernel20<int=256, double>*, double const ) (void lbfgsbcuda::cuda::matupd::kernel20<int=256, double>(int, int, int, int, int, int, double const *, lbfgsbcuda::cuda::matupd::kernel20<int=256, double>*, double const ) : 0x3d0)
=========     Saved host backtrace up to driver entry point at kernel launch time
=========     Host Frame:/usr/lib/x86_64-linux-gnu/libcuda.so.1 (cuLaunchKernel + 0x2b8) [0x222dd8]
=========     Host Frame:./bin/dsscfg [0x2136b]
=========     Host Frame:./bin/dsscfg [0x6de20]
=========     Host Frame:./bin/dsscfg [0xd0c25]
=========     Host Frame:./bin/dsscfg [0x9b098]
=========     Host Frame:./bin/dsscfg [0x9e633]
=========     Host Frame:./bin/dsscfg [0x102a4]
=========     Host Frame:./bin/dsscfg [0x9db6]
=========     Host Frame:/lib/x86_64-linux-gnu/libc.so.6 (__libc_start_main + 0xf3) [0x270b3]
=========     Host Frame:./bin/dsscfg [0xdf4e]
=========

Example error:

bin/dsscfg
Begin testing DSSCFG on the CPU (double precision)
CPU iteration 0 F: -0.349732
CPU iteration 0 F: 0.0192875
CPU iteration 0 F: -0.496264
CPU iteration 100 F: -0.888434
CPU iteration 200 F: -1.00093
CPU iteration 300 F: -1.01115
CPU iteration 400 F: -1.01126
CPU iteration 500 F: -1.0113
Timing: 34.2264 ms / iteration
Begin testing DSSCFG with CUDA (double precision)
CUDA iteration 0 F: -0.349732
CUDA iteration 0 F: -0.199307
CUDA iteration 0 F: -0.496264
lbfgsb failure: 700, /home/dongjun/lbfgsb-gpu-master/culbfgsb/./cauchy.cu, 594

Logical difference between fortran version and active.cu: upper bound not enforced in active

In

if (nbdi > 0) {
if (nbdi <= 2) {
xi = maxr(xi, li);
} else {
xi = minr(xi, ui);
}
}

Whereas in Nocedal et als fortran code (http://users.iems.northwestern.edu/~nocedal/lbfgsb.html) it is

         if (nbd(i) .gt. 0) then
            if (nbd(i) .le. 2 .and. x(i) .le. l(i)) then
               if (x(i) .lt. l(i)) then
                  prjctd = .true.
                  x(i) = l(i)
               endif
               nbdd = nbdd + 1
            else if (nbd(i) .ge. 2 .and. x(i) .ge. u(i)) then
               if (x(i) .gt. u(i)) then
                  prjctd = .true.
                  x(i) = u(i)
               endif
               nbdd = nbdd + 1
            endif
         endif

In the current active.cu code, for nbdi==2 the upper bounds is not enforced.

In the fortran code, the else if (nbd(i) .ge. 2 .and. x(i) .ge. u(i)) will be hit, but in the cuda code, the first if condition already matches, the max is a noop if x>ub>lb and the else will not be hit, so the min is never called and the upper bound not enforced.

Not sure what this means for the rest of the algorithm though..

GPU implementation failing for simple constrained norm approximation problem

Hi @nepluno,

first, thanks a lot for sharing your code with the world!
I would like to use your GPU implementation of L-BFGS-B for my master thesis project. One of the applications will be a large-scale (L2-)norm approximation with non-negativity constraints on the optimization variables, i.e.

$$ \text{min } \frac{1}{m} ||A x - b||_2^2 $$

$$ \text{subject to } x \geq 0 $$

where $A \in \mathbb{R}^{m \times n}$, $b \in \mathbb{R}^m$ and $x \in \mathbb{R}^n$.
To get familiar with the code I created an example for such a problem with $m=n=1000$. While everything works very good in case of the CPU version, the GPU version often jumps to really large or NaN values and terminates after only a few iterations without having reached a value close to a minimum. Removing the box constraints leads to both versions doing the job very well.
Similar results (very fast termination without getting close to the minimum for the GPU version) could be observed for other problems of this type.

My question is now: Do you think this is a limitation of your GPU-algorithm? (I thought this could be possible, because I observed the problem only if box-bounds are enabled. So maybe your adaptions for finding the GCP do not work for this problem?) Is it a known issue?
If not: Do you have an idea what to change in my example to make things work? (see examples folder in my fork ).

My CPU is a Intel(R) Core(TM) i7-7820X CPU, the GPU is a GeForce GTX 1080 Ti with CUDA 11.7. I get the following output

problem size: 1000, constrained, CPU, single precision

Summary: 
Total time: 981.328 ms
Iteration time: 7.85062 ms / iteration
residual f: 4.29088e-09
residual g: 0.000298678
residual x: 1.18977e-05
iterations: 125
info: 1
CPU result 8.92826e-07

problem size: 1000, constrained, GPU, single precision

Summary: 
Total time: 17.6567 ms
Iteration time: 1.35821 ms / iteration
residual f: nan
residual g: 0
residual x: nan
iterations: 13
info: 0
GPU result 0.639691

problem size: 1000, unconstrained, CPU, single precision

Summary: 
Total time: 1507.06 ms
Iteration time: 7.49779 ms / iteration
residual f: 5.06273e-09
residual g: 9.37757e-05
residual x: 0.00013773
iterations: 201
info: 1
CPU result 2.74595e-07

problem size: 1000, unconstrained, GPU, single precision

Summary: 
Total time: 138.902 ms
Iteration time: 0.775991 ms / iteration
residual f: 7.42523e-09
residual g: 0.000114142
residual x: 0.00048579
iterations: 179
info: 0
GPU result 1.28257e-06

Process finished with exit code 0

It would be really nice if you could take a look at this. Thanks!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.