falkonml / falkon Goto Github PK

View Code? Open in Web Editor NEW

178.0 178.0 20.0 7.72 MB

Large-scale, multi-GPU capable, kernel solver

Home Page: https://falkonml.github.io/falkon/

License: MIT License

Shell 3.78% Python 53.82% Cuda 5.61% C++ 3.60% Jupyter Notebook 33.17% C 0.02%

ai kernel kernel-methods large-scale-learning machine-learning python pytorch

falkon's People

Contributors

Stargazers

Watchers

falkon's Issues

error running Falkon on GPU

I have installed Pytorch 2.0.0 and CUDA 11.7, along with pip install falkon -f https://falkon.dibris.unige.it/torch-2.0.0_cu117.html. I can access the GPU and use it for other tasks, but when I try to to run Falkon I get the following error:

---------------------------------------------------------------------------
OSError                                   Traceback (most recent call last)
Cell In[43], line 4
      1 options = falkon.FalkonOptions(keops_active="no")
      3 kernel = falkon.kernels.GaussianKernel(sigma=1, opt=options)
----> 4 flk = falkon.Falkon(kernel=kernel, penalty=1e-5, M=5000, options=options)

File /mambaforge/envs/sobolev/lib/python3.9/site-packages/falkon/models/falkon.py:132, in Falkon.__init__(self, kernel, penalty, M, center_selection, maxiter, seed, error_fn, error_every, weight_fn, options)
    130 self.maxiter = maxiter
    131 self.weight_fn = weight_fn
--> 132 self._init_cuda()
    133 self.beta_ = None

File /mambaforge/envs/sobolev/lib/python3.9/site-packages/falkon/models/model_utils.py:71, in FalkonBase._init_cuda(self)
     69 if self.use_cuda_:
     70     torch.cuda.init()
---> 71     self.num_gpus = devices.num_gpus(self.options)

File /mambaforge/envs/sobolev/lib/python3.9/site-packages/falkon/utils/devices.py:211, in num_gpus(opt)
    209 global __COMP_DATA
    210 if len(__COMP_DATA) == 0:
--> 211     get_device_info(opt)
    212 return len([c for c in __COMP_DATA if c >= 0])

File /mambaforge/envs/sobolev/lib/python3.9/site-packages/falkon/utils/devices.py:199, in get_device_info(opt)
    196     return __COMP_DATA
    198 for g in range(0, tcd.device_count()):
--> 199     __COMP_DATA = _get_gpu_device_info(opt, g, __COMP_DATA)
    201 if len(__COMP_DATA) == 0:
    202     raise RuntimeError("No suitable device found. Enable option 'use_cpu' "
    203                        "if no GPU is available.")

File /mambaforge/envs/sobolev/lib/python3.9/site-packages/falkon/utils/devices.py:91, in _get_gpu_device_info(opt, g, data_dict)
     82 # try:
     83 #     from ..cuda.cudart_gpu import cuda_meminfo
     84 # except Exception as e:
   (...)
     88 # Some of the CUDA calls in here may change the current device,
     89 # this ensures it gets reset at the end.
     90 with tcd.device(g):
---> 91     mem_free, mem_total = mem_get_info(g)
     92     mem_used = mem_total - mem_free
     93     # noinspection PyUnresolvedReferences

File /mambaforge/envs/sobolev/lib/python3.9/site-packages/falkon/c_ext/__init__.py:15, in _make_lazy_cuda_func.<locals>.call_cuda(*args, **kwargs)
     14 def call_cuda(*args, **kwargs):
---> 15     from ._backend import _assert_has_ext
     16     _assert_has_ext()
     17     return getattr(torch.ops.falkon, name)(*args, **kwargs)

File /mambaforge/envs/sobolev/lib/python3.9/site-packages/falkon/c_ext/_backend.py:76
     73 if not _HAS_EXT:
     74     # try to import the compiled module (via setup.py)
     75     lib_path = _get_extension_path("_C")
---> 76     torch.ops.load_library(lib_path)
     77     _HAS_EXT = True
     79     # Check torch version vs. compilation version
     80     # Copyright (c) 2020 Matthias Fey <[email protected]>
     81     # https://github.com/rusty1s/pytorch_scatter/blob/master/torch_scatter/__init__.py

File /mambaforge/envs/sobolev/lib/python3.9/site-packages/torch/_ops.py:643, in _Ops.load_library(self, path)
    638 path = _utils_internal.resolve_library_path(path)
    639 with dl_open_guard():
    640     # Import the shared library into the process, thus running its
    641     # static (global) initialization code in order to register custom
    642     # operators with the JIT.
--> 643     ctypes.CDLL(path)
    644 self.loaded_libraries.add(path)

File /mambaforge/envs/sobolev/lib/python3.9/ctypes/__init__.py:374, in CDLL.__init__(self, name, mode, handle, use_errno, use_last_error, winmode)
    371 self._FuncPtr = _FuncPtr
    373 if handle is None:
--> 374     self._handle = _dlopen(self._name, mode)
    375 else:
    376     self._handle = handle

OSError: libcusolver.so.11: cannot open shared object file: No such file or directory

Enable CUDA in CI

We are stuck at 51% code coverage because CI does not have CUDA.

This requires more effort to install the library in the CI pipeline.

Compiling with Cuda 11.0

Hi, [Sorry, just realized it might be more relevant in the PR section; can close this and edit this as a PR if you wish]
I had to use Falkon, and it worked great. Thanks for all the work you put in!
Still, I ran into a couple of errors before managing to run it on GPU. Here how I fixed them, if you find it useful:

Not compiling with Cuda

If Falkon is not compiled with Cuda (WITH_CUDA in your cmake file), the extensions are not built and running Falkon will fail with an ModuleNotFoundError at from falkon.ooc_ops.cuda import parallel_potrf in falkon/ooc_ops/ooc_potrf.py. It's a bit hard to track down that this is an issue coming from the first compilation. Maybe catching the exception and providing an error message would be useful, e.g:

try: 
    from falkon.ooc_ops.cuda import parallel_potrf
except ModuleNotFoundError as e:
    print(f"Got exception {e} when importing `cuda`. Did you compile with Cuda support?")

Patched version of PyKeops

I had trouble compiling you patched version of PyKeops. I had only two Cuda compilers available: versions 10.1 and 11.0.

With Cuda 10.1, had the issue described here: https://forums.developer.nvidia.com/t/cuda-10-1-nvidia-youre-now-fixing-gcc-bugs-that-gcc-doesnt-even-have/71063. It is apparently a cuda specific issue, I did not try solving it.
So I used Cuda 11.0. Except I fell in this problem: getkeops/keops#122
which is basically Cmake not forwarding c++ 17 flags to Cuda. Solution is to use:

Cmake >= 3.18
AND apply this commit: getkeops/keops@e19e2a1 which is specific to Cuda 11.0 (and not 11.1).

Then things worked fine; all in all, I would suggest:

Providing information when compilation is done without Cuda (e.g. flags which makes the installation fails if WITH_CUDA is false in the setup.py.
Fixed the case with Cuda 11.0 by checking CMake version and applying the commit

Anyway, thanks for the package!

Installation failure (due to pykeops installation failure)

Installation runs with use_cpu=True but fails on a GPU due to an error in pykeops. See getkeops/keops#257 for more details.

Is it possible to ship a stable version of pykeops with pip install git+...?

Allow using KeOps with CUDA inputs (InCoreFalkon)

FileExistsError: [Errno 17] File exists: '/root/.cache/pykeops-1.5-cpython-37//build-pybind11_template-libKeOps_template_660bc304e8'

Hi, I tried to install falkon in Colab.
The installation was successful but trying the KernelRidgeRegression demo I get an error on flk.fit().

[pyKeOps] Compiling libKeOpstorch0977e258bf in /root/.cache/pykeops-1.5-cpython-37:
       formula: Sum_Reduction(Exp(SqDist(x1 / g, x2 / g) * IntInv(-2)) * v,0)
       aliases: x1 = Vi(0,13); x2 = Vj(1,13); v = Vj(2,1); g = Pm(3,1); 
       dtype  : float64
... 

--------------------- MAKE DEBUG -----------------
Command '['cmake', '--build', '.', '--target', 'KeOps_formula', '--', 'VERBOSE=1']' returned non-zero exit status 2.

--------------------- ----------- -----------------
[pyKeOps] Compiling pybind11 template libKeOps_template_660bc304e8 in /root/.cache/pykeops-1.5-cpython-37 ... 

---------------------------------------------------------------------------

FileExistsError                           Traceback (most recent call last)

<ipython-input-12-be46bad2abb7> in <module>()
----> 1 flk.fit(Xtrain, Ytrain)

/usr/local/lib/python3.7/dist-packages/falkon/models/falkon.py in fit(self, X, Y, Xts, Yts, warm_start)
    261                 beta = optim.solve(
    262                     X, ny_points, Y, self.penalty, initial_solution=warm_start,
--> 263                     max_iter=self.maxiter, callback=validation_cback)
    264 
    265             self.alpha_ = precond.apply(beta)

/usr/local/lib/python3.7/dist-packages/falkon/optim/conjgrad.py in solve(self, X, M, Y, _lambda, initial_solution, max_iter, callback)
    306                 B = incore_fmmv(Knm, y_over_n, None, transpose=True, opt=self.params)
    307             else:
--> 308                 B = self.kernel.mmv(M, X, y_over_n, opt=self.params)
    309             B = self.preconditioner.apply_t(B)
    310 

/usr/local/lib/python3.7/dist-packages/falkon/kernels/kernel.py in mmv(self, X1, X2, v, out, opt)
    267             params = dataclasses.replace(self.params, **dataclasses.asdict(opt))
    268         mmv_impl = self._decide_mmv_impl(X1, X2, v, params)
--> 269         return mmv_impl(X1, X2, v, self, out, params)
    270 
    271     def _decide_mmv_impl(self,

/usr/local/lib/python3.7/dist-packages/falkon/kernels/distance_kernel.py in _keops_mmv_impl(self, X1, X2, v, kernel, out, opt)
    283         other_vars = [self.sigma.to(device=X1.device, dtype=X1.dtype)]
    284 
--> 285         return self.keops_mmv(X1, X2, v, out, formula, aliases, other_vars, opt)
    286 
    287     def extra_mem(self) -> Dict[str, float]:

/usr/local/lib/python3.7/dist-packages/falkon/kernels/keops_helpers.py in keops_mmv(self, X1, X2, v, out, formula, aliases, other_vars, opt)
     70         return run_keops_mmv(X1=X1, X2=X2, v=v, other_vars=other_vars,
     71                              out=out, formula=formula, aliases=aliases, axis=1,
---> 72                              reduction='Sum', opt=opt)
     73 
     74     def keops_dmmv_helper(self, X1, X2, v, w, kernel, out, differentiable, opt, mmv_fn):

/usr/local/lib/python3.7/dist-packages/falkon/mmv_ops/keops.py in run_keops_mmv(X1, X2, v, other_vars, out, formula, aliases, axis, reduction, opt)
    226     if comp_dev_type == 'cpu' and all([ddev.type == 'cpu' for ddev in data_devs]):  # incore CPU
    227         variables = [X1, X2, v] + other_vars
--> 228         out = fn(*variables, out=out, backend=backend)
    229     elif comp_dev_type == 'cuda' and all([ddev.type == 'cuda' for ddev in data_devs]):  # incore CUDA
    230         variables = [X1, X2, v] + other_vars

/usr/local/lib/python3.7/dist-packages/pykeops/torch/generic/generic_red.py in __call__(self, out, backend, device_id, ranges, *args)
    576             ny,
    577             out,
--> 578             *args
    579         )
    580         if self.dtype in ("float16", "half"):

/usr/local/lib/python3.7/dist-packages/pykeops/torch/generic/generic_red.py in forward(ctx, formula, aliases, backend, dtype, device_id, ranges, optional_flags, rec_multVar_highdim, nx, ny, out, *args)
     45             optional_flags += ['-DMULT_VAR_HIGHDIM=1']
     46         myconv = LoadKeOps(
---> 47             formula, aliases, dtype, "torch", optional_flags, include_dirs
     48         ).import_module()
     49 

/usr/local/lib/python3.7/dist-packages/pykeops/common/keops_io.py in __init__(self, formula, aliases, dtype, lang, optional_flags, include_dirs)
     46             pykeops.config.build_type == "Debug"
     47         ):
---> 48             self._safe_compile()
     49 
     50     @create_and_lock_build_folder()

/usr/local/lib/python3.7/dist-packages/pykeops/common/utils.py in wrapper_filelock(*args, **kwargs)
     75             lock = FileLock(os.path.join(bf, "pykeops_build2.lock"))
     76             with lock:
---> 77                 func_res = func(*args, **kwargs)
     78 
     79             # clean

/usr/local/lib/python3.7/dist-packages/pykeops/common/keops_io.py in _safe_compile(self)
     61             self.optional_flags,
     62             self.include_dirs,
---> 63             self.build_folder,
     64         )
     65 

/usr/local/lib/python3.7/dist-packages/pykeops/common/compile_routines.py in compile_generic_routine(formula, aliases, dllname, dtype, lang, optional_flags, include_dirs, build_folder)
    244 
    245     template_name, is_rebuilt = get_or_build_pybind11_template(
--> 246         dtype, lang, include_dirs, use_prebuilt_formula=True
    247     )
    248 

/usr/local/lib/python3.7/dist-packages/pykeops/common/compile_routines.py in get_or_build_pybind11_template(dtype, lang, include_dirs, use_prebuilt_formula)
     65         # print('(with dtype=',dtype,', lang=',lang,', include_dirs=',include_dirs,')', flush=True)
     66 
---> 67         os.mkdir(template_build_folder)
     68 
     69         command_line += ["-Dtemplate_name=" + "'{}'".format(template_name)]

FileExistsError: [Errno 17] File exists: '/root/.cache/pykeops-1.5-cpython-37//build-pybind11_template-libKeOps_template_660bc304e8'

Failing to reproduce the Laplacian kernel computation

Hi,

I'm trying to reproduce how falkon computes the Laplacian kernel, as I have to outsource some models I've optimized. So far I'm largely unsuccessful: I get consistent results between my python implementation and using sklearn's metrics.pairwise.laplacian_kernel(), but completely different results with falkon.kernels.LaplacianKernel().

It's hard to tell what's different in the falkon implementation, as I didn't find a simple way to access the _sq_dist() method used in laplacian_core(). Is it really a Manhattan distance computed here? Is there any additional normalization? Would you have a simple numpy implementation which would reproduce falkon's results?

thanks a lot,
Arthur

Enable backprop through the kernel

This could allow optimization of kernel parameters with autograd.

Steps:

Make the kernel classes torch modules (but modules have a single operation, we have call(), mmv, dmmv..
Figure out how to backprop with the prepare, apply, finalize op sequence
Wrap ops (e.g. norm) in torch modules with backwards implemented

With memory pressure allocations may fail due to fragmentation

In many CUDA operations we use separate allocations for all tensors, which may lead to memory fragmentation and therefore to out of CUDA memory errors when there is a lot of memory pressure.
@WackoToe

large M cases

I was trying to try FALKON with M~256k,512k(number of centers). But the process gets killed. How can I efficiently apply FALKON in these large M cases?

Failed to find C-extensions

Hello, I wanted to reinstall Falkon on a linux server (because it cound not find Cuda). I deleted the environment and followed all the installation step again.

Unfortunately, now I keep having the following error:
"ImportError: Failed to find C-extension. Please recompile Falkon."

Is there a way for me to check if the extension is correctly installed, and where it is located?
Is there a way to force installation of these extension (I have tried deleting and reinstalling falkon with various pip options, but it did not make a difference).

Thanks a lot,

Clement

Add example for cross entropy loss

Hi,

The present documentation has only example related to binary cross entropy loss (logistic regression). I was wondering would it be possible to include example for cross entropy loss (for more than two classes) ?

Custom hyperparameter optimization

Hi,

I'd like to implement a hyperparameter optimization procedure based on minimizing a loss function computed on a validation set, in order to preserve transferability as much as possible.

I was previously using the built-in hopt classes in the following way:

model = SGPR(
    kernel=kernel, penalty_init=penalty_init, centers_init=centers_init,
    opt_penalty=True, opt_centers=False)

opt_hp = torch.optim.Adam(model.parameters(), lr=lr)

    for epoch in range(100):
    opt_hp.zero_grad()
    loss = model(X_train, Y_train)
    loss.backward()
    opt_hp.step()

What I'm trying to implement now should probably look like that:

model = SGPR(
    kernel=kernel, penalty_init=penalty_init, centers_init=centers_init,
    opt_penalty=True, opt_centers=False)

opt_hp = torch.optim.Adam(model.parameters(), lr=lr)
loss_fn = torch.nn.L1Loss()

for epoch in range(100):
    opt_hp.zero_grad()
    model(X_train, Y_train)
    loss = loss_fn(model.predict(X_val), Y_val)
    loss.requires_grad = True
    loss.backward()
    opt_hp.step()

But the loss doesn't change upon optimization - the hyperparameters are probably not updated at all. Would that be related to the computation of dLoss/dx ? Should I use an instance of falkon.Falkon instead of one of the falkon.hopt.objectives to define the model (if I remember well I had issues related to keops or cuda with falkon.Falkon)?

many thanks,
Arthur

Remove lauum_par_blk_multiplier option

This option has no effect anymore

Replace falkon.mmv_ops with homemade mmv_ops

Hi again!

Thanks again for the help last time.

This time, I'd like to replace the falkon.mmv_ops in the InCoreFalkon solver with a homemade mmv_ops for a research project.

Wondering what is the "cleanest" and simplest way to do this?

Thank you!

Best regards,
Robert

Fix GaussianKernel in multi-sigma case

It is extremely slow. Example:

#!/usr/bin/env python3
from falkon import Falkon
from falkon.kernels import GaussianKernel
from falkon.options import FalkonOptions
import numpy as np
import time
import torch

def build_dataset():
    X = torch.rand(10000,28)
    f = lambda x: torch.sin(x)
    Y = f(X)
    return X,Y


def single_sigma():
    sigma = 2.8
    lam = 1e-5
    ITERS = 10
    SEED= 4242
    config = {
        'kernel': GaussianKernel(sigma=sigma),
        'penalty': lam,
        'M': 200,
        'maxiter': ITERS,
        'seed': SEED,
        'options': FalkonOptions()
    }
    return Falkon(**config)
    

def multi_sigma():
    sigma = torch.tensor([2.8 for _ in range(28)])
    lam = 1e-5
    ITERS = 10
    SEED= 4242
    config = {
        'kernel': GaussianKernel(sigma=sigma),
        'penalty': lam,
        'M': 200,
        'maxiter': ITERS,
        'seed': SEED,
        'options': FalkonOptions()
    }

    return Falkon(**config)

def multi_sigma_matrix():
    sigma = 2.8 * torch.eye(28,28) 
    lam = 1e-5
    ITERS = 10
    SEED= 4242
    config = {
        'kernel': GaussianKernel(sigma=sigma),
        'penalty': lam,
        'M': 200,
        'maxiter': ITERS,
        'seed': SEED,
        'options': FalkonOptions()
    }

    return Falkon(**config)
    

def test_fit(X,Y, flk):
    st = time.time()
    flk.fit(X, Y)
    end = time.time()
    return end - st

X, Y = build_dataset()
print("[->] Single sigma => dataset fitted in {} seconds".format(test_fit(X, Y, single_sigma())))
print("[->] Multi sigma => dataset fitted in {} seconds".format(test_fit(X, Y, multi_sigma())))
print("[->] Multi sigma (using a matrix with sigmas in the diagonal) => dataset fitted in {} seconds".format(test_fit(X, Y, multi_sigma_matrix())))

Automatic mmv memory management on CPU

Currently blockwise splitting on the CPU is not adaptive to free memory, and only follows the max_cpu_mem option.

Should attempt to use max(max_cpu_mem, actual free memory).

Documentation Plans

This issue collects items of documentation which are missing & should be written.

Example notebooks

Sparse data
GPU parameters
Extending (e.g. extending the center-selection algorithm)

Missing docstrings

LogisticFalkon is not documented
center_selection does not have a sphinx section
sparse module could be documented much better (many methods of SparseTensor are missing)

While Running the Falkon Regression Tutorial, fit function gives following error

NotImplementedError: Could not run 'falkon::cuda_2d_copy_async' with arguments from the 'CUDA' backend. This could be because the operator doesn't exist for this backend,

Please reply at the earliest.

Regards,

Running out of memory when using KeOps

Hi again,

Thanks for the help these days. Running into:

cudaSafeCall() failed at /data/greyostrich/not-backed-up/nvme00/rhu/miniconda3/envs/new_nnenv/lib/python3.8/site-packages/pykeops/cmake_scripts/script_keops_formula/../../keops/core/mapreduce/GpuConv1D.cu:432 : out of memory

when running FALKON.

Setup:
X: 10^9x3
Y: 10^9x1
GaussianKernel with ls=3
penalty=1e-5
GPU: V100 32GB
CPU RAM: 180GB
8 CPU processors

Thank you for the help!

Best regards,
Robert

Keops is imported (and compiles tests) on falkon import

We should only import keops when it's useful. Otherwise if keops is broken for some reason it will attempt to recompile its self-test file every time import falkon is run

Patched version of KeOps

Hi, is there a reason for using the patched version of KeOps? https://github.com/FalkonML/falkon/blob/master/setup.py#L176-L177

No module named 'falkon.la_helpers.cuda_la_helpers'

Error when trying to work with GPU (I do not encounter this on CPU):
Maybe something went wrong when refactoring?

File "falkon/la_helpers/cuda_trsm.py", line 7, in
from falkon.la_helpers.cuda_la_helpers import cuda_transpose
ModuleNotFoundError: No module named 'falkon.la_helpers.cuda_la_helpers'

Do you have any clue how I could circumvent this?

gradient computation for stochastic obj in hopt takes forever

Hi, I am trying to use stochastic objective function in hopt to do gradient based hyperparameter optimization. Tried running it and the first iteration takes forever for some reason. My falkon solver works without problems now. I take a look at the code and wrote a small replication script based on how stoch_new_compreg.py is implemented. Anything I did wrong in the following script?

import numpy as np
import falkon
import torch
from falkon.center_selection import FixedSelector

# generate a tiny dataset 
n = 100
d = 5
X, Y = datasets.make_regression(n, d, random_state=11)
num_train = int(0.8 * n)
X = X.astype(np.float64)
Y = Y.astype(np.float64).reshape(-1, 1)
X_train, y_train = torch.from_numpy(X[:num_train]), torch.from_numpy(Y[:num_train])
X_test, y_test = torch.from_numpy(X[num_train:]), torch.from_numpy(Y[num_train:])

m = 10
X_centers = X_train[:m, :].clone()
center_selector = FixedSelector(centers=X_centers)

options = falkon.FalkonOptions(keops_active="no", debug=True, cpu_preconditioner=True, max_gpu_mem=12*10**9,
                              chol_force_ooc=True, min_cuda_iter_size_64=300000, cg_tolerance=1e-10)

sigma_init = torch.as_tensor(np.array([np.sqrt(d)]*d), dtype=torch.float64)
kernel = falkon.kernels.GaussianKernel(sigma=sigma_init, opt=options)
ridge = 1e-6
maxiter = 50
def error_fn(t, p):
    return torch.sqrt(torch.mean((t - p) ** 2)).item(), "RMSE"

# solve falkon first before running through gradient
flk = falkon.Falkon(kernel=kernel, center_selection=center_selector,
                    penalty=ridge, M=m, options=options, error_every=1, maxiter=maxiter, error_fn=error_fn)

flk.fit(X_train, y_train, X_train, y_train)



# gist of backward process in stoch_new_compreg.py. 
# Remove trace part and only focus on the derivatives of model fitting term w.r.t. kernel bandwidths
# ridge and centers are set to non-trainable

optimize_centers = False
optimize_ridge = False

def calc_dfit_bwd(zy_knm_solve_zy, zy_solve_knm_knm_solve_zy, zy_solve_kmm_solve_zy, pen_n, t,
                  include_kmm_term):
    """Nystrom regularized data-fit backward"""
    dfit_bwd = -(
        2 * zy_knm_solve_zy[t:].sum() -
        zy_solve_knm_knm_solve_zy[t:].sum()
    )
    print(dfit_bwd)
    print(dfit_bwd.shape)
    if include_kmm_term:
        print(zy_solve_kmm_solve_zy[t:].sum().shape)
        dfit_bwd += pen_n * zy_solve_kmm_solve_zy[t:].sum()
        print(pen_n * zy_solve_kmm_solve_zy[t:].sum())
    return dfit_bwd

solve_zy = flk.alpha_.clone().to("cuda:0", copy=False)
X_centers_dev = X_centers.to("cuda:0", copy=False).requires_grad_(optimize_centers)
solve_zy_dev = solve_zy.to("cuda:0", copy=False)
penalty_dev = torch.as_tensor(ridge).to("cuda:0", copy=False).requires_grad_(optimize_ridge)

sigma_init = torch.as_tensor(np.array([np.sqrt(d)]), dtype=torch.float64).requires_grad_(True)
kernel = falkon.kernels.GaussianKernel(sigma=sigma_init, opt=options)

with torch.autograd.enable_grad():
    kernel_dev = kernel.to("cuda:0")
    kmm_dev = kernel_dev(X_centers_dev, X_centers_dev, opt=options)
    zy_solve_kmm_solve_zy = (kmm_dev @ solve_zy_dev * solve_zy_dev).sum(0) 
        
    k_mn_zy = kernel_dev.mmv(X_centers_dev, X_train, y_train, opt=options)  # M x (T+P)
    zy_knm_solve_zy = k_mn_zy.mul(solve_zy_dev).sum(0)
    zy_solve_knm_knm_solve_zy = kernel_dev.mmv(X_train, X_centers_dev, solve_zy_dev, opt=options).square().sum(0)
    
    pen_n = penalty_dev * num_train
    dfit_bwd = calc_dfit_bwd(
                zy_knm_solve_zy, zy_solve_knm_knm_solve_zy, zy_solve_kmm_solve_zy, pen_n, 0,
                include_kmm_term=True)
    
grads = torch.autograd.grad(
        dfit_bwd, list(kernel_dev.diff_params.values()), retain_graph=False, allow_unused=False)

I am also wondering if we implement the gradient computation this way, we would not able to use multi-GPU in the backward pass. Am I right?

Thanks!

Bandwidth optimization on a log scale

Hi,

I'm currently using the automatic hyperparameter optimization features, and would like to know if the kernel bandwidths can be optimized on a log scale rather than a linear scale.

e.g. outside of the opt_he features, I can pass log-scaled bandwidths to a kernel class in the following way

sigma_exp = torch.randn(X_train.shape[1], dtype=torch.float32)
sigma_ten = torch.pow(torch.full((len(sigma_exp),), 10), sigma_exp).requires_grad_()

kernel = falkon.kernels.GaussianKernel(sigma=sigma_ten, opt=options)

This way, if I want to update the bandwidths, I can operate on the exponents. Can I do something similar when using as a central object a falkon.hopt.objectives along with a torch optimizer?

Using KeOps when calling prepare_, apply_, finalize_

I've extended the DotProductKernel with my own custom kernel and have run into some issues with memory.

It seems that when evaluating the function found after doing Kernel Ridge Regression, KeOps gets used.

However, in the fitting phase, prepare_, apply_ and finalize_ get called which don't use KeOps. Because of this, I run out of memory when trying to fit very large inputs.

Is there a way to use KeOps for the fitting phase? Is there a reason this isn't being done by default?

Thanks in advance!

Wheels page only displays most recently uploaded

This is an issue with the upload script which doesn't know about which old wheels had previously been uploaded.

Cannot pass options=None to model constructors

Python version requirement?

What's the Python version requirement for Falkon? Is it 3.6? I haven't seen it mentioned anywhere but in setup.py I see that 3.6 is a requirement. Am I right?

Thanks.

Extracting a model's expansion coefficients

Hi,

Is there a simple way to export the alphas out of a falkon.Falkon or falkon.hopt.objectives fitted model?

thanks,
Arthur

Installation Issue from ./falkon/sparse/cpp/sparse_matmul.cpp

I'm trying to install falkon on macOS 10.14.6 for CPU. However I'm having some issues after running 'python setup.py develop' as suggested in Issue #2. I'm following the installation instructions using a clean virtual environment with pytorch version 1.8.1 installed, Python 3.8.9, GCC 10.2.0_4, and cmake version 3.20.0. I have already installed keops. The problem seems to be something with the file 'falkon/sparse/cpp/sparse_matmul.cpp' and cmake / gcc. Any suggestions for how to fix this are much appreciated.

running develop
running egg_info
writing falkon.egg-info/PKG-INFO
writing dependency_links to falkon.egg-info/dependency_links.txt
writing requirements to falkon.egg-info/requires.txt
writing top-level names to falkon.egg-info/top_level.txt
/Users/user/.pyenv/versions/falkon/lib/python3.8/site-packages/torch/utils/cpp_extension.py:369: UserWarning: Attempted to use ninja as the BuildExtension backend but we could not find ninja.. Falling back to using the slow distutils backend.
  warnings.warn(msg.format('we could not find ninja.'))
reading manifest file 'falkon.egg-info/SOURCES.txt'
reading manifest template 'MANIFEST.in'
warning: no previously-included files matching '*.so' found anywhere in distribution
warning: no previously-included files matching 'notebooks/**' found anywhere in distribution
warning: no previously-included files matching 'doc/_build/**' found anywhere in distribution
warning: no previously-included files matching '**/.ipynb_checkpoints/**' found anywhere in distribution
warning: no previously-included files matching '__pycache__' found anywhere in distribution
warning: no previously-included files matching '*.py[co]' found anywhere in distribution
warning: no previously-included files matching 'keops/**' found anywhere in distribution
warning: no previously-included files matching 'benchmark/**' found anywhere in distribution
writing manifest file 'falkon.egg-info/SOURCES.txt'
running build_ext
building 'falkon.sparse.sparse_helpers' extension
gcc -Wno-unused-result -Wsign-compare -Wunreachable-code -DNDEBUG -g -fwrapv -O3 -Wall -I/Library/Developer/CommandLineTools/SDKs/MacOSX.sdk/usr/include -I/Library/Developer/CommandLineTools/SDKs/MacOSX.sdk/usr/include -I./falkon/sparse -I/Users/user/.pyenv/versions/falkon/lib/python3.8/site-packages/torch/include -I/Users/user/.pyenv/versions/falkon/lib/python3.8/site-packages/torch/include/torch/csrc/api/include -I/Users/user/.pyenv/versions/falkon/lib/python3.8/site-packages/torch/include/TH -I/Users/user/.pyenv/versions/falkon/lib/python3.8/site-packages/torch/include/THC -I/Users/user/.pyenv/versions/falkon/include -I/Users/user/.pyenv/versions/3.8.9/include/python3.8 -c ./falkon/sparse/sparse_extension.cpp -o build/temp.macosx-10.14-x86_64-3.8/./falkon/sparse/sparse_extension.o -Xpreprocessor -fopenmp -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE="_clang" -DPYBIND11_STDLIB="_libcpp" -DPYBIND11_BUILD_ABI="_cxxabi1002" -DTORCH_EXTENSION_NAME=sparse_helpers -D_GLIBCXX_USE_CXX11_ABI=0 -std=c++14
gcc -Wno-unused-result -Wsign-compare -Wunreachable-code -DNDEBUG -g -fwrapv -O3 -Wall -I/Library/Developer/CommandLineTools/SDKs/MacOSX.sdk/usr/include -I/Library/Developer/CommandLineTools/SDKs/MacOSX.sdk/usr/include -I./falkon/sparse -I/Users/user/.pyenv/versions/falkon/lib/python3.8/site-packages/torch/include -I/Users/user/.pyenv/versions/falkon/lib/python3.8/site-packages/torch/include/torch/csrc/api/include -I/Users/user/.pyenv/versions/falkon/lib/python3.8/site-packages/torch/include/TH -I/Users/user/.pyenv/versions/falkon/lib/python3.8/site-packages/torch/include/THC -I/Users/user/.pyenv/versions/falkon/include -I/Users/user/.pyenv/versions/3.8.9/include/python3.8 -c ./falkon/sparse/cpp/sparse_matmul.cpp -o build/temp.macosx-10.14-x86_64-3.8/./falkon/sparse/cpp/sparse_matmul.o -Xpreprocessor -fopenmp -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE="_clang" -DPYBIND11_STDLIB="_libcpp" -DPYBIND11_BUILD_ABI="_cxxabi1002" -DTORCH_EXTENSION_NAME=sparse_helpers -D_GLIBCXX_USE_CXX11_ABI=0 -std=c++14
In file included from ./falkon/sparse/cpp/sparse_matmul.cpp:2:
/Users/user/.pyenv/versions/falkon/lib/python3.8/site-packages/torch/include/ATen/ParallelOpenMP.h:15:13: error: redefinition of 'parallel_for'
inline void parallel_for(
            ^
/Users/user/.pyenv/versions/falkon/lib/python3.8/site-packages/torch/include/ATen/ParallelNative.h:34:13: note: previous definition is here
inline void parallel_for(
            ^
In file included from ./falkon/sparse/cpp/sparse_matmul.cpp:2:
/Users/user/.pyenv/versions/falkon/lib/python3.8/site-packages/torch/include/ATen/ParallelOpenMP.h:64:17: error: redefinition of 'parallel_reduce'
inline scalar_t parallel_reduce(
                ^
/Users/user/.pyenv/versions/falkon/lib/python3.8/site-packages/torch/include/ATen/ParallelNative.h:58:17: note: previous definition is here
inline scalar_t parallel_reduce(
                ^
./falkon/sparse/cpp/sparse_matmul.cpp:14:9: error: no matching function for call to 'parallel_for'
        torch::parallel_for(0, N, 2048, [&](int64_t start, int64_t end) {
        ^~~~~~~~~~~~~~~~~~~
./falkon/sparse/cpp/sparse_matmul.cpp:132:9: note: in instantiation of function template specialization 'run_parallel<unsigned char>' requested here
        run_parallel<scalar_t>(
        ^
/Users/user/.pyenv/versions/falkon/lib/python3.8/site-packages/torch/include/ATen/ParallelOpenMP.h:15:13: note: candidate template ignored: substitution
      failure [with F = (lambda at ./falkon/sparse/cpp/sparse_matmul.cpp:14:41)]
inline void parallel_for(
            ^
./falkon/sparse/cpp/sparse_matmul.cpp:14:9: error: no matching function for call to 'parallel_for'
        torch::parallel_for(0, N, 2048, [&](int64_t start, int64_t end) {
        ^~~~~~~~~~~~~~~~~~~
./falkon/sparse/cpp/sparse_matmul.cpp:132:9: note: in instantiation of function template specialization 'run_parallel<signed char>' requested here
        run_parallel<scalar_t>(
        ^
/Users/user/.pyenv/versions/falkon/lib/python3.8/site-packages/torch/include/ATen/ParallelOpenMP.h:15:13: note: candidate template ignored: substitution
      failure [with F = (lambda at ./falkon/sparse/cpp/sparse_matmul.cpp:14:41)]
inline void parallel_for(
            ^
./falkon/sparse/cpp/sparse_matmul.cpp:14:9: error: no matching function for call to 'parallel_for'
        torch::parallel_for(0, N, 2048, [&](int64_t start, int64_t end) {
        ^~~~~~~~~~~~~~~~~~~
./falkon/sparse/cpp/sparse_matmul.cpp:132:9: note: in instantiation of function template specialization 'run_parallel<double>' requested here
        run_parallel<scalar_t>(
        ^
/Users/user/.pyenv/versions/falkon/lib/python3.8/site-packages/torch/include/ATen/ParallelOpenMP.h:15:13: note: candidate template ignored: substitution
      failure [with F = (lambda at ./falkon/sparse/cpp/sparse_matmul.cpp:14:41)]
inline void parallel_for(
            ^
./falkon/sparse/cpp/sparse_matmul.cpp:14:9: error: no matching function for call to 'parallel_for'
        torch::parallel_for(0, N, 2048, [&](int64_t start, int64_t end) {
        ^~~~~~~~~~~~~~~~~~~
./falkon/sparse/cpp/sparse_matmul.cpp:132:9: note: in instantiation of function template specialization 'run_parallel<float>' requested here
        run_parallel<scalar_t>(
        ^
/Users/user/.pyenv/versions/falkon/lib/python3.8/site-packages/torch/include/ATen/ParallelOpenMP.h:15:13: note: candidate template ignored: substitution
      failure [with F = (lambda at ./falkon/sparse/cpp/sparse_matmul.cpp:14:41)]
inline void parallel_for(
            ^
./falkon/sparse/cpp/sparse_matmul.cpp:14:9: error: no matching function for call to 'parallel_for'
        torch::parallel_for(0, N, 2048, [&](int64_t start, int64_t end) {
        ^~~~~~~~~~~~~~~~~~~
./falkon/sparse/cpp/sparse_matmul.cpp:132:9: note: in instantiation of function template specialization 'run_parallel<int>' requested here
        run_parallel<scalar_t>(
        ^
/Users/user/.pyenv/versions/falkon/lib/python3.8/site-packages/torch/include/ATen/ParallelOpenMP.h:15:13: note: candidate template ignored: substitution
      failure [with F = (lambda at ./falkon/sparse/cpp/sparse_matmul.cpp:14:41)]
inline void parallel_for(
            ^
./falkon/sparse/cpp/sparse_matmul.cpp:14:9: error: no matching function for call to 'parallel_for'
        torch::parallel_for(0, N, 2048, [&](int64_t start, int64_t end) {
        ^~~~~~~~~~~~~~~~~~~
./falkon/sparse/cpp/sparse_matmul.cpp:132:9: note: in instantiation of function template specialization 'run_parallel<long long>' requested here
        run_parallel<scalar_t>(
        ^
/Users/user/.pyenv/versions/falkon/lib/python3.8/site-packages/torch/include/ATen/ParallelOpenMP.h:15:13: note: candidate template ignored: substitution
      failure [with F = (lambda at ./falkon/sparse/cpp/sparse_matmul.cpp:14:41)]
inline void parallel_for(
            ^
./falkon/sparse/cpp/sparse_matmul.cpp:14:9: error: no matching function for call to 'parallel_for'
        torch::parallel_for(0, N, 2048, [&](int64_t start, int64_t end) {
        ^~~~~~~~~~~~~~~~~~~
./falkon/sparse/cpp/sparse_matmul.cpp:132:9: note: in instantiation of function template specialization 'run_parallel<short>' requested here
        run_parallel<scalar_t>(
        ^
/Users/user/.pyenv/versions/falkon/lib/python3.8/site-packages/torch/include/ATen/ParallelOpenMP.h:15:13: note: candidate template ignored: substitution
      failure [with F = (lambda at ./falkon/sparse/cpp/sparse_matmul.cpp:14:41)]
inline void parallel_for(
            ^
9 errors generated.
error: command 'gcc' failed with exit status 1

ModuleNotFoundError: No module named 'falkon.sparse.sparse_helpers' when testing installation

Hello,

I followed the installation steps and did not run into any error while compiling the library.
However, when trying the kernel ridge regression notebook, I cannot load the module.

I get the following error trace:

import falkon
Traceback (most recent call last):
File "", line 1, in
File "/home/vignac/falkon/falkon/init.py", line 3, in
from . import kernels, sparse, center_selection, preconditioner, optim
File "/home/vignac/falkon/falkon/kernels/init.py", line 1, in
from .kernel import Kernel
File "/home/vignac/falkon/falkon/kernels/kernel.py", line 6, in
from falkon.mmv_ops.fmm_cpu import fmm_cpu_sparse, fmm_cpu
File "/home/vignac/falkon/falkon/mmv_ops/fmm_cpu.py", line 10, in
from falkon.sparse.sparse_tensor import SparseTensor
File "/home/vignac/falkon/falkon/sparse/init.py", line 2, in
from .sparse_ops import sparse_norm, sparse_square_norm, sparse_matmul
File "/home/vignac/falkon/falkon/sparse/sparse_ops.py", line 4, in
from falkon.sparse.sparse_helpers import norm_sq, norm_
ModuleNotFoundError: No module named 'falkon.sparse.sparse_helpers'

Is it just a path that is incorrect, or do you think it is a bigger problem with the installation?

Thanks,
Clement

Packages version:
nvcc: 11.4
g++: 7.5.0
cmake: 3.18.2

$ pip list
Package Version

certifi 2021.10.8
cycler 0.11.0
falkon 0.6.3
joblib 1.1.0
kiwisolver 1.3.2
matplotlib 3.4.3
numpy 1.21.4
Pillow 8.4.0
pip 21.2.4
psutil 5.8.0
pykeops 1.4.2
pyparsing 3.0.5
python-dateutil 2.8.2
scikit-learn 1.0.1
scipy 1.7.2
setuptools 58.0.4
six 1.16.0
threadpoolctl 3.0.0
torch 1.10.0+cu113 (works fine on gpu)
typing-extensions 3.10.0.2
wheel 0.37.0

Installation traces: everything seemed to be fine:

$ pip install ./keops
Processing ./keops
DEPRECATION: A future pip version will change local packages to be built in-place without first copying to a temporary directory. We recommend you use --use-feature=in-tree-build to test your packages with this new behavior before it becomes the default.
pip 21.3 will remove support for this functionality. You can find discussion regarding this at pypa/pip#7555.
Requirement already satisfied: numpy in /home/vignac/.conda/envs/falkon/lib/python3.9/site-packages (from pykeops==1.4.2) (1.21.4)
Building wheels for collected packages: pykeops
Building wheel for pykeops (setup.py) ... done
Created wheel for pykeops: filename=pykeops-1.4.2-py3-none-any.whl size=478011 sha256=dac861f7bd93a552854c4566aaf688a1deaaf737518a862f4818e04bc8b8d16d
Stored in directory: /tmp/pip-ephem-wheel-cache-bq1e39qr/wheels/36/47/f5/4be78e0d60dfe330cfb4652a2e21c469d4f6ea7bb0d0d767df
Successfully built pykeops
Installing collected packages: pykeops
Successfully installed pykeops-1.4.2

$ pip install .
Processing /home/vignac/falkon
DEPRECATION: A future pip version will change local packages to be built in-place without first copying to a temporary directory. We recommend you use --use-feature=in-tree-build to test your packages with this new behavior before it becomes the default.
pip 21.3 will remove support for this functionality. You can find discussion regarding this at pypa/pip#7555.
Requirement already satisfied: torch>=1.4 in /home/vignac/.conda/envs/falkon/lib/python3.9/site-packages (from falkon==0.6.3) (1.10.0+cu113)
Collecting scipy
Downloading scipy-1.7.2-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (39.8 MB)
|████████████████████████████████| 39.8 MB 5.5 MB/s
Requirement already satisfied: numpy in /home/vignac/.conda/envs/falkon/lib/python3.9/site-packages (from falkon==0.6.3) (1.21.4)
Collecting scikit-learn
Downloading scikit_learn-1.0.1-cp39-cp39-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (24.7 MB)
|████████████████████████████████| 24.7 MB 108.8 MB/s
Collecting psutil
Downloading psutil-5.8.0-cp39-cp39-manylinux2010_x86_64.whl (293 kB)
|████████████████████████████████| 293 kB 125.6 MB/s
Requirement already satisfied: typing-extensions in /home/vignac/.conda/envs/falkon/lib/python3.9/site-packages (from torch>=1.4->falkon==0.6.3) (3.10.0.2)
Collecting threadpoolctl>=2.0.0
Using cached threadpoolctl-3.0.0-py3-none-any.whl (14 kB)
Collecting joblib>=0.11
Using cached joblib-1.1.0-py2.py3-none-any.whl (306 kB)
Building wheels for collected packages: falkon
Building wheel for falkon (setup.py) ... done
Created wheel for falkon: filename=falkon-0.6.3-cp39-cp39-linux_x86_64.whl size=1270129 sha256=a55a9db44d82a77908f9c190aae18500e199d29563d74434bbeeff2ed977fe41
Stored in directory: /tmp/pip-ephem-wheel-cache-qj6av2bg/wheels/42/2f/de/817b4dc8ce9bdfe9d8d5b31d82288a66442e83b0509995d8a1
Successfully built falkon
Installing collected packages: threadpoolctl, scipy, joblib, scikit-learn, psutil, falkon
Successfully installed falkon-0.6.3 joblib-1.1.0 psutil-5.8.0 scikit-learn-1.0.1 scipy-1.7.2 threadpoolctl-3.0.0

Memory error solved by emptying CUDA cache

Hi FALKON team!

While using Falkon, I stumbled on what looks like a memory bug in the library.

Code to reproduce

import torch
from falkon.kernels import LinearKernel
from falkon import Falkon

n = 50000
d = 51000
l = 10

X = torch.randn((n, d))
y = torch.nn.functional.one_hot(torch.randint(0, l, (n,))).float()
sigma = 1
penalties = [1e-4, 1e-5]
for i in range(2):
    print(f"{i}")
    kernel = LinearKernel(sigma=sigma)
    model = Falkon(
        kernel=kernel,
        penalty=penalties[i],
        M=40000,
        maxiter=10,
        seed=0,
    )
    model.fit(X, y)
    predictions = model.predict(X)
    # torch.cuda.empty_cache()
    # If the line above is commented, FALKON induces a CUDA Out of Memory error.

Expected behavior

Fit different models, no errors.

Actual behavior

RuntimeError: CUDA out of memory. Tried to allocate 9.91 GiB (GPU 1; 31.75 GiB total capacity; 12.38 GiB already allocated; 8.59 GiB free; 21.75 GiB reserved in total by PyTorch)

In the above code,

adding the torch.cuda.empty_cache() line eliminates the issue.
changing d = 51000 to d = 30000 eliminates the issue.

Environment

Ubuntu 18.04 LTS
4x Tesla V100 with 32 GB RAM each, CUDA 11.0
Python 3.8.8 with pytorch 1.9
Falkon compiled with pip

Let me know if I can provide any further information or assistance in fixing the issue! Thanks!

ENH: build and deploy doc automatically when pushing/merging to master

Hi @Giodiro
I spent a bit of time to implement an automated doc deployment with CircleCI in this PR on another repo so if you want we can have a look at it together for this one.

The config code is present in the PR I linked, apart from that you need to add a deploy key to github and circleCI as detailed in Add a Github Deploy key here

Get Out-of-memory when selecting m=10^5

Hi again,

And thanks for the help last time again, will as mentioned try to make a PR as soon we confirm our experiment/method it working.
I get:

RuntimeError: cuda runtime error (2) : out of memory at /opt/conda/conda-bld/pytorch_1616554786529/work/aten/src/THC/THCCachingHostAllocator.cpp:278
when I try to set M=10^5 (I am on a laptop). Should I be getting this error and can I expect it to work if I am using V100 with 32GB?

Thank you!

Best regards,
Robert

Import falkon gives the following error, any suggestions what is missing as I just followed the steps from the installation ==>

sparse_helpers.so: undefined symbol: _ZN5torch3jit6tracer9addOutputEPNS0_4NodeERKN2at6TensorE

Originally posted by @jaiabhayk in #1 (comment)

Unnecessary data copies (to_c_contig) when decide_keops is true and high dimensionality

Both fit and predict will make data C-contiguous if decide_keops returns True, but decide_keops does not take into account data-dimensionality.
So in case of high-dimensional data, where KeOps would not be used, an unnecessary copy may occur.

Add guards against passing F-contiguous tensors to KeOps

KeOps only works with C-contiguous tensors, which is generally fine since by initializing falkon with C tensors (which is the common thing to do) everything works.

But falkon also should work with F inputs (possibly by transposing them appropriately?).
But currently there is no check for this so if the KeOps path is chosen, falkon will crash with an error such as

RuntimeError: [Keops] Arg at position 0: is not contiguous. Please provide 'contiguous' dara array, as KeOps does not support strides. If you're getting this error in the 'backward' pass of a code using torch.sum() on the output of a KeOps routine, you should consider replacing 'a.sum()' with 'torch.dot(a.view(-1), torch.ones_like(a).view(-1))'.

OSError while running the sample code.

Running the following command with GPU :

kernel = falkon.kernels.GaussianKernel(sigma=1, opt=options)
flk = falkon.Falkon(kernel=kernel, penalty=1e-5, M=5000, options=options)

gives OSError : /opt/conda/lib/python3.10/site-packages/falkon/c_ext/_C.so: undefined symbol: _ZNK5torch8autograd4Node4nameEv.
Please provide the solution at the earliest.

Getting error

Hi!

Thank you for writing this library. It's very cool to finally be able to scale kernel regression to a billion points!

Running into the following:

RuntimeError: [KeOps] This KeOps shared object has been compiled without cuda support:

to perform computations on CPU, simply set tagHostDevice to 0
to perform computations on GPU, please recompile the formula with a working version of cuda.

Thank you for the help!

Best regards,
Robert

RuntimeError: Not compiled with CUDA support

      1 options = falkon.FalkonOptions(keops_active="force")
      3 kernel = falkon.kernels.GaussianKernel(sigma=1, opt=options)
----> 4 flk = falkon.Falkon(kernel=kernel, penalty=1e-5, M=5000, options=options)

yields this error message which I am unable to debug. Please help.

File ~/.conda/envs/falkon_env/lib/python3.10/site-packages/falkon/models/falkon.py:132, in Falkon.__init__(self, kernel, penalty, M, center_selection, maxiter, seed, error_fn, error_every, weight_fn, options)
    130 self.maxiter = maxiter
    131 self.weight_fn = weight_fn
--> 132 self._init_cuda()
    133 self.beta_ = None

File ~/.conda/envs/falkon_env/lib/python3.10/site-packages/falkon/models/model_utils.py:70, in FalkonBase._init_cuda(self)
     68 if self.use_cuda_:
     69     torch.cuda.init()
---> 70     self.num_gpus = devices.num_gpus(self.options)

File ~/.conda/envs/falkon_env/lib/python3.10/site-packages/falkon/utils/devices.py:212, in num_gpus(opt)
    210 global __COMP_DATA
    211 if len(__COMP_DATA) == 0:
--> 212     get_device_info(opt)
    213 return len([c for c in __COMP_DATA.keys() if c >= 0])

File ~/.conda/envs/falkon_env/lib/python3.10/site-packages/falkon/utils/devices.py:200, in get_device_info(opt)
    197     return __COMP_DATA
    199 for g in range(0, tcd.device_count()):
--> 200     __COMP_DATA = _get_gpu_device_info(opt, g, __COMP_DATA)
    202 if len(__COMP_DATA) == 0:
    203     raise RuntimeError("No suitable device found. Enable option 'use_cpu' "
    204                        "if no GPU is available.")

File ~/.conda/envs/falkon_env/lib/python3.10/site-packages/falkon/utils/devices.py:92, in _get_gpu_device_info(opt, g, data_dict)
     83 # try:
     84 #     from ..cuda.cudart_gpu import cuda_meminfo
     85 # except Exception as e:
   (...)
     89 # Some of the CUDA calls in here may change the current device,
     90 # this ensures it gets reset at the end.
     91 with tcd.device(g):
---> 92     mem_free, mem_total = cuda_mem_get_info(g)
     93     mem_used = mem_total - mem_free
     94     # noinspection PyUnresolvedReferences

RuntimeError: Not compiled with CUDA support

Bug with python 3.8, torch 2.0.0, cu117

With this combination the wheel expects torch_cuda_cu.so and torch_cuda_cpp.so (i.e. the pytorch used for building had a split-libraries option enabled), but when installing torch with conda, only torch_cuda.so is available (pytorch built without split-libraries option).

installing falkon seems to be succesful but failed when calling import falkon

My system I can use keops on GPU without problem and I install cuda11.6 already. I have a 3090TI on Ubuntu 22.06.

When I install falkon as instructed using command "pip uninstall git+https://github.com/falkonml/falkon.git", everything is fine without warnings/erros. But when I test it in a notebook using import falkon, I got the following error:

OSError Traceback (most recent call last)
/tmp/ipykernel_1096354/295832182.py in
6 plt.style.use('ggplot')
7
----> 8 import falkon

~/anaconda3/envs/repo/lib/python3.8/site-packages/falkon/init.py in
8 "c_ext", [os.path.dirname(file)])
9 if spec is not None:
---> 10 torch.ops.load_library(spec.origin)
11 else:
12 raise ImportError("Failed to find C-extension. Please recompile Falkon.")

~/anaconda3/envs/repo/lib/python3.8/site-packages/torch/_ops.py in load_library(self, path)
571 # static (global) initialization code in order to register custom
572 # operators with the JIT.
--> 573 ctypes.CDLL(path)
574 self.loaded_libraries.add(path)
575

~/anaconda3/envs/repo/lib/python3.8/ctypes/init.py in init(self, name, mode, handle, use_errno, use_last_error, winmode)
371
372 if handle is None:
--> 373 self._handle = _dlopen(self._name, mode)
374 else:
375 self._handle = handle

OSError: /home/mc/anaconda3/envs/repo/lib/python3.8/site-packages/falkon/c_ext.so: undefined symbol: _ZN2at4cuda28getCurrentCUDASolverDnHandleEv

It looks like something wrong with cuda?

Using custom kernels in Falkon

Hey,

After checking kernel.py, I think it should be possible to write a custom kernel and use it in Falkon, but I'm not sure about the conditions under which I'll get proper speedups using my custom kernel. Let me elaborate on the kernel that I have:

Suppose that given a pair of datapoints (x1, x2), my kernel is deterministic, meaning that I have a function to compute k(x1, x2) directly (not super fast, and not trivial to compute, but deterministic). Thus, I think in this case, no training (.fit) is needed in Falkon, am I right? Moreover, I'm not able to write the KeOps routine to compute my kernel (thus, if I use DiffKernel as my parent class, I'm not able to write _keops_mmv_impl, and I have to set KeOps to False).

I'd like to know if in such a case, do I need to compute the full n^2 kernel matrix to compute the KRR predictions, or my space and time complexity will be O(n sqrt(n))?

P.S: In my case, kernel computation is expensive and I'd like to minimize the number of kernel computation calls.

Thanks for the great work!

Mac support

I tried installing falkon on a MacBook Pro 14 (M2 Pro) from source. It installs without any error, but during runtime, I run into the following error (see call stack) when running fit(). Is support for Mac planned?

---------------------------------------------------------------------------
OSError                                   Traceback (most recent call last)
[/Users/ag2435/repos/falkon/notebooks/FalkonRegression.ipynb](https://file+.vscode-resource.vscode-cdn.net/Users/ag2435/repos/falkon/notebooks/FalkonRegression.ipynb) Cell 11 line 1
----> [1](vscode-notebook-cell:/Users/ag2435/repos/falkon/notebooks/FalkonRegression.ipynb#X13sZmlsZQ%3D%3D?line=0) model.fit(Xtr, Ytr)

File [~/anaconda3/envs/falkon/lib/python3.10/site-packages/falkon/models/falkon.py:229](https://file+.vscode-resource.vscode-cdn.net/Users/ag2435/repos/falkon/notebooks/~/anaconda3/envs/falkon/lib/python3.10/site-packages/falkon/models/falkon.py:229), in Falkon.fit(self, X, Y, Xts, Yts, warm_start)
    227     if self.weight_fn is not None:
    228         ny_weight_vec = self.weight_fn(Y[ny_indices], X[ny_indices], ny_indices)
--> 229     precond.init(ny_points, weight_vec=ny_weight_vec)
    231 if _use_cuda_mmv:
    232     # Cache must be emptied to ensure enough memory is visible to the optimizer
    233     torch.cuda.empty_cache()

File [~/anaconda3/envs/falkon/lib/python3.10/site-packages/falkon/preconditioner/flk_preconditioner.py:101](https://file+.vscode-resource.vscode-cdn.net/Users/ag2435/repos/falkon/notebooks/~/anaconda3/envs/falkon/lib/python3.10/site-packages/falkon/preconditioner/flk_preconditioner.py:101), in FalkonPreconditioner.init(self, X, weight_vec)
     99     else:  # If sparse tensor we need fortran for kernel calculation
    100         C = create_fortran((M, M), dtype=dtype, device=dev, pin_memory=self._use_cuda)
--> 101     self.kernel(X, X, out=C, opt=self.params)
    102 if not is_f_contig(C):
    103     C = C.T

File [~/anaconda3/envs/falkon/lib/python3.10/site-packages/falkon/kernels/kernel.py:173](https://file+.vscode-resource.vscode-cdn.net/Users/ag2435/repos/falkon/notebooks/~/anaconda3/envs/falkon/lib/python3.10/site-packages/falkon/kernels/kernel.py:173), in Kernel.__call__(self, X1, X2, diag, out, opt)
    171     params = dataclasses.replace(self.params, **dataclasses.asdict(opt))
    172 mm_impl = self._decide_mm_impl(X1, X2, diag, params)
--> 173 return mm_impl(self, params, out, diag, X1, X2)

File [~/anaconda3/envs/falkon/lib/python3.10/site-packages/falkon/mmv_ops/fmm.py:554](https://file+.vscode-resource.vscode-cdn.net/Users/ag2435/repos/falkon/notebooks/~/anaconda3/envs/falkon/lib/python3.10/site-packages/falkon/mmv_ops/fmm.py:554), in fmm(kernel, opt, out, diag, X1, X2)
    551 import falkon.kernels
    553 if isinstance(kernel, falkon.kernels.DiffKernel):
--> 554     return KernelMmFnFull.apply(kernel, opt, out, diag, X1, X2, *kernel.diff_params.values())
    555 else:
    556     return KernelMmFnFull.apply(kernel, opt, out, diag, X1, X2)

File [~/anaconda3/envs/falkon/lib/python3.10/site-packages/torch/autograd/function.py:506](https://file+.vscode-resource.vscode-cdn.net/Users/ag2435/repos/falkon/notebooks/~/anaconda3/envs/falkon/lib/python3.10/site-packages/torch/autograd/function.py:506), in Function.apply(cls, *args, **kwargs)
    503 if not torch._C._are_functorch_transforms_active():
    504     # See NOTE: [functorch vjp and autograd interaction]
    505     args = _functorch.utils.unwrap_dead_wrappers(args)
--> 506     return super().apply(*args, **kwargs)  # type: ignore[misc]
    508 if cls.setup_context == _SingleLevelFunction.setup_context:
    509     raise RuntimeError(
    510         'In order to use an autograd.Function with functorch transforms '
    511         '(vmap, grad, jvp, jacrev, ...), it must override the setup_context '
    512         'staticmethod. For more details, please see '
    513         'https://pytorch.org/docs/master/notes/extending.func.html')

File [~/anaconda3/envs/falkon/lib/python3.10/site-packages/falkon/mmv_ops/fmm.py:480](https://file+.vscode-resource.vscode-cdn.net/Users/ag2435/repos/falkon/notebooks/~/anaconda3/envs/falkon/lib/python3.10/site-packages/falkon/mmv_ops/fmm.py:480), in KernelMmFnFull.forward(ctx, kernel, opt, out, diag, X1, X2, *kernel_params)
    478     out = KernelMmFnFull.run_diag(X1, X2, out, kernel, False, is_sparse)
    479 elif comp_dev_type == "cpu" and data_dev.type == "cpu":
--> 480     out = KernelMmFnFull.run_cpu_cpu(X1, X2, out, kernel, comp_dtype, opt, False)
    481 elif comp_dev_type == "cuda" and data_dev.type == "cuda":
    482     out = KernelMmFnFull.run_gpu_gpu(X1, X2, out, kernel, comp_dtype, opt, False)

File [~/anaconda3/envs/falkon/lib/python3.10/site-packages/falkon/mmv_ops/fmm.py:354](https://file+.vscode-resource.vscode-cdn.net/Users/ag2435/repos/falkon/notebooks/~/anaconda3/envs/falkon/lib/python3.10/site-packages/falkon/mmv_ops/fmm.py:354), in KernelMmFnFull.run_cpu_cpu(X1, X2, out, kernel, dtype, options, diff)
    342 @staticmethod
    343 def run_cpu_cpu(X1, X2, out, kernel, dtype, options, diff):
    344     args = ArgsFmm(
    345         X1=X1,
    346         X2=X2,
   (...)
    352         differentiable=diff,
    353     )
--> 354     out = _call_direct(mm_run_starter, (args, -1))
    355     return out

File [~/anaconda3/envs/falkon/lib/python3.10/site-packages/falkon/mmv_ops/utils.py:86](https://file+.vscode-resource.vscode-cdn.net/Users/ag2435/repos/falkon/notebooks/~/anaconda3/envs/falkon/lib/python3.10/site-packages/falkon/mmv_ops/utils.py:86), in _call_direct(target, arg)
     84 args_queue.put(arg[0])
     85 new_args_tuple = (-1, args_queue, arg[1])
---> 86 return target(*new_args_tuple)

File [~/anaconda3/envs/falkon/lib/python3.10/site-packages/falkon/mmv_ops/fmm.py:131](https://file+.vscode-resource.vscode-cdn.net/Users/ag2435/repos/falkon/notebooks/~/anaconda3/envs/falkon/lib/python3.10/site-packages/falkon/mmv_ops/fmm.py:131), in mm_run_starter(proc_idx, queue, device_id)
    129     return sparse_mm_run_thread(X1, X2, out, kernel, n, m, computation_dtype, dev, tid=proc_idx)
    130 else:
--> 131     return mm_run_thread(X1, X2, out, kernel, n, m, computation_dtype, dev, tid=proc_idx)

File [~/anaconda3/envs/falkon/lib/python3.10/site-packages/falkon/mmv_ops/fmm.py:291](https://file+.vscode-resource.vscode-cdn.net/Users/ag2435/repos/falkon/notebooks/~/anaconda3/envs/falkon/lib/python3.10/site-packages/falkon/mmv_ops/fmm.py:291), in mm_run_thread(m1, m2, out, kernel, n, m, comp_dt, dev, tid)
    288 c_dev_out.fill_(0.0)
    290 # Compute kernel sub-matrix
--> 291 kernel.compute(c_dev_m1, c_dev_m2, c_dev_out, diag=False)
    293 # Copy back to host
    294 if has_gpu_bufs:

File [~/anaconda3/envs/falkon/lib/python3.10/site-packages/falkon/kernels/diff_kernel.py:91](https://file+.vscode-resource.vscode-cdn.net/Users/ag2435/repos/falkon/notebooks/~/anaconda3/envs/falkon/lib/python3.10/site-packages/falkon/kernels/diff_kernel.py:91), in DiffKernel.compute(self, X1, X2, out, diag)
     90 def compute(self, X1: torch.Tensor, X2: torch.Tensor, out: torch.Tensor, diag: bool):
---> 91     return self.core_fn(X1, X2, out, **self.diff_params, diag=diag, **self._other_params)

File [~/anaconda3/envs/falkon/lib/python3.10/site-packages/falkon/kernels/distance_kernel.py:163](https://file+.vscode-resource.vscode-cdn.net/Users/ag2435/repos/falkon/notebooks/~/anaconda3/envs/falkon/lib/python3.10/site-packages/falkon/kernels/distance_kernel.py:163), in rbf_core(mat1, mat2, out, diag, sigma)
    161 mat1_div_sig = mat1 [/](https://file+.vscode-resource.vscode-cdn.net/) sigma
    162 mat2_div_sig = mat2 [/](https://file+.vscode-resource.vscode-cdn.net/) sigma
--> 163 norm_sq_mat1 = square_norm(mat1_div_sig, -1, True)  # b*n*1 or n*1
    164 norm_sq_mat2 = square_norm(mat2_div_sig, -1, True)  # b*m*1 or m*1
    166 out = _sq_dist(mat1_div_sig, mat2_div_sig, norm_sq_mat1, norm_sq_mat2, out)

File [~/anaconda3/envs/falkon/lib/python3.10/site-packages/falkon/la_helpers/wrapper.py:129](https://file+.vscode-resource.vscode-cdn.net/Users/ag2435/repos/falkon/notebooks/~/anaconda3/envs/falkon/lib/python3.10/site-packages/falkon/la_helpers/wrapper.py:129), in square_norm(mat, dim, keepdim)
    128 def square_norm(mat: torch.Tensor, dim: int, keepdim: Optional[bool] = None) -> torch.Tensor:
--> 129     return c_ext.square_norm(mat, dim, keepdim)

File [~/anaconda3/envs/falkon/lib/python3.10/site-packages/falkon/c_ext/__init__.py:15](https://file+.vscode-resource.vscode-cdn.net/Users/ag2435/repos/falkon/notebooks/~/anaconda3/envs/falkon/lib/python3.10/site-packages/falkon/c_ext/__init__.py:15), in _make_lazy_cuda_func.<locals>.call_cuda(*args, **kwargs)
     14 def call_cuda(*args, **kwargs):
---> 15     from ._backend import _assert_has_ext
     17     _assert_has_ext()
     18     return getattr(torch.ops.falkon, name)(*args, **kwargs)

File [~/anaconda3/envs/falkon/lib/python3.10/site-packages/falkon/c_ext/_backend.py:86](https://file+.vscode-resource.vscode-cdn.net/Users/ag2435/repos/falkon/notebooks/~/anaconda3/envs/falkon/lib/python3.10/site-packages/falkon/c_ext/_backend.py:86)
     84 lib_path = _get_extension_path("_C")
     85 try:
---> 86     torch.ops.load_library(lib_path)
     87 except OSError as e:
     88     # Hack: usually ld can't find torch_cuda_linalg.so which is in TORCH_LIB_PATH
     89     # if we load it first, then load_library will work.
     90     # TODO: This will only work on linux.
     91     if (missing_lib := lib_from_oserror(e)).startswith("libtorch_cuda_linalg"):

File [~/anaconda3/envs/falkon/lib/python3.10/site-packages/torch/_ops.py:643](https://file+.vscode-resource.vscode-cdn.net/Users/ag2435/repos/falkon/notebooks/~/anaconda3/envs/falkon/lib/python3.10/site-packages/torch/_ops.py:643), in _Ops.load_library(self, path)
    638 path = _utils_internal.resolve_library_path(path)
    639 with dl_open_guard():
    640     # Import the shared library into the process, thus running its
    641     # static (global) initialization code in order to register custom
    642     # operators with the JIT.
--> 643     ctypes.CDLL(path)
    644 self.loaded_libraries.add(path)

File [~/anaconda3/envs/falkon/lib/python3.10/ctypes/__init__.py:374](https://file+.vscode-resource.vscode-cdn.net/Users/ag2435/repos/falkon/notebooks/~/anaconda3/envs/falkon/lib/python3.10/ctypes/__init__.py:374), in CDLL.__init__(self, name, mode, handle, use_errno, use_last_error, winmode)
    371 self._FuncPtr = _FuncPtr
    373 if handle is None:
--> 374     self._handle = _dlopen(self._name, mode)
    375 else:
    376     self._handle = handle

OSError: dlopen(/Users/ag2435/anaconda3/envs/falkon/lib/python3.10/site-packages/falkon/c_ext/_C.so, 0x0006): symbol not found in flat namespace '__ZN2at6native14lapackCholeskyIdEEvciPT_iPi'

cuda runtime error (304) : OS call failed or operation not supported on this OS

When running the following example with Falkon, I run into a cuda runtime error.

Example:
`from sklearn import datasets, model_selection
import numpy as np
import torch
import falkon
from falkon.models import Falkon
from falkon.kernels import GaussianKernel
from falkon.options import FalkonOptions

Xtrain = np.random.randn(80000, 1536)
Xtest = np.random.randn(10000, 1536)

Ytrain = np.random.randn(80000, 20)
Ytest = np.random.randn(10000, 20)

Xtrain = torch.from_numpy(Xtrain)
Xtest = torch.from_numpy(Xtest)
Ytrain = torch.from_numpy(Ytrain)
Ytest = torch.from_numpy(Ytest)

print("X TRAIN SHAPE: ", Xtrain.shape, Ytrain.shape, "TEsT SHAPES: ", Xtest.shape, Ytest.shape)

kernel = GaussianKernel(sigma=5)
flk = Falkon(kernel=kernel, penalty=1e-5, M=Xtrain.shape[0])

flk.fit(Xtrain, Ytrain)`

Error:
THCudaCheck FAIL file=/opt/conda/conda-bld/pytorch_1616554827596/work/aten/src/THC/THCCachingHostAllocator.cpp line=278 error=304 : OS call failed or operation not supported on this OS Traceback (most recent call last): File "falkon_test.py", line 26, in <module> flk.fit(Xtrain, Ytrain) File "/home/nehap/anaconda3/envs/falkon/lib/python3.7/site-packages/falkon/models/falkon.py", line 197, in fit ny_points = ny_points.pin_memory() RuntimeError: cuda runtime error (304) : OS call failed or operation not supported on this OS at /opt/conda/conda-bld/pytorch_1616554827596/work/aten/src/THC/THCCachingHostAllocator.cpp:278

Here is my .yml file:
name: falkon
channels:

conda-forge
pytorch
anaconda
defaults
dependencies:
_libgcc_mutex=0.1=main
blas=1.0=mkl
bzip2=1.0.8=h7b6447c_0
ca-certificates=2020.10.14=0
certifi=2020.6.20=py37_0
cmake=3.18.2=ha30ef3c_0
cudatoolkit=10.1.243=h6bb024c_0
expat=2.2.10=he6710b0_2
ffmpeg=4.3=hf484d3e_0
freetype=2.10.4=h5ab3b9f_0
gmp=6.2.1=h2531618_2
gnutls=3.6.15=he1e5248_0
intel-openmp=2020.2=254
joblib=1.0.1=pyhd8ed1ab_0
jpeg=9b=h024ee3a_2
krb5=1.18.2=h173b8e3_0
lame=3.100=h7b6447c_0
lcms2=2.11=h396b838_0
ld_impl_linux-64=2.33.1=h53a641e_7
libblas=3.9.0=1_h6e990d7_netlib
libcblas=3.9.0=3_h893e4fe_netlib
libcurl=7.71.1=h20c2e04_1
libedit=3.1.20191231=h14c3975_1
libffi=3.3=he6710b0_2
libgcc-ng=9.1.0=hdf63c60_0
libgfortran-ng=7.5.0=h14aa051_19
libgfortran4=7.5.0=h14aa051_19
libiconv=1.15=h63c8f33_5
libidn2=2.3.0=h27cfd23_0
liblapack=3.9.0=3_h893e4fe_netlib
libpng=1.6.37=hbc83047_0
libssh2=1.9.0=h1ba5d50_1
libstdcxx-ng=9.1.0=hdf63c60_0
libtasn1=4.16.0=h27cfd23_0
libtiff=4.1.0=h2733197_1
libunistring=0.9.10=h27cfd23_0
libuv=1.40.0=h7b6447c_0
lz4-c=1.9.2=heb0550a_3
mkl=2020.2=256
mkl-service=2.3.0=py37he8ac12f_0
mkl_fft=1.2.0=py37h23d657b_0
mkl_random=1.1.1=py37h0573a6f_0
ncurses=6.2=he6710b0_1
nettle=3.7.2=hbbd107a_1
ninja=1.10.2=py37hff7bd54_0
numpy=1.19.2=py37h54aff64_0
numpy-base=1.19.2=py37hfa32c7d_0
olefile=0.46=py_0
openh264=2.1.0=hd408876_0
openssl=1.1.1h=h7b6447c_0
pillow=8.0.1=py37he98fc37_0
pip=20.3.3=py37h06a4308_0
python=3.7.9=h7579374_0
python_abi=3.7=1_cp37m
pytorch=1.8.1=py3.7_cuda10.1_cudnn7.6.3_0
readline=8.0=h7b6447c_0
rhash=1.4.0=h1ba5d50_0
scikit-learn=0.23.2=py37hddcf8d6_3
scipy=1.5.3=py37h8911b10_0
setuptools=51.0.0=py37h06a4308_2
six=1.15.0=py37h06a4308_0
sqlite=3.33.0=h62c20be_0
threadpoolctl=2.1.0=pyh5ca1d4c_0
tk=8.6.10=hbc83047_0
torchaudio=0.8.1=py37
torchvision=0.9.1=py37_cu101
typing_extensions=3.7.4.3=py_0
wheel=0.36.2=pyhd3eb1b0_0
xz=5.2.5=h7b6447c_0
zlib=1.2.11=h7b6447c_3
zstd=1.4.5=h9ceee32_0
pip:
- falkon==0.6.3
- psutil==5.8.0
- pykeops==1.4.2
  prefix: /home/nehap/anaconda3/envs/falkon

I'm currently using a 1 TITAN RTX GPU  with 24 GB memory and my CPU has 128 GB memory. The example works if we reduce the number of dimensions from 1536 to 20, but with larger datasets it seems to be running into this issue. We would appreciate any help with this issue - thank you!

Automatic hyperparameter optimization for regression

Hi there,

I'm trying to use the hopt features for a regression problem, so I'm currently trying to adapt the hopt example using a regression dataset. This includes:

Preprocessing the Y data train and test data as .to(dtype=torch.float32)
Getting rid of the one-hot encoding
Using an appropriate global loss function (mclass_loss). I'm using the following so far:

def mclass_loss(true, pred):
    mae = torch.nn.L1Loss()
    return mae(true, pred)

But I'm getting the following error right off the bat (before even calling mclass_loss):

Traceback (most recent call last):
  File "/gpfsssd/scratch/rech/tta/uam43iy/tests/falkon_opt_pivOF/opt_hp.py", line 57, in <module>
    loss = model(X_train, Y_train)
  File "/gpfslocalsup/pub/anaconda-py3/2021.05/envs/pytorch-gpu-1.11.0+py3.9.12/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
    return forward_call(*input, **kwargs)
  File "/linkhome/rech/genimp01/uam43iy/.local/lib/python3.9/site-packages/falkon/hopt/objectives/exact_objectives/sgpr.py", line 32, in forward
    L, A, AAT, LB, c = self._calc_intermediate(X, Y)
  File "/linkhome/rech/genimp01/uam43iy/.local/lib/python3.9/site-packages/falkon/hopt/objectives/exact_objectives/sgpr.py", line 82, in _calc_intermediate
    c = torch.triangular_solve(AY, LB, upper=False).solution / sqrt_var
RuntimeError: torch.triangular_solve: Expected b to have at least 2 dimensions, but it has 1 dimensions instead
(pytorch-gpu-1.11.0+py3.9.12) bash-4.4$ python opt_hp.py 
/linkhome/rech/genimp01/uam43iy/.local/lib/python3.9/site-packages/falkon/hopt/objectives/exact_objectives/sgpr.py:75: UserWarning: torch.triangular_solve is deprecated in favor of torch.linalg.solve_triangularand will be removed in a future PyTorch release.
torch.linalg.solve_triangular has its arguments reversed and does not return a copy of one of the inputs.
X = torch.triangular_solve(B, A).solution
should be replaced with
X = torch.linalg.solve_triangular(A, B). (Triggered internally at  /gpfs7kro/gpfslocalsup/src/pub/anaconda-py3/2021.05/pytorch-1.11.0+py3.9.12/pytorch-1.11.0/aten/src/ATen/native/BatchLinearAlgebra.cpp:1672.)
  A = torch.triangular_solve(kmn, L, upper=False).solution / sqrt_var
Traceback (most recent call last):
  File "/gpfsssd/scratch/rech/tta/uam43iy/tests/falkon_opt_pivOF/opt_hp.py", line 57, in <module>
    loss = model(X_train, Y_train)
  File "/gpfslocalsup/pub/anaconda-py3/2021.05/envs/pytorch-gpu-1.11.0+py3.9.12/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
    return forward_call(*input, **kwargs)
  File "/linkhome/rech/genimp01/uam43iy/.local/lib/python3.9/site-packages/falkon/hopt/objectives/exact_objectives/sgpr.py", line 32, in forward
    L, A, AAT, LB, c = self._calc_intermediate(X, Y)
  File "/linkhome/rech/genimp01/uam43iy/.local/lib/python3.9/site-packages/falkon/hopt/objectives/exact_objectives/sgpr.py", line 82, in _calc_intermediate
    c = torch.triangular_solve(AY, LB, upper=False).solution / sqrt_var
RuntimeError: torch.triangular_solve: Expected b to have at least 2 dimensions, but it has 1 dimensions instead

Is this related to the one-hot representation used in the classification problem of the example? Can the hyperparameter optimization methods be used for regression problems here?

thanks,
Arthur

Fixing doc-strings

Autogenerated documentation is incomplete. Missing:

mmv_ops
sparse
center_selection
gsc_losses are no good

Related to #1

FixedSelector Center selection in Falkon Model

Hello,

I am trying to pass a FixedSelector instance to center_selection in the Falkon constructor, however, I do not obtain a different error to the default "Uniform" selector, which leads me to suspect that it uses the default instead.

This is the code I am using:

indices_torch = torch.from_numpy(indices).reshape(-1,1)
X_centers_init = Xtrain[indices].clone()
Y_centers_init = Ytrain[indices].clone()

selector = FixedSelector(X_centers_init,Y_centers_init,indices_torch)

kernel = kernels.GaussianKernel(sigma=1.352)

model = Falkon(
    maxiter=100,
    kernel=kernel,
    penalty=1.07e-06,
    M=20000,
    center_selection=selector,
    options=options
        
)

Could you advise on what is going wrong here?

Running with 3 or more GPUs

Hi,
I tried to run FALKON with 3 GPUs but I got the following error:

`Traceback (most recent call last):
File "/home/"user"/.conda/envs/flk4/lib/python3.10/site-packages/falkon/utils/threading.py", line 15, in run
self.ret = self._target(*self._args, **self._kwargs)
File "/home/"user"//.conda/envs/flk4/lib/python3.10/site-packages/falkon/mmv_ops/fmmv.py", line 138, in mmv_run_starter
return mmv_run_thread(X1, X2, v, out, kernel, blk_n, blk_m, mem_needed, dev, tid=proc_idx)
File "/home/"user"//.conda/envs/flk4/lib/python3.10/site-packages/falkon/mmv_ops/fmmv.py", line 251, in mmv_run_thread
flat_gpu = torch.empty(size=(mem_needed,), dtype=m1.dtype, device=dev)
RuntimeError: CUDA out of memory. Tried to allocate 21.00 GiB (GPU 0; 31.75 GiB total capacity; 5.57 GiB already allocated; 20.88 GiB free; 9.56 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File "/home/"user"/.conda/envs/flk4/lib/python3.10/runpy.py", line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/home/"user"/.conda/envs/flk4/lib/python3.10/runpy.py", line 86, in _run_code
exec(code, run_globals)
File "/home/"user"/research/knotty/run/main.py", line 38, in
alpha, acc_valid_ep3,nystrom_samples,knots_x,acc_ep2_test= run(**args,wandb_run=wandb_run)
File "/home/"user"/research/knotty/run/run.py", line 225, in run
Falkon_loss, accu_falkon = falkon_run(dataset, kernel_fn, options, p=num_knots, epochs=20,
File "/home/"user"/research/knotty/run/run.py", line 34, in falkon_run
flk.fit(x_train, y_train)
File "/home/"user"/.conda/envs/flk4/lib/python3.10/site-packages/falkon/models/falkon.py", line 264, in fit
beta = optim.solve(
File "/home/"user"/.conda/envs/flk4/lib/python3.10/site-packages/falkon/optim/conjgrad.py", line 310, in solve
B = self.kernel.mmv(M, X, y_over_n, opt=self.params)
File "/home/"user"/.conda/envs/flk4/lib/python3.10/site-packages/falkon/kernels/kernel.py", line 266, in mmv
return mmv_impl(X1, X2, v, self, out, params)
File "/home/"user"/.conda/envs/flk4/lib/python3.10/site-packages/falkon/mmv_ops/fmmv.py", line 734, in fmmv
return KernelMmvFnFull.apply(kernel, opt, out, X1, X2, v, *kernel.diff_params.values())
File "/home/"user"/.conda/envs/flk4/lib/python3.10/site-packages/falkon/mmv_ops/fmmv.py", line 695, in forward
KernelMmvFnFull.run_cpu_gpu(X1, X2, v, out, kernel, opt, False)
File "/home/"user"/.conda/envs/flk4/lib/python3.10/site-packages/falkon/mmv_ops/fmmv.py", line 641, in run_cpu_gpu
outputs = _start_wait_processes(mmv_run_starter, args)
File "/home/"user"/conda/envs/flk4/lib/python3.10/site-packages/falkon/mmv_ops/utils.py", line 59, in _start_wait_processes
outputs.append(p.join())
File "/home/"user"/.conda/envs/flk4/lib/python3.10/site-packages/falkon/utils/threading.py", line 22, in join
raise RuntimeError('Exception in thread %s' % (self.name)) from self.exc
RuntimeError: Exception in thread GPU-0
`
It works fine with 1,2 GPUs. I was wondering if using 3 or more GPUs can further make FALKON faster?

Thank you for your help.

falkonml / falkon Goto Github PK

falkon's People

Contributors

Stargazers

Watchers

Forkers

falkon's Issues

Not compiling with Cuda

Patched version of PyKeops

Example notebooks

Missing docstrings

Code to reproduce

Expected behavior

Actual behavior

Environment

Recommend Projects

Recommend Topics

Recommend Org