yngvem / group-lasso Goto Github PK

Group Lasso implementation following the scikit-learn API

License: MIT License

Python 96.95% Makefile 3.05%

group-lasso's Introduction

⚠️⚠️ Disclaimer ⚠️⚠️: This package is no longer maintained.

If you are looking for efficient and scikit-learn-like models with group structure such Group Lasso and Group Logistic Regression, have a look at skglm

skglm provides efficient and scikit-learn-compatible models with group structure such as Group Lasso and Group Logistic Regression. It extends the features of scikit-learn for Generalized Linear Models by implementing a wealth of missing models. Check out the documentation for the full list of supported models, and the Gallery of examples to see its speed and efficiency when tackling large scale problems.

If you are looking for the grouplasso documentation, view the old version of the README.

group-lasso's People

Contributors

Stargazers

Watchers

group-lasso's Issues

Get ids of nonzero groups

Currently, the only way to learn which datapoints are nonzero is by looking at the sparsity mask, however, it would be useful to have a list of the groups with at least one non-zero element.

Difficulty: Easy

Support sparse arrays

Since group lasso is well suited for dummyb encoded covariates, it should support sparse arrays. Currently the code is not tested for that, but there is no reason why this should not be supported.

Comparison with the R package grplasso and imitation of some functions

Hi Yngve

Have you ever examined how this package's functionality relates to similar implementations in R (grplasso, gglasso?

Below, I'm sharing some code with you that I've created recently. Inspired by the grplasso package in R, I've implemented (simplified versions of) lambdamax and plot.grplasso.

lambdamax(): Determines the value of the penalty parameter lambda when the first penalized parameter group enters the model.
plot_evolution(): Plots the solution path of a regression solution.

I've restricted myself to the logistic regression case, using a dataset "colon" provided by gglasso. See the attached .csv that contains the demo data.

I don't know what your plans are for your package, but I'm sharing the code so you or others can use it. I've noticed that are some difference when comparing grplasso with group-lasso, but did not have a thorough look at everything.

Data: colon.csv

Result: evolution plot. It gives the number of groups selected for a decreasing value of $lambda$.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from group_lasso import LogisticGroupLasso

np.random.seed(0)
LogisticGroupLasso.LOG_LOSSES = True

def compute_grplr(X, y, param=0.05):
  gl = LogisticGroupLasso(
      groups=groups,
      group_reg=param,
      n_iter=100000,
      l1_reg=0,
      scale_reg="inverse_group_size",
      subsampling_scheme=None,
      supress_warning=True,
  )
  gl.fit(X, y)
  return gl


def plot_grplr(gl):
  # Plot the results
  coef = gl.coef_[:, 1] - gl.coef_[:, 0]
  plt.figure()
  plt.plot(coef / np.linalg.norm(coef), ".", 
           label="Estimated weights")

  plt.figure()
  plt.plot(gl.losses_)

  plt.show()
  
  
def lambdamax(X, y, hi=1, lo=0.01, mode="log", 
              tol=1e-3, iter_max=10, **kwargs):
  """
  Stupid imitation of grplasso.lambdamax:
  Get the value of the penalty parameter lambda when
  the first penalized parameter group enters the model.
  Algorithm: bisect
  """
  def count_nonzero_coefs(X, y, lmbda, **kwargs):
    gl = compute_grplr(X, y, param=lmbda, **kwargs)
    coef = gl.coef_[:, 1] - gl.coef_[:, 0]
    return sum(coef!=0)
  
  def bisect(func, x_min, x_max, x_func=None,
             tol=1e-7, iter_max=None, **kwargs):
    # Problem: we need to solve a "degenerate" root finding problem
    # https://stackoverflow.com/questions/76168787
    y_min = func(x_min, **kwargs)
    if y_min > 0:
      msg = "Warning: no solution as y_min>0, with x_min=%f."
      print(msg % x_min)
      return x_min
    y_max = func(x_max, **kwargs)
    if y_max <= 0:
      msg = "Warning: no solution as y_max<=0, with x_max=%f."
      print(msg % x_max)
      return x_max
    if tol is None and iter_max is None:
        tol = 1e-7
    if x_func is None:
      x_func = lambda x0, x1: (x1+x0)/2
    
    from itertools import count, islice
    x_last = np.infty
    for cnt in islice(count(), iter_max):
      x_new = x_func(x_min, x_max)
      y_new = func(x_new, **kwargs)
      if y_new<=0:
        x_min = x_new
      else:
        x_max = x_new
      if (tol is not None) and abs(x_last - x_new) < tol:
        break
      x_last = x_new
    neg_direction = x_min > x_max
    if neg_direction:
      return x_max if y_new>0 else x_min
    else:
      return x_min if y_new>0 else x_max
    
  if mode=="log":
    # Geometric mean
    x_func = lambda x0, x1: np.sqrt(x0*x1)
  else:
    # Arithmetic mean
    x_func = lambda x0, x1: (x0+x1)/2
    
  func = lambda x, **kwargs: count_nonzero_coefs(X, y, lmbda=x, **kwargs)
  lmbda = bisect(func=func, x_min=hi, x_max=lo, x_func=x_func,
                 tol=tol, iter_max=iter_max, **kwargs)
  return lmbda
  
  
def plot_evolution(X, y):
  coefs = []
  lambda_start = lambdamax(X, y, hi=1, lo=0.01)
  exp_start = np.log10(lambda_start)
  lambdas = np.logspace(exp_start, -2, 20)
  for i, lmbda in enumerate(lambdas):
    print("Step i=%d, lambda=%.3f" % (i+1, lmbda))
    gl = compute_grplr(X, y, param=lmbda)
    coef = gl.coef_[:, 1] - gl.coef_[:, 0]
    coefs.append(coef)
  coefs = np.vstack(coefs)
  fig, ax = plt.subplots()
  ax.plot(lambdas, coefs)
  ax.set_xscale("log")
  ax.invert_xaxis()
  ax.set_xlabel("log(lambda)")
  ax.set_ylabel("coeff")
  plt.show()
  

df = pd.read_csv("colon.csv", index_col=[0])
y = df["y"].copy()
X = df.loc[:,df.columns != "y"].copy()
groups = np.repeat(range(1,21), 5)

gl = compute_grplr(X, y, param=0.1)
#gl = compute_grplr(X, y, param=0.05)
plot_grplr(gl)
plot_evolution(X, y)

Comparison with the R package grplasso. Same data, similar code. The results are not exactly the same, but I have not investigated why that is.

library(pacman)
pacman::p_install("grplasso")

# Load the colon data (from gglasso)
data(colon)
# Define group index
group <- rep(1:20, each=5)
# y must take values 0 and 1
colon$y <- (colon$y+1)/2

# Determine the value of the penalty parameter lambda when 
# the first penalized parameter group enters the model.
lambda <- lambdamax(x=colon$x, y=colon$y, model = LogReg(), 
                    index=group, standardize = TRUE)
# Create a sequence of lambda values to sample...
lambda <- lambda * 0.8^seq(0,8,0.1)

# Fit a model using the specified lambda sequence.
# Equation: y ~ . Means: y against all other columns.
fit <- grplasso(x=colon$x, y=colon$y, model = LogReg(), index=group,
                lambda = lambda, standardize = TRUE)

# With some explicit settings... (trace=0 for quite evaluation)
fit <- grplasso(x=colon$x, y=colon$y, model = LogReg(), index=group, 
                lambda = lambda, standardize = TRUE,
                control = grpl.control(trace = 0, inner.loops = 10,
                                       update.every = 1, 
                                       update.hess = "lambda"))

# Plot the solution path of the group lasso regression.
plot(fit, log = "x")

Evolution plot in R:

Specify Multiple Groups for One Feature

Nice package! Just following up from another thread, in your package is it possible to specify multiple groups for one feature (eg. overlapping groups)? Thanks!

Add option for disabling scaling of regularisation

Currently, the group-wise regularisation coefficients are scaled by the square root of the group size, so the influence of each variable is the same. However, this does not make sense for dummy-encoded variables.

There should therefore be added an option upon init to disable the group-wise scaling so the user doesn't need to feed in a vector of regularisation coeficients.

GroupLasso silently assumes that the labels y are stored in a column vector

Thanks for your work so far. Had a first look at your group-lasso tool. Looks as if it runs super fast, though I haven't tested the performance yet in detail.

I observed a small problem if the y is stored in a 1d-array:

How to reproduce:

X = ...
y = ...
y = y.flatten()

gl.fit(X,y)

This will yield the following exception.

Traceback (most recent call last):
  File "trainExplore.py", line 73, in <module>
    executor.run()
  File "/Users/norman/workspace/education/phd/projects/geomtk/python/utilities/executor.py", line 811, in run
    args.func(args)
  File "/Users/norman/workspace/education/phd/projects/geomtk/python/utilities/executor.py", line 403, in _runSingle
    self.ret = self._functor(args.file, args.outDir, args, taskInfo)
  File "trainExplore.py", line 48, in runTraining
    returnStd=False)
  File "/Users/norman/workspace/education/phd/projects/geomtk/python/utilities/ml.py", line 338, in testBinaryClassifiers
    clf.fit(XTrain, yTrain)
  File "/Users/norman/workspace/dev/misc/python/group-lasso/group_lasso/_group_lasso.py", line 258, in fit
    self._fista(X, y, lipschitz_coef=lipschitz_coef)
  File "/Users/norman/workspace/dev/misc/python/group-lasso/group_lasso/_group_lasso.py", line 208, in _fista
    prox
  File "/Users/norman/workspace/dev/misc/python/group-lasso/group_lasso/_group_lasso.py", line 155, in _fista_it
    u_ = prox(v - grad(v)/L)
  File "/Users/norman/workspace/dev/misc/python/group-lasso/group_lasso/_group_lasso.py", line 185, in grad
    SSE_grad = _subsampled_l2_grad(X, w, y, self.subsampling_scheme)
  File "/Users/norman/workspace/dev/misc/python/group-lasso/group_lasso/_group_lasso.py", line 32, in _subsampled_l2_grad
    A, b = subsample(subsampling_scheme, A, b)
  File "/Users/norman/workspace/dev/misc/python/group-lasso/group_lasso/_subsampling.py", line 54, in subsample
    return _extract_from_singleton_iterable([X[inds, :] for X in Xs])
  File "/Users/norman/workspace/dev/misc/python/group-lasso/group_lasso/_subsampling.py", line 54, in <listcomp>
    return _extract_from_singleton_iterable([X[inds, :] for X in Xs])
IndexError: too many indices for array

You probably require a check at the beginning of the fit function.
Just a detail, just wanted to let you know.

Good courage with the further development

LogisticGroupLasso.chosen_groups_ raises a TypeError

First, a sign of gratitude: I finally had a look at the LogisticGroupLasso() estimator. Thanks so much for extending your library by LR classifier! So far, it works very well, and blends in perfectly with the sklearn infrastructure. Great job! (For my data, I currently get some convergence warnings from FIESTA, but I guess I have to play around with the parameters first, and that's not the issue here).

Problem: Calling LogisticGroupLasso.chosen_groups_ raises the following exception:

Traceback (most recent call last):
  File "example_logistic_group_lasso.py", line 90, in <module>
    print(gl.chosen_groups_)
  File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/group_lasso/_group_lasso.py", line 442, in chosen_groups_
    return set(np.unique(self.groups_[self.sparsity_mask_]))
TypeError: only integer scalar arrays can be converted to a scalar index

One can reproduce the problem in example_group_lasso.py.

Is it correct that chosen_groups is supposed to return the group ids which have been selected by group lasso. Isn't this information reflected also by the gl.group_reg_vector_?

Change coverage reporting to use GitHub Actions

With Travis stopping their free tier for open source, I have switched to GitHub Actions. This works well for the tests, but the coverage reporting with coveralls is not updated yet.

``LogisticGroupLasso`` is not equivalent to ``scikit-learn`` ``LogisticRegression`` with groups of size 1

Problem

LogisticGroupLasso doesn't give the same regression coefficient as scikit-learn LogisticRegression when fitted on groups with size 1.

Expected behavior

Group logistic regression with groups of size one should be equivalent to logistic regression.

Script to reproduce

(click to expend)

import numpy as np
from sklearn.linear_model import LogisticRegression
from group_lasso import LogisticGroupLasso


n_samples, n_features = 20, 60

# generate dummy data
rng = np.random.RandomState(123)
X = rng.randn(n_samples, n_features)
y = np.sign(rng.randn(n_samples))

# max regularization parameter
alpha_max = np.linalg.norm(X.T @ y, ord=np.inf) / (2 * n_samples)
alpha =  0.1 * alpha_max 

# fit scikit-learn log reg 
sk_model = LogisticRegression(
    penalty='l1',
    C=1/(n_samples * alpha),  # scikit-learn uses an un-normalized loss
    fit_intercept=False, tol=1e-9, solver='liblinear'
).fit(X, y)
sk_coef = sk_model.coef_.flatten()

# group log reg
yn_model = LogisticGroupLasso(
    group_reg=alpha,
    groups=np.arange(n_features),
    fit_intercept=False,
    subsampling_scheme=None,
    tol=1e-9,
    l1_reg=0.,
).fit(X, y)
coef_ = yn_model.coef_
yn_coef = coef_[:, 1]

np.testing.assert_allclose(yn_coef, sk_coef)

output

(click to expend)

AssertionError: 
Not equal to tolerance rtol=1e-07, atol=0

Mismatched elements: 10 / 60 (16.7%)
Max absolute difference: 0.96593495
Max relative difference: 1.
 x: array([ 0.      , -0.      , -0.010529,  0.      ,  0.      ,  0.43878 ,
        0.      ,  0.337345,  0.15179 , -0.      ,  0.066626,  0.      ,
       -0.      ,  0.      , -0.      ,  0.      ,  0.      , -0.      ,...
 y: array([ 0.      ,  0.      , -0.034694,  0.      ,  0.      ,  1.111946,
        0.      ,  0.840564,  0.309066,  0.      ,  0.15613 ,  0.      ,
        0.      ,  0.      ,  0.      ,  0.      ,  0.      ,  0.      ,...

automatic Ridge regression

Hi, I've been testing your library, and using GroupLasso setting both l1_reg=0 and group_reg=0 I get a solution which I think is not the minimizer of L(beta,X,y) but the minimization of that times a high Ridge regression penalization.

Is this the expected behaviour?

I not interesting only in adding Lasso regularization but not including any kind of Ridge regression.

Thank you very much for your repo

Poisson regression

There has been several requests for group lasso regularised Poisson regression. This was more difficult than first thought as the gradient of the Poisson loss is not globally Lipschitz.

The loss gradient is, however, locally Lipschitz. A line-search for the Lipschtiz constant can therefore converge (and will under some reasonable assumptions). It is unclear if the restarting scheme will work with a line-search so that need to be investigated further.

Unfortunately, the code is currently structured so that major refactoring is needed to support a line-search for the Lipschitz.

Solver seems to diverge on Lasso problem

Hello @yngvem, while trying to benchmark your solver I came across this weird behavior:

import time
import numpy as np

from numpy.linalg import norm
import group_lasso
from sklearn.datasets import fetch_openml


dataset = fetch_openml('leukemia')
X = np.asfortranarray(dataset.data)
# make n_features non prime
X = X[:, :7128]
X /= norm(X, axis=0)[None, :]

np.random.seed(0)
y = X @ np.random.randn(X.shape[1])
noise = np.random.randn(X.shape[0])
y += 0.5 * norm(y) * noise / norm(noise)  # SNR of 2
y /= norm(y)

grp_size = 1  # amounts to a Lasso
alpha_max = norm(norm((X.T @ y).reshape(-1, grp_size), axis=1),
                 ord=np.inf) / len(y)

groups_yngvem = np.repeat(np.arange(X.shape[1] // grp_size), grp_size)
alpha = alpha_max / 2


t0 = time.time()
clf2 = group_lasso.GroupLasso(
    groups=groups_yngvem, group_reg=alpha, fit_intercept=False, l1_reg=0)
clf2.fit(X, y[:, None])
t1 = time.time()
print("yngvem: %.3f s" % (t1 - t0))

w = clf2.coef_[:, 0]

print('norm coef: %f, intercept: %f' % (norm(w), clf2.intercept_))

The output seems to indicate that the solver is diverging:

yngvem: 34.497 s
norm coef: 364630147741370583765123205856284522915143760751402823711210582759452252246484410896325964517805437969243038142816346667477476404348394286350336.000000, intercept: 0.000000

I would expect the results of a regular Lasso. Am I using the API wrongly?

of interest @agramfort

Typo in _group_lasso.py raises error when applying GridSearchCV

`_group_lasso.py

    super().__init__(
        groups=groups,
        l1_reg=l1_reg,
        group_reg=group_reg,
        n_iter=n_iter,
        tol=tol,
        scale_reg=scale_reg,
        subsampling_scheme=subsampling_scheme,
        fit_intercept=fit_intercept,
        random_state=random_state,
        warm_start=warm_start,
        old_regularisation=old_regularisation,
        supress_warning=supress_warning,
    )
    self.frobenius_lipchitz = frobenius_lipschitz

I believe self.frobenius_lipchitz is a typo and it should be self.frobenius_lipschitz for consistency.
I was trying to use sklearn.model_selection.GridSearchCV to choose hyperparamters for this estimator, and it raised an error ' GroupLasso object has no attributre frobenius_lipschitz', which led me finding this typo.

DOC: what is the objective function for GroupLasso?

I am having a hard time getting the same results as sklearn or other Lasso/GroupLasso solvers.

Running the following snippet, I get 0 coefficients for clf:

import numpy as np

from numpy.linalg import norm
import group_lasso
from sklearn.datasets import fetch_openml


dataset = fetch_openml('leukemia')
X = np.asfortranarray(dataset.data)
y = np.array([-1 if lab == "ALL" else 1 for lab in dataset.target])
X = X[:, :500]

X -= X.mean(axis=0)
X /= norm(X, axis=0)

grp_size = 1  # amounts to a Lasso
alpha_max = np.max(np.abs(X.T @ y)) / len(y)

groups_yngvem = np.arange(1, X.shape[1] + 1)
alpha = alpha_max / 2

clf2 = group_lasso.GroupLasso(
    groups=groups_yngvem, group_reg=alpha, fit_intercept=False,
    l1_reg=0, supress_warning=True, tol=1e-10, n_iter=10000)
clf2.fit(X, y[:, None])

w = clf2.coef_[:, 0]
print("obj:", np.sum((y - X @ w) ** 2) / 2 + alpha * norm(w, ord=1))

Is it normal ? Since alpha < alpha_max (= norm(X.T @ y, ord="inf") / len(y)), I would expect the results to be non-zero.

Writing the objective in the docstring as it is done e.g. in sklearn.linear_model.Lasso would help, what do you think ?

About Group lasso logistic regression using PGD

Thanks for your contribution in group lasso question! What should I do if I want to solve Group lasso logistic regression using proximal gradident descent method? Can you give me some advice?

why the three number in 'yhat' is same

Hello!
I don't know why the three number in 'yhat' is same and 'w_hat' is all zero. I'd like to get 'w_hat' with a few non-zero numbers so that I can know which group of 'X' is used. Can you help me with that? Thank you very much!

###############################################################################
# Setup
# -----

import matplotlib.pyplot as plt
import numpy as np
from numpy import linalg
from skimage import io
from skimage.color.colorconv import _prepare_colorarray
from sklearn.metrics import r2_score
from sklearn.metrics import mean_squared_error

from group_lasso import GroupLasso

np.random.seed(0)     
GroupLasso.LOG_LOSSES = True

###############################################################################
# Set dataset parameters
# ----------------------
group_sizes = [3, 3, 2, 2, 2]
active_groups = np.ones(5)
active_groups[:3] = 2
np.random.shuffle(active_groups)
np.random.shuffle(active_groups)
groups = np.concatenate(
    [size * [i] for i, size in enumerate(group_sizes)]
).reshape(-1, 1)
num_coeffs = sum(group_sizes)
num_datapoints = 3

print("group_sizes:", group_sizes)
print("active_groups:", active_groups)
print("groups:", groups.shape)
print("num_coeffs:", num_coeffs)
print("____________________________________________")

###############################################################################
# Generate data matrix
# --------------------
X = np.array([[0.571, 0.095, 0.767, 0.095, 0.105, 0.767, 0.571, 0.767, 0.095, 0.767, 0.105, 0.767],
              [0.584, 0.258, 0.576, 0.258, 0.758, 0.576, 0.584, 0.576, 0.258, 0.576, 0.758, 0.576],
              [0.577, 0.961, 0.284, 0.961, 0.644, 0.284, 0.577, 0.284, 0.961, 0.284, 0.644, 0.284]])

print("X:", X.shape)
print("____________________________________________")

###############################################################################
# Generate coefficients
# ---------------------k
w = np.concatenate(
    [
        np.random.standard_normal(group_size) * is_active
        for group_size, is_active in zip(group_sizes, active_groups)
    ]
)
w = w.reshape(-1, 1)
true_coefficient_mask = w != 0
intercept = 2

print("w:", w.shape)
print("true_coefficient_mask:", true_coefficient_mask.sum())
print("____________________________________________")

###############################################################################
# Generate regression targets
# ---------------------------

y_true = X @ w
y = np.array([[-0.17997138],
              [-0.15219182],
              [-0.17062552]])
y_true = X @ w
print("y:", y)
MSE1 = mean_squared_error(y, y_true)
print("MSE_yt_y:", MSE1)
print("____________________________________________")

###############################################################################
# Generate estimator and train it
# -------------------------------
gl = GroupLasso(
    groups=groups,
    group_reg=5,
    l1_reg=2,
    frobenius_lipschitz=True,
    scale_reg="inverse_group_size",
    subsampling_scheme=1,
    supress_warning=True,
    n_iter=1000,
    tol=1e-3,
)
gl.fit(X, y)

###############################################################################
# Extract results and compute performance metrics
# -----------------------------------------------

# Extract info from estimator
yhat = gl.predict(X)
sparsity_mask = gl.sparsity_mask_
w_hat = gl.coef_

print("yhat:", yhat)
print("w_hat:", w_hat.sum())
print("sparsity_mask:", sparsity_mask)
print("____________________________________________")

# Compute performance metrics
R2 = r2_score(y, yhat)
MSE_y_yh = mean_squared_error(y, yhat)
print("MSE_y_yh:", MSE_y_yh)
print("____________________________________________")

And the result of the program after running is as follows.

group_sizes: [3, 3, 2, 2, 2]
active_groups: [2. 2. 2. 1. 1.]
groups: (12, 1)
num_coeffs: 12
____________________________________________
X: (3, 12)
____________________________________________
w: (12, 1)
true_coefficient_mask: 12
____________________________________________
y: [[-0.17997138]
 [-0.15219182]
 [-0.17062552]]
MSE_yt_y: 55.67436677644974
____________________________________________
yhat: [-0.16644355 -0.16644355 -0.16644355]
w_hat: 0.0
sparsity_mask: [False False False False False False False False False False False False]
____________________________________________
MSE_y_yh: 0.0001345342966391801
____________________________________________

How to do stain unmix with group lasso

Hello!
Do you know how to do stain unmixing for RGB images when the number of stains is more than three, just as the way the paper introduces.
Group sparsity model for stain unmixing in brightfield multipleximmunohistochemistry images.pdf
The following code is used to do stain unmixing for 4 colors. But the method is not good.

import numpy as np
import matplotlib.pyplot as plt
from skimage import io
from numpy import linalg
from skimage.color.colorconv import _prepare_colorarray
from sklearn.metrics import mean_squared_error

img = io.imread('E:\\test1.png')   
img1 = _prepare_colorarray(img, force_copy=True)    
np.maximum(img1, 1E-6, out=img1)    
y = np.log(img1)

X1 = np.array([[0.571, 0.584, 0.577], [0.095, 0.258, 0.961], [0.767, 0.576, 0.284]])  # CD8,PanCK,Hema
X1_inv = linalg.inv(X1)       
B1 = y @ X1_inv
y1 = B1 @ X1

X2 = np.array([[0.095, 0.258, 0.961], [0.105, 0.758, 0.644], [0.767, 0.576, 0.284]])  # PanCK,PD-L1,Hema
X2_inv = linalg.inv(X2)       
B2 = y @ X2_inv
y2 = B2 @ X2

X3 = np.array([[0.571, 0.584, 0.577], [0.767, 0.576, 0.284], [-0.48, 0.808, -0.343]])  # CD8,Hema
X3_inv = linalg.inv(X3)       
B3 = y @ X3_inv
y3 = B3 @ X3

X4 = np.array([[0.095, 0.258, 0.961], [0.767, 0.576, 0.284], [-0.553, 0.817, -0.165]])  # PanCK,Hema
X4_inv = linalg.inv(X4)       
B4 = y @ X4_inv
y4 = B4 @ X4

X5 = np.array([[0.105, 0.758, 0.644], [0.767, 0.576, 0.284], [-0.218, 0.649, -0.729]])  # PDL1,Hema
X5_inv = linalg.inv(X5)       
B5 = y @ X5_inv
y5 = B5 @ X5

a = 0
b = 0
rgb_CD8 = np.zeros_like(y) + 1
rgb_PanCK = np.zeros_like(y) + 1
rgb_Hema = np.zeros_like(y) + 1
rgb_PDL1 = np.zeros_like(y) + 1
for i in y:
    for j in i:
        e = y[a, b, :]
        e1 = y1[a, b, :]
        e2 = y2[a, b, :]
        e3 = y3[a, b, :]
        e4 = y4[a, b, :]
        e5 = y5[a, b, :]
        p1 = mean_squared_error(e, e1)
        p2 = mean_squared_error(e, e2)
        p3 = mean_squared_error(e, e3)
        p4 = mean_squared_error(e, e4)
        p5 = mean_squared_error(e, e5)
        if p1 > p2 and p3 > p2 and p4 > p2 and p5 > p2:
            null = np.zeros_like(B2[:, :, 0])
            B2_A = np.stack((B2[:, :, 0], null, null), axis=-1)  
            B2_B = np.stack((null, B2[:, :, 1], null), axis=-1)  
            B2_C = np.stack((null, null, B2[:, :, 2]), axis=-1)  
            conv_matrix = X2
            log_rgb21 = B2_A[a][b] @ conv_matrix
            rgb_PanCK[a][b] = np.exp(log_rgb21)
            log_rgb22 = B2_B[a][b] @ conv_matrix
            rgb_PDL1[a][b] = np.exp(log_rgb22)
            log_rgb23 = B2_C[a][b] @ conv_matrix
            rgb_Hema[a][b] = np.exp(log_rgb23)
        elif p2 > p1 and p3 > p1 and p4 > p1 and p5 > p1:
            null = np.zeros_like(B1[:, :, 0])
            B1_A = np.stack((B1[:, :, 0], null, null), axis=-1)  
            B1_B = np.stack((null, B1[:, :, 1], null), axis=-1)  
            B1_C = np.stack((null, null, B1[:, :, 2]), axis=-1)  
            conv_matrix = X1
            log_rgb11 = B1_A[a][b] @ conv_matrix
            rgb_CD8[a][b] = np.exp(log_rgb11)
            log_rgb12 = B1_B[a][b] @ conv_matrix
            rgb_PanCK[a][b] = np.exp(log_rgb12)
            log_rgb13 = B1_C[a][b] @ conv_matrix
            rgb_Hema[a][b] = np.exp(log_rgb13)
        elif p1 > p3 and p2 > p3 and p4 > p3 and p5 > p3:
            null = np.zeros_like(B3[:, :, 0])
            B3_A = np.stack((B3[:, :, 0], null, null), axis=-1)  
            B3_B = np.stack((null, B3[:, :, 1], null), axis=-1)  
            conv_matrix = X3
            log_rgb31 = B3_A[a][b] @ conv_matrix
            rgb_CD8[a][b] = np.exp(log_rgb31)
            log_rgb32 = B3_B[a][b] @ conv_matrix
            rgb_Hema[a][b] = np.exp(log_rgb32)
        elif p1 > p4 and p2 > p4 and p3 > p4 and p5 > p4:
            null = np.zeros_like(B4[:, :, 0])
            B4_A = np.stack((B4[:, :, 0], null, null), axis=-1)  
            B4_B = np.stack((null, B4[:, :, 1], null), axis=-1)  
            conv_matrix = X4
            log_rgb41 = B4_A[a][b] @ conv_matrix
            rgb_PanCK[a][b] = np.exp(log_rgb41)
            log_rgb42 = B4_B[a][b] @ conv_matrix
            rgb_Hema[a][b] = np.exp(log_rgb42)
        else:
            null = np.zeros_like(B5[:, :, 0])
            B5_A = np.stack((B5[:, :, 0], null, null), axis=-1)  
            B5_B = np.stack((null, B5[:, :, 1], null), axis=-1)  
            conv_matrix = X5
            log_rgb51 = B5_A[a][b] @ conv_matrix
            rgb_PDL1[a][b] = np.exp(log_rgb51)
            log_rgb42 = B5_B[a][b] @ conv_matrix
            rgb_Hema[a][b] = np.exp(log_rgb42)
        b = b + 1
    b = 0
    a = a + 1

fig, axes = plt.subplots(3, 2, figsize=(8, 7), sharex=True, sharey=True)
ax = axes.ravel()

ax[0].imshow(img)
ax[0].set_title("Original image")

rgb_Hema[rgb_Hema < 0] = 0
rgb_Hema[rgb_Hema > 1] = 1
ax[2].imshow(rgb_Hema)
ax[2].set_title("Hematoxylin")

rgb_PanCK[rgb_PanCK < 0] = 0
rgb_PanCK[rgb_PanCK > 1] = 1
ax[3].imshow(rgb_PanCK)
ax[3].set_title("PanCK")

rgb_PDL1[rgb_PDL1 < 0] = 0
rgb_PDL1[rgb_PDL1 > 1] = 1
ax[4].imshow(rgb_PDL1)
ax[4].set_title("PDL1")

rgb_CD8[rgb_CD8 < 0] = 0
rgb_CD8[rgb_CD8 > 1] = 1
ax[5].imshow(rgb_CD8)
ax[5].set_title("CD8")

for a in ax.ravel():
    a.axis('off')

fig.tight_layout()
plt.show()

Using the loss() method for the Logistic Regression example raises an IndexError

Thank you so much for making this module! We're using it for work on a genomics project. I was interested in calculating loss(X, c) (using the same variable names in the Logistic Regression example) but am getting this error:

IndexError: boolean index did not match indexed array along dimension 0; dimension is 719 but corresponding boolean dimension is 720

While on the subject of losses, what is the difference between the .loss() method and the .losses_ attribute (I assume this is loss as a function of number of FISTA iterations until convergence)?

Thanks!

LogisticGroupLasso is not running stably and seems inferior to scikit-learn's LogisticRegression

Hi Yngve

I have been experimenting a bit with LogisticGroupLasso (for binary classification) and found it not to work super reliably, yet. LogisticGroupLasso, from what I have understood, is a recently added extension of the core functionality: sparse group lasso for regression. Is it possible that the logistic regression variant suffers from some early illnesses, still?

To demonstrate what I mean, I compared LogisticGroupLasso with sklearn's LogisticRegression estimator, on both synthetic and real-world data. To make the two comparable, I zeroed both regularizers l1_reg and group_reg. This converts the sparse group-lasso problem into a standard logistic regression problem.

To reproduce the observations, run the attached script comparison_group_lasso_scikit.py. Those are the parameters to adjust

# Settings.
d = 2           # Number of features
n = 100         # Number of samples per class
p = 100         # Number of difficulty levels
n_iter = 100    # Number of max iterations

In the test, I generate a series of p=100 synthetic classification problems with increasing difficulty. The data consists of d=2 predictors with n=100 samples per class, the difficulty is increased by increasing the noise in the data. Here is how the point clouds look like for three different difficulties:

First of all, I observed that FISTA convergence almost always is a problem, unless the problem is very easy, that is the data is separable with a large-enough margin. Is there anything one can do to improve convergence?

Secondly, it appears that LogisticGroupLasso often is inferior to sklearn's LogisticRegression. Note that the average difference in prediction accuracy does not improve with a larger setting for n_iter.

Thirdly, LogisticGroupLasso sometimes yields good results, even FISTA has not converged. But in the majority of cases (about 60%) it "fails". What are the reasons for the "sometimes-character" of the problem? It looks as if the "failure rate" is not strongly dependent on the problem difficulty.

Fourthly, the failure rate is reduced for a larger number of features d, from 60% to below 10%. In other words, the results by FISTA are closer to those from lbfgs (the solver in LogisticRegression) if more features are present. For this test, I increased the number of samples n from 100 to 1000. Adding more evidence (with every dimension adding the same amount of information) makes the problem certainly easier. Note that this is not reflected by my "difficulty" indicator.

Any recommendations on how to improve the performance of LogisticGroupLasso? It's perfectly possible that I'm not using LogisticGroupLasso the right way. Here's how I perform the fitting:

def compute_logistic_group_lasso(X, y, rs, n_iter=100):
    LogisticGroupLasso.LOG_LOSSES = True
    gl = LogisticGroupLasso(
        groups=np.arange(X.shape[1]),
        group_reg=0.0,
        l1_reg=0.0,
        supress_warning=True,
        random_state=rs,
        n_iter=n_iter,
        tol=1e-4,
    )
    gl.fit(X, y)
    y_pred = gl.predict(X).ravel()
    accuracy = (y_pred == y).mean()
    score = gl.score(X, y)
    assert(accuracy==score)
    return score

Other feedback regarding usability:

sklearn returns 1d-arrays for estimator.predict() with shape (n,), LogisticGroupLasso.predict() returns 2d arrays with shape (n,1). I think one just has to call y.ravel() when appropriate.

Inherit from logistic regression in sklearn

Currently, the two-class logistic regression model and multi-class logistic regression model is implemented differently. This is because I was able to find a better bound for the Lipschitz coefficient with the two-class problem. However, this artificial difference is not very intuitive, especially considering that it is regarded the same in sklearn.

A good improvement would be to simply inherit from Logistic regression in sklearn and overload the necessary functions.

Add regularisation selection utility

Selecting the regularisation coefficient can be difficult. It would therefore be useful for a utility that retrains many models (with warm start) and logs the regularisation coefficients as well as which groups are chosen with that regularisation coefficient.

There are two ways to accomplish this, one naive, simply doing a linear or logarithmic search on the group regularisation coefficient. The other approach I can think of is to start with a high regularisation and do a bisecting search for regularisation coefficients that lead to new groups being added and removed from the chosen variables. This latter approach is much more complex but may be implemented if there are interest (or if I have time to spare).

Possible typo in error message

Hello, I just got the following error message while trying to fit a LogisticGroupLasso:

RuntimeWarning: The FISTA iterations did not converge to a sufficient minimum.
You used subsampling then this is expected, otherwise,try to increase the number of iterations or decreasing the tolerance.

Shouldn't this recommend increasing the tolerance to reach convergence sooner?

BUG: Major bug in two-class logistic regression [Fixed]

In v.1.3.0, a severe bug was introduced that made logistic regression not work for two-class problems. This is now fixed in v1.3.1, but the code needs to be tidied up somewhat.

Positive group lasso ?

Hi, thank you for this nice repo :)
Is there any way to use it to do positive group lasso, that is minimizing over vector with non-negative coefficients ?
In scikit-learn, there is a positive=True parameter that handles this. Is there such possibility here, or is there an easy way to solve this ? Thank you!

Compatibility with Scikit-Learn Feature Selection

BaseGroupLasso is implemented as a scikit-learn transformer. As an intermediate step inside a pipeline it works well, but in combination with SelectFromModel (to further remove infinitesimal coefs), it throws an error:

>>> import sklearn, group_lasso
>>> sklearn.__version__, group_lasso.__version__
('1.0.2', '1.5.0')
>>> from group_lasso import GroupLasso
>>> from sklearn.datasets import make_regression
>>> from sklearn.feature_selection import SelectFromModel
>>> from sklearn.linear_model import Ridge
>>> from sklearn.pipeline import make_pipeline
>>> X, y = make_regression(n_features=5, n_informative=3, random_state=0)
>>> pipe = make_pipeline(SelectFromModel(GroupLasso(supress_warning=True)), Ridge())
>>> pipe.fit(X, y)
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
[<ipython-input-3-e63339588fc2>](https://localhost:8080/#) in <module>()
      3 
      4 X, y = make_regression(n_features=5, n_informative=3, random_state=0)
----> 5 pipe.fit(X, y)

6 frames
[/usr/local/lib/python3.7/dist-packages/sklearn/feature_selection/_base.py](https://localhost:8080/#) in _transform(self, X)
    101             return np.empty(0).reshape((X.shape[0], 0))
    102         if len(mask) != X.shape[1]:
--> 103             raise ValueError("X has a different shape than during fitting.")
    104         return X[:, safe_mask(X, mask)]
    105 

ValueError: X has a different shape than during fitting.

Possible solutions:

Add a new parameter min_coef and zero out all the coefficients s.t. np.abs(coef_) < min_coef. Optionally, reimplement TransformerMixin methods by inheriting from SelectorMixin to support get_support() and get_feature_names_out() methods.
or
Remove fit_transform and transform methods to enable SelectFromModel(Grouplasso(), threshold=min_coef) inside a pipeline.

Improve sklearn compatibility

Because I need better sklearn support for my needs, I started to look at your implementation and punctually improved a couple of things.

You may want to have a look at my fork here: https://github.com/normanius/group-lasso/tree/normanius/sklearn-compliance I don't ask for a pull request yet because it is work-in-progress. Happy to contribute a bit more if I find the time.

I made use check_estimator and mixins (BaseEstimator, ClassifierMixin, LinearClassifierMixin) to check how compliant the estimator is with sklearn. (See also the sklearn contribution guidelines.) Currently, GroupLasso is not fully compliant with ClassifierMixin nor with LinearClassifierMixin because it lacks support for multi-class classification and sparse data. The mixins come also with their tests.

I assembled a list of improvements that I've partially realized already. Feel free to use the contribution. Because I am not very familiar with the algorithmic aspects of group lasso, someone definitely should have a closer look at the changes.

[Fixed] predict_proba and decision_function are missing. predict does not return class predictions but scores. (My fix is temporary, functions predict and decision_function are copied from LinearClassifierMixin - they should be removed once GroupLasso inherits LinearClassifierMixin)
[Fixed] The results are not reproducible for repeated runs of fit(). This is because GroupLasso uses random coefficient initialization. There are sklearn guidelines for the handling of random states.
[Fixed] The labels y are not necessarily numeric. In particular, grad() makes use of the assumption that y takes numerical values. In my fix, I employ LabelEncoder to convert y into a numeric vector. But not entirely sure if this is the canonical way to do it. Furthermore, there is no multi-class support at the moment. Have a look at LogisticRegression to see how multi-class is handled there.
[Fixed] I strongly dislike the group assignment format (list of tuples with start/stop indices). Instead, I would simply go for a list of group identifier. For example if X has 5 features, a possible group assignment could be groups = [3,1,1,2,3]. If the features must be organized specially (e.g. ordering of the features such that groups are [1,1,2,3,3]), it can be done under the hood. I added _check_and_format_groups() to convert the group assignment into the original list-of-tuples format - though I haven't checked if this can optimized/simplified any further.
[Open] No support of intercepts, as in LogisticRegression. This should not be a big deal. I introduced self.intercept_ = 0 - but it is not computed at the moment.
[Open] Support for multi-class classification (with more than two labels)
[Open] Sparse-data support
[Open] Is it an option to derive from LogisticRegression and only override fit()?

There is certainly more to do. But this is what I assembled so far.

Saving LogisticGroupLasso models?

Saving (pickle.dump) LogisticGroupLasso fitted model not possible:

_pickle.PicklingError: Can't pickle <function LogisticGroupLasso._unregularised_loss at 0x1a22114048>: it's not the same object as group_lasso._group_lasso.LogisticGroupLasso._unregularised_loss

If this is rather a pickle issue, and not from this package, please ignore. However, would be nice to know how to save it then :)

Group LASSO vs. sparse group LASSO

Hey @yngvem , it wasn't clear to me if this should be implementing the group LASSO method or the sparse group LASSO method. Based on the README, I think it is sparse group LASSO. However, lines 207-211 in src/group_lasso/_group_lasso.py state:

regulariser = 0
coef_ = _split_intercept(w)[1]
for group, reg in zip(self.groups_, self.group_reg_vector_):
    regulariser += reg * la.norm(coef_[group])
regulariser += self.l1_reg * la.norm(coef_.ravel(), 1)

and I therefore think this implements the standard group LASSO with since reg * la.norm(coef_[group]) is the L2 norm. It would, of course, also be simple to add a kwarg to let a user choose between group LASSO and sparse group LASSO.

One further detail: in BaseGroupLasso the kwarg l1_reg is set to 0.00 but the docstring states it is 0.05.

Submitting a PR to (1) implement the L1 norm for sparse group LASSO and (2) fix the default l1_reg.

`AttributeError` shows up, when it tested on the official example of logistic regression

In the wake of installing group-lasso in my conda env, I tried to test it if it was configured correctly, so I ran the example of logistic regression offered on the docs page.

Every step is OK, but gl.fit(X, c), which throws out AttributeError as follows.

~/project/NDA_download/workshop_2/scripts/example_logistic_group_lasso.py in 
----> 86 gl.fit(X, c)

~/miniconda3/envs/zxcpy/lib/python3.6/site-packages/group_lasso/_group_lasso.py in fit(self, X, y, lipschitz)
    433         """Fit a group-lasso regularised linear model.
    434         """
--> 435         self._init_fit(X, y, lipschitz=lipschitz)
    436         self._minimise_loss()
    437         self.intercept_ -= (self._X_means_ @ self.coef_).reshape(

~/miniconda3/envs/zxcpy/lib/python3.6/site-packages/group_lasso/_group_lasso.py in _init_fit(self, X, y, lipschitz)
    413 
    414         self.subsampler_ = Subsampler(
--> 415             X.shape[0], self.subsampling_scheme, self.random_state
    416         )
    417 

~/miniconda3/envs/zxcpy/lib/python3.6/site-packages/group_lasso/_subsampling.py in __init__(self, num_indices, subsampling_scheme, random_state)
     88         self.random_state = random_state
     89         self.subsampling_scheme = subsampling_scheme
---> 90         self.set_num_indices(num_indices)
     91 
     92     def set_num_indices(self, num_indices):

~/miniconda3/envs/zxcpy/lib/python3.6/site-packages/group_lasso/_subsampling.py in set_num_indices(self, num_indices)
     92     def set_num_indices(self, num_indices):
     93         self.num_indices_ = num_indices
---> 94         self.update_indices()
     95 
     96     def subsample(self, *Xs):

~/miniconda3/envs/zxcpy/lib/python3.6/site-packages/group_lasso/_subsampling.py in update_indices(self)
    106             self.num_indices_,
    107             self.subsampling_scheme,
--> 108             random_state=self.random_state,
    109         )
    110 

~/miniconda3/envs/zxcpy/lib/python3.6/site-packages/group_lasso/_subsampling.py in _get_random_row_idxes(num_rows, subsampling_scheme, random_state)
     29         raise ValueError("Not valid subsampling scheme")
     30 
---> 31     inds = random_state.choice(num_rows, num_subsampled_rows, replace=False)
     32     inds.sort()
     33     return inds

AttributeError: 'NoneType' object has no attribute 'choice'

I had searched it on Google, while the error info seems ambiguous, nothing related found.
The versions of dependencies are:

numpy = 1.18.4
scikit-learn = 0.23.0
joblib = 0.14.1
scipy = 1.4.1
threadpoolctl = 2.0.0
group-lasso = 1.4.0

Thank you very much.

How to adjust the regularization parameters?

Hi Yngve

This is the last issues for a while, promised. I spent the weekend with your toolbox, but now I have to move on... :)

I was wondering if there are some rules of thumb how to adjust the regularization parameters in general.

For the following very easy problem (see the data in the left subplot), I found that LogisticGroupLasso has quite some issues if l1_reg and group_reg is increased. The subplot on the righthand side shows the prediction accuracy for different values of l1_reg and group_reg, sampled in a squared grid with the parameters falling inside the range [0,0.02]. One can see that the regression fails already for very small values of the parameter.

Questions:

Is this behavior to be expected for low-dimensional problems?
Any advice on parameter tuning? Are there good strategies how to set regularization parameters? Using a grid-search, one might find the stable region. But the stable region for this problem is relatively small.
Since FISTA very frequently fails (just like lbfgs in sklearn's Logistic Regression, by the way!), it is difficult to differentiate the causes for poor prediction results: Is it because the optimization problem was too difficult for the choice of parameters, or is it because the training data was not providing enough information? Any idea how to distinguish the former from the latter?

Thanks!

can your code find these groups by itself?

may you help understand what to do if I do not know how many groups there in data?
can your code find these groups by itself?
what would be for overlapping groups like your refrences
https://papers.nips.cc/paper/4275-efficient-methods-for-overlapping-group-lasso.pdf
https://www.researchgate.net/publication/46582615_Fast_Overlapping_Group_Lasso

yngvem / group-lasso Goto Github PK

group-lasso's Introduction

group-lasso's People

Contributors

Stargazers

Watchers

Forkers

group-lasso's Issues

Problem

Expected behavior

Script to reproduce

output

Recommend Projects

Recommend Topics

Recommend Org