jamesyang007 / adelie Goto Github PK

View Code? Open in Web Editor NEW

8.0 4.0 0.0 17.74 MB

A fast and flexible Python package for solving group elastic net problems.

Home Page: https://jamesyang007.github.io/adelie/

License: MIT License

C++ 62.32% Jupyter Notebook 1.70% Python 35.78% Starlark 0.17% Shell 0.02%

elastic group lasso net python310 python39 cpp17 convex-optimization coordinate-descent python311

adelie's Introduction

Adelie is a fast and flexible Python package for solving group elastic net problems.

Installation: https://jamesyang007.github.io/adelie/notebooks/installation.html
Documentation: https://jamesyang007.github.io/adelie
Source code: https://github.com/JamesYang007/adelie
Issue Tracker: https://github.com/JamesYang007/adelie/issues

It offers a general purpose group elastic net solver, a wide range of matrix classes that can exploit special structure to allow large-scale inputs, and an assortment of generalized linear model (GLM) classes for fitting various types of data. These matrix and GLM classes can be extended by the user for added flexibility. Many inner routines such as matrix-vector products and gradient, hessian, and loss of GLM functions have been heavily optimized and parallelized. Algorithmic optimizations such as the pivot rule for screening variables and the proximal Newton method have been carefully tuned for convergence and numerical stability.

adelie's People

Contributors

Stargazers

Watchers

adelie's Issues

Implement `offset`

Offset is useful for nesting model fits to predict remaining residuals as a separate model. We want to model eta = offset + Xb + b0. Currently, we implemented the case when offset = 0.

Gaussian can be done very easily by simply subtracting the offset from y first before doing any fit.
IRLS needs to change though.

Finish #47 first then do this.

`dense` GPU version

Benchmarking shows that we are truly bottlenecked by memory bandwidth. There's no solution for this except moving computation to GPU. It might be fine for our main applications, but something to keep thinking about.

Application to SNP Data

Pivot rule Paper TODOS

Eigen-decomp Optimization

Simple optimization is to check for group-size of 1, in which case, set V = [[1]] and D = norm-squared of column.

Benchmark against other packages

Python:

group_lasso

grplasso
gglasso
sparsegl
lassogrp

Naive version

Currently, there's only covariance method, which works well for n > p, but for large p and especially for groups with large size, the covariance method is terrible with both memory and time.

implement naive
benchmark naive vs cov to get good heuristic for when to choose what

New GLM design

The most general method is to have each GLM specify the full objective. Not every objective will be of the current form. The most general form is: -y^T W^0 eta + A(eta) for some convex function A and eta = Xb + b0.

Input checking

We need a lot more stringent input checking throughout the solving functions/state creators.

UKB Analysis TODOS

Group lasso is not doing that much better than lasso in R^2. Some improvement ideas:
- Try different phenotype than height?
- We may need to pool all population to get enough ancestry variation. The point is that we're probably not going to do that much better on white_british, but pooling can improve performance on the other populations.
  - Check ancestry information for white british population that there's sufficient counts.
  - Include population information as unpenalized features.
Try Mexico biobank
Try subsetting UK biobank to those born outside UK
Try offset idea by first fitting lasso.

Sort lmdas before fitting

Change KKT to go backwards

Since active set and strong set are always increasing, if KKT passes for lambda_i, it also passes for the previous lambdas in the batch during basil.

Is this true?

Refactor 11/6

Change matrix exports to be inherited. Add Trampolines.
Change check for basil to use state.X

Refactor v1.1

Rewrite group_lasso that abstracts out X. Clearly define the interface for X.
Write custom X class that subsets groups of columns. Idea is that since group_lasso only operates on strong groups anyways, we don't need to store the whole X. We just need to store a list of the groups of columns and a mapping of original index -> index to the list.
Parallelize some LA routines.

Flag to skip over no-penalty coefficients

Add a flag to lasso fitting where we may skip over non-penalty coefficients. In basil, the very first iteration requires us to fit all non-penalized variables (penalty.factor[i] == 0). But this solution remains the same across the regularization path. So, subsequent calls down the regularization path can skip over coordinate descent on these coefficients.

Refactor Plan

Refactor:

State restructuring: try to take as many quantities outside the state object as possible. We only want to maintain invariants that are needed for passing the state again.
state.reset(*, **kwargs) and allow user to modify any subset of the "modifiable" things. Return a new object with those changes. This is a change to state.update_lmda_path.
objective should really just be naive version and add a glm object to make it general.
Redo diagnostic with GLM extension.

Tests:

All glm classes
logistic regression as an example

TODO 11/8

Tooling

Predict

Diagnostic tools

`dense` flexible threading

Matrix dense implements the multiplication routines using OMP. Benchmarking shows that threading is not that efficient if things don't lie in cache well. The most sophisticated thing to do is to batch-process these multiplication routines. Specifically, we chunk up the batches as large as possible so that everything sits in cache, perform parallelization, and then loop to the next batch.

It would be nice to allow users to specify a number of parameters that go into this implementation. For example, user can specify total-memory allowed per batch and the total number of threads. Then, we can deduce how many rows to batch.

kronecker_eye for multi-response GLMs

multi-response gaussian
multinomial

TODO 12/02

New repo snpper
- Move snputils related things here
- Package it with C++ backend.
gen_to_snpdat
- Add samples_indices and snps_indices option. Useful for debugging, splitting to training and testing, and removing SNPs based on MAF. None means use all.
- Write custom converter from calldatas -> calldata (I x 2 * snp) (F storage, int8_t). Also applies for lai.
- Save converted calldata and LAI using np.save instead of np.savetxt.
- Implement calldata_path and lai_path which supplies path of converted calldata and LAI .npy files.

Add `None` option to all diagnostic (or all) functions that want `glm`

Strong Rule Improvements

Eviction: idea is to evict the newly added strong variables which didn't turn out to be active. The hope is that after eviction and we check for the next set of strong variables, there is more evidence to either include or not include them.
- This doesn't really change the sets. It's not worth pursuing this.
Quantile: verify empirically which variables that fail strong condition are actually active next. Hopefully, it's those which are the largest abs_grad.

SIGINT pybind checker in active set loop

Consider moving the SIGINT checker in the active set solving loop instead of screen set solving loop.

2023-10-01 TODO

Fully parallelize routines in BASIL

solve_pin has been fully parallelized but BASIL has lots of opportunities for parallelizations still. Find ALL places to parallelize.

`sp_btmul` optimization

There's no reason why we need to have weights for this one.
Some implementations have not been optimized (not needed for fitting). But for completion, it would be nice to have it fast.

Benchmark Diagnostic

It would be nice to also save timings for each region of the code.
For each lambda fitting:

Time on coordinate descent on strong set?
Time on coordinate descent on active set?
etc.
Then, similar timings for basil iteration.

state API should contain reference to original `y`

Right now, state.check requires the user to pass in the y and similarly for Diagnostic(y=y, ...). This seems a bit awkward. Considering passing in y in the python level wrapper when constructing the state just to keep the reference? No need to change C++ code.

General GLM

Implement general GLM fitter.

Think about difference convergence metric for Newton method: ||mu^{k+1} - mu^k||_{W}^2 <= tol. mu = mean parameter at current natural parameter eta = Xb + b0.
Verify generalized R^2 metric.

Gaussian Naive move `resid_sum` as passable input for consistency?

Remove all buffers from state objects as local variables in solver functions

Real Data Analysis Debugging

Try glmnet on chromosome 1 for a bit of the path just for correctness.
Try linear regression of y ~ covariates to see if we can reproduce the R^2.
Make cmul parallelized in snp_unphased
Implement concat matrix to concatenate dense + chromosome SNP matrices.
Refactor snp matrices to be single file. Use concat to concatenate many snp matrices.
Use concat matrix to fit on real data.
Double-check that calldata is pulled correctly when read using subset feature. Check by reading the full chromosome 1 calldata then subsetting to the rows and columns.
Test grpnet for snp matrices against dense with multiple blocks

TODO 2023-10-06

Parallelism Primer

Add a notebook in documentation User Guide about tips on parallelism:

On local machines like laptop or maybe some desktops, only parallelize if n and/or really large (at least few hundred thousand) (good default is no parallelism).
Cost/benefit analysis is tricky bc it depends on OS + hardware.
Note some differences between M1 laptop and a linux machine on a HPC.
Thread management cost is high in general.
Make n_threads=1 or the same number for every object that can specify this parameter. Like only allowed values are {1, 8} if you'd like to use 8 threads wherever you want to parallelize. OMP doesn't like it when thread pool size non-trivially changes.

Add testing framework for checking of group lasso is good

Check strong_gradient matches with current strong_beta
Check rsq matches formula at current strong beta and gradient
Check active_g1 and active_g2 are correct w.r.t. active_set
Check active_begins
Check active_order
Check is_active
Check rsqs match with lmdas and betas

GLM Extension Primer

Notebook on how to extend GLM classes

VERSION unify

The version number should be stored in one file.

`setup.py` Mac installation generalize path to `libomp`

Add early stopping rule based on number of active values

Matrix Extension Primer

Notebook explaining what a matrix class is and how users can extend it would be useful.

Experiment with convergence criterion

glmnet has 2 metrics on how we define this: https://glmnet.stanford.edu/articles/glmnet.html#appendix-0-convergence-criteria
I have a new idea of checking how much the prediction mu(Xb + b0) changed
- I really like this idea. Firstly, mu has to be tracked because it is needed to compute gradient. So we have mu^k and mu^{k+1} for free. Second, ||mu^{k+1} - mu^k||_{W0}^2 ~ sum_i w0_i v^k_i (eta^{k+1} - eta^k)^2 by doing Taylor expansion of mu. This is like measuring average prediction MSE under the IRLS weights.

Parallel CD?

It should be possible to parallelize coordinate descent here...

How good is SAFE rule?

When alpha != 1, we must fallback to the heuristic for strong rule, which is not safe. A simple alternative approach is to combine SAFE and STRONG rule. Then KKT only needs to be checked on SAFE set. Note that when alpha == 1, the EDPP rule ensures that the strong set is safe already. If SAFE rule is pretty good, we can still get benefits in elastic net case.

return vec_value_t::NullaryExpr(
        grad.size(), [&](auto i) {
            return (penalty[i] <= 0.0) ? 0.0 : std::abs(grad[i]) / penalty[i];
        }
    ).maxCoeff() / std::max(alpha, 1e-3)

Progress bar

It would be nice to show a progress bar like tqdm in the number of lambdas. Consider copying this: https://gitlab.com/miguelraggi/tqdm-cpp

Pivot Rule Plot

strong-LS, strong-S, pivot-LS, pivot-S, active, safe
- Plot for n=100, p=10000, G=1000, rho in {0, 0.5}.