Giter Site home page Giter Site logo

jamesyang007 / adelie Goto Github PK

View Code? Open in Web Editor NEW
8.0 4.0 0.0 17.74 MB

A fast and flexible Python package for solving group elastic net problems.

Home Page: https://jamesyang007.github.io/adelie/

License: MIT License

C++ 62.32% Jupyter Notebook 1.70% Python 35.78% Starlark 0.17% Shell 0.02%
elastic group lasso net python310 python39 cpp17 convex-optimization coordinate-descent python311

adelie's Introduction


GitHub Actions Workflow Status PyPI Downloads versions PyPI - Version GitHub Release

Adelie is a fast and flexible Python package for solving group elastic net problems.

It offers a general purpose group elastic net solver, a wide range of matrix classes that can exploit special structure to allow large-scale inputs, and an assortment of generalized linear model (GLM) classes for fitting various types of data. These matrix and GLM classes can be extended by the user for added flexibility. Many inner routines such as matrix-vector products and gradient, hessian, and loss of GLM functions have been heavily optimized and parallelized. Algorithmic optimizations such as the pivot rule for screening variables and the proximal Newton method have been carefully tuned for convergence and numerical stability.

adelie's People

Contributors

jamesyang007 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

adelie's Issues

Implement `offset`

Offset is useful for nesting model fits to predict remaining residuals as a separate model. We want to model eta = offset + Xb + b0. Currently, we implemented the case when offset = 0.

  • Gaussian can be done very easily by simply subtracting the offset from y first before doing any fit.
  • IRLS needs to change though.

Finish #47 first then do this.

`dense` GPU version

Benchmarking shows that we are truly bottlenecked by memory bandwidth. There's no solution for this except moving computation to GPU. It might be fine for our main applications, but something to keep thinking about.

Application to SNP Data

  • MatrixSNPUnphased: SNP calldata matrix (unphased).
    • outer_indices: (p+1,) array indicating the starting index into inner_indices that grabs the corresponding non-zero indices. The last entry should be length of inner_indices.
    • inner_indices: (I,) array where inner_indices[outer_indices[i]:outer_indices[i+1]] is the ith column non-zero indices.
    • values : (I,) array where values[outer_indices[i]:outer_indices[i+1]] is the ith column non-zero values.
  • MatrixSNPPhased: SNP calldata matrix (phased).
    • outer_indices_1
    • outer_indices_2
    • inner_indices_1
    • inner_indices_2
  • MatrixSNPAncestryUnphased
  • MatrixSNPAncestryPhased

Pivot rule Paper TODOS

  • Real data application:
    • Take the same datasets from strong rule paper and run lasso + group lasso with splines as groups.
  • Strong rule vs Pivot rule benchmark
    • Total time difference (shows that strong fit time is non-negligible to the whole algorithm cost)
    • Strong fit time difference (show that this is the main difference)
  • Rewrite BASIL to only batch lmda = 1.
    • Only needs to return bool for whether KKT passed or not.
    • Must still save every grad element (just compute X^T resid one shot)
    • Compute one shot abs_grad as well.
    • If KKT failed, populate with failed variables.
  • Show a picture for why pivot rule is better than the strong rule (correlated variables). The strong rule is a hard cutoff at some threshold only depending on the lambda sequence, not so much the data. In general, when the active scores saturate towards the top, the strong rule will pull too many groups. The pivot rule is more sensitive to the structure of the active scores as it looks at how close the scores are at the top.
  • Elastic net? General loss?
  • LARS
  • Combine active score for arcene zoomed in into Figure 2

Eigen-decomp Optimization

Simple optimization is to check for group-size of 1, in which case, set V = [[1]] and D = norm-squared of column.

Naive version

Currently, there's only covariance method, which works well for n > p, but for large p and especially for groups with large size, the covariance method is terrible with both memory and time.

  • implement naive
  • benchmark naive vs cov to get good heuristic for when to choose what

New GLM design

The most general method is to have each GLM specify the full objective. Not every objective will be of the current form. The most general form is: -y^T W^0 eta + A(eta) for some convex function A and eta = Xb + b0.

Input checking

We need a lot more stringent input checking throughout the solving functions/state creators.

UKB Analysis TODOS

  • Group lasso is not doing that much better than lasso in R^2. Some improvement ideas:
    • Try different phenotype than height?
    • We may need to pool all population to get enough ancestry variation. The point is that we're probably not going to do that much better on white_british, but pooling can improve performance on the other populations.
      • Check ancestry information for white british population that there's sufficient counts.
      • Include population information as unpenalized features.
  • Try Mexico biobank
  • Try subsetting UK biobank to those born outside UK
  • Try offset idea by first fitting lasso.

Change KKT to go backwards

Since active set and strong set are always increasing, if KKT passes for lambda_i, it also passes for the previous lambdas in the batch during basil.

Is this true?

Refactor 11/6

  • Change matrix exports to be inherited. Add Trampolines.
  • Change check for basil to use state.X

Refactor v1.1

  • Rewrite group_lasso that abstracts out X. Clearly define the interface for X.
  • Write custom X class that subsets groups of columns. Idea is that since group_lasso only operates on strong groups anyways, we don't need to store the whole X. We just need to store a list of the groups of columns and a mapping of original index -> index to the list.
  • Parallelize some LA routines.

Flag to skip over no-penalty coefficients

Add a flag to lasso fitting where we may skip over non-penalty coefficients. In basil, the very first iteration requires us to fit all non-penalized variables (penalty.factor[i] == 0). But this solution remains the same across the regularization path. So, subsequent calls down the regularization path can skip over coordinate descent on these coefficients.

Refactor Plan

Refactor:

  • State restructuring: try to take as many quantities outside the state object as possible. We only want to maintain invariants that are needed for passing the state again.
  • state.reset(*, **kwargs) and allow user to modify any subset of the "modifiable" things. Return a new object with those changes. This is a change to state.update_lmda_path.
  • objective should really just be naive version and add a glm object to make it general.
  • Redo diagnostic with GLM extension.

Tests:

  • All glm classes
  • logistic regression as an example

TODO 11/8

  • n_threads_n, n_threads_p to control different parts of the algorithm.
  • Remove the capping of n_threads because there's significant cost to changing pool size. Just keep it fixed.
  • Recommendation: all n_threads_ should be either 1 or the same value, including the one used in matrix classes.
  • Benchmark OMP to be absolutely sure that there is huge cost to changing thread pool size.
  • abs_grad needs to change to include elastic net correction lmda * (1-alpha) * w_g * beta_g!!!!! Double check we only need to change the definition for abs_grad.
  • fix sparse matrix export of beta so that it is faster!
  • Implement matrix class for phased ancestry
  • Write tutorial notebook for using different SNP styles.
  • Write a script to subset data (calldata/ancestry/phenotype) and scp the results to local machine.
  • Write exact script for processing all existing data files to .snpdat format.
  • Try using adelie on real data!
  • Change SNP matrices to include a dense component

Diagnostic tools

  • Implement grpnet(X, y, groups, group_sizes, alpha, penalty)
    • If X dense, make naive dense matrix.
    • Otherwise, should be a naive matrix
    • All naive matrices must implement means(out) and group_norms(groups, group_sizes, center, out).
    • Create inf state and call solve_basil.
  • Plot coefficient profile
  • Plot R^2 as a function of -log(lmdas)
  • Plot active-size, strong-size, edpp-safe-size
  • Plot KKT check
  • Predict function (linear interpolation)
  • Change convergence criterion to be based on change in R^2
  • Benchmark tools
    • How much time spent on strong set CD for each lmda?
    • How much time spent on active set CD for each lmda?
    • BASIL fit time at each basil iteration: total strong, total active
    • BASIL screen time at each basil iteration
    • BASIL kkt time at each basil iteration
    • BASIL invariance time at each basil iteration
    • How many valid solutions for each basil iteration?
  • Add groups and group_sizes check where groups[i]-groups[i-1] matches group_sizes[i].

`dense` flexible threading

Matrix dense implements the multiplication routines using OMP. Benchmarking shows that threading is not that efficient if things don't lie in cache well. The most sophisticated thing to do is to batch-process these multiplication routines. Specifically, we chunk up the batches as large as possible so that everything sits in cache, perform parallelization, and then loop to the next batch.

It would be nice to allow users to specify a number of parameters that go into this implementation. For example, user can specify total-memory allowed per batch and the total number of threads. Then, we can deduce how many rows to batch.

TODO 12/02

  • New repo snpper
    • Move snputils related things here
    • Package it with C++ backend.
  • gen_to_snpdat
    • Add samples_indices and snps_indices option. Useful for debugging, splitting to training and testing, and removing SNPs based on MAF. None means use all.
    • Write custom converter from calldatas -> calldata (I x 2 * snp) (F storage, int8_t). Also applies for lai.
    • Save converted calldata and LAI using np.save instead of np.savetxt.
    • Implement calldata_path and lai_path which supplies path of converted calldata and LAI .npy files.

Strong Rule Improvements

  • Eviction: idea is to evict the newly added strong variables which didn't turn out to be active. The hope is that after eviction and we check for the next set of strong variables, there is more evidence to either include or not include them.
    • This doesn't really change the sets. It's not worth pursuing this.
  • Quantile: verify empirically which variables that fail strong condition are actually active next. Hopefully, it's those which are the largest abs_grad.

2023-10-01 TODO

  • pyproject.toml
    • Follow classifiers as in numpy
    • Remove pybind, cppimport from deps and poetry deps
  • README.md better descriptions
  • tests
    • test_grpnet fill in
  • docs
    • dev
      • Versioning simplify and better format
      • Write command for git tag
  • adelie
    • bcd.py
      • use h stars in docs
      • say undefined behavior when conditions are not satisfied (maybe throw exception?)
      • more descriptive return name
      • solve description not exactly correct? Well-defined iff ||v|| <= l1 or ||v_S|| <= l1.
      • Root should only take quad, linear, l1 because it solves root_function. Well-defined iff ||v|| >= l1 and ||v_S|| < l1.
    • matrix.py
      • rethink API. Really should only expose bare minimum, highest-level routine. Right now, we are enforcing a dense representation of block and col. As a result, we must enforce column major for dense.
    • state.py
      • move X_k description to solve_pin
      • Alpha bound [0,1]
      • Penalty >= 0
      • Strong_set is a subset of groups
      • Strong_begin say how much you should read also.
      • strong_A_diag change name globally to strong_var and math k instead.
      • default vals for some configs
      • globally change thr to tol
      • rsq_slope_tol, rsq_curv_tol
      • strong_grad math k
      • Make readonly!
    • grpnet
      • give same bound information for wach param

`sp_btmul` optimization

  • There's no reason why we need to have weights for this one.
  • Some implementations have not been optimized (not needed for fitting). But for completion, it would be nice to have it fast.

Benchmark Diagnostic

It would be nice to also save timings for each region of the code.
For each lambda fitting:

  • Time on coordinate descent on strong set?
  • Time on coordinate descent on active set?
  • etc.
    Then, similar timings for basil iteration.

state API should contain reference to original `y`

Right now, state.check requires the user to pass in the y and similarly for Diagnostic(y=y, ...). This seems a bit awkward. Considering passing in y in the python level wrapper when constructing the state just to keep the reference? No need to change C++ code.

General GLM

Implement general GLM fitter.

  • Think about difference convergence metric for Newton method: ||mu^{k+1} - mu^k||_{W}^2 <= tol. mu = mean parameter at current natural parameter eta = Xb + b0.
  • Verify generalized R^2 metric.

Real Data Analysis Debugging

  • Try glmnet on chromosome 1 for a bit of the path just for correctness.
  • Try linear regression of y ~ covariates to see if we can reproduce the R^2.
  • Make cmul parallelized in snp_unphased
  • Implement concat matrix to concatenate dense + chromosome SNP matrices.
  • Refactor snp matrices to be single file. Use concat to concatenate many snp matrices.
  • Use concat matrix to fit on real data.
  • Double-check that calldata is pulled correctly when read using subset feature. Check by reading the full chromosome 1 calldata then subsetting to the rows and columns.
  • Test grpnet for snp matrices against dense with multiple blocks

TODO 2023-10-06

  • Export covariance state to Python
  • Test covariance state
  • Document covariance state
  • Implement solve_pin_cov
  • Test solve_pin_cov
  • Change states to all own objects
  • Redocument in export file and simplify state.py
  • Change matrices to be split into pin_naive_base pin_naive_dense etc.
  • Cov lazy
  • Change file names (final)
  • Basil idea:
    • center y if fitting with intercept (save the mean also to output intercept value)
    • Strong set only includes the un-l1-penalized variables at first. Use lambda_max to fit always! Since fitting part assumes zero coefficient outside strong set, it must be lambda_max by definition to be well-defined. Use X.to_dense() to grab the respective blocks. Process them to get strong_vars and UD, V (centering first if necessary).
    • Now that initial solution has been found and some variables are updated, the current state should be in its invariance state.
    • X.init_states(...) can do a one-pass through the entire matrix to compute a lot of quantities like gradient, ||X_k||_2^2, and X_k mean. This will give us all the initial states related to X.
    • Strong/EDPP rule to omit variables.
    • X.to_dense(j, q, out) stores the dense version of the column blocks into out. From this, we can compute (possibly centered if fitting with intercept) SVD of the block and save the necessary UD and V separately. This will only be done on the strong set.
  • Implement MatrixPinNaiveSubset which views a subset of a matrix on the strong set and integrate into solve_basil_naive.hpp.
  • Figure out interface for MatrixBasilNaiveBase. Export bindings and add to matrix.py.
  • Implement MatrixBasilNaiveDense, export, add to matrix.py, test.
  • Export StateBasilNaive with MatrixBasilNaiveBase.
  • Extend state.py to include basil_base, basil_naive_base, basil_naive, etc. Create tests.
  • Export solve_basil_naive. Add to solver.py. Test.

Parallelism Primer

Add a notebook in documentation User Guide about tips on parallelism:

  • On local machines like laptop or maybe some desktops, only parallelize if n and/or really large (at least few hundred thousand) (good default is no parallelism).
  • Cost/benefit analysis is tricky bc it depends on OS + hardware.
  • Note some differences between M1 laptop and a linux machine on a HPC.
  • Thread management cost is high in general.
  • Make n_threads=1 or the same number for every object that can specify this parameter. Like only allowed values are {1, 8} if you'd like to use 8 threads wherever you want to parallelize. OMP doesn't like it when thread pool size non-trivially changes.

Add testing framework for checking of group lasso is good

  • Check strong_gradient matches with current strong_beta
  • Check rsq matches formula at current strong beta and gradient
  • Check active_g1 and active_g2 are correct w.r.t. active_set
  • Check active_begins
  • Check active_order
  • Check is_active
  • Check rsqs match with lmdas and betas

Experiment with convergence criterion

  • glmnet has 2 metrics on how we define this: https://glmnet.stanford.edu/articles/glmnet.html#appendix-0-convergence-criteria
  • I have a new idea of checking how much the prediction mu(Xb + b0) changed
    • I really like this idea. Firstly, mu has to be tracked because it is needed to compute gradient. So we have mu^k and mu^{k+1} for free. Second, ||mu^{k+1} - mu^k||_{W0}^2 ~ sum_i w0_i v^k_i (eta^{k+1} - eta^k)^2 by doing Taylor expansion of mu. This is like measuring average prediction MSE under the IRLS weights.

Parallel CD?

It should be possible to parallelize coordinate descent here...

How good is SAFE rule?

When alpha != 1, we must fallback to the heuristic for strong rule, which is not safe. A simple alternative approach is to combine SAFE and STRONG rule. Then KKT only needs to be checked on SAFE set. Note that when alpha == 1, the EDPP rule ensures that the strong set is safe already. If SAFE rule is pretty good, we can still get benefits in elastic net case.

Check lambda max construction

Is this correct? If alpha < 1e-3, this should still return the largest lambda_max that makes all coefficients to 0. I get the clipping, but the threshold seems way too large.

return vec_value_t::NullaryExpr(
        grad.size(), [&](auto i) {
            return (penalty[i] <= 0.0) ? 0.0 : std::abs(grad[i]) / penalty[i];
        }
    ).maxCoeff() / std::max(alpha, 1e-3)

Pivot Rule Plot

  • strong-LS, strong-S, pivot-LS, pivot-S, active, safe
    • Plot for n=100, p=10000, G=1000, rho in {0, 0.5}.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.