felix-clark / ndarray-glm Goto Github PK

View Code? Open in Web Editor NEW

22.0 22.0 0.0 447 KB

Rust library for linear, logistic, and generalized linear model regression

License: MIT License

Rust 96.84% R 3.16%

ndarray-glm's People

Contributors

Stargazers

Watchers

ndarray-glm's Issues

IRLS should subtract the saturated likelihood

This could improve numerical issues for bad fits and would put tolerances on a more similar scale across different families. It can be computed once and stored somewhere; perhaps in the IRLS struct itself.

The fit statistics will need to be re-evaluated for this change.

Simple way to extract residuals?

I note that there is no residuals field or method on the Fit object. I guess I could fit.expectation(x) (which by the way is not documented due to #18) and subtract y, but it might be a bit more user friendly to make a method for this. In particular you could provide the option to output standardised residuals, which is actually what I'm trying to calculate here.

Prelude module

Consider exporting common functions in a prelude submodule rather than directly in the ndarray-glm namespace.

One issue to work out is whether re-exporting ndarray::Array1 and the like would be a problem; some users may want to use ndarray_glm::prelude::*; and this would probably lead to conflicts.

Enable math rendering in docs

See ndarray-linalg for an example.

Expand test cases

Convergence should be shown over a wide range of potentially difficult datasets.

Ideally some canonical datasets with known results could be used.

A dataset should be found that engages the step halving.

Covariance, fisher matrix, wald and score tests are incorrect

Due to a bug in ndarray-linalg, the calculations for several fit statistics, including the covariance matrix, fisher matrix, Wald and score tests, have resulted in incorrect off-diagonal entries. This bug appears to affect the Hermetian inversion functions invh() and invh_into(), and the latter was used often.

As a workaround the use of this function should be changed to inv_into().

Additional common exponential families

Exponential
Gamma (effectively an exponential w/ arbitrary dispersion parameter)
Inverse Gaussian
...

Logging system

Would the standard log interface be sufficient? slog at least should be investigated.

Any logging should be gated behind a feature flag so that it can be compiled out entirely in performance-critical applications. I'm unsure whether it should be enabled or disabled by default.

Proper LASSO regression

Right now L1 regularization is implemented with a smoothed V-curve potential. It would be better to use a LASSO approach, which should reduce over-stepping and actually encourage parameters to stay near zero instead of bouncing back and forth.

Return a `Fit` from a function

I encountered an annoying case where I wanted to write a function that was basically fn(data: Array2<T>) -> Fit<Linear, T>. Unfortunately the Fit struct holds a reference to the Model used to create it, so it's basically impossible to return the Fit at all. The best you could do is use owning_ref to return both together, but this is certainly awkward. I could of course return just the model, but then I have to call fit() and therefor re-run the substantial processing involved each time I want access to the Fit. Can you think of a way to make this use case a bit nicer? Maybe removing the Model reference from Fit, and instead make it an argument to the functions that require it?

Track history of each iteration

The guess at each step of IRLS, as well as the regularized likelihood, are already held in the IrlsStep objects returned from the iteration. These are unused now, but a vector of them could be saved and stored in the Fit. This should probably be opt-in behavior.

Weighted/correlated observations

It should be relatively straightforward to modify the IRLS algorithm to account for varying weights between data points or even correlations between data points. There may be other ramifications to consider, e.g. for convergence and termination conditions.

Configurable termination conditions

tolerance (split into relative and absolute?)
maximum iterations

Fit module is not documented?

It seems that the Fit struct returned by the fit() method is not documented, for example. Indeed it seems that the fit module that it belongs to is also not included in the list of modules in the documentation. Is there any way you can export either the struct in the prelude or the module itself so that it's documented. There's a chance that it's not possible to use the struct in type definitions either for the same reason, certainly my autocomplete doesn't find it.

Logistic regression standardized and studentized residuals are slightly different from those given in R

This is apparent in the logistic integration test log_termination_0, which now compares several fit statistics to those given by R. For the standardized/studentized residual equality checks to pass the tolerance has to be increased to ~0.05, so they are approximately equal but not exact up to FPEs.

I'm including these residual functions in 0.0.10 since they are close and should still be useful, but the source of the difference should be investigated further.

polars integration

polars seems to be emerging as the preeminent dataframe library with rust bindings. Abstracting the Dataset interface could allow a convenient way to let users provide a dataframe and column names instead of manually constructing arrays. This would mesh with other features like optional variable names.

This should probably be feature-gated. Polars itself has an ndarray feature that would need to be activated.

Multi-dimensional sufficient statistics

I haven't found much in the way of documentation or examples of this, but it should be possible to, for instance, use separate simultaneous predictors for mu and log(sigma^2) in a Gaussian response.

R^2 / coefficient of multiple correlation

Would also be nice for this to be implemented.

Make `utility` module public

Currently the docs advise using utility::one_pad if the data doesn't have an intercept column. Unfortunately this module is also private.

Cook's distance

This would be a useful diagnostic to determine how much influence each observation has over the fit.

Estimate the variance sigma2

It seems that OLS regression is the type that has been implemented, so makes the assumption of a fixed sigma squared: the scalar variance of the errors and also y. How can this be extracted from the fit object? Is it the same as the dispersion() function? Calling this parameter "dispersion" is a bit unfamiliar to me.

Caching intermediate results in `Fit` interface

Some operations are performed repeatedly in the implementation of the Fit result. The covariance matrix is already cached manually, which is most important because this is probably an O(n^3) calculation. Some other functions and operations may be of only linear complexity but be called repeatedly and potentially involve relatively slow function calls like logarithms.

Some potential caching targets:

the prediction of the fitted data points given the model
the response residuals

Data standardization could be internalized

The standardize utility function should probably be moved internally for at least two reasons.

interface uniformity
the ability to persist the transformation to be used downstream in Fit::predict() on external/test/validation data

Define macros for common/simple use cases

Basic uses cases like ordinary least squares should be more ergonomic. Macros should be defined so that the user can get results in one short line and doesn't have to go through "building" a model and fitting it. Optional keyword parameters could also be used for configuration.

GPU acceleration

accel
GPUfit

De-couple fit-specific options from Model/ModelBuilder

One could imagine fitting a single model (i.e. data + GLM type) with different fitting options (e.g. regularization and max number of iterations). This will require changing the general interface and some thought should go into making this as ergonomic as possible.

covariate p-values, special math functions

A crate like special-fun would enable some calculations like CDFs and quantile residuals. Perhaps this should be gated behind an optional feature.

felix-clark / ndarray-glm Goto Github PK

ndarray-glm's People

Contributors

Stargazers

Watchers

ndarray-glm's Issues

Recommend Projects

Recommend Topics

Recommend Org