Giter Site home page Giter Site logo

ndarray-glm's People

Contributors

felix-clark avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

ndarray-glm's Issues

IRLS should subtract the saturated likelihood

This could improve numerical issues for bad fits and would put tolerances on a more similar scale across different families. It can be computed once and stored somewhere; perhaps in the IRLS struct itself.

The fit statistics will need to be re-evaluated for this change.

Simple way to extract residuals?

I note that there is no residuals field or method on the Fit object. I guess I could fit.expectation(x) (which by the way is not documented due to #18) and subtract y, but it might be a bit more user friendly to make a method for this. In particular you could provide the option to output standardised residuals, which is actually what I'm trying to calculate here.

Prelude module

Consider exporting common functions in a prelude submodule rather than directly in the ndarray-glm namespace.

One issue to work out is whether re-exporting ndarray::Array1 and the like would be a problem; some users may want to use ndarray_glm::prelude::*; and this would probably lead to conflicts.

Expand test cases

Convergence should be shown over a wide range of potentially difficult datasets.

Ideally some canonical datasets with known results could be used.

A dataset should be found that engages the step halving.

Covariance, fisher matrix, wald and score tests are incorrect

Due to a bug in ndarray-linalg, the calculations for several fit statistics, including the covariance matrix, fisher matrix, Wald and score tests, have resulted in incorrect off-diagonal entries. This bug appears to affect the Hermetian inversion functions invh() and invh_into(), and the latter was used often.

As a workaround the use of this function should be changed to inv_into().

Logging system

Would the standard log interface be sufficient? slog at least should be investigated.

Any logging should be gated behind a feature flag so that it can be compiled out entirely in performance-critical applications. I'm unsure whether it should be enabled or disabled by default.

Proper LASSO regression

Right now L1 regularization is implemented with a smoothed V-curve potential. It would be better to use a LASSO approach, which should reduce over-stepping and actually encourage parameters to stay near zero instead of bouncing back and forth.

Return a `Fit` from a function

I encountered an annoying case where I wanted to write a function that was basically fn(data: Array2<T>) -> Fit<Linear, T>. Unfortunately the Fit struct holds a reference to the Model used to create it, so it's basically impossible to return the Fit at all. The best you could do is use owning_ref to return both together, but this is certainly awkward. I could of course return just the model, but then I have to call fit() and therefor re-run the substantial processing involved each time I want access to the Fit. Can you think of a way to make this use case a bit nicer? Maybe removing the Model reference from Fit, and instead make it an argument to the functions that require it?

Track history of each iteration

The guess at each step of IRLS, as well as the regularized likelihood, are already held in the IrlsStep objects returned from the iteration. These are unused now, but a vector of them could be saved and stored in the Fit. This should probably be opt-in behavior.

Weighted/correlated observations

It should be relatively straightforward to modify the IRLS algorithm to account for varying weights between data points or even correlations between data points. There may be other ramifications to consider, e.g. for convergence and termination conditions.

Fit module is not documented?

It seems that the Fit struct returned by the fit() method is not documented, for example. Indeed it seems that the fit module that it belongs to is also not included in the list of modules in the documentation. Is there any way you can export either the struct in the prelude or the module itself so that it's documented. There's a chance that it's not possible to use the struct in type definitions either for the same reason, certainly my autocomplete doesn't find it.

Logistic regression standardized and studentized residuals are slightly different from those given in R

This is apparent in the logistic integration test log_termination_0, which now compares several fit statistics to those given by R. For the standardized/studentized residual equality checks to pass the tolerance has to be increased to ~0.05, so they are approximately equal but not exact up to FPEs.

I'm including these residual functions in 0.0.10 since they are close and should still be useful, but the source of the difference should be investigated further.

polars integration

polars seems to be emerging as the preeminent dataframe library with rust bindings. Abstracting the Dataset interface could allow a convenient way to let users provide a dataframe and column names instead of manually constructing arrays. This would mesh with other features like optional variable names.

This should probably be feature-gated. Polars itself has an ndarray feature that would need to be activated.

Multi-dimensional sufficient statistics

I haven't found much in the way of documentation or examples of this, but it should be possible to, for instance, use separate simultaneous predictors for mu and log(sigma^2) in a Gaussian response.

Make `utility` module public

Currently the docs advise using utility::one_pad if the data doesn't have an intercept column. Unfortunately this module is also private.

Cook's distance

This would be a useful diagnostic to determine how much influence each observation has over the fit.

Estimate the variance sigma2

It seems that OLS regression is the type that has been implemented, so makes the assumption of a fixed sigma squared: the scalar variance of the errors and also y. How can this be extracted from the fit object? Is it the same as the dispersion() function? Calling this parameter "dispersion" is a bit unfamiliar to me.

Caching intermediate results in `Fit` interface

Some operations are performed repeatedly in the implementation of the Fit result. The covariance matrix is already cached manually, which is most important because this is probably an O(n^3) calculation. Some other functions and operations may be of only linear complexity but be called repeatedly and potentially involve relatively slow function calls like logarithms.

Some potential caching targets:

  • the prediction of the fitted data points given the model
  • the response residuals

Data standardization could be internalized

The standardize utility function should probably be moved internally for at least two reasons.

  1. interface uniformity
  2. the ability to persist the transformation to be used downstream in Fit::predict() on external/test/validation data

Define macros for common/simple use cases

Basic uses cases like ordinary least squares should be more ergonomic. Macros should be defined so that the user can get results in one short line and doesn't have to go through "building" a model and fitting it. Optional keyword parameters could also be used for configuration.

De-couple fit-specific options from Model/ModelBuilder

One could imagine fitting a single model (i.e. data + GLM type) with different fitting options (e.g. regularization and max number of iterations). This will require changing the general interface and some thought should go into making this as ergonomic as possible.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.