felix-clark / ndarray-glm Goto Github PK
View Code? Open in Web Editor NEWRust library for linear, logistic, and generalized linear model regression
License: MIT License
Rust library for linear, logistic, and generalized linear model regression
License: MIT License
This could improve numerical issues for bad fits and would put tolerances on a more similar scale across different families. It can be computed once and stored somewhere; perhaps in the IRLS struct itself.
The fit statistics will need to be re-evaluated for this change.
I note that there is no residuals
field or method on the Fit
object. I guess I could fit.expectation(x)
(which by the way is not documented due to #18) and subtract y
, but it might be a bit more user friendly to make a method for this. In particular you could provide the option to output standardised residuals, which is actually what I'm trying to calculate here.
Consider exporting common functions in a prelude
submodule rather than directly in the ndarray-glm
namespace.
One issue to work out is whether re-exporting ndarray::Array1
and the like would be a problem; some users may want to use ndarray_glm::prelude::*;
and this would probably lead to conflicts.
See ndarray-linalg
for an example.
Convergence should be shown over a wide range of potentially difficult datasets.
Ideally some canonical datasets with known results could be used.
A dataset should be found that engages the step halving.
Due to a bug in ndarray-linalg
, the calculations for several fit statistics, including the covariance matrix, fisher matrix, Wald and score tests, have resulted in incorrect off-diagonal entries. This bug appears to affect the Hermetian inversion functions invh()
and invh_into()
, and the latter was used often.
As a workaround the use of this function should be changed to inv_into()
.
Would the standard log
interface be sufficient? slog
at least should be investigated.
Any logging should be gated behind a feature flag so that it can be compiled out entirely in performance-critical applications. I'm unsure whether it should be enabled or disabled by default.
Right now L1 regularization is implemented with a smoothed V-curve potential. It would be better to use a LASSO approach, which should reduce over-stepping and actually encourage parameters to stay near zero instead of bouncing back and forth.
I encountered an annoying case where I wanted to write a function that was basically fn(data: Array2<T>) -> Fit<Linear, T>
. Unfortunately the Fit
struct holds a reference to the Model
used to create it, so it's basically impossible to return the Fit
at all. The best you could do is use owning_ref
to return both together, but this is certainly awkward. I could of course return just the model, but then I have to call fit()
and therefor re-run the substantial processing involved each time I want access to the Fit
. Can you think of a way to make this use case a bit nicer? Maybe removing the Model
reference from Fit
, and instead make it an argument to the functions that require it?
The guess at each step of IRLS, as well as the regularized likelihood, are already held in the IrlsStep
objects returned from the iteration. These are unused now, but a vector of them could be saved and stored in the Fit
. This should probably be opt-in behavior.
It should be relatively straightforward to modify the IRLS algorithm to account for varying weights between data points or even correlations between data points. There may be other ramifications to consider, e.g. for convergence and termination conditions.
It seems that the Fit
struct returned by the fit()
method is not documented, for example. Indeed it seems that the fit
module that it belongs to is also not included in the list of modules in the documentation. Is there any way you can export either the struct in the prelude or the module itself so that it's documented. There's a chance that it's not possible to use the struct in type definitions either for the same reason, certainly my autocomplete doesn't find it.
This is apparent in the logistic integration test log_termination_0
, which now compares several fit statistics to those given by R. For the standardized/studentized residual equality checks to pass the tolerance has to be increased to ~0.05, so they are approximately equal but not exact up to FPEs.
I'm including these residual functions in 0.0.10 since they are close and should still be useful, but the source of the difference should be investigated further.
polars seems to be emerging as the preeminent dataframe library with rust bindings. Abstracting the Dataset
interface could allow a convenient way to let users provide a dataframe and column names instead of manually constructing arrays. This would mesh with other features like optional variable names.
This should probably be feature-gated. Polars itself has an ndarray
feature that would need to be activated.
I haven't found much in the way of documentation or examples of this, but it should be possible to, for instance, use separate simultaneous predictors for mu and log(sigma^2) in a Gaussian response.
Would also be nice for this to be implemented.
Currently the docs advise using utility::one_pad
if the data doesn't have an intercept column. Unfortunately this module is also private.
This would be a useful diagnostic to determine how much influence each observation has over the fit.
It seems that OLS regression is the type that has been implemented, so makes the assumption of a fixed sigma squared: the scalar variance of the errors and also y. How can this be extracted from the fit object? Is it the same as the dispersion()
function? Calling this parameter "dispersion" is a bit unfamiliar to me.
Some operations are performed repeatedly in the implementation of the Fit
result. The covariance matrix is already cached manually, which is most important because this is probably an O(n^3)
calculation. Some other functions and operations may be of only linear complexity but be called repeatedly and potentially involve relatively slow function calls like logarithms.
Some potential caching targets:
The standardize
utility function should probably be moved internally for at least two reasons.
Fit::predict()
on external/test/validation dataBasic uses cases like ordinary least squares should be more ergonomic. Macros should be defined so that the user can get results in one short line and doesn't have to go through "building" a model and fitting it. Optional keyword parameters could also be used for configuration.
One could imagine fitting a single model (i.e. data + GLM type) with different fitting options (e.g. regularization and max number of iterations). This will require changing the general interface and some thought should go into making this as ergonomic as possible.
A crate like special-fun
would enable some calculations like CDFs and quantile residuals. Perhaps this should be gated behind an optional feature.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.