quantco / glum Goto Github PK

View Code? Open in Web Editor NEW

292.0 15.0 23.0 28.98 MB

High performance Python GLMs with all the features!

Home Page: https://glum.readthedocs.io/

License: BSD 3-Clause "New" or "Revised" License

Python 95.00% Shell 0.82% Cython 4.18%

glm logit poisson gamma tweedie lasso ridge elastic-net

glum's Introduction

glum

Documentation

Generalized linear models (GLM) are a core statistical tool that include many common methods like least-squares regression, Poisson regression and logistic regression as special cases. At QuantCo, we have used GLMs in e-commerce pricing, insurance claims prediction and more. We have developed glum, a fast Python-first GLM library. The development was based on a fork of scikit-learn, so it has a scikit-learn-like API. We are thankful for the starting point provided by Christian Lorentzen in that PR!

The goal of glum is to be at least as feature-complete as existing GLM libraries like glmnet or h2o. It supports

Built-in cross validation for optimal regularization, efficiently exploiting a “regularization path”
L1 regularization, which produces sparse and easily interpretable solutions
L2 regularization, including variable matrix-valued (Tikhonov) penalties, which are useful in modeling correlated effects
Elastic net regularization
Normal, Poisson, logistic, gamma, and Tweedie distributions, plus varied and customizable link functions
Box constraints, linear inequality constraints, sample weights, offsets

This repo also includes tools for benchmarking GLM implementations in the glum_benchmarks module. For details on the benchmarking, see here. Although the performance of glum relative to glmnet and h2o depends on the specific problem, we find that when N >> K (there are more observations than predictors), it is consistently much faster for a wide range of problems.

For more information on glum, including tutorials and API reference, please see the documentation.

Why did we choose the name glum? We wanted a name that had the letters GLM and wasn't easily confused with any existing implementation. And we thought glum sounded like a funny name (and not glum at all!). If you need a more professional sounding name, feel free to pronounce it as G-L-um. Or maybe it stands for "Generalized linear... ummm... modeling?"

A classic example predicting housing prices

>>> from sklearn.datasets import fetch_openml
>>> from glum import GeneralizedLinearRegressor
>>>
>>> # This dataset contains house sale prices for King County, which includes
>>> # Seattle. It includes homes sold between May 2014 and May 2015.
>>> house_data = fetch_openml(name="house_sales", version=3, as_frame=True)
>>>
>>> # Use only select features
>>> X = house_data.data[
...     [
...         "bedrooms",
...         "bathrooms",
...         "sqft_living",
...         "floors",
...         "waterfront",
...         "view",
...         "condition",
...         "grade",
...         "yr_built",
...         "yr_renovated",
...     ]
... ].copy()
>>>
>>>
>>> # Model whether a house had an above or below median price via a Binomial
>>> # distribution. We'll be doing L1-regularized logistic regression.
>>> price = house_data.target
>>> y = (price < price.median()).values.astype(int)
>>> model = GeneralizedLinearRegressor(
...     family='binomial',
...     l1_ratio=1.0,
...     alpha=0.001
... )
>>>
>>> _ = model.fit(X=X, y=y)
>>>
>>> # .report_diagnostics shows details about the steps taken by the iterative solver.
>>> diags = model.get_formatted_diagnostics(full_report=True)
>>> diags[['objective_fct']]
        objective_fct
n_iter               
0            0.693091
1            0.489500
2            0.449585
3            0.443681
4            0.443498
5            0.443497
>>>
>>> # Models can also be built with formulas from formulaic.
>>> model_formula = GeneralizedLinearRegressor(
...     family='binomial',
...     l1_ratio=1.0,
...     alpha=0.001,
...     formula="bedrooms + np.log(bathrooms + 1) + bs(sqft_living, 3) + C(waterfront)"
... )
>>> _ = model_formula.fit(X=house_data.data, y=y)

Installation

Please install the package through conda-forge:

conda install glum -c conda-forge

Performance

For optimal performance on an x86_64 architecture, we recommend using the MKL library (conda install mkl). By default, conda usually installs the openblas version, which is slower, but supported on all major architecture and OS.

glum's People

Contributors

Stargazers

Watchers

glum's Issues

Add real world data example

This basically entails increasing the number of rows and number of variables of the current practice data set so that it looks like the types of data sets we're working with.

I will create a function that takes as arguments:

the number of rows
the number of numeric features
the number of categorical features

We could also think about adding an option to add features with little to not informational value, though this might be the kind of thing we do in a separate toy example where we control the full DGP.

[Medium] Ability to change something about the model and re-fit

[Critical] Benchmarks on various data sets

Based on 091fe02

Machine: r5.4xlarge (16vCPUs | 128 GB Ram)

I ran the sklearn_fork using

all available data sets
for 1M observations and 18M observations
dense vs sparse
on different numbers of threads
lasso and elnet

Results

                                                                    n_iter    runtime  intercept obj_val rel_obj_val
problem                      num_rows storage threads library                                                       
narrow_insurance_l2_poisson  1000000  dense   1       sklearn_fork       5     2.8577    -3.8547  0.3192           0
                                              2       sklearn_fork       5     2.1641    -3.8547  0.3192           0
                                              4       sklearn_fork       5     1.8084    -3.8547  0.3192           0
                                              8       sklearn_fork       5     1.6458    -3.8547  0.3192           0
                                              16      sklearn_fork       5     1.6295    -3.8547  0.3192           0
                                      sparse  1       sklearn_fork       5     2.4790    -3.8547  0.3192           0
                                              2       sklearn_fork       5     1.9025    -3.8547  0.3192           0
                                              4       sklearn_fork       5     1.5203    -3.8547  0.3192           0
                                              8       sklearn_fork       5     2.4176    -3.8547  0.3192           0
                                              16      sklearn_fork       5     2.4066    -3.8547  0.3192           0
                             18000000 dense   1       sklearn_fork       5    52.9698    -3.8305  0.3189           0
                                              2       sklearn_fork       5    40.7352    -3.8305  0.3189           0
                                              4       sklearn_fork       5    33.9417    -3.8305  0.3189           0
                                              8       sklearn_fork       5    30.6861    -3.8305  0.3189           0
                                              16      sklearn_fork       5    30.5071    -3.8305  0.3189           0
                                      sparse  1       sklearn_fork       5    53.0467    -3.8305  0.3189           0
                                              2       sklearn_fork       5    42.5917    -3.8305  0.3189           0
                                              4       sklearn_fork       5    36.1416    -3.8305  0.3189           0
                                              8       sklearn_fork       5    39.9059    -3.8305  0.3189           0
                                              16      sklearn_fork       5    46.8101    -3.8305  0.3189           0
narrow_insurance_net_poisson 1000000  dense   1       sklearn_fork       9     5.2903    -3.7820  0.3199           0
                                              2       sklearn_fork       9     4.0050    -3.7820  0.3199           0
                                              4       sklearn_fork       9     3.3740    -3.7820  0.3199           0
                                              8       sklearn_fork       9     3.0434    -3.7820  0.3199           0
                                              16      sklearn_fork       9     3.0169    -3.7820  0.3199           0
                                      sparse  1       sklearn_fork       9     4.4610    -3.7820  0.3199           0
                                              2       sklearn_fork       9     3.4356    -3.7820  0.3199           0
                                              4       sklearn_fork       9     2.6984    -3.7820  0.3199           0
                                              8       sklearn_fork       9     3.6768    -3.7820  0.3199           0
                                              16      sklearn_fork       9     4.3575    -3.7820  0.3199           0
                             18000000 dense   1       sklearn_fork       9    98.1811    -3.7678  0.3196           0
                                              2       sklearn_fork       9    74.8176    -3.7678  0.3196           0
                                              4       sklearn_fork       9    63.2101    -3.7678  0.3196           0
                                              8       sklearn_fork       9    57.3406    -3.7678  0.3196           0
                                              16      sklearn_fork       9    56.7188    -3.7678  0.3196           0
                                      sparse  1       sklearn_fork       9    98.1400    -3.7678  0.3196           0
                                              2       sklearn_fork       9    78.0241    -3.7678  0.3196           0
                                              4       sklearn_fork       9    65.7398    -3.7678  0.3196           0
                                              8       sklearn_fork       9    83.0182    -3.7678  0.3196           0
                                              16      sklearn_fork       9    85.0394    -3.7678  0.3196           0
real_insurance_l2_poisson    1000000  dense   1       sklearn_fork       5     5.1556    -3.3377  0.1601           0
                                              2       sklearn_fork       5     3.7674    -3.3377  0.1616           0
                                              4       sklearn_fork       5     3.0280    -3.3377  0.1612           0
                                              8       sklearn_fork       5     2.6941    -3.3377  0.1623           0
                                              16      sklearn_fork       5     2.8711    -3.3377  0.1612           0
                                      sparse  1       sklearn_fork       5    13.8171    -3.3377  0.1618           0
                                              2       sklearn_fork       5     8.8360    -3.3377  0.1619           0
                                              4       sklearn_fork       5     6.4723    -3.3377  0.1608           0
                                              8       sklearn_fork       5     5.5610    -3.3377    0.16           0
                                              16      sklearn_fork       5    10.6910    -3.3377  0.1617           0
                             18000000 dense   1       sklearn_fork       5    88.1266    -3.3665  0.1609           0
                                              2       sklearn_fork       5    61.3444    -3.3665  0.1608           0
                                              4       sklearn_fork       5    48.0059    -3.3665  0.1608           0
                                              8       sklearn_fork       5    41.3488    -3.3665  0.1608           0
                                              16      sklearn_fork       5    41.2265    -3.3665  0.1611           0
                                      sparse  1       sklearn_fork       5   265.7090    -3.3665  0.1608           0
                                              2       sklearn_fork       5   173.5621    -3.3665  0.1607           0
                                              4       sklearn_fork       5   124.4030    -3.3665  0.1607           0
                                              8       sklearn_fork       5   106.1407    -3.3665  0.1606           0
                                              16      sklearn_fork       5   192.5957    -3.3665   0.161           0
real_insurance_net_poisson   1000000  dense   1       sklearn_fork      10     9.7671    -3.3532  0.1612           0
                                              2       sklearn_fork      10     6.9492    -3.3532  0.1609           0
                                              4       sklearn_fork      10     5.5663    -3.3532  0.1614           0
                                              8       sklearn_fork      10     4.7798    -3.3532  0.1606           0
                                              16      sklearn_fork      10     4.8909    -3.3532  0.1603           0
                                      sparse  1       sklearn_fork      10    26.5831    -3.3532  0.1635           0
                                              2       sklearn_fork      10    16.5875    -3.3532  0.1629           0
                                              4       sklearn_fork      10    11.3910    -3.3532   0.162           0
                                              8       sklearn_fork      10     9.4460    -3.3532  0.1612           0
                                              16      sklearn_fork      10    19.7150    -3.3532  0.1608           0
                             18000000 dense   1       sklearn_fork      10   175.4268    -3.3576  0.1618           0
                                              2       sklearn_fork      10   121.9683    -3.3576  0.1618           0
                                              4       sklearn_fork      10    95.6292    -3.3576  0.1617           0
                                              8       sklearn_fork      10    82.3669    -3.3576  0.1617           0
                                              16      sklearn_fork      10    82.6210    -3.3576  0.1618           0
                                      sparse  1       sklearn_fork      10   511.6008    -3.3576  0.1618           0
                                              2       sklearn_fork      10   324.6799    -3.3576  0.1616           0
                                              4       sklearn_fork      10   227.5025    -3.3576  0.1615           0
                                              8       sklearn_fork      10   190.4679    -3.3576  0.1616           0
                                              16      sklearn_fork      10   359.1789    -3.3576  0.1617           0
wide_insurance_l2_poisson    1000000  dense   1       sklearn_fork      10    49.0238    -2.0280  0.1422           0
                                              2       sklearn_fork      10    30.6427    -2.0280  0.1422           0
                                              4       sklearn_fork      10    21.8896    -2.0280  0.1422           0
                                              8       sklearn_fork      10    17.2667    -2.0280  0.1422           0
                                              16      sklearn_fork      10    17.4586    -2.0280  0.1422           0
                                      sparse  1       sklearn_fork      10    11.1084    -2.0280  0.1422           0
                                              2       sklearn_fork      10     8.6160    -2.0280  0.1422           0
                                              4       sklearn_fork      10     7.4407    -2.0280  0.1422           0
                                              8       sklearn_fork      10     7.2905    -2.0280  0.1422           0
                                              16      sklearn_fork      10     7.5111    -2.0280  0.1422           0
                             18000000 dense   1       sklearn_fork      13  1171.3334    -2.1096  0.1403           0
                                              2       sklearn_fork      29  1546.3203    -2.1096  0.1403           0
                                              4       sklearn_fork      10   405.3979    -2.1096  0.1403           0
                                              8       sklearn_fork      10   322.3633    -2.1096  0.1403           0
                                              16      sklearn_fork      10   320.1045    -2.1096  0.1403           0
                                      sparse  1       sklearn_fork      10   241.5324    -2.1096  0.1403           0
                                              2       sklearn_fork      20   352.8254    -2.1096  0.1403           0
                                              4       sklearn_fork      20   307.4050    -2.1096  0.1403           0
                                              8       sklearn_fork      16   248.4476    -2.1096  0.1403           0
                                              16      sklearn_fork      16   218.3485    -2.1096  0.1403           0
wide_insurance_net_poisson   1000000  dense   1       sklearn_fork      13    66.1757    -2.2057  0.1426           0
                                              2       sklearn_fork      13    42.3533    -2.2057  0.1426           0
                                              4       sklearn_fork      13    30.3914    -2.2057  0.1426           0
                                              8       sklearn_fork      13    24.3199    -2.2057  0.1426           0
                                              16      sklearn_fork      13    24.6798    -2.2057  0.1426           0
                                      sparse  1       sklearn_fork      13    16.0367    -2.2057  0.1426           0
                                              2       sklearn_fork      13    12.3721    -2.2057  0.1426           0
                                              4       sklearn_fork      13    10.9711    -2.2057  0.1426           0
                                              8       sklearn_fork      13    10.9589    -2.2057  0.1426           0
                                              16      sklearn_fork      13    11.0781    -2.2057  0.1426           0
                             18000000 dense   1       sklearn_fork      15  1416.9511    -2.3355  0.1402           0
                                              2       sklearn_fork      15   895.1590    -2.3355  0.1402           0
                                              4       sklearn_fork      15   622.3026    -2.3355  0.1402           0
                                              8       sklearn_fork      15   501.4581    -2.3355  0.1402           0
                                              16      sklearn_fork      15   491.9906    -2.3355  0.1402           0
                                      sparse  1       sklearn_fork      15   378.4437    -2.3355  0.1402           0
                                              2       sklearn_fork      15   297.5474    -2.3355  0.1402           0
                                              4       sklearn_fork      15   261.0684    -2.3355  0.1402           0
                                              8       sklearn_fork      15   256.5559    -2.3355  0.1402           0
                                              16      sklearn_fork      15   225.0999    -2.3355  0.1402           0

Results as CSV (zipped)

Running benchmark suite CLI

I ran the benchmark suite as follows:

docker-compose run work
(In the Docker container) pip install -e .
glm_benchmarks_run

Is there any reason that pip install -e . is not part of the Dockerfile? It would be nice to run this in one command instead of 3. @tbenthompson

[Major] Try out 32-bit floating point.

Are estimates still decent? What speed-up do we see?

[Minor] Add objective function calculation to utils for gamma and tweedie

Docker container sometimes fails to build due to the cython/gcc error.

But when I just run docker-compose build work, I get this:
"Installing collected packages: glm-benchmarks
Running setup.py develop for glm-benchmarks
ERROR: Command errored out with exit status 1:
command: /opt/conda/bin/python -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'/app/setup.py'"'"'; file='"'"'/app/setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(file);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, file, '"'"'exec'"'"'))' develop --no-deps
cwd: /app/
Complete output (33 lines):
running develop
running egg_info
writing src/glm_benchmarks.egg-info/PKG-INFO
writing dependency_links to src/glm_benchmarks.egg-info/dependency_links.txt
writing entry points to src/glm_benchmarks.egg-info/entry_points.txt
writing top-level names to src/glm_benchmarks.egg-info/top_level.txt
writing manifest file 'src/glm_benchmarks.egg-info/SOURCES.txt'
running build_ext
building 'mkl_spblas' extension
creating build
creating build/temp.linux-x86_64-3.7
creating build/temp.linux-x86_64-3.7/src
creating build/temp.linux-x86_64-3.7/src/glm_benchmarks
creating build/temp.linux-x86_64-3.7/src/glm_benchmarks/spblas
gcc -pthread -B /opt/conda/compiler_compat -Wl,--sysroot=/ -Wsign-compare -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -fPIC -I/opt/conda/lib/python3.7/site-packages/numpy/core/include -I/opt/conda/include -I/opt/conda/include/python3.7m -c src/glm_benchmarks/spblas/mkl_spblas.c -o build/temp.linux-x86_64-3.7/src/glm_benchmarks/spblas/mkl_spblas.o
In file included from /opt/conda/lib/python3.7/site-packages/numpy/core/include/numpy/ndarraytypes.h:1832:0,
from /opt/conda/lib/python3.7/site-packages/numpy/core/include/numpy/ndarrayobject.h:12,
from /opt/conda/lib/python3.7/site-packages/numpy/core/include/numpy/arrayobject.h:4,
from src/glm_benchmarks/spblas/mkl_spblas.c:628:
/opt/conda/lib/python3.7/site-packages/numpy/core/include/numpy/npy_1_7_deprecated_api.h:17:2: warning: #warning "Using deprecated NumPy API, disable it with " "#define NPY_NO_DEPRECATED_API NPY_1_7_API_VERSION" [-Wcpp]
#warning "Using deprecated NumPy API, disable it with "
^~~~~
In file included from /opt/conda/include/mkl.h:34:0,
from src/glm_benchmarks/spblas/mkl_spblas.c:630:
/opt/conda/include/mkl_lapacke.h:16957:1: warning: function declaration isn't a prototype [-Wstrict-prototypes]
int LAPACKE_get_nancheck( );
^
creating build/lib.linux-x86_64-3.7
gcc -pthread -shared -B /opt/conda/compiler_compat -L/opt/conda/lib -Wl,-rpath=/opt/conda/lib -Wl,--no-as-needed -Wl,--sysroot=/ build/temp.linux-x86_64-3.7/src/glm_benchmarks/spblas/mkl_spblas.o -L -L/opt/conda/include -lmkl_rt -o build/lib.linux-x86_64-3.7/mkl_spblas.cpython-37m-x86_64-linux-gnu.so
/opt/conda/compiler_compat/ld: cannot find /lib/libpthread.so.0
/opt/conda/compiler_compat/ld: cannot find /usr/lib/libpthread_nonshared.a
collect2: error: ld returned 1 exit status
error: command 'gcc' failed with exit status 1

ERROR: Command errored out with exit status 1: /opt/conda/bin/python -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'/app/setup.py'"'"'; file='"'"'/app/setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(file);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, file, '"'"'exec'"'"'))' develop --no-deps Check the logs for full command output.
ERROR: Service 'work' failed to build: The command '/bin/sh -c pip install --no-use-pep517 --disable-pip-version-check -e .' returned a non-zero code: 1"
2:14
the main issue seems to be
"gcc -pthread -shared -B /opt/conda/compiler_compat -L/opt/conda/lib -Wl,-rpath=/opt/conda/lib -Wl,--no-as-needed -Wl,--sysroot=/ build/temp.linux-x86_64-3.7/src/glm_benchmarks/spblas/mkl_spblas.o -L -L/opt/conda/include -lmkl_rt -o build/lib.linux-x86_64-3.7/mkl_spblas.cpython-37m-x86_64-linux-gnu.so
/opt/conda/compiler_compat/ld: cannot find /lib/libpthread.so.0
/opt/conda/compiler_compat/ld: cannot find /usr/lib/libpthread_nonshared.a"

[Minor] Add --doctest-modules to pytest CI to make sure examples are accurate

See for example https://github.com/Quantco/glm_benchmarks/blob/9e76c82f1269e41769cdc62bad957b782833b4c7/src/glm_benchmarks/scaled_spmat/standardize.py

[Critical] Improve performance for sparse matrices

Reported by @jtilly
"I'm having some difficulties getting good results using real data (on 1 million rows for now). The script and corresponding log file that I'm using in our infrastructure are here:
https://gist.github.com/jtilly/d2ff9b7bd6c690a35db052d1730e0a06
I'm comparing the sklearn_fork vs glmnet_python. Performance doesn't look that great for the sklearn_fork implementations once we add l1 penalties or make things sparse. I'm also having difficulties aligning coefficients (and predictions) between the sklearn fork and glmnet.
I'm not sure what the best way to debugging the problem is. I also integrated the real data set into the glm_benchmarks package (see #47).
If you get the chance, could you take a quick look at what I implemented so make sure I didn't screw things up anywhere? Also, any insights why glmnet and sklearn results don't align?
I'm also not an export user of the glm_benchmarks package (yet), so if you have any ideas how to pull debug information out of it, please let me know"

[Major] Explore use of Cython for improving speed and memory efficiency of "sandwich", mat-mat, and mat-vec products

Target completion: By meeting on 4/23

Add pyglmnet to the benchmarking suite

https://github.com/glm-tools/pyglmnet

Investigate if it makes sense to include vowpal-wabbit

https://vowpalwabbit.org/

[Minor] ScaledSpMat is inconsistent with scipy.sparse regarding returning arrays and matrices

Scipy sparse todense returns a matrix and toarray returns an array. ScaledSpMat todense returns an array and toarray is not supported.

[Major] Include offsets

In yesterday's call I said that offsets are unnecessary, because we can simply achieve the same by re-weighting. Turns out that's not quite true, when the algorithm standardizes the data internally. For the standardization, we only want exposure to show up in the means, not the offset.

Background: We have y, exposure, and offset. We then set
weight = exposure * exp(offset) and y_adj = y / (exposure * exp(offset)) and fit the model with y_adj and weight. For the purpose of standardization, we would only want exposure to be taken account when computing means and standard deviations. This is easy to accomplish when we standardize in a pre-processing step.

Note that one trivial fix for this is to allow for different weights for optimization and standardization :)

Is sklearn_fork handling tweedie correctly?

if isinstance(self._family_instance, TweedieDistribution):
    if self._family_instance.power <= 0:
        self._link_instance = IdentityLink()
    if self._family_instance.power >= 1:
        self._link_instance = LogLink()

This looks wrong to me, based on this: https://stats.stackexchange.com/questions/137227/what-is-the-canonical-link-function-for-a-tweedie-glm
power == 1 should be the special case of the log link, and otherwise there should be something like a TweedieLink equal to mu ** (1 - p) / (1 - p), which doesn't currently exist.

@MarcAntoineSchmidtQC : Could you add a test for whether we're getting the right answer in the tweedie-1.5 benchmark? (Regarding Issue #43 )

Inability to run all benchmarks with a small number of rows

I would like to be able to run through benchmarks on a very small number of rows in order to quickly make sure that everything runs. For example, after merging, or after making a change to a data creation script, I would like to be able to run the whole benchmark suite.

Currently, everything works with 800 rows, and the whole benchmark suite takes 17s to run on my laptop with 800 rows, so I think this is okay for now, but let's keep an eye on it.

glmnet_python

With <=650 rows, glmnet_python gives the mysterious error ValueError: max() arg is an empty sequence. It works with 800 rows.

data creation

data.create.generate_simple_insurance_dataset attempts to drop categories that don't exist with n_rows <= 11, raising a ValueError. This does not seem like a serious problem since 12 is a small number.

Steps to reproduce: generate_simple_insurance_dataset(11)

Suggested fixes:

Raise an error if the user inputs less than 12 rows, and ask them to put in at least 12.
Or fix this so it is possible to generate a very small test set.

[Medium] Cleanup: Remove unneeded packages from benchmarks library

This will make it easier to just compare the libraries we are interested in and head off installation difficulties. I suggest keeping glmnet_python, sklearn-fork, and h2o, and getting rid fo the rest.

Different libraries may deal with constants in the log-likelihood differently

Log-likelihoods sometimes have some ugly constants, like pi for the normal distribution or a factorial for Poisson. Libraries may reasonably choose to omit these constants. This presents a problem if they make different decisions about how to treat such constants, since omitting a constant will change the strength of the regularization, leading to different optimal solutions.

Two possible approaches:

Suggested by @tbenthompson: Fit cross-validated models along a regularization path, and report the one with the lowest cross-validated error. This will be slow, but sidesteps the issue.
Figure out how the libraries are dealing with constants and correct for it.

[Medium] Remove option to use non-centered predictors in CD and simplify intercept logic

DON'T Simplify intercept logic
DON'T Remove option to use non-centered predictors in CD
For other solvers, test using both centered and noncentered predictors

Working on this in #68

[Medium] Implement active-set iteration within CD.

This could be valuable in any cases where the CD inner loop is expensive. See #81 for an example of a problem where this is the case.

This could happen in tandem with a Cython-ization of the _cd_cycle CD inner loop function.

Save more complete iteration history for diagnostics in sklearn_fork

From @tbenthompson :
"@elizabeth.santorella @marc-antoine.schmidt We might want to add a more formal iteration history object/list. Then we can output the info that's currently being printed without actually having a dumb print statement. We could also add some simple/cheap timing (e.g. runtime for each outer loop)."

[Minor] sklearn-fork Newton-CG optimizer calls deprecated sklearn.utils.optimize.newton_cg

Our tests raise deprecation warnings because the Newton-CG optimizer call the deprecated function sklearn.utils.optimize.newton_cg.

/opt/conda/lib/python3.7/site-packages/sklearn/utils/deprecation.py:87: FutureWarning: Function newton_cg is deprecated; newton_cg is deprecated in version 0.22 and will be removed in version 0.24.

I specified scikit-learn==0.22.1 in the requirements, which should take care of this for now, but this will be an issue if we ever aspire to merge this back into the main sklearn. We could also just remove the newton-cg solver (as of right now, I believe it will only be used if the user specifies it).

[Major] Fix up the dense sandwich code.

This is for the code in src/glm_benchmarks/sandwich/dense-tmpl.c, currently in PR #84

Use C to fill in lower triangle, not Numpy
Better error handling for matrices passed as row-major rather than column-major

[Medium] Cython-ization of the `_cd_cycle` CD inner loop function

This could be valuable in any cases where the CD inner loop is expensive. See #81 and #92
There also seems to be some potential for vectorization, but Cython would be easier.

[Critical] Document improvements made to sklearn_fork

Objective function consistency

copying from merged PR, Liz says:
@tbenthompson making sure you saw this: If you multiply alpha by N, does h2o reproduce the results from the other libraries?

And can we set up the objective function in the benchmarks so that all of the libraries are solving the same problem?

[Medium] Poor convergence with Tweedie

from glm_benchmarks_run --library_names sklearn_fork --num_rows 10000
From d4e28bef4e19893d28d68c2fe4f632e3f007c023
Here are some of the worse-converging benchmarks; the others converge OK:



running problem=wide_insurance_net_tweedie_p=1.5 library=sklearn_fork
/app/src/glm_benchmarks/sklearn_fork/_glm.py:893: RuntimeWarning: divide by zero encountered in power
  + np.power(mu, 2 - p) / (2 - p)
/app/src/glm_benchmarks/sklearn_fork/_glm.py:893: RuntimeWarning: invalid value encountered in multiply
  + np.power(mu, 2 - p) / (2 - p)
/app/src/glm_benchmarks/sklearn_fork/_glm.py:1611: ConvergenceWarning: Coordinate descent failed to converge. Increase the maximum number of iterations max_iter (currently 5)
  ConvergenceWarning,
diagnostics:
            inner_tol  n_cycles   runtime  intercept
n_iter                                              
1       257572.645999         1  0.040235  -0.778006
2        21563.398660       505  1.195232  -0.369897
3        12514.327561       506  0.017142  -0.314297
4         9289.710857       507  0.017300  -0.310320
5         5260.738162       508  0.016049  -0.330114
ran
running problem=wide_insurance_no_weights_net_tweedie_p=1.5 library=sklearn_fork
/app/src/glm_benchmarks/sklearn_fork/_glm.py:893: RuntimeWarning: divide by zero encountered in power
  + np.power(mu, 2 - p) / (2 - p)
/app/src/glm_benchmarks/sklearn_fork/_glm.py:893: RuntimeWarning: invalid value encountered in multiply
  + np.power(mu, 2 - p) / (2 - p)
/app/src/glm_benchmarks/sklearn_fork/_glm.py:1611: ConvergenceWarning: Coordinate descent failed to converge. Increase the maximum number of iterations max_iter (currently 5)
  ConvergenceWarning,
diagnostics:
            inner_tol  n_cycles   runtime  intercept
n_iter                                              
1       355586.522993         1  0.042016  -0.799277
2       404672.542442      1001  2.329688  -0.197018
3       443857.467067      1002  0.034167  -0.207599
4       343211.298633      1003  0.031101  -0.226024
5       265101.096082      1004  0.018249  -0.244082


running problem=wide_insurance_no_weights_lasso_tweedie_p=1.5 library=sklearn_fork
/app/src/glm_benchmarks/sklearn_fork/_glm.py:893: RuntimeWarning: divide by zero encountered in power
  + np.power(mu, 2 - p) / (2 - p)
/app/src/glm_benchmarks/sklearn_fork/_glm.py:893: RuntimeWarning: invalid value encountered in multiply
  + np.power(mu, 2 - p) / (2 - p)
/app/src/glm_benchmarks/sklearn_fork/_glm.py:1611: ConvergenceWarning: Coordinate descent failed to converge. Increase the maximum number of iterations max_iter (currently 5)
  ConvergenceWarning,
diagnostics:
            inner_tol  n_cycles   runtime  intercept
n_iter                                              
1       355498.458047         1  0.043208  -0.799212
2       397118.520436       138  0.367230   0.113342
3       409637.535767       139  0.025405   0.091688
4       322470.142868       140  0.022713   0.100142
5       252181.825569       141  0.021358   0.099256

Sparse data is stored row-wise, but coordinate descent operates column by column

Converting the sparse data with scipy.sparse.spmatrix.tocsc() might greatly improve the performance of coordinate descent in several of the benchmarked libraries.

[Critical] Removing unnecessary matrix copies for memory efficiency

[Medium] Aligning better again with https://github.com/Quantco/pypackage-cookiecutter

As this moves more into a redistributable package, could we get this back into a shape similar to the cookiecutter?

I'm happy to add a generic Dockerfile to the cookiecutter but currently I don't understand why you actually need docker at all. conda is already an isolated environment, why do you need a second layer of isolation?

Standardization breaks on small data sets

With real data, things work fine with say 5000 rows.

~/code/glm_benchmarks (real-data) $ glm_benchmarks_run --problem_names real_dense_insurance_net_poisson --library_names sklearn_fork --num_rows 5000
running problem=real_dense_insurance_net_poisson library=sklearn_fork
0.2835308954545516 0 0
0.09067153652489054 1 3
0.026592483094850235 2 6
0.006140473537118396 3 10
0.0019797265695442007 4 16
0.0003182693342492612 5 24
3.1550463187432e-05 6 38
3.0357792371484787e-06 7 54
ran

With 2000 rows, I get for both the dense and the sparse case:

dense

(sd-pricing-r) root@83bd37895502
~/code/glm_benchmarks (real-data) $ glm_benchmarks_run --problem_names real_dense_insurance_net_poisson --library_names sklearn_fork --num_rows 2000
running problem=real_dense_insurance_net_poisson library=sklearn_fork
/home/X916841/code/glm_benchmarks/src/glm_benchmarks/sklearn_fork/_glm.py:201: RuntimeWarning: invalid value encountered in true_divide
  X /= col_stds
/home/X916841/code/glm_benchmarks/src/glm_benchmarks/sklearn_fork/_glm.py:205: RuntimeWarning: divide by zero encountered in true_divide
  P1 /= col_stds
/home/X916841/code/glm_benchmarks/src/glm_benchmarks/sklearn_fork/_glm.py:210: RuntimeWarning: divide by zero encountered in true_divide
  P2 = P2 / (col_stds ** 2)
/home/X916841/code/glm_benchmarks/src/glm_benchmarks/sklearn_fork/_glm.py:177: RuntimeWarning: invalid value encountered in multiply
  grad_wP2 += coef[idx:] * P2
/home/X916841/code/glm_benchmarks/src/glm_benchmarks/sklearn_fork/_glm.py:183: RuntimeWarning: invalid value encountered in multiply
  grad_wP2 + np.sign(coef[idx:]) * P1,
Traceback (most recent call last):
  File "/opt/conda/envs/sd-pricing-r/bin/glm_benchmarks_run", line 11, in <module>
    load_entry_point('glm-benchmarks', 'console_scripts', 'glm_benchmarks_run')()
  File "/opt/conda/envs/sd-pricing-r/lib/python3.8/site-packages/click/core.py", line 829, in __call__
    return self.main(*args, **kwargs)
  File "/opt/conda/envs/sd-pricing-r/lib/python3.8/site-packages/click/core.py", line 782, in main
    rv = self.invoke(ctx)
  File "/opt/conda/envs/sd-pricing-r/lib/python3.8/site-packages/click/core.py", line 1066, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/opt/conda/envs/sd-pricing-r/lib/python3.8/site-packages/click/core.py", line 610, in invoke
    return callback(*args, **kwargs)
  File "/home/X916841/code/glm_benchmarks/src/glm_benchmarks/main.py", line 80, in cli_run
    result = execute_problem_library(P, L, num_rows)
  File "/home/X916841/code/glm_benchmarks/src/glm_benchmarks/main.py", line 91, in execute_problem_library
    result = L(dat, P.distribution, P.regularization_strength, P.l1_ratio)
  File "/home/X916841/code/glm_benchmarks/src/glm_benchmarks/bench_sklearn_fork.py", line 49, in sklearn_fork_bench
    result["runtime"], m = runtime(build_and_fit, model_args, fit_args)
  File "/home/X916841/code/glm_benchmarks/src/glm_benchmarks/util.py", line 10, in runtime
    out = f(*args, **kwargs)
  File "/home/X916841/code/glm_benchmarks/src/glm_benchmarks/bench_sklearn_fork.py", line 14, in build_and_fit
    return GeneralizedLinearRegressor(**model_args).fit(**fit_args)
  File "/home/X916841/code/glm_benchmarks/src/glm_benchmarks/sklearn_fork/_glm.py", line 2295, in fit
    inner_tol = 4 * linalg.norm(inner_tol, ord=1)
  File "/opt/conda/envs/sd-pricing-r/lib/python3.8/site-packages/scipy/linalg/misc.py", line 142, in norm
    a = np.asarray_chkfinite(a)
  File "/opt/conda/envs/sd-pricing-r/lib/python3.8/site-packages/numpy/lib/function_base.py", line 498, in asarray_chkfinite
    raise ValueError(
ValueError: array must not contain infs or NaNs

sparse

(sd-pricing-r) root@83bd37895502
~/code/glm_benchmarks (real-data) $ glm_benchmarks_run --problem_names real_sparse_insurance_net_poisson --library_names sklearn_fork --num_rows 2000
running problem=real_sparse_insurance_net_poisson library=sklearn_fork
/home/X916841/code/glm_benchmarks/src/glm_benchmarks/scaled_spmat/standardize.py:54: RuntimeWarning: divide by zero encountered in true_divide
  return centered_mat @ sps.diags(1 / st_devs), means, st_devs
/home/X916841/code/glm_benchmarks/src/glm_benchmarks/sklearn_fork/_glm.py:205: RuntimeWarning: divide by zero encountered in true_divide
  P1 /= col_stds
/home/X916841/code/glm_benchmarks/src/glm_benchmarks/sklearn_fork/_glm.py:207: RuntimeWarning: divide by zero encountered in true_divide
  inv_col_stds_mat = sparse.diags(1.0 / col_stds)
/home/X916841/code/glm_benchmarks/src/glm_benchmarks/sklearn_fork/_glm.py:183: RuntimeWarning: invalid value encountered in multiply
  grad_wP2 + np.sign(coef[idx:]) * P1,
Traceback (most recent call last):
  File "/opt/conda/envs/sd-pricing-r/bin/glm_benchmarks_run", line 11, in <module>
    load_entry_point('glm-benchmarks', 'console_scripts', 'glm_benchmarks_run')()
  File "/opt/conda/envs/sd-pricing-r/lib/python3.8/site-packages/click/core.py", line 829, in __call__
    return self.main(*args, **kwargs)
  File "/opt/conda/envs/sd-pricing-r/lib/python3.8/site-packages/click/core.py", line 782, in main
    rv = self.invoke(ctx)
  File "/opt/conda/envs/sd-pricing-r/lib/python3.8/site-packages/click/core.py", line 1066, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/opt/conda/envs/sd-pricing-r/lib/python3.8/site-packages/click/core.py", line 610, in invoke
    return callback(*args, **kwargs)
  File "/home/X916841/code/glm_benchmarks/src/glm_benchmarks/main.py", line 80, in cli_run
    result = execute_problem_library(P, L, num_rows)
  File "/home/X916841/code/glm_benchmarks/src/glm_benchmarks/main.py", line 91, in execute_problem_library
    result = L(dat, P.distribution, P.regularization_strength, P.l1_ratio)
  File "/home/X916841/code/glm_benchmarks/src/glm_benchmarks/bench_sklearn_fork.py", line 49, in sklearn_fork_bench
    result["runtime"], m = runtime(build_and_fit, model_args, fit_args)
  File "/home/X916841/code/glm_benchmarks/src/glm_benchmarks/util.py", line 10, in runtime
    out = f(*args, **kwargs)
  File "/home/X916841/code/glm_benchmarks/src/glm_benchmarks/bench_sklearn_fork.py", line 14, in build_and_fit
    return GeneralizedLinearRegressor(**model_args).fit(**fit_args)
  File "/home/X916841/code/glm_benchmarks/src/glm_benchmarks/sklearn_fork/_glm.py", line 2295, in fit
    inner_tol = 4 * linalg.norm(inner_tol, ord=1)
  File "/opt/conda/envs/sd-pricing-r/lib/python3.8/site-packages/scipy/linalg/misc.py", line 142, in norm
    a = np.asarray_chkfinite(a)
  File "/opt/conda/envs/sd-pricing-r/lib/python3.8/site-packages/numpy/lib/function_base.py", line 498, in asarray_chkfinite
    raise ValueError(
ValueError: array must not contain infs or NaNs

I couldn't reproduce the issue with our other data sets.

Haven't dug deeper than this. When I'll do, I'll update this issue.

[minor] Implement better line search algorithm.

Copying slack message from me to Liz

Ben Thompson 4:05 PM
as a general topic for algorithmic thought if you're interested...
4:05
can we do more work per IRLS iteration?
4:06
90% of the cost is in the quadratic approximation even after our optimizations.
4:06
if there's some way that we can do a bit more work per quadratic appx, that could be interesting/useful
4:07
maybe not possible... but interesting to think about
4:07
better line search? maybe exploring just a bit more than just a line?

Turn glmnet_python into a conda package

I think it would help if https://github.com/bbalasub1/glmnet_python was a conda package. Linux only is fine.

@xhochy could you take a look? The repo is not terribly well organized, also no releases or anything.

[Medium] Bound constraints in sklearn

Put bound constraints into API and optimizer
Modify gradient-based stopping criteria for products at the boundary (probably by hacking subgradients)

[Major] Replicate sparse performance penalty in public data

Using the real data, the sparse case is 5 times slower than the dense case (same data, just different storage). Let's make sure we can replicate that using the public data sets in this repo. If we can't, we need to come up with a different data set.

Rename dense and sparse data sets

Call them narrow and wide instead
Use the sparsify option to make each of them sparse

[Major] Parallelization of matrix-matrix and matrix-vector products in sklearn-fork

For speed

MKL from conda-forge

It seems that in general we get better matrix performance when using MKL as the BLAS library. Currently we prefer to pull therefore numpy from defaults but from a packagaing perspective, it would be much nicer to pull all packages from conda-forge.

To use MKL in conda-forge packages the manual https://conda-forge.org/docs/maintainer/knowledge_base.html#blas tells us to put libblas=*=*mkl into our requirements.

I made some measurements with the Intel-provided benchmark on my quad core Macbook. Results on Xeon CPUs like in AWS are probably even better for MKL. I've also included user time and wall time to show that more than a single CPU core is used.

import numpy as np  
import time   
N = 6000  
M = 10000  
  
k_list = [64, 80, 96, 104, 112, 120, 128, 144, 160, 176, 192, 200, 208, 224, 240, 256, 384]  
  
def get_gflops(M, N, K):  
    return M*N*(2.0*K-1.0) / 1000**3  

%%time

for K in k_list:  
    a = np.array(np.random.random((M, N)), dtype=np.double, order='C', copy=False)  
    b = np.array(np.random.random((N, K)), dtype=np.double, order='C', copy=False)  
    A = np.matrix(a, dtype=np.double, copy=False)  
    B = np.matrix(b, dtype=np.double, copy=False)  
  
    C = A*B  
  
    start = time.time()  
  
    C = A*B  
    C = A*B  
    C = A*B  
    C = A*B  
    C = A*B  
  
    end = time.time()  
  
    tm = (end-start) / 5.0  
  
    print ('{0:4}, {1:9.7}, {2:9.7}'.format(K, tm, get_gflops(M, N, K) / tm))

Output for OpenBLAS:

  64,  0.105501,  72.22678
  80, 0.1245156,  76.61692
  96, 0.1234688,  92.81699
 104, 0.1445188,  85.94037
 112, 0.1439478,  92.95037
 120, 0.1685714,  85.06782
 128, 0.1491372,  102.5901
 144, 0.1855208,  92.81976
 160, 0.1702018,  112.4548
 176, 0.1998742,  105.3663
 192, 0.1956518,  117.4535
 200,  0.225059,  106.3721
 208, 0.2254568,  110.4424
 224, 0.2224828,  120.5486
 240, 0.2488988,  115.4686
 256, 0.2492487,  123.0096
 384, 0.3535572,  130.1628
CPU times: user 2min 57s, sys: 4.8 s, total: 3min 2s
Wall time: 32.4 s

Output for MKL:

  64, 0.06001639,  126.9653
  80, 0.07444038,  128.1562
  96, 0.08233962,  139.1797
 104, 0.09056726,  137.1357
 112, 0.09994278,  133.8766
 120, 0.1016876,  141.0201
 128, 0.1139066,  134.3205
 144, 0.1162876,  148.0811
 160, 0.1306622,  146.4846
 176, 0.1408976,  149.4703
 192, 0.1543494,   148.883
 200, 0.1684494,  142.1198
 208, 0.1721786,  144.6172
 224, 0.1878748,  142.7546
 240, 0.1929878,  148.9213
 256, 0.2067366,  148.3046
 384, 0.3119772,  147.5108
CPU times: user 1min 15s, sys: 2.66 s, total: 1min 18s
Wall time: 25.5 s

Calculate and store objective function value for analysis

We should have a function that takes parameters 'dat', 'distribution', 'alpha', and 'l1_ratio' (as the benchmarking functions like glmnet_python_bench do), and also an 'intercept' and 'coefs', and returns the objective function value. This could then be used to figure out whether different libraries are converging similarly well. I'm not sure whether we can take this function from one of the libraries or need to write it ourselves.

[Minor] Fixing compiler flags

Check if -march=haswell vs -march=skylake makes a difference.
Include a check in the setup.py that when march is set and it isn't nocona, then we don't add march=native. Then we would have -march=native in the setup.py (fast by default) but still can configure it.
Investigate if there are any global changes to the shared library when we compile with -fno-math-errno (via -ffast-math). More generally, are there are problems caused by using -ffast-math? Maybe if -fno-math-errno is safe enough for RedHat glibc, it's safe enough for this? Haha. https://bugzilla.redhat.com/show_bug.cgi?id=1664408

[Major] Comprehensive tests -- golden master and correctness

Every feature in sklearn should be used in some golden master test, so we can make sure we aren't breaking anything, especially for less-used features like warm starts
We should also confirm the correctness of sklearn solutions, perhaps by perturbing parameter values at the estimated optimum. This could be slow so we might not want to build it into the test suite. This is especially needed for gamma and Tweedie with regularization. Jan is confident in these since they have been tested against other software, but not with regularization.

[Medium] Add regularization paths to sklearn

[Minor] glmnet_python gives a mysterious error

I ran glm_benchmarks_run --num_rows 700 --library_names glmnet_python --problem_names sparse_insurance_lasso_poisson and got an error from within glmnet_python:

File "/opt/conda/lib/python3.7/site-packages/glmnet_python/glmnet.py", line 471, in glmnet
thresh, isd, intr, maxit, family)
File "/opt/conda/lib/python3.7/site-packages/glmnet_python/fishnet.py", line 176, in fishnet
ninmax = max(nin)
ValueError: max() arg is an empty sequence

Tensorflow diverges

Tensorflow fit_sparse diverges. Since we are solving a convex problem, and since other optimizers converge on the same problem, this is not a great sign. Some approaches to fix this might be

Use a lower learning_rate parameter.
Use fit instead of fit_sparse where applicable (only for dense inputs and L2 penalty). This seems to work much better.
Use fit_sparse_one_step and stop when the objective function value starts increasing.

Why does scaling predictors (not just centering them) help?

Since we have also scaled the penalty, I don't see why scaling would cause any change in the steps taken. @tbenthompson says scaled and centered predictors lead to faster convergence than just centered predictors. This is relevant because scaling zero-variance columns is impossible and is causing issues (#48).

Number of rows of test problem should be stored in benchmark results

The user has the option of specifying the number of rows of test data to use, but this is not stored, making it easy to accidentally try to compare benchmarks run on different sizes of data. 'n_rows' should be treated the same way as the 'library' and 'problem' parameters when storing results, as its own subfolder in the benchmark_results directory.

[Minor] glmnet_python says 'Killed' and crashes Python with distribution=gaussian

Steps to reproduce:

Check out 81cfa39
docker-compose run work
glm_benchmarks_run --problem_names simple_insurance_no_weights_l2_gaussian --library_names glmnet_python

I have not investigated this and am simply going to tell it to skip such problems.

Standardization

We should think about if we want to standardize features for our benchmarks (or not). For any serious use case, we should standardize features, but this is slightly more trivial than adding a StandardScaler to our data processing, because scaling one-hot encoded features will remove all sparsity from the original data. That's why glmnet takes care of standardization internally.

Options:

do not standardize
standardize everything (with its drawbacks for sparse matrices)
only standardize numerical features, leave one-hot encoded features untouched (this is what h2o does)
only standardize numerical features, leave one-hot encoded features untouched, but adjust lasso weights (if possible) to give us the same behavior as in 2 (this should work in the sklearn fork implementation).

I would lean towards 2. or 4.

[Medium] Bug: DenseGLMDataMatrix does not support integer-valued data

In [1]: import numpy as np                                                                                           

In [2]: from glm_benchmarks.sklearn_fork.dense_glm_matrix import DenseGLMDataMatrix                                  

In [3]: arr = DenseGLMDataMatrix(np.eye(2, dtype=int))                                                               

In [4]: arr.standardize(weights=[1, 1], scale_predictors=True)                                                       
---------------------------------------------------------------------------
UFuncTypeError                            Traceback (most recent call last)
<ipython-input-4-5dbea128068a> in <module>
----> 1 arr.standardize(weights=[1, 1], scale_predictors=True)

/app/src/glm_benchmarks/sklearn_fork/dense_glm_matrix.py in standardize(self, weights, scale_predictors)
     45             # TODO: avoid copying X -- the X ** 2 makes a copy
     46             col_stds = np.sqrt((self ** 2).T.dot(weights))
---> 47             self *= one_over_var_inf_to_zero(col_stds)
     48         else:
     49             col_stds = np.ones(self.shape[1], dtype=self.dtype)

UFuncTypeError: Cannot cast ufunc 'multiply' output from dtype('float64') to dtype('int64') with casting rule 'same_kind'

Tests cannot pass without fixing this, in particular the test that calls sklearn check_estimator

[Major] Add a class for efficient operations on categorical features (sparse/dense split done)

One-hot encoding categorical variables generates matrices where all nonzero elements are 1, and there is only one nonzero element per row. It is possible to store these matrices with much less memory than a general sparse matrix and to operate on them more efficiently. We could improve performance a lot by adding a class that represents our data as a partitioned matrix composed of several one-hot encoded matrices (and perhaps also a dense block).