online-ml / river Goto Github PK

🌊 Online machine learning in Python

License: BSD 3-Clause "New" or "Revised" License

Python 97.48% Makefile 0.05% Cython 1.84% Rust 0.41% C++ 0.23%

incremental-learning machine-learning python online-learning online-statistics data-science streaming online-machine-learning streaming-data concept-drift

river's People

Contributors

Stargazers

Watchers

Forkers

adilzouitine raphaelsty vishalbelsare zuomatthew moneytech spencerai yyht stevenlol stjordanis mickael-anj lbowenwest batermj koaning kavithaqxf2 sprinterzzj donnut rsvarma95 karawitan micseb sroecker abeusher darrenzhang01 jcshoekstra 2torus liudi1025 fdwangchao bigrlab rajagurunath dandandan tomarraj008 willomans igorol ifv oeeckhoutte airicky jovanveljanoski rafmacalaba sunbc0120 agcopenhaver brp-sara sagarrathod-tomtom alirezaizd dimi-tree baskaranangappan giangzuzana choltz95 greatsharma konstantinklepikov zie225 oatish amateosd l1k1rc b1sounours mushuledragon tdevilleduc sofianebenziane cclauss kiminh gregoryarnal tonellotto amit2014 axelbellec chkoar thomasjpfan aolanbin melissabphd 3outeille leo-vk seeker1943 jsakv angulartist jcjimmy databill86 ivanletteri murilo lffranca galvaowesley keoni161 turahul sahanduiuc leetcode-notes betulays malinthar muleina dj-application sunyssc brcharron xxyy1 hyeon95y henzhiqian chrinide sicotfre jacobmontiel lkh-1 chenxingqiang pomelo zeta1999 clu-ling raphaelmansuy 53x

river's Issues

L1 loss implementation

This should go in the optim/losses.py file.

Online LDA

Latent Dirichlet Allocation is a cool model and there is a nice online algorithm to learn it's parameters.

References:

Passive/aggressive optimizer

PA classifiers and regressor are simple yet quite powerful. Although scikit-learn implements them as separate classes, I think we should write a new optimizer and add it to the optim module.

References:

0/1 loss

By implementing this we can run perceptrons by using the loss in LogisticRegression. This should go in the optim/losses.py file.

References:

0-1 Loss Function explanation

Running quantiles

I've looked into computing running quantiles and it is not trivial.

References:

At the moment we're using the autosummary of Sphinx to generate tables that contain the classes of each module. The problem is that we have to manually add new classes to each module's .rst file. This is error-prone and it would be better if all the classes of each module were automatically added. I've tried to do this yesterday, but to no avail :(

Online outlier detection

This paper just came out. This is obviously a big topic and I'm not exactly sure what the API should look like, but it's worth investigating at some point.

Mondrian tree implementation

Mondrian trees seem like a good method for building online decision trees. Here is some litterature:

Incremental PCA implementation

Incremental PCA is already implemented in scikit-learn but seems like it is straightforward to implement. See this paper too.

Multiclass benchmark

We need to write a simple benchmark for multi-class learning. I'm not set on a particular dataset yet so feel free to propose one!

Standard error of the mean

The standard error for the mean would be a nice statistic to have. Here are some references:

scipy
pandas.

Number of unique values

This should be a running statistic in the stats module. The only way to know the true cardinality is to maintain a set of all seen values up to now, but this isn't very efficient memory-wise. A better way would be to use sketches or a Bloom filter. The count will be approximate but it will be good enough for machine learning purposes. Maybe the class could be Cardinality or NUnique.

References:

Patsy formulation

patsy is a library for formulating statistical models. I just watched this great talk where at the 3rd minute the X and y are obtained via patsy's dmatrices function. I think this is very elegant and reminiscent of R. I think that we could implement something like this in the stream module or the feature_extraction module. Needs some thinking about though.

Mini-batch k-means

This is different from incremental k-means!

References:

scikit-learn implementation

Regression benchmark

It would be nice if we could add a simple benchmark for regression models. I was thinking of using the data from the Rossman store sales Kaggle competition.

HedgeRegressor

We have a HedgeClassifier but not a HedgeRegressor.

Micro, macro, and weighted metrics

It would be nice to implement micro/macro/weighted precision, recall and f1-score.

First shot at benchmarks

I think it would be good if we had a suite of benchmarks to run. Ideally we would a global benchmark per task (i.e. one for regression, one for binary classification, one for multi-class classification). In each benchmark we want to run a set of estimators against a dataset and compare their scores, running time, memory consumption, etc. I'm not sure if we should compare our models to those of scikit-learn in these benchmarks or we should do this separately.

Give some love to the docs

References:

Alabaster customization

Feature imputation

It would be nice to be able to do online feature imputation. We can easily compute rolling statistics so this doesn't seem reasonable. Actually we could even do condition the feature imputation on another variable. That is, instead of simply replacing a missing value by the mean value we could actually compute a mean per value of another feature.

The trick is that missing features are usually dropped from the input features because we're using dicts. Either we impute all the features that have been seen but are not in the current x. Or else we can impute the values for which a None value is provided.

Feel free to propose a new implementation! I think this should be part of a new module called impute because this is what scikit-learn does.

Running skew and kurtosis

At the moment running mean and running variance are implemented. Skew and kurtosis are easy to implement, details can be found here.

Mode - The value that appears most often

Implement mode, this new feature will use creme.stats.Count() and unique().

It could be useful to impute categorical features.

Related to issues #16 #24

Check FMRegressor

I'm not yet satisfied with the implementation of FMRegressor. I think it needs to be compared with some other implementation (many are available) as a sanity check.

Bayesian linear regression

I'm currently working on an implementation of Bayesian linear regression for streaming data. I don't need any help (yet!) but I'm just putting this issue here for information.

Min/Max statistics

Implement running min and max statistics.

Bayesian networks

Bayesian networks (graphical models in general) are a cool way to reason about knowledge that is updating along time. I think it would be nice to start with a Bayesian network implementation where the structure of the graph is set by the user. The user can then show observations to the networks and then ask queries. I don't know yet how flexible we should be about what kinds of probability distributions we want to allow.

References:

Introductory Jupyter notebook

First go at feature selection

There are many feature selection methods, but I'm not sure if any of them applicable to online learning. It would be nice if someone did some research in this direction and maybe made a first simple implementation.

Linear and logistic regression batch size

At the moment LinearRegression and LogisticRegression are completely online: they learn with each incoming sample. Maybe that storing the samples and thus computing the gradient on a mini-batch could be useful. Basically we would sum the incoming gradients until batch_size of them have been computed. Once we have enough samples we can call the optimizer and update the weights before finally resetting the batch size counter. In theory this should speed up the training process but I'm not clear on how this will affect the accuracy of the model.

Check HedgeClassifier

I want to make sure that the implementation ensemble.HedgeClassifier is rock solid. There are different variants of ensemble hedging and so I want to make sure that ours is correct.

Optimizer unit tests

We've implemented a decent amount of gradient-based optimizers, and I would like to make sure that their implementation is correct. I think it would be nice to have a little set of benchmarks that compare their outputs with either another library or some written examples.

Voting classifier

References:

scikit-learn implementation

Func/Lambda transformer

We definitely need some easy way to extract arbitrary features. For example you might want to extract the day of the week from a datetime, or divide/multiply two features. Also you might want to be able to return multiple outputs. I think the good way to do this is to let the user provide a function that takes as input the features x and outputs a new set of features (so, also a dict). The FeatureUnion class will take of merging everything together. I'm not sure about the naming of the class though. By the way this is the equivalent of sklearn's FunctionTransformer.

Rolling statistics

At the moment we compute running statistics and that's fine. It would be nice to also have running rolling statistics. For example we might want to compute the mean inside a window of values. I'm not sure if any smart way to do this exists. But I think this is important if we want our models to capture potential drifts.

Cross-entropy metric

We recently added a CrossEntropy loss function. It would be nice to also have a CrossEntropy in the metrics module. This would inherit from MultiClassificationMetric and reuse the CrossEntropy from the optim module, much like LogLoss from metrics uses LogLoss from optim.

Online stacking

I'm not sure about the existing but I'm pretty sure that stacking should be easy to implement. The thing is that we get validation predictions for free, so we should be able to simply plug these into a meta estimator. Actually it shouldn't be too hard to stack with multiple layers, though I'm not sure this is a wise thing to propose. Also I'm not sure about the API should look for this: should there be a StackingClassifier and a StackingMultiClassifier? My guts tell me yes because I believe that binary and multi classification should be explicitly separated but I'm scared that this is going to generate too many classes, especially when doing multi-label learning...

Food for the mind!

Gradient boosting

There seem to be a lot of possible implementations, as can be seen by a quick Google search.

Imbalanced learning simple solution

It would be nice to see if we could implement something simple to handle imbalanced learning. This could be part of a new sub-module called imblearn in reference to the imbalanced-learn library.

Recommendation benchmark

I would like to benchmark algorithms in the reco module. However I don't want to simply predict the rating of a movie, I actually want to recommend movies by using predicted ratings. In the end we want to measure something like the DCG. This will require rehauling the reco module but it is well worth if we want to go beyond what Surprise does. Also we need a dataset with ratings and queries ordered by time. This is a lot of work, but it should be very rewarding. We will probably breaking this issue into smaller ones in the near future.

Cauchy loss

See this and this.

Producer/consumer pattern for the pipeline

sklearn's Pipeline processes estimators one by one whereas we process observations one by one. In this sense we act like a "true" pipeline, closer to how people imagine a pipeline works. At the moment each observation goes through the whole pipeline, and only when it has finished does the next observation go through. Ideally the observations should be able to go through the pipeline in a FIFO manner and the estimators should be able to run in parallel. Naturally some estimators might take more than others so this might cause bottlenecks. A good implementation should be able to limit the size of the queue.

First try at Cython

Most of the time creme is slower than sklearn. This is mostly due to the fact that sklearn can vectorize code because all the data is available in memory at once. There might be some low hanging fruits that we can optimize using Cython instead of pure Python. A good start might be "cythonize" some of the optimizers in the optim module and see if there is a noticeable gain.

Pipeline should inherit from collections.OrderedDict

When you think about it a compose.Pipeline is nothing more than a collections.OrderedDict from the standard library. I'm not sure but this might be a nice thing to have.

Sliding Fourier

Sliding Fourier transforms would be a nice to thing have for feature extraction purposes. See Understanding and Implementing the Sliding DFT for a nice explanation.

AdaBound optimizer

See here.

Online KNN

I've been toying with the idea in my head and I think this shouldn't be too hard. There are some nice libraries out there that can build nearest neighbors indexes online, such as annoy. It shouldn't be too hard to build something on top of this that searches for the k nearest neighbors online and then makes a prediction. Naturally the past observations have to be stored on the disk and not in memory, but annoy takes care of this. I'm not sure about this one but it's definitely worth trying out.

Softmax regression

AKA logistic regression for more than two classes.

Target encoding

I believe target encoding shouldn't be too difficult to implement given that we can reuse the stats module. We should be doing Bayesian target encoding, by using a prior. We should reuse the SmoothMean class.

Incremental DBSCAN implementation

References:

Density-Based Clustering over an Evolving Data Stream with Noise

Ensemble models as dicts

Pipeline inherits from collections.OrderedDict and TransformerUnion inherits from collections.UserDict. The reason is that we get dict methods such len and .items() for free. Also users can access parts of a pipeline using the name of step, instead of the position. Likewise for a union of transformers. Finally this makes a lot of sense conceptually and slightly simplifies the code under the hood. The only downside is that this adds a lot of methods to documentation page of each class, but maybe we can find some way of reorganizing those.

I believe it would a good idea if ensemble models also inherited from collections.UserDict, for the same reasons as TransformerUnion and pipeline.Pipeline. There should probably also be a BaseEnsemble class in the base module, much like what sklearn does. This is a good issue to tackle if you're interesting in working on your Python standard library skills.

online-ml / river Goto Github PK

river's People

Contributors

Stargazers

Watchers

Forkers

river's Issues

Recommend Projects

Recommend Topics

Recommend Org