Giter Site home page Giter Site logo

online-ml / river Goto Github PK

View Code? Open in Web Editor NEW
4.8K 86.0 526.0 282.53 MB

๐ŸŒŠ Online machine learning in Python

Home Page: https://riverml.xyz

License: BSD 3-Clause "New" or "Revised" License

Python 97.48% Makefile 0.05% Cython 1.84% Rust 0.41% C++ 0.23%
incremental-learning machine-learning python online-learning online-statistics data-science streaming online-machine-learning streaming-data concept-drift

river's People

Contributors

3outeille avatar adilzouitine avatar albandecrevoisier avatar andrefcruz avatar brcharron avatar coldteapot273k avatar darkmyter avatar dependabot[bot] avatar etiennekintzler avatar foxriver76 avatar garawalid avatar gbolmier avatar gilbertoolimpio avatar greatsharma avatar guimatsumoto avatar hoanganhngo610 avatar jacobmontiel avatar jmread avatar krifimedamine avatar kulbachcedric avatar lbowenwest avatar maxhalford avatar mertozer94 avatar nakamasato avatar pgijsbers avatar raphaelsty avatar smastelini avatar styren avatar vaysserobin avatar yupbank avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

river's Issues

Sphinx autosummary automation

At the moment we're using the autosummary of Sphinx to generate tables that contain the classes of each module. The problem is that we have to manually add new classes to each module's .rst file. This is error-prone and it would be better if all the classes of each module were automatically added. I've tried to do this yesterday, but to no avail :(

Online outlier detection

This paper just came out. This is obviously a big topic and I'm not exactly sure what the API should look like, but it's worth investigating at some point.

Multiclass benchmark

We need to write a simple benchmark for multi-class learning. I'm not set on a particular dataset yet so feel free to propose one!

Number of unique values

This should be a running statistic in the stats module. The only way to know the true cardinality is to maintain a set of all seen values up to now, but this isn't very efficient memory-wise. A better way would be to use sketches or a Bloom filter. The count will be approximate but it will be good enough for machine learning purposes. Maybe the class could be Cardinality or NUnique.

References:

Patsy formulation

patsy is a library for formulating statistical models. I just watched this great talk where at the 3rd minute the X and y are obtained via patsy's dmatrices function. I think this is very elegant and reminiscent of R. I think that we could implement something like this in the stream module or the feature_extraction module. Needs some thinking about though.

First shot at benchmarks

I think it would be good if we had a suite of benchmarks to run. Ideally we would a global benchmark per task (i.e. one for regression, one for binary classification, one for multi-class classification). In each benchmark we want to run a set of estimators against a dataset and compare their scores, running time, memory consumption, etc. I'm not sure if we should compare our models to those of scikit-learn in these benchmarks or we should do this separately.

Feature imputation

It would be nice to be able to do online feature imputation. We can easily compute rolling statistics so this doesn't seem reasonable. Actually we could even do condition the feature imputation on another variable. That is, instead of simply replacing a missing value by the mean value we could actually compute a mean per value of another feature.

The trick is that missing features are usually dropped from the input features because we're using dicts. Either we impute all the features that have been seen but are not in the current x. Or else we can impute the values for which a None value is provided.

Feel free to propose a new implementation! I think this should be part of a new module called impute because this is what scikit-learn does.

Running skew and kurtosis

At the moment running mean and running variance are implemented. Skew and kurtosis are easy to implement, details can be found here.

Check FMRegressor

I'm not yet satisfied with the implementation of FMRegressor. I think it needs to be compared with some other implementation (many are available) as a sanity check.

Bayesian linear regression

I'm currently working on an implementation of Bayesian linear regression for streaming data. I don't need any help (yet!) but I'm just putting this issue here for information.

Bayesian networks

Bayesian networks (graphical models in general) are a cool way to reason about knowledge that is updating along time. I think it would be nice to start with a Bayesian network implementation where the structure of the graph is set by the user. The user can then show observations to the networks and then ask queries. I don't know yet how flexible we should be about what kinds of probability distributions we want to allow.

References:

First go at feature selection

There are many feature selection methods, but I'm not sure if any of them applicable to online learning. It would be nice if someone did some research in this direction and maybe made a first simple implementation.

Linear and logistic regression batch size

At the moment LinearRegression and LogisticRegression are completely online: they learn with each incoming sample. Maybe that storing the samples and thus computing the gradient on a mini-batch could be useful. Basically we would sum the incoming gradients until batch_size of them have been computed. Once we have enough samples we can call the optimizer and update the weights before finally resetting the batch size counter. In theory this should speed up the training process but I'm not clear on how this will affect the accuracy of the model.

Check HedgeClassifier

I want to make sure that the implementation ensemble.HedgeClassifier is rock solid. There are different variants of ensemble hedging and so I want to make sure that ours is correct.

Optimizer unit tests

We've implemented a decent amount of gradient-based optimizers, and I would like to make sure that their implementation is correct. I think it would be nice to have a little set of benchmarks that compare their outputs with either another library or some written examples.

Func/Lambda transformer

We definitely need some easy way to extract arbitrary features. For example you might want to extract the day of the week from a datetime, or divide/multiply two features. Also you might want to be able to return multiple outputs. I think the good way to do this is to let the user provide a function that takes as input the features x and outputs a new set of features (so, also a dict). The FeatureUnion class will take of merging everything together. I'm not sure about the naming of the class though. By the way this is the equivalent of sklearn's FunctionTransformer.

Rolling statistics

At the moment we compute running statistics and that's fine. It would be nice to also have running rolling statistics. For example we might want to compute the mean inside a window of values. I'm not sure if any smart way to do this exists. But I think this is important if we want our models to capture potential drifts.

Cross-entropy metric

We recently added a CrossEntropy loss function. It would be nice to also have a CrossEntropy in the metrics module. This would inherit from MultiClassificationMetric and reuse the CrossEntropy from the optim module, much like LogLoss from metrics uses LogLoss from optim.

Online stacking

I'm not sure about the existing but I'm pretty sure that stacking should be easy to implement. The thing is that we get validation predictions for free, so we should be able to simply plug these into a meta estimator. Actually it shouldn't be too hard to stack with multiple layers, though I'm not sure this is a wise thing to propose. Also I'm not sure about the API should look for this: should there be a StackingClassifier and a StackingMultiClassifier? My guts tell me yes because I believe that binary and multi classification should be explicitly separated but I'm scared that this is going to generate too many classes, especially when doing multi-label learning...

Food for the mind!

Recommendation benchmark

I would like to benchmark algorithms in the reco module. However I don't want to simply predict the rating of a movie, I actually want to recommend movies by using predicted ratings. In the end we want to measure something like the DCG. This will require rehauling the reco module but it is well worth if we want to go beyond what Surprise does. Also we need a dataset with ratings and queries ordered by time. This is a lot of work, but it should be very rewarding. We will probably breaking this issue into smaller ones in the near future.

Producer/consumer pattern for the pipeline

sklearn's Pipeline processes estimators one by one whereas we process observations one by one. In this sense we act like a "true" pipeline, closer to how people imagine a pipeline works. At the moment each observation goes through the whole pipeline, and only when it has finished does the next observation go through. Ideally the observations should be able to go through the pipeline in a FIFO manner and the estimators should be able to run in parallel. Naturally some estimators might take more than others so this might cause bottlenecks. A good implementation should be able to limit the size of the queue.

First try at Cython

Most of the time creme is slower than sklearn. This is mostly due to the fact that sklearn can vectorize code because all the data is available in memory at once. There might be some low hanging fruits that we can optimize using Cython instead of pure Python. A good start might be "cythonize" some of the optimizers in the optim module and see if there is a noticeable gain.

Online KNN

I've been toying with the idea in my head and I think this shouldn't be too hard. There are some nice libraries out there that can build nearest neighbors indexes online, such as annoy. It shouldn't be too hard to build something on top of this that searches for the k nearest neighbors online and then makes a prediction. Naturally the past observations have to be stored on the disk and not in memory, but annoy takes care of this. I'm not sure about this one but it's definitely worth trying out.

Target encoding

I believe target encoding shouldn't be too difficult to implement given that we can reuse the stats module. We should be doing Bayesian target encoding, by using a prior. We should reuse the SmoothMean class.

Ensemble models as dicts

Pipeline inherits from collections.OrderedDict and TransformerUnion inherits from collections.UserDict. The reason is that we get dict methods such len and .items() for free. Also users can access parts of a pipeline using the name of step, instead of the position. Likewise for a union of transformers. Finally this makes a lot of sense conceptually and slightly simplifies the code under the hood. The only downside is that this adds a lot of methods to documentation page of each class, but maybe we can find some way of reorganizing those.

I believe it would a good idea if ensemble models also inherited from collections.UserDict, for the same reasons as TransformerUnion and pipeline.Pipeline. There should probably also be a BaseEnsemble class in the base module, much like what sklearn does. This is a good issue to tackle if you're interesting in working on your Python standard library skills.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.