online-ml / river Goto Github PK
View Code? Open in Web Editor NEW๐ Online machine learning in Python
Home Page: https://riverml.xyz
License: BSD 3-Clause "New" or "Revised" License
๐ Online machine learning in Python
Home Page: https://riverml.xyz
License: BSD 3-Clause "New" or "Revised" License
This should go in the optim/losses.py
file.
Latent Dirichlet Allocation is a cool model and there is a nice online algorithm to learn it's parameters.
References:
PA classifiers and regressor are simple yet quite powerful. Although scikit-learn implements them as separate classes, I think we should write a new optimizer and add it to the optim
module.
References:
By implementing this we can run perceptrons by using the loss in LogisticRegression
. This should go in the optim/losses.py
file.
References:
I've looked into computing running quantiles and it is not trivial.
References:
At the moment we're using the autosummary of Sphinx to generate tables that contain the classes of each module. The problem is that we have to manually add new classes to each module's .rst
file. This is error-prone and it would be better if all the classes of each module were automatically added. I've tried to do this yesterday, but to no avail :(
This paper just came out. This is obviously a big topic and I'm not exactly sure what the API should look like, but it's worth investigating at some point.
Mondrian trees seem like a good method for building online decision trees. Here is some litterature:
Incremental PCA is already implemented in scikit-learn but seems like it is straightforward to implement. See this paper too.
We need to write a simple benchmark for multi-class learning. I'm not set on a particular dataset yet so feel free to propose one!
This should be a running statistic in the stats
module. The only way to know the true cardinality is to maintain a set of all seen values up to now, but this isn't very efficient memory-wise. A better way would be to use sketches or a Bloom filter. The count will be approximate but it will be good enough for machine learning purposes. Maybe the class could be Cardinality
or NUnique
.
References:
patsy is a library for formulating statistical models. I just watched this great talk where at the 3rd minute the X
and y
are obtained via patsy
's dmatrices
function. I think this is very elegant and reminiscent of R. I think that we could implement something like this in the stream
module or the feature_extraction
module. Needs some thinking about though.
It would be nice if we could add a simple benchmark for regression models. I was thinking of using the data from the Rossman store sales Kaggle competition.
We have a HedgeClassifier
but not a HedgeRegressor
.
It would be nice to implement micro/macro/weighted precision, recall and f1-score.
I think it would be good if we had a suite of benchmarks to run. Ideally we would a global benchmark per task (i.e. one for regression, one for binary classification, one for multi-class classification). In each benchmark we want to run a set of estimators against a dataset and compare their scores, running time, memory consumption, etc. I'm not sure if we should compare our models to those of scikit-learn in these benchmarks or we should do this separately.
References:
It would be nice to be able to do online feature imputation. We can easily compute rolling statistics so this doesn't seem reasonable. Actually we could even do condition the feature imputation on another variable. That is, instead of simply replacing a missing value by the mean value we could actually compute a mean per value of another feature.
The trick is that missing features are usually dropped from the input features because we're using dict
s. Either we impute all the features that have been seen but are not in the current x
. Or else we can impute the values for which a None
value is provided.
Feel free to propose a new implementation! I think this should be part of a new module called impute
because this is what scikit-learn does.
At the moment running mean and running variance are implemented. Skew and kurtosis are easy to implement, details can be found here.
I'm not yet satisfied with the implementation of FMRegressor
. I think it needs to be compared with some other implementation (many are available) as a sanity check.
I'm currently working on an implementation of Bayesian linear regression for streaming data. I don't need any help (yet!) but I'm just putting this issue here for information.
Implement running min and max statistics.
Bayesian networks (graphical models in general) are a cool way to reason about knowledge that is updating along time. I think it would be nice to start with a Bayesian network implementation where the structure of the graph is set by the user. The user can then show observations to the networks and then ask queries. I don't know yet how flexible we should be about what kinds of probability distributions we want to allow.
References:
There are many feature selection methods, but I'm not sure if any of them applicable to online learning. It would be nice if someone did some research in this direction and maybe made a first simple implementation.
At the moment LinearRegression
and LogisticRegression
are completely online: they learn with each incoming sample. Maybe that storing the samples and thus computing the gradient on a mini-batch could be useful. Basically we would sum the incoming gradients until batch_size
of them have been computed. Once we have enough samples we can call the optimizer and update the weights before finally resetting the batch size counter. In theory this should speed up the training process but I'm not clear on how this will affect the accuracy of the model.
I want to make sure that the implementation ensemble.HedgeClassifier
is rock solid. There are different variants of ensemble hedging and so I want to make sure that ours is correct.
We've implemented a decent amount of gradient-based optimizers, and I would like to make sure that their implementation is correct. I think it would be nice to have a little set of benchmarks that compare their outputs with either another library or some written examples.
References:
We definitely need some easy way to extract arbitrary features. For example you might want to extract the day of the week from a datetime
, or divide/multiply two features. Also you might want to be able to return multiple outputs. I think the good way to do this is to let the user provide a function that takes as input the features x
and outputs a new set of features (so, also a dict
). The FeatureUnion
class will take of merging everything together. I'm not sure about the naming of the class though. By the way this is the equivalent of sklearn
's FunctionTransformer
.
At the moment we compute running statistics and that's fine. It would be nice to also have running rolling statistics. For example we might want to compute the mean inside a window of values. I'm not sure if any smart way to do this exists. But I think this is important if we want our models to capture potential drifts.
We recently added a CrossEntropy
loss function. It would be nice to also have a CrossEntropy
in the metrics
module. This would inherit from MultiClassificationMetric
and reuse the CrossEntropy
from the optim
module, much like LogLoss
from metrics
uses LogLoss
from optim
.
I'm not sure about the existing but I'm pretty sure that stacking should be easy to implement. The thing is that we get validation predictions for free, so we should be able to simply plug these into a meta estimator. Actually it shouldn't be too hard to stack with multiple layers, though I'm not sure this is a wise thing to propose. Also I'm not sure about the API should look for this: should there be a StackingClassifier
and a StackingMultiClassifier
? My guts tell me yes because I believe that binary and multi classification should be explicitly separated but I'm scared that this is going to generate too many classes, especially when doing multi-label learning...
Food for the mind!
There seem to be a lot of possible implementations, as can be seen by a quick Google search.
It would be nice to see if we could implement something simple to handle imbalanced learning. This could be part of a new sub-module called imblearn
in reference to the imbalanced-learn
library.
I would like to benchmark algorithms in the reco
module. However I don't want to simply predict the rating of a movie, I actually want to recommend movies by using predicted ratings. In the end we want to measure something like the DCG. This will require rehauling the reco
module but it is well worth if we want to go beyond what Surprise does. Also we need a dataset with ratings and queries ordered by time. This is a lot of work, but it should be very rewarding. We will probably breaking this issue into smaller ones in the near future.
sklearn
's Pipeline
processes estimators one by one whereas we process observations one by one. In this sense we act like a "true" pipeline, closer to how people imagine a pipeline works. At the moment each observation goes through the whole pipeline, and only when it has finished does the next observation go through. Ideally the observations should be able to go through the pipeline in a FIFO manner and the estimators should be able to run in parallel. Naturally some estimators might take more than others so this might cause bottlenecks. A good implementation should be able to limit the size of the queue.
Most of the time creme
is slower than sklearn
. This is mostly due to the fact that sklearn
can vectorize code because all the data is available in memory at once. There might be some low hanging fruits that we can optimize using Cython instead of pure Python. A good start might be "cythonize" some of the optimizers in the optim
module and see if there is a noticeable gain.
When you think about it a compose.Pipeline
is nothing more than a collections.OrderedDict
from the standard library. I'm not sure but this might be a nice thing to have.
Sliding Fourier transforms would be a nice to thing have for feature extraction purposes. See Understanding and Implementing the Sliding DFT for a nice explanation.
See here.
I've been toying with the idea in my head and I think this shouldn't be too hard. There are some nice libraries out there that can build nearest neighbors indexes online, such as annoy. It shouldn't be too hard to build something on top of this that searches for the k
nearest neighbors online and then makes a prediction. Naturally the past observations have to be stored on the disk and not in memory, but annoy takes care of this. I'm not sure about this one but it's definitely worth trying out.
AKA logistic regression for more than two classes.
I believe target encoding shouldn't be too difficult to implement given that we can reuse the stats
module. We should be doing Bayesian target encoding, by using a prior. We should reuse the SmoothMean
class.
Pipeline
inherits from collections.OrderedDict
and TransformerUnion
inherits from collections.UserDict
. The reason is that we get dict
methods such len
and .items()
for free. Also users can access parts of a pipeline using the name of step, instead of the position. Likewise for a union of transformers. Finally this makes a lot of sense conceptually and slightly simplifies the code under the hood. The only downside is that this adds a lot of methods to documentation page of each class, but maybe we can find some way of reorganizing those.
I believe it would a good idea if ensemble models also inherited from collections.UserDict
, for the same reasons as TransformerUnion
and pipeline.Pipeline
. There should probably also be a BaseEnsemble
class in the base
module, much like what sklearn
does. This is a good issue to tackle if you're interesting in working on your Python standard library skills.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.