Giter Site home page Giter Site logo

rubixml / ml Goto Github PK

View Code? Open in Web Editor NEW
2.0K 58.0 171.0 30.27 MB

A high-level machine learning and deep learning library for the PHP language.

Home Page: https://rubixml.com

License: MIT License

PHP 100.00%
machine-learning data-science neural-network analytics php classification clustering regression anomaly-detection deep-learning

ml's Introduction

Rubix ML

PHP from Packagist Latest Stable Version Downloads from Packagist Code Checks GitHub

A high-level machine learning and deep learning library for the PHP language.

  • Developer-friendly API is delightful to use
  • 40+ supervised and unsupervised learning algorithms
  • Support for ETL, preprocessing, and cross-validation
  • Open source and free to use commercially

Installation

Install Rubix ML into your project using Composer:

$ composer require rubix/ml

Requirements

  • PHP 7.4 or above

Recommended

Optional

Documentation

Read the latest docs here.

What is Rubix ML?

Rubix ML is a free open-source machine learning (ML) library that allows you to build programs that learn from your data using the PHP language. We provide tools for the entire machine learning life cycle from ETL to training, cross-validation, and production with over 40 supervised and unsupervised learning algorithms. In addition, we provide tutorials and other educational content to help you get started using ML in your projects.

Getting Started

If you are new to machine learning, we recommend taking a look at the What is Machine Learning? section to get started. If you are already familiar with basic ML concepts, you can browse the basic introduction for a brief look at a typical Rubix ML project. From there, you can browse the official tutorials below which range from beginner to advanced skill level.

Tutorials & Example Projects

Check out these example projects using the Rubix ML library. Many come with instructions and a pre-cleaned dataset.

Interact With The Community

Contributing

See CONTRIBUTING.md for guidelines.

License

The code is licensed MIT and the documentation is licensed CC BY-NC 4.0.

ml's People

Contributors

0xflotus avatar absolutic avatar allanmcarvalho avatar andrewdalpino avatar bahmanmd avatar basvanh avatar biostaz avatar bogkonstantin avatar boorinio avatar chrisdimas avatar christophwurst avatar divineomega avatar drdub avatar elgigi avatar github-actions[bot] avatar jakim avatar javiereguiluz avatar jerodev avatar kroky avatar marclaporte avatar markuspoerschke avatar maximecolin avatar michaelbutler avatar nterray avatar programarivm avatar simplechris avatar st3iny avatar sublimeraiser avatar torchello avatar wintersilence avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

ml's Issues

Not able to train GridSearch with KDNeighbors as base

Fatal error: Uncaught Error: Call to a member function bare() on null in [..]/composer/vendor/rubix/ml/src/Classifiers/KDNeighbors.php on line 116 Error

Here it should return "!$this->tree->bare()", but the tree of KDNeighbors is not set by the GridSearch so the tree is still null.

Trace:
17. Rubix\ML\GridSearch->train() MLModelManagementClass.inc:309
18. Rubix\ML\Other\Specifications\SamplesAreCompatibleWithEstimator::check() /composer/vendor/rubix/ml/src/GridSearch.php:251
19. Rubix\ML\GridSearch->compatibility() /composer/vendor/rubix/ml/src/Other/Specifications/SamplesAreCompatibleWithEstimator.php:23
20. Rubix\ML\GridSearch->trained() /composer/vendor/rubix/ml/src/GridSearch.php:200
21. Rubix\ML\Classifiers\KDNeighbors->trained() /composer/vendor/rubix/ml/src/GridSearch.php:214

Code:
$grid = array( [5,6,7], [true], [new KDTree()] ); $oModel = new GridSearch( KDNeighbors::class, $grid); $oModel->train($oDataset);

Determine the range of expected values

Hi Andrew, I'm really impressed by this project, that's what PHP ecosystem needs right now.

I would use this project in my application to determine the range of expected values in a time-series data. My application is able to trace how long does it take an HTTP request to be completed (duration) that I represent in my app with a simple line chart.

Duration tend to have pronounced peaks and valleys, depending on the time of day or the day of the week. Those fluctuations make it very hard to set a simple thresholds for alerting purpose.

Based on historical data I would calculate an area-range which shows the range within which the data could be considered normal, like the example below:
area-range

In this way I could have a visual feedback about the probability that current values are normal or not by analyzing a metric’s historical behavior.

Studying the documentation Logistic Regression cought my attention becouse it can be partially trained also and I can evolve its model as new data is acquired.

My Question
Using this kind of algorithm for any point in my data I will get a numeric value that represents the prediction if the point is an anomaly or not.

Is there a way ML can help me to find a range of correctness for any point based on historical behavior as shown in the image above?

Are there some resources to put me in the right direction?

Error in t-SNE momentum gain calculation

The computation of the gains for t-SNE Gradient Descent with momentum updates is backwards. Instead of speeding up when the gradient is sloping down, it slows down and vice versa. The effect is slower embedding and clusters that are too close together. Just flip the comparison to less than instead of greater than and boom.

 $gain = $direction[$i][$j] < 0.
     ? $gain + self::INC_GAIN
     : $gain * self::DEC_GAIN;

Vantage Tree for nearest neighbor and range search

Rubix ML currently offers k-D Tree and Ball Tree for the purpose of greatly accelerating nearest neighbor and range searches for Learners such as k-D Neighbors, Radius Neighbors, and DBSCAN. K-D tree is best suited for low-dimensional samples, while Ball Tree is well suited for high-dimensional samples. Ball Tree, however, requires many distance calculations in the processes of building the tree which makes the building process slower than k-D tree.

I propose we implement a generalized Vantage Tree for the purposes of accelerating nearest neighbor and range searches with high-dimensional samples, but without the overhead of spatially partitioning the samples into left and right clusters during tree building like with Ball Tree. Vantage Tree would sit somewhere in between k-D tree and Ball Tree in terms of it's performance and ability to handle high-dimensional data. It can also be used to estimate the pairwise high-dimensional affinities of t-SNE in O(n log n) time as opposed to the exact method which requires O(n^2) distance computations.

spatial-tree-knn

Old obsolete info

As a training to start with the machine learning with PHP by using this library, I want to create a simple project that will use the old lotto numbers to predict if a generated random numbers sequence can contain some winning numbers and the probability that the sequence will be extracted. My dataset for now is composed from the last year winning numbers series, but I'm planning to expand the dataset to add the last three years of winning extractions. My question is how I can implement the library features, in particular what is the best feature of this library I can use, and how to train correctly the AI, the dataset is unlabeled because there is no label that can classify the numbers series, this because I only got the winning number series. I'm reading the documentations and I've read some of the tutorials, but some help on how to start with this problem will be appreciated.

Library is missing Stacked Ensemble

Out of the 4 major ensemble methods (boosting, bagging, voting, and stacking), Rubix only implements 3. Stacking is similar to a voting committee, however the influences are learned by a separate learner such as Linear or Logistic Regression. The Rubix architecture allows for a Meta Estimator implementation such that the notion of a Model Orchestra can be generalized to classification and regression. As stacking has been shown in data competitions to be crucial for the best possible accuracy scores, the benefit to Rubix engineers will be high.

The general idea is we have an orchestra of estimators that each become the input features to a final estimator (conductor) that makes the final prediction. Since the Rubix architecture allows for the concept of a Meta Estimator, the implementation would be general to both classifiers and regressors.

Auto ML

Are you working on auto ml features or is it even in the scope this API?

Some auto tuning features would be a great benefit for the lib.

Local Outlier Factor Can Be Improved

The current implementation of Local Outlier Factor does not follow the convention in the original paper. Instead, the author has opted for a simpler implementation that was quicker to compute, however does not achieve the same level of accuracy. As LOF is currently the most accurate detector in the suite, I recommend that the full LOF algorithm be implemented in favor of fulfilling that role to the highest degree.

Feature Request: Add Mish Activation Function

Mish is a novel activation function proposed in this paper.
It has shown promising results so far and has been adopted in several packages including:

TensorFlow-Addons SpaCy (Tok2Vec Layer) Thinc - SpaCy's official NLP based ML library
Eclipse's deeplearning4j Hasktorch Echo AI
CNTKX - Extension of Microsoft's CNTK FastAI-Dev Darknet
Yolov3 BeeDNN - Library in C++ Gen-EfficientNet-PyTorch
dnet ruby-dnn blackcat-tensors
DL4S HuggingFace Transformers PAGI
OpenCV Odin-AI Mini DNN
Efficient Segmentation Networks TF Semantic Segmentation Dynastes
DLib Copernicus AllenNLP
PyWick

All benchmarks, analysis and links to official package implementations can be found in this repository

Mish also was recently used for a submission on the Stanford DAWN Cifar-10 Training Time Benchmark where it obtained 94% accuracy in just 10.7 seconds which is the current best score on 4 GPU and second fastest overall. Additionally, Mish has shown to improve convergence rate by requiring less epochs. Reference -

0 (2)

Mish also has shown consistent improved ImageNet scores and is more robust. Reference -

0

Additional ImageNet benchmarks along with Network architectures and weights are avilable on my repository.

Summary of Vision related results:

Capture

It would be nice to have Mish as an option within the activation function group.

This is the comparison of Mish with other conventional activation functions in a SEResNet-50 for CIFAR-10:
se50_1

Undefined offset in metric class while training Multilayer Perceptron Classifier

Describe the bug

When attempting to train a Multilayer Perception Classifier, I occasionally get the following type of exception. I have been able to replicate this with both the MCC and FBeta metrics. Unfortunately this exception does not occur consistently even with the same dataset.

[2020-04-04 22:32:21] production.ERROR: Undefined offset: 0 {"exception":"[object] (ErrorException(code: 0): Undefined offset: 0 at /[REDACTED]/vendor/rubix/ml/src/CrossValidation/Metrics/MCC.php:107)
[stacktrace]
#0 /[REDACTED]/vendor/rubix/ml/src/CrossValidation/Metrics/MCC.php(107): Illuminate\\Foundation\\Bootstrap\\HandleExceptions->handleError()
#1 /[REDACTED]/vendor/rubix/ml/src/Classifiers/MultilayerPerceptron.php(414): Rubix\\ML\\CrossValidation\\Metrics\\MCC->score()
#2 /[REDACTED]/vendor/rubix/ml/src/Classifiers/MultilayerPerceptron.php(360): Rubix\\ML\\Classifiers\\MultilayerPerceptron->partial()
#3 /[REDACTED]/vendor/rubix/ml/src/Pipeline.php(189): Rubix\\ML\\Classifiers\\MultilayerPerceptron->train()
#4 /[REDACTED]/vendor/rubix/ml/src/PersistentModel.php(191): Rubix\\ML\\Pipeline->train()
#5 /[REDACTED]/app/Console/Commands/TrainModel.php(89): Rubix\\ML\\PersistentModel->train()
#6 [internal function]: App\\Console\\Commands\\TrainModel->handle()
#7 /[REDACTED]/vendor/laravel/framework/src/Illuminate/Container/BoundMethod.php(32): call_user_func_array()
#8 /[REDACTED]/vendor/laravel/framework/src/Illuminate/Container/Util.php(36): Illuminate\\Container\\BoundMethod::Illuminate\\Container\\{closure}()
#9 /[REDACTED]/vendor/laravel/framework/src/Illuminate/Container/BoundMethod.php(90): Illuminate\\Container\\Util::unwrapIfClosure()
#10 /[REDACTED]/vendor/laravel/framework/src/Illuminate/Container/BoundMethod.php(34): Illuminate\\Container\\BoundMethod::callBoundMethod()
#11 /[REDACTED]/vendor/laravel/framework/src/Illuminate/Container/Container.php(592): Illuminate\\Container\\BoundMethod::call()
#12 /[REDACTED]/vendor/laravel/framework/src/Illuminate/Console/Command.php(134): Illuminate\\Container\\Container->call()
#13 /[REDACTED]/vendor/symfony/console/Command/Command.php(255): Illuminate\\Console\\Command->execute()
#14 /[REDACTED]/vendor/laravel/framework/src/Illuminate/Console/Command.php(121): Symfony\\Component\\Console\\Command\\Command->run()
#15 /[REDACTED]/vendor/symfony/console/Application.php(912): Illuminate\\Console\\Command->run()
#16 /[REDACTED]/vendor/symfony/console/Application.php(264): Symfony\\Component\\Console\\Application->doRunCommand()
#17 /[REDACTED]/vendor/symfony/console/Application.php(140): Symfony\\Component\\Console\\Application->doRun()
#18 /[REDACTED]/vendor/laravel/framework/src/Illuminate/Console/Application.php(93): Symfony\\Component\\Console\\Application->run()
#19 /[REDACTED]/vendor/laravel/framework/src/Illuminate/Foundation/Console/Kernel.php(129): Illuminate\\Console\\Application->run()
#20 /[REDACTED]/artisan(37): Illuminate\\Foundation\\Console\\Kernel->handle()
#21 {main}
"}

To Reproduce

The following code is capable to recreating this error occasionally.

$estimator = new PersistentModel(
    new Pipeline(
        [
            new TextNormalizer(),
            new WordCountVectorizer(10000, 3, new NGram(1, 3)),
            new TfIdfTransformer(),
            new ZScaleStandardizer()
        ],
        new MultilayerPerceptron([
            new Dense(100),
            new PReLU(),
            new Dense(100),
            new PReLU(),
            new Dense(100),
            new PReLU(),
            new Dense(50),
            new PReLU(),
            new Dense(50),
            new PReLU(),
        ], 100, null, 1e-4, 1000, 1e-4, 10, 0.1, null, new MCC())
    ),
    new Filesystem($modelPath.'classifier.model')
);

$estimator->setLogger(new Screen('train-model'));

$estimator->train($dataset);

The labelled dataset used is a series of text files split into different directories that indicate their class names. This dataset is built using the following function.

    public static function buildLabeled(): Labeled
    {
        $samples = $labels = [];

        $directories = glob(storage_path('app/dataset/*'));

        foreach($directories as $directory) {
            foreach (glob($directory.'/*.txt') as $file) {
                $text = file_get_contents($file);
                $samples[] = [$text];
                $labels[] = basename($directory);
            }
        }

        return Labeled::build($samples, $labels);
    }

Expected behavior

Training should complete without any errors within the metric class.

Logger and persistence problem

I have extended the Logger of Rubix to Laravel Log facade. This works very well except when you start to use persistence.

$estimator->setLogger(new class extends Logger {
  public function log($level, $message, array $context = array())
   {
     Log::$level($message, $context);
   }
});

When I try to save the estimator, I get this error:
Serialization of 'class@anonymous' is not allowed
vendor/rubix/ml/src/Persisters/Serializers/Native.php:26

Can't seem to figure out why this is happening. Logger should not be near the persistence I think.

k-D LOF bug in LOF score calculation

When neighbors() returns the computed distances in standard LOF, the method returns an array indexed by the original key from the training set, however, in k-D LOF neighbors returns a sequential array. Due to this, the k-D version LOF score calculation is always returning the first k local reachability densities instread of the k nearest neighbors LRDs.

Here is the code at fault ...

list($distances) = $this->neighbors($sample, $this->k);
...
$lrds = array_intersect_key($this->lrds, $distances);

Logger and persistence problem

I have extended the Logger of Rubix to Laravel Log facade. This works very well except when you start to use persistence.

$estimator->setLogger(new class extends Logger {
  public function log($level, $message, array $context = array())
   {
     Log::$level($message, $context);
   }
});

When I try to save the estimator, I get this error:
Serialization of 'class@anonymous' is not allowed
vendor/rubix/ml/src/Persisters/Serializers/Native.php:26

Can't seem to figure out why this is happening. Logger should not be near the persistence I think.

Logger and persistence problem

I have extended the Logger of Rubix to Laravel Log facade. This works very well except when you start to use persistence.

$estimator->setLogger(new class extends Logger {
  public function log($level, $message, array $context = array())
   {
     Log::$level($message, $context);
   }
});

When I try to save the estimator, I get this error:
Serialization of 'class@anonymous' is not allowed
vendor/rubix/ml/src/Persisters/Serializers/Native.php:26

Can't seem to figure out why this is happening. Logger should not be near the persistence I think.

Forecast the time series data

Do you have any solution for forecast the time series data? I have huge amount of time series data, and i would like to peek in the future.

Possible Numerical Instability in Gaussian Mixture

The current implementation of Gaussian Mixture clusterer uses probabilities directly instead of taking the log as is a convention used in other Gaussian/probability-based estimators in Rubix. When computing probabilities, numerical instability issues arise when such probabilities are very high or (more commonly) very low. By entering log space, we push the precision of the floating point number from the exponent to the mantissa, which has more precision and therefore avoids numerical under/overflows.

Logger and persistence problem

I have extended the Logger of Rubix to Laravel Log facade. This works very well except when you start to use persistence.

$estimator->setLogger(new class extends Logger {
  public function log($level, $message, array $context = array())
   {
     Log::$level($message, $context);
   }
});

When I try to save the estimator, I get this error:
Serialization of 'class@anonymous' is not allowed
vendor/rubix/ml/src/Persisters/Serializers/Native.php:26

Can't seem to figure out why this is happening. Logger should not be near the persistence I think.

OneHotEncoder feature mismatch on training/test samples

Unsure if this is just a problem with how I handle things, but let's say that I train my estimator and have a OneHotEncoder transformer.

If one of my categorical features has 4 distinct values, (let's use colors for sake of simplicity),
I will get

Red n_1
Green n_2
Yellow n_3
Blue n_4

So, my training data will have sample values which match these additional columns because all of those colors exist in some place in my training data.

However, if I have test samples which does not have those columns, there is no way to "fit" the columns to the training data. Additionally, the methods I would need to get the categories from the transformers from within my pipeline are protected, so I can't actually know how the columns were built.

Is this normal, and if so, am I suposed to do some pre-processing/detection before I usethe OneHot on my learner so that I can also fit manually myself?

Stemming handlers (start with porter?)

Would be great if we could have stemmers addition to "other" section to reduce dimensionality of NLP. Cutting down some wasted memory/processing time from things like plurals and generating stronger links for the TfIdf transformer. Usually applied after basic normalisation and stop words.

Would imagine something like it becoming a 4th option of the WordCountVectorizer. Though for processing it'd make sense for it to kick in during the tokanize method eg in NGram before it stitches back together the split word tokens.

Examples that'd be easy to drop in found at https://tartarus.org/martin/PorterStemmer/php.txt and https://github.com/angeloskath/php-nlp-tools/blob/master/src/NlpTools/Stemmers/PorterStemmer.php

^ tartarus.org/martin being the the home of the author of the Porter algorithm.

If more adventurous there's a bunch of multi-language examples at https://github.com/wamania/php-stemmer (could be added as a composer dependancy?)

Ecommerce recommender with multi events

Great project!
I would like to use it for e-commerce recommendations. I have events like View Peoduct, Buy Product, Add to Cart, View Newsletter, etc.
How will you do something that consider all the information to recommend a product to a visitor based on his other actions?
Can you create some sample project like the one that predicts a House Sale Price?

Add Skip Gram Tokenizer

Is your feature request related to a problem? Please describe.
In Natural Language Processing, the bag of words technique is sometimes used to encode text data as word (or token) counts. Using n-grams as tokens, some sentence structure is maintained yielding more informative features. Another technique closely related to n-gram is the skip-gram method which skips over words to produce features with broader contexts. Skip-grams have been shown to achieve higher performance in comparison to words counts and n-grams at the cost of a sparser language model.

Describe the solution you'd like
Implement skip-gram (SkipGram) tokenizer

Describe alternatives you've considered
None

Additional context
None

CART does not consider all features when outputting importances

Describe the bug
The CART implementation of feature importances does not consider completely unused feature columns in its output.

To Reproduce
Return feature importances with a substantially high dimensional dataset such that not every column is considered when determining a split.

Expected behavior
A more appropriate behavior would be to output a feature importance of 0 thus indicating that the column was never used in a split node.

Make K Means an Online Learner with inertia

K Means is a well-known clusterer used throughout industry due to its speed and accuracy. The current implementation of K Means uses mini batch gradient descent with the stopping criteria being if the number of sample training points that change cluster mappings over an epoch is below a threshold then stop. Mini batch gradient descent is often used in Online learners as a way to iteratively train a model over time. However, the current stopping criteria becomes non-intuitive in the context of partial training. Specifically, we'd like the stop to be based on some minimum change in the loss function.

Thus, I propose to use the inertial cost function a.k.a. the sum of distances between each training sample and its nearest cluster centroid as a basis for a new stopping criteria which would enable online learning.

Native Time Series and Forecasting Support (Sequence Learning)

Time series analysis is a popular machine learning technique for forecasting trends of time-dependent variables such as stock price, GDP, and quarterly sales. Given the popularity (#35, #38, #40) and current lack of tooling within the PHP ecosystem, I propose adding native time series support as well as a new type of estimator class for forecasting time series datasets. This includes the following ...

  1. A datastructure extending Dataset for time series datasets that includes an additional index for timestamps
  2. An additional estimator type "Forecaster" to predict the next k values in a series

There should be no need to modify any of the public interfaces to integrate these features into the current architecture

Proposed initial Forecaster implementations:

  • ARIMA - AutoRegressive Integrated Moving Average (univariate)
  • VARMAX - Vector AutoRegressive Moving Average with eXogenous regressors (multivariate)

Open to comments

Persistable SVC Model

Hi,

I'm currently trying to classify some text using the SVC classifier, i'm able to predict using the original object that i trained, but the problem occurs when i try to load the model, i get an error saying:

PHP Fatal error: Uncaught svmexception: No model available to classify with in /var/www/html/rubix/vendor/rubix/ml/src/Classifiers/SVC.php:194
Stack trace:
#0 /var/www/html/rubix/vendor/rubix/ml/src/Classifiers/SVC.php(194): svmmodel->predict(Array)
#1 /var/www/html/rubix/vendor/rubix/ml/src/Pipeline.php(178): Rubix\ML\Classifiers\SVC->predict(Object(Rubix\ML\Datasets\Unlabeled))
#2 /var/www/html/rubix/vendor/rubix/ml/src/PersistentModel.php(129): Rubix\ML\Pipeline->predict(Object(Rubix\ML\Datasets\Unlabeled))
#3 /var/www/html/rubix/teste2.php(26): Rubix\ML\PersistentModel->predict(Object(Rubix\ML\Datasets\Unlabeled))
#4 {main}
thrown in /var/www/html/rubix/vendor/rubix/ml/src/Classifiers/SVC.php on line 194

There's something i didn't catch up?

Is it possible to generate a deterministic model from MLP Regressor?

Hi Andrew,

It is really an excellent work to build a machine learning library for PHP. A big thank you for that.

Recently I am trying to use MLP Regressor in Rubix ML for a specific set of data. However, every time I run it, I will get a different trained model, even though I have fixed the weight initializer and the bias initializer.

So my two questions:

  1. I am thinking if it is because of the holdout ratio (0.01 in the following code) that makes the machine use different portion of the data for training and validation every time I run it. Is there a way to disable the holdout ratio and use 100% of the data for training?

  2. I understand constant initialization is not a good practice, but what I was trying to do is to have something similar to random.seed() function in Python so that we can have deterministic random data. Is it possible to have this feature added?

$estimator = new MLPRegressor([
    new Dense(50, new Constant(1.0), new Constant(0.0)), 
    new Activation(new Relu()),
], 256, new Adam(0.001), 1e-4, 500, 1e-4, 10, 0.01, new LeastSquares(), new RMSE());

Thanks.
J.

K-d Tree Accelerated Local Outlier Factor (LOF)

Local Outlier Factor (LOF) is a popular distance-based anomaly detector. K-d trees are spatial binary search trees that offer fast (log n) search queries as well as pruning for nearest neighbors search. The current implementation of Local Outlier Factor uses the brute force method which has the advantage of being an online learner but requires neighbor searches which are quadratic (n^2) in the number of samples. By implementing a version of LOF that uses k-d tree accelerated neighbors search under the hood, inference speed would be drastically improved (roughly n log n). The implementation would use the already existing KDTree base class.

Load from Binary Serializer causes "Object Not Found" error on partial training

Describe the bug
When loading a model using the binary serializer and doing a partial train on the model, an error is thrown. "UnexpectedValueException: Object not found". The exact line that throws the error is line 119 of Optimizers\Adam.php.

[$velocity, $g2] = $this->cache[$param];

This does not occur when using the native serializer. My guess is that the binary serializer is not restoring the protected properties in Adam.php

To Reproduce
Steps to reproduce the behavior:

  1. Create MLP estimator using Adam optimizer
  2. Train MLP
  3. Save model to Filesystem or Redis using Binary serializer
  4. Load model
  5. Do a partial train on the model.
  6. See error

Expected behavior
No error. The model should train successfully after being reloaded.

Additional context
I'm basing the MLP based on the one used in the Text Sentiment repo which uses a Pipeline to initialize the estimator. Below is a stack trace. On a side note, seriously great project. Thanks for your effort. It motivated me to learn more about ML.

Stacktrace

[2019-03-26 09:45:59] local.ERROR: Object not found {"userId":1,"exception":"[object] (UnexpectedValueException(code: 0): Object not found at C:\\[path]\\vendor\
ubix\\ml\\src\\NeuralNet\\Optimizers\\Adam.php:119):
[stacktrace]
0. C:\\[path]\\vendor\
ubix\\ml\\src\\NeuralNet\\Optimizers\\Adam.php(119): SplObjectStorage->offsetGet(Object(Rubix\\ML\\NeuralNet\\Parameter))
1. C:\\[path]\\vendor\
ubix\\ml\\src\\NeuralNet\\Layers\\Multiclass.php(266): Rubix\\ML\\NeuralNet\\Optimizers\\Adam->step(Object(Rubix\\ML\\NeuralNet\\Parameter), Object(Rubix\\Tensor\\Matrix))
2. C:\\[path]\\vendor\
ubix\\ml\\src\\NeuralNet\\FeedForward.php(212): Rubix\\ML\\NeuralNet\\Layers\\Multiclass->back(Array, Object(Rubix\\ML\\NeuralNet\\Optimizers\\Adam))
3. C:\\[path]\\vendor\
ubix\\ml\\src\\NeuralNet\\FeedForward.php(169): Rubix\\ML\\NeuralNet\\FeedForward->backpropagate(Array)
4. C:\\[path]\\vendor\
ubix\\ml\\src\\Classifiers\\MultiLayerPerceptron.php(373): Rubix\\ML\\NeuralNet\\FeedForward->roundtrip(Object(Rubix\\ML\\Datasets\\Labeled))
5. C:\\[path]\\vendor\
ubix\\ml\\src\\Pipeline.php(164): Rubix\\ML\\Classifiers\\MultiLayerPerceptron->partial(Object(Rubix\\ML\\Datasets\\Labeled))
6. C:\\[path]\\app\\Http\\Controllers\\TestController.php(286): Rubix\\ML\\Pipeline->partial(Object(Rubix\\ML\\Datasets\\Labeled))
7. C:\\[path]\\app\\Http\\Controllers\\TestController.php(98): App\\Http\\Controllers\\TestController->tweetSentiment()
8. [internal function]: `App\\Http\\Controllers\\TestController->test()

Memory ballooning in certain transformers

Some Transformers that call the Dataset method columnsByType() suffer from a memory ballooning issue when dealing with large datasets. The issue is caused by the fact that the columnsByType() method returns a copy of the data in the dataset. The issue can be resolved by determining the column datatype inside the main loop, thereby keeping only one column in memory at a time.

The Transformers effected are Z Scale Standardizer, Robust Standardizer, One Hot Encoder, Interval Discretizer, and Variance Threshold Filter.

Interoperability of persisted models

Hey!
Thanks for what seems to be a huge amount of work. I appreciate it :)

I was wondering if it is possible to use models trained with a different framework? E.g. there is a large amount of pre-trained tensorflow models available and it would be amazing to be able to use those natively in PHP.

Classification/Extra Trees sometimes fail to output all probabilities

The Classification Tree implementation intermittently fails to produce a probability for every class label. This is due falsely assuming the underlying CART will return all class labels. This is the intention of the CART implementation to allow flexibility in the future. However, the extending estimator must provide a full class list with probabilities to fulfill the Probabilistic API and therefore must have a default probability of 0 for each class regardless if a probability was returned by CART or not.

$template = array_fill_keys($this->classes, 0.);

$probs = array_replace($template, $node->probabilities())

should do the trick

Logger and persistence problem

I have extended the Logger of Rubix to Laravel Log facade. This works very well except when you start to use persistence.

$estimator->setLogger(new class extends Logger {
  public function log($level, $message, array $context = array())
   {
     Log::$level($message, $context);
   }
});

When I try to save the estimator, I get this error:
Serialization of 'class@anonymous' is not allowed
vendor/rubix/ml/src/Persisters/Serializers/Native.php:26

Can't seem to figure out why this is happening. Logger should not be near the persistence I think.

Predict market trend up or down, unlabeled?

Hello,

I'm starting with ML and trying to predict stock trend up or down based on stock history. There are two challenges which I cannot seem to solve at the moment.

I have my stock history, this is data containing the price, volume and amount of trades at a certain point of time. I think I need to class this as Unlabeled data as I have not labeled them what trend a certain datapoint is in. Am I correct in this? When training the history data I get a warning it's missing labels. So I'm kind of lost how to handle/train unlabeled data.

Secondly, a timeline is also in play. I do not know how to handle this in the library.

Any help is much appreciated.

Thanks,
Bastiaan

KDLOF - Neighborhood cannot be empty

When training the k-d tree-based Local Outlier Factor (KDLOF), if the highest variance column has a heavy tail such that the median is the smallest value in the column, then the partition will result in one group that has no members - throwing the Neighborhood cannot be empty exception in the Neighborhood factory method.

The fix is simple, have the partition operation be inclusive of the column split value

Undefined Offset, when making an image prediction

ErrorException : Undefined offset: 4677996

at /home/ubuntu/environment/vendor/rubix/ml/src/Kernels/Distance/Euclidean.php:43
39| {
40| $distance = 0.;
41|
42| foreach ($a as $i => $value) {

43| $distance += ($value - $b[$i]) ** 2;
44| }
45|
46| return sqrt($distance);
47| }

Exception trace:

1 Illuminate\Foundation\Bootstrap\HandleExceptions::handleError("Undefined offset: 4677996", "/home/ubuntu/environment/vendor/rubix/ml/src/Kernels/Distance/Euclidean.php", [])
/home/ubuntu/environment/vendor/rubix/ml/src/Kernels/Distance/Euclidean.php:43

2 Rubix\ML\Kernels\Distance\Euclidean::compute()
/home/ubuntu/environment/vendor/rubix/ml/src/Classifiers/KNearestNeighbors.php:273

Regarding isolation forest

Greetings, this is a great tool. I have a question: I run isolation forest on a dataset. I run it multiple times on the same dataset with the same hiperparameters and I never get similar results. Im not sure im implementing it right, because I think that I should get similar results when I run it on the same dataset. This is the code i run multiple times:

$IsolationForest = new IsolationForest(500,0.2,0.05);
$IsolationForest->train($dataset);
$resultado = $IsolationForest->predict($dataset);

I train with the dataset and then predict on it. Is that the way to do it? I apreciate any help. Greetings Martin.

Implement n-gram tokenizer

Is your feature request related to a problem? Please describe.
For natural language processing tasks, Rubix engineers often use the bag-of-words method to represent documents as fixed length feature vectors. Currently, the two tokenizers (Word and Whitespace) are only capable of tokenizing a single word at a time. This means that word sequence information is thrown away. We can recover some amount of sentence structure by using n-grams, however Rubix currently does not offer n-gram support.

Describe the solution you'd like
Implement an n-gram tokenizer with ability to choose word length and separator.

Describe alternatives you've considered
N/A

Additional context
N/A

Post-pruning optimization to CART

The current implementation of CART (Classification and Regression Tree) only utilizes pre-pruning tactics to control overfitting. It has been shown that combining pre-pruning and post-pruning can achieve better regularization effects while speeding up inference. I propose a lightweight heuristic that will work with the impurity system already in place such that nodes with minimal impurity decrease can be pruned.

I am open to other suggestions.

Radius Neighbors with Ball Tree Implementation

A less common yet widely used algorithm in the neighbors family is Radius Neighbors which makes predictions using the vote of all neighbors within a fixed radius. Ball Trees are efficient spatial data structures that allow pruning of nodes for radius searches making it a good base implementation for radius neighbors queries. The marriage of the two would provide performance similar to K-d Neighbors using it's k-d tree base implementation with pruning. Both a Classifier and Regressor can be made using a base tree implementation.

The proposal would add 2 new tree-based learners i.e. RadiusNeighbors (classifier) and RadiusNeighborsRegressor as well as a base BallTree implementation with Centroid and Cluster (leaf) nodes. Ball Tree construction will be using the top down method similarly to k-d tree construction.

Logger and persistence problem

I have extended the Logger of Rubix to Laravel Log facade. This works very well except when you start to use persistence.

$estimator->setLogger(new class extends Logger {
  public function log($level, $message, array $context = array())
   {
     Log::$level($message, $context);
   }
});

When I try to save the estimator, I get this error:
Serialization of 'class@anonymous' is not allowed
vendor/rubix/ml/src/Persisters/Serializers/Native.php:26

Can't seem to figure out why this is happening. Logger should not be near the persistence I think.

Is it possible to save transformers?

Greetings. I see is possible to save models trained for later use. Is there a way to save also trasnformers? For example, if I train a model using MaxAbsoluteScaler before, can I save the trasnformation so I can aply it to the predictions I would be performing on the already trained model? Thanks ahead for any help! Greetings. Martín.

Logger and persistence problem

I have extended the Logger of Rubix to Laravel Log facade. This works very well except when you start to use persistence.

$estimator->setLogger(new class extends Logger {
  public function log($level, $message, array $context = array())
   {
     Log::$level($message, $context);
   }
});

When I try to save the estimator, I get this error:
Serialization of 'class@anonymous' is not allowed
vendor/rubix/ml/src/Persisters/Serializers/Native.php:26

Can't seem to figure out why this is happening. Logger should not be near the persistence I think.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.