Hi Andrew, I'm really impressed by this project, that's what PHP ecosystem needs right

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

Determine the range of expected values about ml HOT 3 CLOSED

rubixml commented on May 12, 2024

Determine the range of expected values

from ml.

Comments (3)

andrewdalpino commented on May 12, 2024 1

@ilvalerione We're glad you've chosen Rubix ML to learn - Feel free to ask questions and welcome to our community

I don't quite understand you're objective, help me understand

What is the target variable that you are trying to predict? Duration and Memory Peak?

If so, your labels will be either the duration or memory peak (not both yet since we don't support multi-label regressors yet). Since those variables are continuous in nature, you'll need a regressor to predict the value of duration or memory peak given some input features - such as the hour of the day and the day of the week (using your example). See the section of the docs on inference for more info. Note that despite having regression in the name, Logistic Regression is a classifier.

Since you have a categorical feature 'day of week' in your dataset you'll need a regressor that is compatible with both categorical and continuous features. For your case, I would recommend either a Regression Tree because it is simple, fast, and explainable. Another option is Gradient Boost which has a tutorial but may be overkill for your dataset.

Unfortunately, neither of those learners can be partially trained - however, you can transform your categorical features to continuous ones using One Hot Encoder and then you could use Adaline.

One last option is to use KNN Regressor with a Gower distance kernel (since it is compatible with both categorical and continuous data types). KNN has the added benefit of implementing the Online interface, however can be computationally intractable with large training sets.

You can obtain a 'confidence interval' or perhaps 'range of expected values' using your words by cross-validation in which the model is tested on unseen data. A report such as Residual Analysis will be able to give you error metrics such as MAE (mean absolute error) such that a MAE of 10 means that each prediction can be +/- 10.

Could it be that what you are really looking for is a way to forecast this time series so that you can predict the next k time steps starting from an initial timestamp? If so, you'll have to wait for time series support.

from ml.

andrewdalpino commented on May 12, 2024

Hi @ilvalerione thanks for the interesting question

Let me start by making sure that our understandings are consistent

Logistic Regression is a type of Online classifier whose prediction is a class label such as 'cat', 'dog', etc. It can also output a probability distribution over these classes as it implements the Probabilistic interface. Is this what you mean by 'range of correctness?'

As a side effect, Logistic Regression can also be used as a supervised anomaly detector where the class labels are 'anomaly', 'not anomaly.' Is this how you plan to use the estimator? As opposed to an unsupervised online anomaly detector such as Gaussian MLE?

As others have inquired about recently in issues #38 and #35, similarly your problem is one that involves non-stationary time series data, which Rubix ML does not currently support. There are models, for example ARIMA, that can handle non-stationary time series natively and, given the recent interest, I am currently looking into how models like these will fit into the Rubix ML architecture. As such, we may end up implementing time-series support in the near future.

from ml.

ilvalerione commented on May 12, 2024

Hi @andrewdalpino thank you for your message.
Reading your documentation I better understand my problem, and I appreciated your "learning purpose" contents.

It is right to emphasize that I thought from a developer point of view, so could be many details out of my skills.

I think about Logistic Regression classifying data by "hour of the day" and "day of the week":

$transactions = [
    // [duration, memory_peak, hour_of_day, day_of_week],

    [12.1, 4.2, 10, 'Saturday'],
    [20.0, 6,7, 11, 'Saturday'],
    [68.35, 12.0, 11, 'Thursday'],
];

In this way I'm trying to correlate duration and memory_peak but linking this classification to the hour and day of the week is equal to assume that data is weekly seasonal. I thought that using an online detector could mitigate the seasonal assumption changing the model over time.

It can also output a probability distribution over these classes as it implements the Probabilistic interface. Is this what you mean by 'range of correctness?'

Yes, I thought to use this information to build the "dynamic grey band" in the chart.

I'm not sure tha classifiers are the right choose for this scenario because at the end I'm dealing with "unsupervised dataset" I thought. I'm not able to know what samples in the past are anomalies or not and train the model accordingly. I'm thinking in the way that the ability to understand if a sample is an anomaly or not should be acquired by the algorithm itself, based on the historical dataset.

Thanks to your advice I better understood Gaussian MLE, it could be another reasonable approach.

I hear more and more often about algorithms like ARIMA or SARIMA (S - seasonal).

I'm a developer that is trying to implement better solutions to solve problems. This is a coompletely new world for me, so thank you for your informations.

from ml.

Determine the range of expected values about ml HOT 3 CLOSED

Comments (3)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent