Giter Site home page Giter Site logo

Comments (3)

andrewdalpino avatar andrewdalpino commented on May 12, 2024 1

@ilvalerione We're glad you've chosen Rubix ML to learn - Feel free to ask questions and welcome to our community

I don't quite understand you're objective, help me understand

What is the target variable that you are trying to predict? Duration and Memory Peak?

If so, your labels will be either the duration or memory peak (not both yet since we don't support multi-label regressors yet). Since those variables are continuous in nature, you'll need a regressor to predict the value of duration or memory peak given some input features - such as the hour of the day and the day of the week (using your example). See the section of the docs on inference for more info. Note that despite having regression in the name, Logistic Regression is a classifier.

Since you have a categorical feature 'day of week' in your dataset you'll need a regressor that is compatible with both categorical and continuous features. For your case, I would recommend either a Regression Tree because it is simple, fast, and explainable. Another option is Gradient Boost which has a tutorial but may be overkill for your dataset.

Unfortunately, neither of those learners can be partially trained - however, you can transform your categorical features to continuous ones using One Hot Encoder and then you could use Adaline.

One last option is to use KNN Regressor with a Gower distance kernel (since it is compatible with both categorical and continuous data types). KNN has the added benefit of implementing the Online interface, however can be computationally intractable with large training sets.

You can obtain a 'confidence interval' or perhaps 'range of expected values' using your words by cross-validation in which the model is tested on unseen data. A report such as Residual Analysis will be able to give you error metrics such as MAE (mean absolute error) such that a MAE of 10 means that each prediction can be +/- 10.

Could it be that what you are really looking for is a way to forecast this time series so that you can predict the next k time steps starting from an initial timestamp? If so, you'll have to wait for time series support.

from ml.

andrewdalpino avatar andrewdalpino commented on May 12, 2024

Hi @ilvalerione thanks for the interesting question

Let me start by making sure that our understandings are consistent

Logistic Regression is a type of Online classifier whose prediction is a class label such as 'cat', 'dog', etc. It can also output a probability distribution over these classes as it implements the Probabilistic interface. Is this what you mean by 'range of correctness?'

As a side effect, Logistic Regression can also be used as a supervised anomaly detector where the class labels are 'anomaly', 'not anomaly.' Is this how you plan to use the estimator? As opposed to an unsupervised online anomaly detector such as Gaussian MLE?

As others have inquired about recently in issues #38 and #35, similarly your problem is one that involves non-stationary time series data, which Rubix ML does not currently support. There are models, for example ARIMA, that can handle non-stationary time series natively and, given the recent interest, I am currently looking into how models like these will fit into the Rubix ML architecture. As such, we may end up implementing time-series support in the near future.

from ml.

ilvalerione avatar ilvalerione commented on May 12, 2024

Hi @andrewdalpino thank you for your message.
Reading your documentation I better understand my problem, and I appreciated your "learning purpose" contents.

It is right to emphasize that I thought from a developer point of view, so could be many details out of my skills.

I think about Logistic Regression classifying data by "hour of the day" and "day of the week":

$transactions = [
    // [duration, memory_peak, hour_of_day, day_of_week],

    [12.1, 4.2, 10, 'Saturday'],
    [20.0, 6,7, 11, 'Saturday'],
    [68.35, 12.0, 11, 'Thursday'],
];

In this way I'm trying to correlate duration and memory_peak but linking this classification to the hour and day of the week is equal to assume that data is weekly seasonal. I thought that using an online detector could mitigate the seasonal assumption changing the model over time.

It can also output a probability distribution over these classes as it implements the Probabilistic interface. Is this what you mean by 'range of correctness?'

Yes, I thought to use this information to build the "dynamic grey band" in the chart.

I'm not sure tha classifiers are the right choose for this scenario because at the end I'm dealing with "unsupervised dataset" I thought. I'm not able to know what samples in the past are anomalies or not and train the model accordingly. I'm thinking in the way that the ability to understand if a sample is an anomaly or not should be acquired by the algorithm itself, based on the historical dataset.

Thanks to your advice I better understood Gaussian MLE, it could be another reasonable approach.

I hear more and more often about algorithms like ARIMA or SARIMA (S - seasonal).

I'm a developer that is trying to implement better solutions to solve problems. This is a coompletely new world for me, so thank you for your informations.

from ml.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.