Giter Site home page Giter Site logo

Accelerate learning about ml HOT 9 CLOSED

rubixml avatar rubixml commented on May 12, 2024
Accelerate learning

from ml.

Comments (9)

andrewdalpino avatar andrewdalpino commented on May 12, 2024

Hi @YuK1Game thanks for the question

As of 0.0.17-beta we do not support CPU or GPU multithreading

However, this is a feature that we are currently working on

What learner are you using?

How many samples do you have?

How many features do you have?

Have you seen the section of the FAQ entitled Training is slower than usual?

from ml.

YuK1Game avatar YuK1Game commented on May 12, 2024

using...

        $estimator = new PersistentModel(
            new GradientBoost(new RegressionTree(4), 0.1),
            new Filesystem($this->modelFilepath, true)
        );

and 100,000 records.
and 11 columns.

Unlimited memory settings

ini_set('memory_limit', -1);

but, using 140MB
WS000000

from ml.

andrewdalpino avatar andrewdalpino commented on May 12, 2024

Ok everything seems pretty reasonable so far ...

How long is training taking?

What are you comparing the training time to? XGBoost? ScikitLearn?

The implementation of Gradient Boost is similar to the ScikitLearn one with a few of exceptions ...

  • Rubix ML GBM supports both categorical and continuous data, sklearn does not
  • Sklearn does the gradient computation and gradient descent step over multiple threads using NumPY under the hood, Rubix ML is single-threaded for the time being
  • Sklearn offers GBMs with either a regular decision tree or a light (histogram splitting) decision tree that is quicker, the Rubix ML decision tree already implements an optimization that is somewhere in between both of these - instead of using histograms, it uses the percentile method

from ml.

YuK1Game avatar YuK1Game commented on May 12, 2024

Start

[2019-11-29 11:44:52] netkeiba.INFO: Training base learner

and now

[2019-12-03 05:54:20] netkeiba.INFO: Epoch 328 score=0.39623370712645 loss=13.036411570958

Cannot compare due to first attempt

thanks

from ml.

andrewdalpino avatar andrewdalpino commented on May 12, 2024

Are your features categorical or continuous or a mix of both?

Where are you extracting the data from?

How long does the learner take between epochs?

What version of PHP are you using?

from ml.

andrewdalpino avatar andrewdalpino commented on May 12, 2024

Hey @YuK1Game let me know if you can answer those questions above

I'm thinking there may be an issue with how the data is being imported (perhaps as categorical features instead of continuous) ... if that is the case then each Regression Tree will have to search a much larger space to find the best split (could also help to explain the low R Squared score)

or potentially an issue with garbage collection

Any additional context will help me diagnose the issue

Thanks

from ml.

YuK1Game avatar YuK1Game commented on May 12, 2024

Hi.

Sample row

array(11) {
    [0]=>    
    string(6) "中山"
    [1]=>
    string(3) "晴"
    [2]=>
    string(3) "重"
    [3]=>
    int(1200)
    [4]=>
    string(15) "サンクララ"
    [5]=>
    int(2)
    [6]=>
    int(3)
    [7]=>
    int(428)
    [8]=>
    int(54)
    [9]=>
    int(9)
    [10]=>
    int(510)
  }

and, want to learn record count 120,000

Are your features categorical or continuous or a mix of both?

The label is entering a number (score).

Where are you extracting the data from?

data from Database(MySQL)
Extract immediately.

How long does the learner take between epochs?

epoch is early but slow.

first,

[2019-12-09 01:23:18] test.INFO: Learner init booster=RegressionTree rate=0.1 ratio=0.5 estimators=1000 min_change=0.0001 window=10 hold_out=0.1 metric=RSquared base=DummyRegressor
[2019-12-09 01:23:18] netkeiba.INFO: Training base learner
[2019-12-09 01:23:18] netkeiba.INFO: Epoch 1 score=-0.3767914110961 loss=1824212.958192
[2019-12-09 01:23:18] netkeiba.INFO: Epoch 2 score=-0.35629949818136 loss=1669563.4770574
[2019-12-09 01:23:18] netkeiba.INFO: Epoch 3 score=-0.2977777404172 loss=1381185.1642557
[2019-12-09 01:23:18] netkeiba.INFO: Epoch 4 score=-0.24026440379874 loss=1145989.5650973
[2019-12-09 01:23:18] netkeiba.INFO: Epoch 5 score=-0.21870186348087 loss=1145573.0300535
[2019-12-09 01:23:18] netkeiba.INFO: Epoch 6 score=-0.18248081881129 loss=1142564.511372

and now (8000 records progress)

[2019-12-09 01:28:07] netkeiba.INFO: Epoch 21 score=0.010869674438744 loss=217648.58429299
[2019-12-09 01:28:09] netkeiba.INFO: Epoch 22 score=0.010815738916415 loss=216145.40970688
[2019-12-09 01:28:10] netkeiba.INFO: Epoch 23 score=0.038142362305688 loss=212999.1373364
[2019-12-09 01:28:12] netkeiba.INFO: Epoch 24 score=0.037980473891026 loss=209966.29716697
[2019-12-09 01:28:14] netkeiba.INFO: Epoch 25 score=0.037948646127267 loss=206445.56595115
[2019-12-09 01:28:15] netkeiba.INFO: Epoch 26 score=0.038896182805228 loss=200410.77220361
[2019-12-09 01:28:17] netkeiba.INFO: Epoch 27 score=0.039153822056527 loss=200318.92461847

What version of PHP are you using?

$ php -v
PHP 7.2.14 (cli) (built: Jan  9 2019 22:23:26) ( ZTS MSVC15 (Visual C++ 2017) x64 )
Copyright (c) 1997-2018 The PHP Group
Zend Engine v3.2.0, Copyright (c) 1998-2018 Zend Technologies

Listen if you still need information

thanks

from ml.

andrewdalpino avatar andrewdalpino commented on May 12, 2024

Thanks for the info @YuK1Game

It looks like you are getting sub-second performance per epoch

I see that duration starts to rise to about 1 - 2 seconds per epoch as training progresses ... it's hard to say if that is an indicator of an issue because of the way that Regression Trees work under the hood. I see you have both categorical and continuous features in your dataset. Searching for the best split of a Regression Tree is handled differently for either categorical or continuous feature columns - and one can be much faster than the other. For example, if your categorical feature columns have 10 possible choices, then the tree only needs to search a space of 10 discrete values. However, if it is a continuous column, then a set of k percentiles (linear operation in the number of samples at that node split in expectation) along with as many as 200 comparisons will need to be computed. The disparity shown in the excerpt of your training log could be explained by this. However, I would need to see the full training log in order to be certain.

To clarify, this is with an 8,000 sample dataset? If so, performance seems to be good

What is the duration between epochs using the full dataset (100,000 samples)?

Is the learner able to converge to a good solution with a small dataset? (say, greater than a 0.7 R Squared score)

Also, it's probably will be best for you to post the whole training log - more information is always better than less when it comes to debugging issues with many factors such as performance

Thanks

from ml.

andrewdalpino avatar andrewdalpino commented on May 12, 2024

Hi @YuK1Game

The CART implementation has been optimized in the latest commit 89f6991

We're seeing up to an order of magnitude speed improvement with Gradient Boost as a result. It is particularly better with large datasets. Give the latest dev-master a try or you can wait until the next release.

from ml.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.