Comments (9)
Hi @YuK1Game thanks for the question
As of 0.0.17-beta we do not support CPU or GPU multithreading
However, this is a feature that we are currently working on
What learner are you using?
How many samples do you have?
How many features do you have?
Have you seen the section of the FAQ entitled Training is slower than usual?
from ml.
using...
$estimator = new PersistentModel(
new GradientBoost(new RegressionTree(4), 0.1),
new Filesystem($this->modelFilepath, true)
);
and 100,000 records.
and 11 columns.
Unlimited memory settings
ini_set('memory_limit', -1);
from ml.
Ok everything seems pretty reasonable so far ...
How long is training taking?
What are you comparing the training time to? XGBoost? ScikitLearn?
The implementation of Gradient Boost is similar to the ScikitLearn one with a few of exceptions ...
- Rubix ML GBM supports both categorical and continuous data, sklearn does not
- Sklearn does the gradient computation and gradient descent step over multiple threads using NumPY under the hood, Rubix ML is single-threaded for the time being
- Sklearn offers GBMs with either a regular decision tree or a light (histogram splitting) decision tree that is quicker, the Rubix ML decision tree already implements an optimization that is somewhere in between both of these - instead of using histograms, it uses the percentile method
from ml.
Start
[2019-11-29 11:44:52] netkeiba.INFO: Training base learner
and now
[2019-12-03 05:54:20] netkeiba.INFO: Epoch 328 score=0.39623370712645 loss=13.036411570958
Cannot compare due to first attempt
thanks
from ml.
Are your features categorical or continuous or a mix of both?
Where are you extracting the data from?
How long does the learner take between epochs?
What version of PHP are you using?
from ml.
Hey @YuK1Game let me know if you can answer those questions above
I'm thinking there may be an issue with how the data is being imported (perhaps as categorical features instead of continuous) ... if that is the case then each Regression Tree will have to search a much larger space to find the best split (could also help to explain the low R Squared score)
or potentially an issue with garbage collection
Any additional context will help me diagnose the issue
Thanks
from ml.
Hi.
Sample row
array(11) {
[0]=>
string(6) "中山"
[1]=>
string(3) "晴"
[2]=>
string(3) "重"
[3]=>
int(1200)
[4]=>
string(15) "サンクララ"
[5]=>
int(2)
[6]=>
int(3)
[7]=>
int(428)
[8]=>
int(54)
[9]=>
int(9)
[10]=>
int(510)
}
and, want to learn record count 120,000
Are your features categorical or continuous or a mix of both?
The label is entering a number (score).
Where are you extracting the data from?
data from Database(MySQL)
Extract immediately.
How long does the learner take between epochs?
epoch is early but slow.
first,
[2019-12-09 01:23:18] test.INFO: Learner init booster=RegressionTree rate=0.1 ratio=0.5 estimators=1000 min_change=0.0001 window=10 hold_out=0.1 metric=RSquared base=DummyRegressor
[2019-12-09 01:23:18] netkeiba.INFO: Training base learner
[2019-12-09 01:23:18] netkeiba.INFO: Epoch 1 score=-0.3767914110961 loss=1824212.958192
[2019-12-09 01:23:18] netkeiba.INFO: Epoch 2 score=-0.35629949818136 loss=1669563.4770574
[2019-12-09 01:23:18] netkeiba.INFO: Epoch 3 score=-0.2977777404172 loss=1381185.1642557
[2019-12-09 01:23:18] netkeiba.INFO: Epoch 4 score=-0.24026440379874 loss=1145989.5650973
[2019-12-09 01:23:18] netkeiba.INFO: Epoch 5 score=-0.21870186348087 loss=1145573.0300535
[2019-12-09 01:23:18] netkeiba.INFO: Epoch 6 score=-0.18248081881129 loss=1142564.511372
and now (8000 records progress)
[2019-12-09 01:28:07] netkeiba.INFO: Epoch 21 score=0.010869674438744 loss=217648.58429299
[2019-12-09 01:28:09] netkeiba.INFO: Epoch 22 score=0.010815738916415 loss=216145.40970688
[2019-12-09 01:28:10] netkeiba.INFO: Epoch 23 score=0.038142362305688 loss=212999.1373364
[2019-12-09 01:28:12] netkeiba.INFO: Epoch 24 score=0.037980473891026 loss=209966.29716697
[2019-12-09 01:28:14] netkeiba.INFO: Epoch 25 score=0.037948646127267 loss=206445.56595115
[2019-12-09 01:28:15] netkeiba.INFO: Epoch 26 score=0.038896182805228 loss=200410.77220361
[2019-12-09 01:28:17] netkeiba.INFO: Epoch 27 score=0.039153822056527 loss=200318.92461847
What version of PHP are you using?
$ php -v
PHP 7.2.14 (cli) (built: Jan 9 2019 22:23:26) ( ZTS MSVC15 (Visual C++ 2017) x64 )
Copyright (c) 1997-2018 The PHP Group
Zend Engine v3.2.0, Copyright (c) 1998-2018 Zend Technologies
Listen if you still need information
thanks
from ml.
Thanks for the info @YuK1Game
It looks like you are getting sub-second performance per epoch
I see that duration starts to rise to about 1 - 2 seconds per epoch as training progresses ... it's hard to say if that is an indicator of an issue because of the way that Regression Trees work under the hood. I see you have both categorical and continuous features in your dataset. Searching for the best split of a Regression Tree is handled differently for either categorical or continuous feature columns - and one can be much faster than the other. For example, if your categorical feature columns have 10 possible choices, then the tree only needs to search a space of 10 discrete values. However, if it is a continuous column, then a set of k percentiles (linear operation in the number of samples at that node split in expectation) along with as many as 200 comparisons will need to be computed. The disparity shown in the excerpt of your training log could be explained by this. However, I would need to see the full training log in order to be certain.
To clarify, this is with an 8,000 sample dataset? If so, performance seems to be good
What is the duration between epochs using the full dataset (100,000 samples)?
Is the learner able to converge to a good solution with a small dataset? (say, greater than a 0.7 R Squared score)
Also, it's probably will be best for you to post the whole training log - more information is always better than less when it comes to debugging issues with many factors such as performance
Thanks
from ml.
Hi @YuK1Game
The CART implementation has been optimized in the latest commit 89f6991
We're seeing up to an order of magnitude speed improvement with Gradient Boost as a result. It is particularly better with large datasets. Give the latest dev-master a try or you can wait until the next release.
from ml.
Related Issues (20)
- GraphViz: RuntimeException: Tree has not been constructed HOT 3
- Prune redundant Decision Tree leaf nodes
- Fixed Array Memory Optimizations HOT 2
- Use new PHP 8.0 features in version 3.0
- Warning: Ambiguous class resolution, `Rubix\ML\Kernels\Distance\Gower` HOT 4
- Use pretrained models? HOT 3
- How to Train only One Class? HOT 4
- Not working in PHP 8.2 because of voku/portable-utf8/src/voku/helper/UTF8.php HOT 7
- Question, which model users for Fraud Prediction HOT 2
- Incorrect SVC save/load methods implementation HOT 2
- psr/log old version is limiting the project to be used on modern frameworks HOT 2
- Which alogorithm can be used for search result ranking ? HOT 2
- Is "Transformer Architecture Marchine Learning Model" supported on RubixML ??? HOT 4
- Map method in Dataset doesn't exist HOT 2
- Multi Language Tokenization Support HOT 2
- WordCountVectorizer Memory Issue HOT 2
- TruncatedSVD() made PHP crash without any message HOT 3
- Evaluation of the cluster quality with indicators HOT 1
- Requirements not resolved to an installable set of packages HOT 3
- Softmax Classifier & partial training HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from ml.