Comments (6)
Hi! Yes you are right! Those two parameters control the size of the model.
The size of the gradient boosted decision tree model is determined by the number of nodes in all of the trees. You can control how many nodes are in the model by limiting the number of nodes in each of the trees and the total number of trees trained.
The most straightforward way to control the model size would be with the max_depth
, max_rounds
, and max_leaf_nodes
parameters.
These parameters control the size of the model:
max_rounds
: Themax_rounds
parameter controls the total number of trees in the model.max_depth
: Themax_depth
parameter will limit the number of nodes in each individual tree because trees will only be allowed to grow to thatdepth
.max_leaf_nodes
: Decreasing the parametermax_leaf_nodes
will also control the number of total nodes in your trees, because once this many leaf_nodes have been added, the tree will no longer continue growing.min_examples_per_node
: Increasing themin_examples_per_node
parameter limits the number of nodes in the tree as well because during the training process, if a node in the tree has less thanmin_examples_per_node
, the node will not be added to the tree.min_sum_hessians_per_node
: Increasing this parameter will prevent overfitting as well as limit the number of nodes added to an individual tree.min_gain_to_split
: Increasing this parameter will also prevent overfitting and limit the number of nodes added to an individual tree.
If model size is a really big concern, you can also train linear models which will be smaller.
Currently, the .tangram
file contains both the report that we display in the tangram app and also the model that you will use to make predictions. We plan on adding the ability to produce an optimized model
used just for predictions that strips all of the reporting information. Here is the issue I just created to track that: tangramdotdev/tangram#49.
I'm going to keep this issue open until we add documentation to our website explaining this!
Also, does your dataset contain text columns?
from modelfox.
Thanks for the very detailed answer, this helps and clarifies the effect of the hyperparameters! It would be great to have this information added to the docs.
The dataset I'm looking at contains 30 float columns and a binary enum column as target.
from modelfox.
Great! I'll make sure to add it to the docs :)
The reason I asked about text columns is because we by default create a large number of features and that could greatly increase model size and we are adding support to customize that now.
from modelfox.
It seems like a tree Node
has a size of 72 bytes (as determined by the patch below).
So the binary classifier should have a size of approximately <72 bytes> * <average number of nodes per tree> * <number of trees>
, which should be less than <72 bytes> * max_leaf_nodes * max_rounds
, right? (This neglects branch nodes, but as far as I can see their number is not directly limited?)
Patch
diff --git a/crates/tree/lib.rs b/crates/tree/lib.rs
index fe030f8..1691bbe 100644
--- a/crates/tree/lib.rs
+++ b/crates/tree/lib.rs
@@ -124,6 +124,11 @@ pub struct Tree {
pub nodes: Vec<Node>,
}
+#[test]
+fn node_size() {
+ assert_eq!(std::mem::size_of::<Node>(), 0);
+}
+
impl Tree {
/// Make a prediction.
pub fn predict(&self, example: &[tangram_table::TableValue]) -> f32 {
from modelfox.
<number of branch nodes> = <number of leaf nodes> - 1
So, the total number of nodes in any given tree is 2 * <number of leaf nodes> - 1 <= 2 * max_leaf_nodes - 1
which means the total number of nodes (leaf nodes and branch nodes) in all of the trees is less than 2 * max_leaf_nodes * max_rounds
.
The serialized size of the Branch
and Leaf
nodes is different from the in-memory size. I can look into this and get back to you on the exact sizes of each of those nodes.
from modelfox.
So max_leaf_nodes
also limits the branch nodes directly. Thanks for clarifying!
The serialized size of the
Branch
andLeaf
nodes is different from the in-memory size. I can look into this and get back to you on the exact sizes of each of those nodes.
Tangram seems to be using a binary serialization format, so I would expect the serialized size to be similar to the in-memory size (maybe minus the padding, and plus the data for the report). I was just trying to estimate what model sizes I should expect, so the exact sizes are not necessary, thank you!
from modelfox.
Related Issues (20)
- Add CLI Command to auto-generate config file HOT 2
- Playground Chart min value is deceptive, use 0 instead
- Long column names overflow training stats table
- Repo overview View to compare all models contained in the repo
- Ctrl-c to cancel training from python
- Early Stopping Options default to present
- Coerce boolean values appropriately to enum values for prediction in language libraries/ cli
- Improve error message for CLI incorrect path to train/test file
- Forgetting the threshold in logPrediction causes bad request
- Allow data frame as input to predict function in python
- Explain what a baseline classifier is on the metrics page.
- Training error when column to predict has more than 100 variants HOT 2
- Thread 'main' panicked at 'called `Option::unwrap()` on a `None` value' HOT 5
- Failed to get `modelfox` as a dependency of package HOT 1
- URL in repo description is broken HOT 2
- Debian package is not installable HOT 1
- Running `modelfox app` gives "error: No such file or directory (os error 2)" HOT 1
- Bag of words - what is the delimiter? HOT 3
- [Ruby] Does not work for M1 Mac OSX
- datasets are not downloadable anymore
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from modelfox.