Giter Site home page Giter Site logo

yubin-park / bonsai-dt Goto Github PK

View Code? Open in Web Editor NEW
34.0 7.0 6.0 3.9 MB

Programmable Decision Tree Framework

Home Page: https://yubin-park.github.io/bonsai-dt/

Python 73.27% Jupyter Notebook 23.05% R 3.68%
machine-learning machine-learning-algorithms boosting-algorithms boosting-tree gradient-boosting-machine decision-tree gbdt

bonsai-dt's Introduction

Bonsai-DT (or Bonsai in short form) is a programmable decision tree framework written in Python (and a little bit of Cython). Using Bonsai, you can quickly design/build/customize new decision tree algorithms simply by writing these two functions:

  • find_split()
  • is_leaf()

The intent of this project is to provide a quick testing bed for various decision tree ideas. Although the speed of Bonsai is comparable to other implementations, we note that the speed and scalability of the framework are not the primary focus.

Many decision trees, such as C4.5 and CART, are already implemented and available in the Bonsai templates. Even ensemble models, such as Gradient Boosting, Random Forests and PaloBoost, are readily available. For the full list of the Bonsai templates, please see the Model Templates section.

Contents

  1. Installing
  2. Background
  3. Design Functions
  4. Code Examples
  5. Model Templates
  6. Glossaries
  7. Authors
  8. License

Installing

  • Required: numpy and cython
  • Run python setup.py develop
  • NOTE: Bonsai is still under rapid development. Please use this at your own discretion.

Background

Decision tree is a recursive data partitioning algorithm. Some well-known algorithms include CART (Classification and Regression Trees), C4.5, ID3, and CHAID. Although these decision trees may look very differerent from outside, without loss of generality, the inner-workings of the most trees can be written as follows:

def decision_tree(X, y):

    # Stopping criteria
    if is_leaf(X, y):     
        return np.mean(y)

    split_variable, split_value = find_split(X, y)

    # Partition data into two sets
    X_right, X_left, y_right, y_left = partition_data(X, y, 
        split_variable, split_value)  

    # Recursive call of decision_tree()
    return {"split_variable": split_variable,
        "split_value": split_value,
        "right": decision_tree(X_right, y_right), 
        "left": decision_tree(X_left, y_left)}   

In the code above, X and y represent feature matrix and target vector, respectively. The algorithm first looks at if it should stop the recursion - is_leaf(). If not, then it searches a splitting variable and value pair to branch out further - find_split(). Based on the best pair of splitting variable and value, the dataset (X, y) is partitioned into two disjoint sets - one with smaller value than the splitting value, and the other with greather than or equal to the splitting value. Finally, the algorithm recursively repeats the above process on each partitioned dataset.

As can be seen, the behavior of a decision tree is primarily governed by the two functions:

  • find_split() decides which variable/value to split on. For example, C4.5 and ID3 use Information Gain using Shannon Entropy, and CART uses the Gini impurity measure (for classification) and the minimum variance criterion (for regression) to choose a splitting variable.
  • is_leaf() controls when to stop growing the tree. For example, a tree can stop growing if its depth is greater than 4 or its node size is smaller than 30.

Many decision tree algorithms implement these functions internally, and expose only a set of parameters through their interfaces. This approach is definitely better for users who want to try the off-the-shelf decision trees. However, if we want to design a new tree that does not exist anywhere, we need to edit the underlying source code, which may be quite complex and time-consuming. Is there a better way? That's why we developed Bonsai.

Design Functions

Rather than a set of parameters, Bonsai takes two functions as its arguments. In this section, we illustrate the details of these functions, such as arguments to pass on, output formats, and some example implementations.

find_split(avc)

The find_split function has only one argument avc. The avc variable is a numpy array that has 10 columns and variable rows. The name avc stands for Attribute-Value Classlabel Group (or AVC-group), which was first introduced in the Rainforest framework.

Here is how Bonsai works in a nutshell. Whenever we call find_split, Bonsai scans the data, and updating a couple of online statistics such as mean, second moment, count, etc. We call this process as Bonsai "sketches" the stats on canvas, where canvas represents just an empty numpy array. When the sketch is done, the empty numpy array, canvas, is filled with all relevant statistics for defining splitting criteria. Just to be clear, we call the canvas after sketch as avc. The column order of avc is as follows:

  • avc[:,0]: AVC indices
  • avc[:,1]: variable indices i.e. column indices starting from 0
  • avc[:,2]: split values i.e. samples below this value goes to the left node, and vice versa
  • avc[:,3]: number of samples at (hypothetical) left node
  • avc[:,4]: sum y at left node i.e. \sum_{left} y
  • avc[:,5]: sum of y^2 at left node i.e. \sum_{left} y^2
  • avc[:,6]: number of samples at (hypothetical) right node
  • avc[:,7]: sum of y at right node i.e. \sum_{right} y
  • avc[:,8]: sum of y^2 at right node i.e. \sum_{right} y^2
  • avc[:,9]: missing value inclusion (0: left node, 1: right node)

Essentially, the find_split function is given with the joint distribution information in the form of AVC-group. With this information, the function needs to define what properties of the distributions should paly roles in finding the best splitting variable and value.

The return of the function is a Python dictionary with the following required key value pairs:

  • selected: the selected row of the avc array
  • <additional var>: any additional variables you want to pass on

For example, if you want to design a splitting criteria that minimizes the weighted sum of variances after the split (e.g. regression tree in CART), you can write find_split as follows:

def find_split(avc):
    var_left = avc[:,5]/avc[:,3] - np.square(avc[:,4]/avc[:,3])
    var_right = avc[:,8]/avc[:,6] - np.square(avc[:,7]/avc[:,6])
    varsum = avc[:,3] * var_left  + avc[:,6] * var_right
    min_idx = np.argsort(varsum)[0]
    return {"selected": avc[min_idx,:]}

is_leaf(branch, branch_parent)

The is_leaf function takes two arguments: one for the current branch, and the other for the parent branch. Both variables are Python dictionaries. By default, these branch variables have the following key value pairs, but can have additional key value pairs if users defined <additional var> in find_split:

  • depth: the depth of the branch in the tree
  • n_samples: the number of samples in the branch

The return of this function is boolean. If the branch is a leaf, then returns True. Otherwise, returns False.

For example, if you want to design a stopping criteria that stops at depth 3, you can write is_leaf as follows:

def is_leaf(branch, branch_parent):
    if branch["depth"] > 2:
        return True
    else:
        return False

Putting Things Together

With the find_split and is_leaf written, now you can put these together to make your own decision tree as follows:

from ..core.bonsai import Bonsai
import numpy as np

class MyTree(Bonsai):

    def __init__(self):

        def find_split(avc):
            var_left = avc[:,5]/avc[:,3] - np.square(avc[:,4]/avc[:,3])
            var_right = avc[:,8]/avc[:,6] - np.square(avc[:,7]/avc[:,6])
            varsum = avc[:,3] * var_left  + avc[:,6] * var_right
            min_idx = np.argsort(varsum)[0]
            return {"selected": avc[min_idx,:]}

        def is_leaf(branch, branch_parent):
            if branch["depth"] > 2:
                return True
            else:
                return False

        Bonsai.__init__(self, find_split, is_leaf)

Now, you can use MyTree just by importing the class in your project.

Code Examples

We provide several examples of using Bonsai.

Using a Model Template

Probably the easiest and fastest way to use Bonsai is using the model templeates in Bonsai. The model templates are decision trees that are already built in Bonsai, which include regression tree, C4.5, Alpha-Tree, etc. For example, if you want to use a regression tree example:

# Pre-built regression tree using Bonsai
from bonsai.regtree import RegTree 

from sklearn.datasets import make_friedman1
from sklearn.model_selection import train_test_split
import numpy as np

X, y = make_friedman1(n_samples=100000) 
n, m = X.shape
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

# Initialize
model = RegTree(max_depth=5),

# Fit
model.fit(X_train, y_train)

# Predict
y_hat = model.predict(X_test)

rmse = np.sqrt(np.mean((y_test - y_hat)**2))

Available model templates are listed in the Model Templates section. Also, take a look at the test scripts under the tests folder to see how to use these templates.

Running a Test Script

The best to way to get used to Bonsai is try running the test scripts. To run the test scripts, if you just go to the tests folder, and run a Python script in the folder.

$ cd tests
$ python paloboost.py 

# Test Regression
-----------------------------------------------------
 model_name     train_time     predict_time   rmse   
-----------------------------------------------------
 baseline       -              -              6.99054
 gbm            3.24681 sec    0.32504 sec    5.76867
 palobst        4.15213 sec    0.28674 sec    5.14846
 sklearn        6.36740 sec    0.07248 sec    5.68470
-----------------------------------------------------

This test script tests the PaloBoost algorithm using the Friedman's simulated data (regression task). As can be seen, the test script prints the training time, predictio tiem, and Root Mean Squared Errors (RMSE) measured on hold-out data.

Interpreting a Trained Bonsai Tree

In Bonsai, you can easily check the details of a trained tree:

from __future__ import print_function
from bonsai.base.regtree import RegTree
from sklearn.datasets import make_friedman1
import json
X, y = make_friedman1(n_samples=10000) 
model = RegTree(max_depth=1)
model.fit(X, y)
print(json.dumps(model.dump(), indent=2))

This script will output the trained tree in the form of an array of leaf nodes. Here is an example of the output:

[
  {
    "is_leaf": true, 
    "eqs": [
      {
        "svar": 3, 
        "missing": 1, 
        "sval": 0.49608099474494854, 
        "op": "<"
      }
    ], 
    "depth": 1, 
    "n_samples": 5031.0, 
    "y": 11.798433612486887, 
  }, 
  {
    "is_leaf": true, 
    "eqs": [
      {
        "svar": 3, 
        "missing": 0, 
        "sval": 0.49608099474494854, 
        "op": ">="
      }
    ], 
    "depth": 1, 
    "n_samples": 4969.0, 
    "y": 16.969889465743922, 
  }
]

As this is a depth-1 tree, we only see two leaves in the array. In each leaf, you see eqs that stores the logical rules for the leaf, and y that indicates the node value or predicted value for the leaf. All Bonsai-derived trees would have this form of output.

Model Templates

Here is the list of Bonsai templates:

Glossaries

Variable/function names that are used in the source code:

  • avc: AVC stands for Attribute-Value Classlabel Group.
  • canvas: Canvas is a numpy array that stores AVC.
  • canvas_dim: The dimension of Canvas. #rows x #columns
  • setup_canvas: Setting up a canvas for AVC.
  • sketch: Sketching refers to a process of filling AVC values to Canvas. We scan a dataset and construct statistics relevant to AVC, and fill those values to a canvas one by one. We thought the process resembles skteching in paiting - rapid drawing for setting up a blueprint.
  • erase_canvas: Erase the values in a canvas.

Authors

  • Yubin Park, PhD
  • If you want to contribute, please contact Yubin

License

Apache License 2.0

bonsai-dt's People

Contributors

yubin-park avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

bonsai-dt's Issues

mondrian tree

Hi Yubin,

Very interesting work! I was wondering if you had any ideas on implementing a mondrian tree: I have some script from nel215 that looks to be in the most basic form using just numpy, do you think their is a way we could implement it in a find_split/is_leaf form that you present?

import numpy as np

class MondrianForestRegressor(object):
def init(self, n_tree):
self.n_tree = n_tree
self.trees = []
for i in range(self.n_tree):
self.trees.append(MondrianTreeRegressor())
def fit(self, X, y):
for tree in self.trees:
tree.fit(X, y)
def partial_fit(self, X, y):
for tree in self.trees:
tree.partial_fit(X, y)

def get_params(self, deep):
    return {'n_tree': self.n_tree}

def predict(self, X):
    res = np.array([tree.predict(X) for tree in self.trees])
    return res.mean(axis=0)

class Node(object):
def init(self, min_list, max_list, tau, is_leaf, stat, parent=None, delta=None, xi=None):
self.parent = parent
self.tau = tau
self.is_leaf = is_leaf
self.min_list = min_list
self.max_list = max_list
self.delta = delta
self.xi = xi
self.left = None
self.right = None
self.stat = stat

def update_leaf(self, x, label):
    self.stat.add(x, label)

def update_internal(self):
    self.stat = self.left.stat.merge(self.right.stat)

def get_parent_tau(self):
    if self.parent is None:
        return 0.0
    return self.parent.tau

def __repr__(self):
    return "<mondrianforest.Node tau={} min_list={} max_list={} is_leaf={}>".format(
        self.tau,
        self.min_list,
        self.max_list,
        self.is_leaf,
    )

class RegressorFactory(object):
def create(self):
return Regressor()

class MondrianTree(object):
def init(self):
self.root = None
self.classes = set()

def create_leaf(self, x, label, parent):
    leaf = Node(
        min_list=x.copy(),
        max_list=x.copy(),
        is_leaf=True,
        stat=self.stat_factory.create(),
        tau=1e9,
        parent=parent,
    )
    leaf.update_leaf(x, label)
    return leaf

def extend_mondrian_block(self, node, x, label):
    '''
        return root of sub-tree
    '''
    e_min = np.maximum(node.min_list - x, 0)
    e_max = np.maximum(x - node.max_list, 0)
    e_sum = e_min + e_max
    rate = np.sum(e_sum) + 1e-9
    E = np.random.exponential(1.0/rate)
    if node.get_parent_tau() + E < node.tau:
        e_sample = np.random.rand() * np.sum(e_sum)
        delta = (e_sum.cumsum() > e_sample).argmax()
        if x[delta] > node.min_list[delta]:
            xi = np.random.uniform(node.min_list[delta], x[delta])
        else:
            xi = np.random.uniform(x[delta], node.max_list[delta])
        parent = Node(
            min_list=np.minimum(node.min_list, x),
            max_list=np.maximum(node.max_list, x),
            is_leaf=False,
            stat=self.stat_factory.create(),
            tau=node.get_parent_tau() + E,
            parent=node.parent,
            delta=delta,
            xi=xi,
        )
        sibling = self.create_leaf(x, label, parent=parent)
        if x[parent.delta] <= parent.xi:
            parent.left = sibling
            parent.right = node
        else:
            parent.left = node
            parent.right = sibling
        node.parent = parent
        parent.update_internal()
        return parent
    else:
        node.min_list = np.minimum(x, node.min_list)
        node.max_list = np.maximum(x, node.max_list)
        if not node.is_leaf:
            if x[node.delta] <= node.xi:
                node.left = self.extend_mondrian_block(node.left, x, label)
            else:
                node.right = self.extend_mondrian_block(node.right, x, label)
            node.update_internal()
        else:
            node.update_leaf(x, label)
        return node

def partial_fit(self, X, y):
    for x, label in zip(X, y):
        self.classes |= {label}
        if self.root is None:
            self.root = self.create_leaf(x, label, parent=None)
        else:
            self.root = self.extend_mondrian_block(self.root, x, label)

def fit(self, X, y):
    self.root = None
    self.partial_fit(X, y)

def _predict(self, x, node, p_not_separeted_yet):
    d = node.tau - node.get_parent_tau()
    eta = np.sum(np.maximum(x-node.max_list, 0) + np.maximum(node.min_list - x, 0))
    p = 1.0 - np.exp(-d*eta)
    result = node.stat.create_result(x, p_not_separeted_yet * p)
    if node.is_leaf:
        w = p_not_separeted_yet * (1.0 - p)
        return result.merge(node.stat.create_result(x, w))
    if x[node.delta] <= node.xi:
        child_result = self._predict(x, node.left, p_not_separeted_yet*(1.0-p))
    else:
        child_result = self._predict(x, node.right, p_not_separeted_yet*(1.0-p))
    return result.merge(child_result)

def get_params(self, deep):
    return {}

class MondrianTreeRegressor(object):
def init(self):
MondrianTree.init(self)
self.stat_factory = RegressorFactory()

def predict(self, X):
    res = []
    for x in X:
        predicted = self._predict(x, self.root, 1.0).get()
        res.append(predicted)
    # res=map(lambda x : (self._predict(X, self.root, 1.0).get()),X)
    return np.array(res)

class RegressorResult(object):
def init(self, avg):
self.avg = avg

def merge(self, r):
    return RegressorResult(self.avg + r.avg)

def get(self):
    return self.avg

class Regressor(object):
def init(self):
self.sum = 0
self.count = 0
def add(self, x, y):
(self.sum) += y
(self.count) += 1
def merge(self, r):
res = Regressor()
res.sum = self.sum + r.sum
res.count = self.count + r.count
return res

def predict(self, x):
    if self.count == 0:
        return 0
    else:
        return self.sum / self.count

def create_result(self, x, w):
    return RegressorResult(self.predict(x)*w)

AlphaTree risk pyramid

Hi Yubin,

I would like to know how I can create a risk pyramid similar to the one provided in your work in Figure 5. The steps necessary to do so, after fitting an AlphaTree, aren't very clear to me. I know it will probably be calculated by using the current ratio of classes but I'm not sure if also the predicted y_label comes into play somehow.

Looking forward to hearing from you!

multiclass handling

Currently, bonsai handles only regression and binary classification tasks. It needs to handle multiclass classification tasks.

Pre-compiled version/release

Hi,

thanks for your time and effort you've put into this project. I wanted to ask, is there any possibility we could get pre-compiled versions/releases so that we don't need to compile anything on our computers? I'm working in an environment where compiling software is problematic.

Thanks for your answer.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.