Giter Site home page Giter Site logo

nanxstats / stackgbm Goto Github PK

View Code? Open in Web Editor NEW
25.0 3.0 1.0 1.95 MB

๐ŸŒณ Stacked Gradient Boosting Machines

Home Page: https://nanx.me/stackgbm/

License: Other

R 86.69% CSS 9.22% TeX 4.09%
machine-learning decision-trees gbdt gbm gradient-boosting xgboost lightgbm catboost model-stacking ensemble-learning

stackgbm's Introduction

stackgbm

R-CMD-check CRAN status CRAN downloads

stackgbm offers a minimalist, research-oriented implementation of model stacking (Wolpert, 1992) for gradient boosted tree models built by xgboost (Chen and Guestrin, 2016), lightgbm (Ke et al., 2017), and catboost (Prokhorenkova et al., 2018).

Installation

The easiest way to get stackgbm is to install from CRAN:

install.packages("stackgbm")

Alternatively, to use a new feature or get a bug fix, you can install the development version of stackgbm from GitHub:

# install.packages("remotes")
remotes::install_github("nanxstats/stackgbm")

To install all potential dependencies, check out the instructions from manage dependencies.

Model

stackgbm implements a classic two-layer stacking model: the first layer generates "features" produced by gradient boosting trees. The second layer is a logistic regression that uses these features as inputs.

Related projects

For a more comprehensive and flexible implementation of model stacking, see stacks in tidymodels, mlr3pipelines in mlr3, and StackingClassifier in scikit-learn.

Code of Conduct

Please note that the stackgbm project is released with a Contributor Code of Conduct. By contributing to this project, you agree to abide by its terms.

stackgbm's People

Contributors

nanxstats avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

Forkers

zheng7ai310

stackgbm's Issues

Add GitHub Actions workflow for pkgdown

Need to customize the r-lib/actions pkgdown workflow (to install catboost) so the vignette and code examples can render with results. This would save a lot of manual running of pkgdown and reduce repo size.

Revisit license

Perhaps MIT would be a more permissive choice.

lightgbm uses MIT; xgboost and catboost both use Apache 2.0.

Release stackgbm 0.1.0

First release

  • Proof read Title: and Description: and ensure they are informative
  • Check that all exported functions have @returns and @examples
  • Check that Authors@R: includes a copyright holder (role 'cph')
  • Review extrachecks
  • usethis::use_cran_comments() (optional)

Prepare for release

  • Check current CRAN check results
  • Check licensing of included files
  • Review pkgdown reference index for, e.g., missing topics
  • Bump version
  • Update cran-comments.md (optional)
  • Update NEWS.md
  • Review pkgdown website
  • urlchecker::url_check()
  • Check with local machine
  • Check with GitHub Actions
  • Check with win-builder

Submit to CRAN

  • Draft GitHub release
  • Submit to CRAN via web form
  • Approve emails

Wait for CRAN

  • Accepted ๐ŸŽ‰
  • Blog post

Improve example in vignette

The current example in the vignette gives better results on macOS but not on Linux.

Could use a better example that shows some consistency.

Future-based parallel over parameter grid

  • My goal is to maximize CPU utilization as only increasing n_threads doesn't seem to make that happen.
  • The idea is to refactor the for-loop with foreach + %dofuture% loops to parallel over the parameter grid.
  • Remove the seed argument to let users set.seed() outside.
  • Let users configure the future parallel backend outside (or not, to use sequential).
  • Use progressr (%dofuture% compatible) to display progress bar.
  • Would probably recommend to set n_threads to 1 when using this parallel.
  • A vignette to explain these would be nice.

Optimize default parameter grid

Previous grid:

  • n_iterations = c(10, 50, 100, 200, 500, 1000)
  • max_depth = c(2, 3, 4, 5)
  • learning_rate = c(0.001, 0.01, 0.02, 0.05, 0.1)

Optimized grid:

  • n_iterations = c(100, 200, 500, 1000)
  • max_depth = c(3, 5, 7, 9)
  • learning_rate = c(0.01, 0.05, 0.1, 0.2)

These values follow the common ranges of the three parameters, and provide a balance between model complexity, risk of overfitting, and training time.

Rethink the parallel parameter

For example, consider having this parameter as a global option instead of a function argument.

  • To save repeated typing - set it once and for all.
  • This is mostly a side effect not controlled by R but the underlying boosting libraries.

Need to do some research on best practices and decide.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.