Giter Site home page Giter Site logo

lightgbm-benchmark's Introduction

LightGBM benchmarking suite

AzureML Pipelines Validation Benchmark scripts gated build

The LightGBM benchmark aims at providing tools and automation to compare implementations of lightgbm and other boosting-tree-based algorithms for both training and inferencing. The focus is on production use cases, and the evaluation on both model quality (validation metrics) and computing performance (training speed, compute hours, inferencing latency, etc).

The goal is to support the community of developers of LightGBM by providing tools and a methodology for evaluating new releases of LightGBM on a standard and reproducible benchmark.

Documentation

Please find the full documentation of this project at microsoft.github.io/lightgbm-benchmark

Contributing

This project welcomes contributions and suggestions. Most contributions require you to agree to a Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us the rights to use your contribution. For details, visit https://cla.opensource.microsoft.com.

When you submit a pull request, a CLA bot will automatically determine whether you need to provide a CLA and decorate the PR appropriately (e.g., status check, comment). Simply follow the instructions provided by the bot. You will only need to do this once across all repos using our CLA.

This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact [email protected] with any additional questions or comments.

Trademarks

This project may contain trademarks or logos for projects, products, or services. Authorized use of Microsoft trademarks or logos is subject to and must follow Microsoft's Trademark & Brand Guidelines. Use of Microsoft trademarks or logos in modified versions of this project must not cause confusion or imply Microsoft sponsorship. Any use of third-party trademarks or logos are subject to those third-party's policies.

lightgbm-benchmark's People

Contributors

alimahmoudzadeh avatar dependabot[bot] avatar dkmiller avatar jfomhover avatar marsupialtail avatar microsoft-github-operations[bot] avatar microsoftopensource avatar perezbecker avatar piyushmadan avatar ruizhuanguw avatar ulfk-ms avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

lightgbm-benchmark's Issues

Excessive memory usage when creating synthetic data

When running the script to generate synthetic data, the WSL2 instance in my laptop (which is caped to use 12GB of RAM) runs out of memory and the process is killed. As there are users who presumably will be running this script on laptops with 8GB of RAM or less, I suggest reducing the number of train, test, and inferencing samples by a factor of 3. When running the script with the reduced settings, the script uses about 4GB of RAM to run, which I believe is a reasonable compromise.

Partitioning: implement multiple partitioning methods

Let's discuss the assumptions behind partitioning, and which options we should cover in the partitioning data module.

Current module assumes records are independent, which will be wrong in some cases where they are grouped.

Implement an LGBM->ONNX model conversion + inferencing

The goal of this task is to add another variant to the inferencing benchmark for LightGBM. We already are comparing lightgbm python, lightgbm C, treelite. We'd like to try onnxruntime as it seems to be applicable.

In particular, we'd like to reproduce the results in this post on hummindbird and onnxruntime for classical ML models.

Feel free to reach out to the posters of the blog for collaboration.

The expected impact of this task:

  • increase the value of the benchmark for the lightgbm community, in particular for production scenarios
  • identify better production inferencing technologies

⚠️ It is unknown at this point if hummingbird allows the conversion of lightgbm>=v3 models to onnx. If that was impossible, it's still a good think to know, and to report in the hummingbird issues.

Learning Goals

By working on this project you'll be able to learn:

  • how to use onnxruntime for classical ML models
  • how to compare inferencing technologies in a benchmark
  • how to write components and pipelines for AzureML (component sdk + shrike)

Expected Deliverable:

To complete this task, you need to deliver:

  • 2 working python script: one to convert lightgbm models into onnx (using hummingbird?), one to use onnxruntime for inferencing
  • their corresponding working AzureML component
  • a successful run of the lightgbm inferencing benchmark pipeline

Instructions

Prepare for coding

  1. Follow the installation process, please report any issue you meet, that will help!
  2. Clone this repo, create your own branch username/onnxruntime (or something) for your own work (commit often!).
  3. In src/scripts/model_transformation create a folder lightgbm_to_onnx/ and copy the content of src/scripts/samples/ in it.

Local development

Let's start locally first.

To iterate on your python script, you need to consider a couple of constraints:

  • Follow the instructions in the sample script to modify and make your own.
  • Please consider using inputs and outputs that are provided as directories, not single files. There's a helper function to let you automatically select the unique file contained in a directory (see src/common/io.py function input_file_path)

Here's a couple of links to get you started:

Feel free to check out the current treelite modules (model_conversion/treelite_compile and inferencing/treelite_python). They have a similar behavior. You can also implement some unit tests from tests/scripts/test_treelite_python.py.

Develop for AzureML

Component specification

  1. First, unit tests. Edit tests/aml/test_components.py and watch for the list of components. Add the relative path to your component spec in this list.
    You can test your component by running

    pytest tests/aml/test_components.py -v -k name_of_component
  2. Edit the file spec.yaml in the directory of your component (copied from sample) and align its arguments with the expected arguments of your component until you pass the unit tests.

Integration in the inferencing pipeline

WORK IN PROGRESS

Implement Ray's lightgbm distributed training to compare against regular lightgbm distributed

Ray is a new parallelization framework: "Ray is an open source project that makes it simple to scale any compute-intensive Python workload — from deep learning to production model serving.".

RayML provides a LightGBM integration. We'd like to compare this version against vanilla distributed LightGBM (mpi).

The expected impact of this task:

  • increase the value of the benchmark for the lightgbm community
  • assess Ray's viability as a library for supporting scalable ML workloads

Learning Goals

By working on this project you'll be able to learn:

  • how to use RayML
  • how to compare training in a consistent benchmark

Expected Deliverable:

To complete this task, you need to deliver:

  • 1 working python script to train LightGBM on Ray

Instructions

Prepare for coding

  1. Follow the installation process, please report any issue you meet, that will help!
  2. Clone this repo, create your own branch username/lightgbmonray (or something) for your own work (commit often!).
  3. In src/scripts/training create a folder lightgbm_on_ray/ and copy the content of src/scripts/samples/ in it.

Note: it might be worth copying heavily from the lightgbm python training script.

Local development

Let's start locally first.

To iterate on your python script, you need to consider a couple of constraints:

  • Follow the instructions in the sample script to modify and make your own.
  • Please consider using inputs and outputs that are provided as directories, not single files. There's a helper function to let you automatically select the unique file contained in a directory (see src/common/io.py function input_file_path)

Develop for AzureML

Component specification

  1. First, unit tests. Edit tests/aml/test_components.py and watch for the list of components. Add the relative path to your component spec in this list.
    You can test your component by running

    pytest tests/aml/test_components.py -v -k name_of_component
  2. Edit the file spec.yaml in the directory of your component (copied from sample) and align its arguments with the expected arguments of your component until you pass the unit tests.

Integration in the training pipeline

WORK IN PROGRESS

Implement pre-processing and training on LETOR dataset

The goal of this task is to reproduce results observed in the paper LightGBM: A Highly Efficient Gradient Boosting
Decision Tree
and test LightGBM on a publicly available and well known benchmark dataset, to ensure our benchmark reproducibility. We already have a generic training script for LightGBM, so this task will consists in writing a pre-processor for this particular dataset, identify the right parameters for running LightGBM on this sample dataset, and run in AzureML.

The expected impact of this task to:

  • establish trust in our benchmark by obtaining comparable results with existing reference benchmarks
  • increase value of this benchmark for the community by providing reproducible results on standard data

Learning Goals

By working on this project you'll be able to learn:

  • how to write components and pipelines for AzureML (component sdk + shrike)
  • how to use lightgbm in practice on a sample dataset
  • how to use mlflow and AzureML run history to report metrics

Expected Deliverable:

To complete this task, you need to deliver:

  • a working python script to parse original LETOR dataset to feed into LightGBM
  • a working AzureML component
  • [stretch] a working pipeline with pre-processing and training, reporting training metrics

Instructions

Prepare for coding

  1. Follow the installation process, please report any issue you meet, that will help!
  2. Clone this repo, create your own branch username/letor (or something) for your own work (commit often!).
  3. In src/scripts/ create a folder preprocess_letor/ and copy the content of src/scripts/samples/ in it.
  4. Download the LETOR dataset from the original source, unzip it if necessary and put it in a subfolder under data/ at the root of the repo (git ignored).

Local development

Let's start locally first...

WORK IN PROGRESS

Develop for AzureML

WORK IN PROGRESS

Create lightgbm cpu versus gpu pipeline

Per discussion with LightGBM team, create a pipeline to compare cpu and gpu executions.

Currently LightGBM has 2 different implementations for GPU training. The first one is specified by device_type=gpu, which is implemented with OpenCL. The second one is specified by device_type=cuda, which is implemented by CUDA. Currently, to use GPU with LightGBM, we need to build LightGBM from source by following the guidelines here
LightGBM GPU Tutorial — LightGBM 3.2.1.99 documentation.
And then enable GPU training by setting the parameter device_type, as described above.
Lightgbm team is implementing a new CUDA version to replace the old one, but the building process and the parameters setting to use this new CUDA version will be the same as the old CUDA version. They’ll merging this new CUDA version soon. You may check the PR to follow the progress
[CUDA] New CUDA version by shiyu1994 · Pull Request #4528 · microsoft/LightGBM (github.com)

Implement common data loader to support csv/libsvm in all inferencing

Lightgbm_python and treelite_python both consume inference data.

That data could be either csv or libsvm, but because we're using numpy in treelite_python, it doesn't support libsvm.

To avoid writing the same data loading code twice, we want to implement a common class for handling that through either numpy or libsvm or lightgbm itself.

Implement percentile metrics for treelite

Current inferencing report only provides percentile metrics for LightGBM C API, that's because this is the only variant implementing latency measurement at the request level. The general issue here is that predictions at request level for python API will measure a lot of overhead (which might be a good thing to surface).

Lightgbm (python): implement learning curves using callbacks()

Current lightgbm_python/train.py does not recording metrics during training. We should use callbacks to do that efficiently.

Here's a proposed solution:

  • create a class for handling metrics and calling MetricsLogger.log_metric()
  • add callback from class in the callbacks list of lightgbm.train()
  • support iteration (or step) as an optional argument in MetricsLogger.log_metric() (link)

See example implementation below.

class MetricsCallbackHandler():
    def __init__(self):
        self.metrics = {}
    
    def callback(self, env: lightgbm.callback.CallbackEnv) -> None:
        self.metrics[env.iteration] = env.evaluation_result_list
        # pass it on to metricslogger

# ...

booster = lightgbm.train(
    lgbm_params,
    train_data,
    valid_sets = val_data,
    callbacks=[metrics_handler.callback]
)

Sweep: build on the draft sweep pipeline to extend parameters and options

Current setup of the sweep pipeline is using a minimal set of options:

  num_iterations: "choice(100, 200)"
  num_leaves: "choice(10,20,30)"
  min_data_in_leaf: 20
  learning_rate: 0.1
  max_bin: 255
  feature_fraction: 1.0

Instead, we could use this to promote specific sweep params, and run more interesting experiments.

Let's work on a set of interesting options to report on.

Add option to report confidence intervals for benchmark results

I was thinking that we might want to consider adding an option of running the same benchmark several times to be able to quantify the natural variance in LightGBM training/inferencing times. This will allow us to report confidence intervals in benchmark results.

I have measured training times that varied by about 20% when running distributed LightGBM repeatedly under identical conditions. In some cases, this intrinsic variance might be larger than the time differences measured across variants that we are benchmarking. Having confidence intervals in benchmark results would be especially beneficial for these cases.

Platform-dependent behavior of subprocess.run command in src/scripts/lightgbm_cli/score.py

Subprocess.run is used in src/scripts/lightgbm_cli/score.py to execute the LightGBM binary on the local machine. The path of the LightGBM binary, as well as the task, path of the data, path of the model, and output verbosity are passed as arguments to the function. These arguments can be passed to the function as a sequence or a single string or path-like object. If arguments are passed a string, the interpretation is platform-dependent. Particularly, in POSIX, if args is a string, the string is interpreted as the name or path of the program to execute. As a result, while the script executes correctly in Windows, it crashes when running in Linux because the entire argument string (including task, data, model, etc.) is interpreted as the path of the program to execute.

Proposed Fix: Pass the arguments as list to subprocess.run, which should be interpreted correctly by all platforms.

Generate TB of synthetic data to test distributed lightgbm

Current implementation using sklearn make_classification or make_regression generates all data in memory. We'd like to produce data that cannot fit in memory, in order to test scalability, multi-node, etc.

Goal is to create a duplicate of src/scripts/data_processing/generate_data/generate.py to allow for the production of "any" size of data, in particular in the TB order of magnitude. This will be validated by running LightGBM distributed training.

The reason we can't naively use multiple calls of make_regression is because when sklearn produces that data, it creates a random fake regression problem and generates data accordingly. If you call that function multiple times, you will obtain distinct datasets having distinct regression problems.

Two solutions:

  • modify the current behavior of sklearn to allow for reusing the coefs generated by a previous call.
  • create a new data generation scheme for our purpose.

Learning Goals

By working on this project you'll be able to learn:

  • how to write components and pipelines for AzureML (component sdk + shrike)
  • how to run LightGBM distributed training

Expected Deliverable:

To complete this task, you need to deliver:

  • 1 working python script to generate large quantity of data
  • a successful run of the lightgbm distributed training pipeline

Instructions

Prepare for coding

  1. Follow the installation process, please report any issue you meet, that will help!
  2. Clone this repo, create your own branch username/synthetictbdata (or something) for your own work (commit often!).
  3. Copy src/scripts/data_processing/generate_data/ into another subfolder in data_processing/large_data_generate.

Local development

Feel free to start working on a local python script. Once you have the good behavior, implementing it in AzureML should be straightforward (see following sections).

Here's a couple of contraints we'll ask you to follow:

  • create your script as a class that inherits from RunnableScript
  • use proper argument parsing using argparse (see get_arg_parser() method)
  • keep reporting some metrics on the data you're generating (see example from generate_data/)

Develop for AzureML

Component specification

  1. First, unit tests. Edit tests/aml/test_components.py and watch for the list of components. Add the relative path to your component spec in this list.

    You can test your component by running

    pytest tests/aml/test_components.py -v -k large_data_generate
  2. Edit the file spec.yaml in the directory of your component (copied from sample) and align its arguments with the expected arguments of your component until you pass the unit tests.

Integration in the data generation pipeline

WORK IN PROGRESS

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.