microsoft / lightgbm-benchmark Goto Github PK

Benchmark tools for LightGBM

License: MIT License

Python 88.79% Dockerfile 7.72% CMake 0.51% C++ 2.97%

lightgbm-benchmark's Introduction

LightGBM benchmarking suite

The LightGBM benchmark aims at providing tools and automation to compare implementations of lightgbm and other boosting-tree-based algorithms for both training and inferencing. The focus is on production use cases, and the evaluation on both model quality (validation metrics) and computing performance (training speed, compute hours, inferencing latency, etc).

The goal is to support the community of developers of LightGBM by providing tools and a methodology for evaluating new releases of LightGBM on a standard and reproducible benchmark.

Documentation

Please find the full documentation of this project at microsoft.github.io/lightgbm-benchmark

Contributing

This project welcomes contributions and suggestions. Most contributions require you to agree to a Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us the rights to use your contribution. For details, visit https://cla.opensource.microsoft.com.

When you submit a pull request, a CLA bot will automatically determine whether you need to provide a CLA and decorate the PR appropriately (e.g., status check, comment). Simply follow the instructions provided by the bot. You will only need to do this once across all repos using our CLA.

This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact [email protected] with any additional questions or comments.

Trademarks

This project may contain trademarks or logos for projects, products, or services. Authorized use of Microsoft trademarks or logos is subject to and must follow Microsoft's Trademark & Brand Guidelines. Use of Microsoft trademarks or logos in modified versions of this project must not cause confusion or imply Microsoft sponsorship. Any use of third-party trademarks or logos are subject to those third-party's policies.

lightgbm-benchmark's People

Contributors

Stargazers

Watchers

Forkers

marsupialtail standardgalactic luigiw ruizhuanguw isabella232 ahsenbaig-forks

lightgbm-benchmark's Issues

Create library of datasets to be re-used instead of generating a new dataset in each benchmark pipeline

Excessive memory usage when creating synthetic data

When running the script to generate synthetic data, the WSL2 instance in my laptop (which is caped to use 12GB of RAM) runs out of memory and the process is killed. As there are users who presumably will be running this script on laptops with 8GB of RAM or less, I suggest reducing the number of train, test, and inferencing samples by a factor of 3. When running the script with the reduced settings, the script uses about 4GB of RAM to run, which I believe is a reasonable compromise.

Organize the collection of docker containers in a distinct folder used across components

Multiple components can benefit from using the same docker containers:

lightgbm python
lightgbm cli (inferencing)
lightgbm c api (inferencing)

Instead of using them inside lightgbm_python, let's move those in a centralized folder.

Partitioning: implement multiple partitioning methods

Let's discuss the assumptions behind partitioning, and which options we should cover in the partitioning data module.

Current module assumes records are independent, which will be wrong in some cases where they are grouped.

in lightgbm inferencing, instead of compiling .exe and using subprocess, build a dll with proper python/c integration

Verify manual benchmark with newest set of arguments (probably fails)

Use mkdocs to publish docs on github.io

#8 (comment)

Show lightgbm logs in the logs in AzureML

Current execution lets lightgbm handle its own logs, they are likely printed in stdout, but don't show up in AzureML

Connect AzureML pipeline with MLOPS workspace for pipeline validation

Create environment in Github
Create workflow capable of running pipeline in AzureML
Figure out the credentials

Allow node postfix in addition to/instead of prefix for metrics

One option might have better usability over the other depending on usage scenario/user preferences

MPI: verify if script can be run locally, and modify docs to run mpi locally during benchmark

Implement an LGBM->ONNX model conversion + inferencing

The goal of this task is to add another variant to the inferencing benchmark for LightGBM. We already are comparing lightgbm python, lightgbm C, treelite. We'd like to try onnxruntime as it seems to be applicable.

In particular, we'd like to reproduce the results in this post on hummindbird and onnxruntime for classical ML models.

Feel free to reach out to the posters of the blog for collaboration.

The expected impact of this task:

increase the value of the benchmark for the lightgbm community, in particular for production scenarios
identify better production inferencing technologies

⚠️ It is unknown at this point if hummingbird allows the conversion of lightgbm>=v3 models to onnx. If that was impossible, it's still a good think to know, and to report in the hummingbird issues.

Learning Goals

By working on this project you'll be able to learn:

how to use onnxruntime for classical ML models
how to compare inferencing technologies in a benchmark
how to write components and pipelines for AzureML (component sdk + shrike)

Expected Deliverable:

To complete this task, you need to deliver:

2 working python script: one to convert lightgbm models into onnx (using hummingbird?), one to use onnxruntime for inferencing
their corresponding working AzureML component
a successful run of the lightgbm inferencing benchmark pipeline

Instructions

Prepare for coding

Follow the installation process, please report any issue you meet, that will help!
Clone this repo, create your own branch username/onnxruntime (or something) for your own work (commit often!).
In src/scripts/model_transformation create a folder lightgbm_to_onnx/ and copy the content of src/scripts/samples/ in it.

Local development

Let's start locally first.

To iterate on your python script, you need to consider a couple of constraints:

Follow the instructions in the sample script to modify and make your own.
Please consider using inputs and outputs that are provided as directories, not single files. There's a helper function to let you automatically select the unique file contained in a directory (see src/common/io.py function input_file_path)

Here's a couple of links to get you started:

Feel free to check out the current treelite modules (model_conversion/treelite_compile and inferencing/treelite_python). They have a similar behavior. You can also implement some unit tests from tests/scripts/test_treelite_python.py.

Develop for AzureML

Component specification

First, unit tests. Edit tests/aml/test_components.py and watch for the list of components. Add the relative path to your component spec in this list.
You can test your component by running
```
pytest tests/aml/test_components.py -v -k name_of_component
```
Edit the file spec.yaml in the directory of your component (copied from sample) and align its arguments with the expected arguments of your component until you pass the unit tests.

Integration in the inferencing pipeline

WORK IN PROGRESS

Fix cuda builds for lightgbm python

Current docker files in /src/scripts/lightgbm_python/dockers/*cuda.dockerfile are failing build due to cmake version not recent enough.

Create documentation for each of the component

We don't have good documentation to explain what each component does.

We should at least have a list of them, but some motivation and general usage.

Sweep does not take early termination policy argument yet.

Implement Ray's lightgbm distributed training to compare against regular lightgbm distributed

Ray is a new parallelization framework: "Ray is an open source project that makes it simple to scale any compute-intensive Python workload — from deep learning to production model serving.".

RayML provides a LightGBM integration. We'd like to compare this version against vanilla distributed LightGBM (mpi).

The expected impact of this task:

increase the value of the benchmark for the lightgbm community
assess Ray's viability as a library for supporting scalable ML workloads

Learning Goals

By working on this project you'll be able to learn:

how to use RayML
how to compare training in a consistent benchmark

Expected Deliverable:

To complete this task, you need to deliver:

1 working python script to train LightGBM on Ray

Instructions

Prepare for coding

Follow the installation process, please report any issue you meet, that will help!
Clone this repo, create your own branch username/lightgbmonray (or something) for your own work (commit often!).
In src/scripts/training create a folder lightgbm_on_ray/ and copy the content of src/scripts/samples/ in it.

Note: it might be worth copying heavily from the lightgbm python training script.

Local development

Let's start locally first.

To iterate on your python script, you need to consider a couple of constraints:

Follow the instructions in the sample script to modify and make your own.
Please consider using inputs and outputs that are provided as directories, not single files. There's a helper function to let you automatically select the unique file contained in a directory (see src/common/io.py function input_file_path)

Develop for AzureML

Component specification

First, unit tests. Edit tests/aml/test_components.py and watch for the list of components. Add the relative path to your component spec in this list.
You can test your component by running
```
pytest tests/aml/test_components.py -v -k name_of_component
```
Edit the file spec.yaml in the directory of your component (copied from sample) and align its arguments with the expected arguments of your component until you pass the unit tests.

Integration in the training pipeline

WORK IN PROGRESS

long label_gain leads to failure in job creation.

Implement pre-processing and training on LETOR dataset

The goal of this task is to reproduce results observed in the paper LightGBM: A Highly Efficient Gradient Boosting
Decision Tree and test LightGBM on a publicly available and well known benchmark dataset, to ensure our benchmark reproducibility. We already have a generic training script for LightGBM, so this task will consists in writing a pre-processor for this particular dataset, identify the right parameters for running LightGBM on this sample dataset, and run in AzureML.

The expected impact of this task to:

establish trust in our benchmark by obtaining comparable results with existing reference benchmarks
increase value of this benchmark for the community by providing reproducible results on standard data

Learning Goals

By working on this project you'll be able to learn:

how to write components and pipelines for AzureML (component sdk + shrike)
how to use lightgbm in practice on a sample dataset
how to use mlflow and AzureML run history to report metrics

Expected Deliverable:

To complete this task, you need to deliver:

a working python script to parse original LETOR dataset to feed into LightGBM
a working AzureML component
[stretch] a working pipeline with pre-processing and training, reporting training metrics

Instructions

Prepare for coding

Follow the installation process, please report any issue you meet, that will help!
Clone this repo, create your own branch username/letor (or something) for your own work (commit often!).
In src/scripts/ create a folder preprocess_letor/ and copy the content of src/scripts/samples/ in it.
Download the LETOR dataset from the original source, unzip it if necessary and put it in a subfolder under data/ at the root of the repo (git ignored).

Local development

Let's start locally first...

WORK IN PROGRESS

Develop for AzureML

WORK IN PROGRESS

Add system metrics (cpu utilization, gpu utilization, memory utilization, etc).

variant override of custom_params does not work as expected

if you override custom_params in the lightgbm_training.variants, it will just override the whole dictionary of config rather than just one parameter.

Implement parameter sweeping for LightGBM training

Fix data loading strategy in lightgbm_predict binaries, use library instead

Data loading in lightgbm_predict is using homebrew LibSVM parsing, and does not support TSV.

We should use the LightGBM Dataset loader instead, if possible.

Here's the code from the main application where Dataset loading occurs:
https://github.com/microsoft/LightGBM/blob/b1facf50502fb51a1e60c05b3ec83f68289df497/src/application/application.cpp#L101

The open question is around getting individual vectors from the Dataset.

LightGBM inferencing (C API) needs to work on windows compute targets

Add comparison of inference scores

Manual benchmark docs need to align with required arguments for train.py (previously "Is device-type needed for local run?")

"device_type" is a required argument for train script. However, it seems not needed for local run and the tutorial at https://microsoft.github.io/lightgbm-benchmark/run/manual-benchmark/ does not provide the argument.

Maintain consitency of metrics name between single/multi node

When run lightgbm single node, metrics don't have a prefix
when run on multi node, they have node_N/ as prefix
this creates inconsistency in the reporting dashboard

Implement proper logging (get rid of print)

In each of the benchmark scripts, replace print() with logging calls.

Add more system/platform properties

Use platform import to get more information, such as:

total memory
cpu full description
anything relevant that we can get

Align all current scripts with developer guide (once accepted)

When PR #8 and PR #9 will be merged, take existing scripts and align with the sample code.

Write docs on how to run benchmark on your own prod data

Create lightgbm cpu versus gpu pipeline

Per discussion with LightGBM team, create a pipeline to compare cpu and gpu executions.

Currently LightGBM has 2 different implementations for GPU training. The first one is specified by device_type=gpu, which is implemented with OpenCL. The second one is specified by device_type=cuda, which is implemented by CUDA. Currently, to use GPU with LightGBM, we need to build LightGBM from source by following the guidelines here
LightGBM GPU Tutorial — LightGBM 3.2.1.99 documentation.
And then enable GPU training by setting the parameter device_type, as described above.
Lightgbm team is implementing a new CUDA version to replace the old one, but the building process and the parameters setting to use this new CUDA version will be the same as the old CUDA version. They’ll merging this new CUDA version soon. You may check the PR to follow the progress
[CUDA] New CUDA version by shiyu1994 · Pull Request #4528 · microsoft/LightGBM (github.com)

Implement common data loader to support csv/libsvm in all inferencing

Lightgbm_python and treelite_python both consume inference data.

That data could be either csv or libsvm, but because we're using numpy in treelite_python, it doesn't support libsvm.

To avoid writing the same data loading code twice, we want to implement a common class for handling that through either numpy or libsvm or lightgbm itself.

Fatal error for missing query info with group_column specified.

Specify group_column: "0" but still get fatal error of "missing query info" for ranking task.

Support testing multiple parameter sets

allow definition of multiple parameter sets
provide default list of parameter sets to test
potentially use sweep
provide informative analysis

Implement percentile metrics for treelite

Current inferencing report only provides percentile metrics for LightGBM C API, that's because this is the only variant implementing latency measurement at the request level. The general issue here is that predictions at request level for python API will measure a lot of overhead (which might be a good thing to surface).

Treelite uses different naming convention for metrics

let's use time_inferencing so we can correlate with other scripts.

Create proper home page with links to main docs

In particular, link to:

goals of the project
developer guide
related projects (lightgbm, azureml)

Sweep: what to do when lgbm metrics has multiple metrics? (ex "rmse,ndcg")

Per comment below, this is now supported, but needs to be documented.

Lightgbm (python): implement learning curves using callbacks()

Current lightgbm_python/train.py does not recording metrics during training. We should use callbacks to do that efficiently.

Here's a proposed solution:

create a class for handling metrics and calling MetricsLogger.log_metric()
add callback from class in the callbacks list of lightgbm.train()
support iteration (or step) as an optional argument in MetricsLogger.log_metric() (link)

See example implementation below.

class MetricsCallbackHandler():
    def __init__(self):
        self.metrics = {}
    
    def callback(self, env: lightgbm.callback.CallbackEnv) -> None:
        self.metrics[env.iteration] = env.evaluation_result_list
        # pass it on to metricslogger

# ...

booster = lightgbm.train(
    lgbm_params,
    train_data,
    valid_sets = val_data,
    callbacks=[metrics_handler.callback]
)

Sweep: build on the draft sweep pipeline to extend parameters and options

Current setup of the sweep pipeline is using a minimal set of options:

  num_iterations: "choice(100, 200)"
  num_leaves: "choice(10,20,30)"
  min_data_in_leaf: 20
  learning_rate: 0.1
  max_bin: 255
  feature_fraction: 1.0

Instead, we could use this to promote specific sweep params, and run more interesting experiments.

Let's work on a set of interesting options to report on.

Add option to report confidence intervals for benchmark results

I was thinking that we might want to consider adding an option of running the same benchmark several times to be able to quantify the natural variance in LightGBM training/inferencing times. This will allow us to report confidence intervals in benchmark results.

I have measured training times that varied by about 20% when running distributed LightGBM repeatedly under identical conditions. In some cases, this intrinsic variance might be larger than the time differences measured across variants that we are benchmarking. Having confidence intervals in benchmark results would be especially beneficial for these cases.

Provide documentation for each script individually (for local runs)

Platform-dependent behavior of subprocess.run command in src/scripts/lightgbm_cli/score.py

Subprocess.run is used in src/scripts/lightgbm_cli/score.py to execute the LightGBM binary on the local machine. The path of the LightGBM binary, as well as the task, path of the data, path of the model, and output verbosity are passed as arguments to the function. These arguments can be passed to the function as a sequence or a single string or path-like object. If arguments are passed a string, the interpretation is platform-dependent. Particularly, in POSIX, if args is a string, the string is interpreted as the name or path of the program to execute. As a result, while the script executes correctly in Windows, it crashes when running in Linux because the entire argument string (including task, data, model, etc.) is interpreted as the path of the program to execute.

Proposed Fix: Pass the arguments as list to subprocess.run, which should be interpreted correctly by all platforms.

Should the lightgbm-benchmark repo send metrics to a central open database?

We could send metrics to appinsights?
We could inspire from the GeekBench example https://browser.geekbench.com/v5/cpu/singlecore?

Treelite Inferencing throws std::bad_alloc runtime error with large datasets

Repro parameters:
Data: 1,000,000 samples, 4,000 features
Model: regression, 1,000 trees, 31 leaves

Add/linking to configuration notebook for reference

Would it be worthy to add a link to the configuration notebook in the set up page of Azure ML run tutorial to help customers use AML?

https://github.com/Azure/MachineLearningNotebooks/blob/master/configuration.ipynb

Write a sample component + pipeline + developer instructions to support small contrib projects

To support users to contribute with this kind of projects:
#128

We need to have:

a clean python sample with all best practices
a sample training pipeline for lightgbm
developer instructions on how to use/test those

Generate TB of synthetic data to test distributed lightgbm

Current implementation using sklearn make_classification or make_regression generates all data in memory. We'd like to produce data that cannot fit in memory, in order to test scalability, multi-node, etc.

Goal is to create a duplicate of src/scripts/data_processing/generate_data/generate.py to allow for the production of "any" size of data, in particular in the TB order of magnitude. This will be validated by running LightGBM distributed training.

The reason we can't naively use multiple calls of make_regression is because when sklearn produces that data, it creates a random fake regression problem and generates data accordingly. If you call that function multiple times, you will obtain distinct datasets having distinct regression problems.

Two solutions:

modify the current behavior of sklearn to allow for reusing the coefs generated by a previous call.
create a new data generation scheme for our purpose.

Learning Goals

By working on this project you'll be able to learn:

how to write components and pipelines for AzureML (component sdk + shrike)
how to run LightGBM distributed training

Expected Deliverable:

To complete this task, you need to deliver:

1 working python script to generate large quantity of data
a successful run of the lightgbm distributed training pipeline

Instructions

Prepare for coding

Follow the installation process, please report any issue you meet, that will help!
Clone this repo, create your own branch username/synthetictbdata (or something) for your own work (commit often!).
Copy src/scripts/data_processing/generate_data/ into another subfolder in data_processing/large_data_generate.

Local development

Feel free to start working on a local python script. Once you have the good behavior, implementing it in AzureML should be straightforward (see following sections).

Here's a couple of contraints we'll ask you to follow:

create your script as a class that inherits from RunnableScript
use proper argument parsing using argparse (see get_arg_parser() method)
keep reporting some metrics on the data you're generating (see example from generate_data/)

Develop for AzureML

Component specification

First, unit tests. Edit tests/aml/test_components.py and watch for the list of components. Add the relative path to your component spec in this list.

You can test your component by running
```
pytest tests/aml/test_components.py -v -k large_data_generate
```
Edit the file spec.yaml in the directory of your component (copied from sample) and align its arguments with the expected arguments of your component until you pass the unit tests.

Integration in the data generation pipeline

WORK IN PROGRESS

For regular docker containers of lightgbm, instead of re-building it, host them on Github?

Current repository hosts dockerfiles for building LightGBM in multiple ways:

pypi
mpi
gpu
build from a patch file against lightgbm main branch

For the regular ones (pypi, mpi, gpu), we could host those images on github instead.

https://devmind.hashnode.dev/how-to-create-a-docker-image-and-host-it-on-github-packages

Document how users can upload their own data into AzureML workspace

Current instructions reference tutorials that are not very helpful.

microsoft / lightgbm-benchmark Goto Github PK

lightgbm-benchmark's Introduction

LightGBM benchmarking suite

Documentation

Contributing

Trademarks

lightgbm-benchmark's People

Contributors

Stargazers

Watchers

Forkers

lightgbm-benchmark's Issues

Learning Goals

Expected Deliverable:

Instructions

Prepare for coding

Local development

Develop for AzureML

Component specification

Integration in the inferencing pipeline

Learning Goals

Expected Deliverable:

Instructions

Prepare for coding

Local development

Develop for AzureML

Component specification

Integration in the training pipeline

Learning Goals

Expected Deliverable:

Instructions

Prepare for coding

Local development

Develop for AzureML

Learning Goals

Expected Deliverable:

Instructions

Prepare for coding

Local development

Develop for AzureML

Component specification

Integration in the data generation pipeline

Recommend Projects

Recommend Topics

Recommend Org