Giter Site home page Giter Site logo

intelpython / scikit-learn_bench Goto Github PK

View Code? Open in Web Editor NEW
104.0 12.0 65.0 710 KB

scikit-learn_bench benchmarks various implementations of machine learning algorithms across data analytics frameworks. It currently support the scikit-learn, DAAL4PY, cuML, and XGBoost frameworks for commonly used machine learning algorithms.

License: Apache License 2.0

Python 100.00%
machine-learning-benchmarks machine-learning benchmarks scikit-learn-benchmarks daal4py hacktoberfest

scikit-learn_bench's Issues

xgboost benchmark datasets missing

When benchmark using xgb_cpu_main_config.json. The following datasets are missing

WARNING: Dataset mlsr could not be loaded.
Check the correct name or expand the download in the folder dataset.
INFO: gbt algorithm: 1 case(s), 1 dataset(s)

WARNING: Dataset mortgage1Q could not be loaded.
Check the correct name or expand the download in the folder dataset.
INFO: gbt algorithm: 1 case(s), 1 dataset(s)

WARNING: Dataset plasticc could not be loaded.
Check the correct name or expand the download in the folder dataset.
INFO: gbt algorithm: 1 case(s), 1 dataset(s)

WARNING: Dataset santander could not be loaded.
Check the correct name or expand the download in the folder dataset.

HistGradientBoostingEstimator

Hi!
Is there a reason HistGradientBoostingEstimator from sklearn is not included in the benchmark? It should be about as fast as XGBoost.

Make use of "--device(s)" for XGBoost

CPU and GPU configs for XGBoost have only few differences: in data-format (pandas vs cudf) and tree-method (hist vs gpu_hist). Dispatching for them with --devices(s) argument will simplify configs.

svm.py fails with IndexError

Following instructions in https://medium.com/intel-analytics-software/leverage-intel-optimizations-in-scikit-learn-f562cb9d5544 get errors in all cases. Typical error message is

INFO: python sklearn_bench/svm.py --arch mericas --data-format pandas --data-order F --dtype float64 --max-cache-size 2 --probability -C 1.0 --kernel rbf --device none --file-X-train data/klaverjas_x_train.npy --file-y-train data/klaverjas_y_train.npy --file-X-test data/klaverjas_x_test.npy --file-y-test data/klaverjas_y_test.npy --dataset-name klaverjas

WARNING: Error in benchmark:

Traceback (most recent call last):
File "/home/mericas/scikit-learn_bench-master/sklearn_bench/svm.py", line 107, in
bench.run_with_context(params, main)
File "/home/mericas/scikit-learn_bench-master/bench.py", line 572, in run_with_context
function()
File "/home/mericas/scikit-learn_bench-master/sklearn_bench/svm.py", line 63, in main
train_acc = bench.accuracy_score(y_train, y_pred)
File "/home/mericas/scikit-learn_bench-master/bench.py", line 347, in accuracy_score
return columnwise_score(y_true, y_pred, lambda y1, y2: np.mean(y1 == y2))
File "/home/mericas/scikit-learn_bench-master/bench.py", line 342, in columnwise_score
return [score_func(y[i], yp[i]) for i in range(y.shape[1])]
IndexError: tuple index out of rangeCASE sklearn,svm --data-format pandas --data-order F --dtype float64 --max-cache-size 2 --probability -C 1.0 --kernel rbf --device none JSON DECODING ERROR:
Expecting value: line 1 column 1 (char 0)

Also notice that svm fails using scikit-learn_bench/blob/master/configs/blogs/skl_2021_3.json

The error messages when running benchmark are ignored in some cases

According to https://github.com/IntelPython/scikit-learn_bench/blob/master/runner.py#L306-L310, the benchmark is said to be successful when 'daal4py' not in stderr, but when benchmark fails, stderr also contains 'daal4py' in some cases.

This issue may affect CI test, there are still errors although CIs are passed.

To reproduce, you can add print(stderr) under the if statement at https://github.com/IntelPython/scikit-learn_bench/blob/master/runner.py#L306, and run python runner.py --configs configs/testing/sklearn.json, then you will get errors like:

Traceback (most recent call last):
  File "sklearn_bench/pca.py", line 66, in <module>
    bench.run_with_context(params, main)
  File "/home/zhaojieh/sklearn-benchmark-lulin/bench.py", line 567, in run_with_context
    function()
  File "sklearn_bench/pca.py", line 37, in main
    fit_time, _ = bench.measure_function_time(pca.fit, X_train, params=params)
  File "/home/zhaojieh/sklearn-benchmark-lulin/bench.py", line 267, in measure_function_time
    return time_box_filter(func, *args,
  File "/home/zhaojieh/sklearn-benchmark-lulin/bench.py", line 276, in time_box_filter
    val = func(*args, **kwargs)
  File "/home/zhaojieh/miniconda3/envs/sklearn-bench/lib/python3.8/site-packages/sklearn/decomposition/_pca.py", line 435, in fit
    self._fit(X)
  File "/home/zhaojieh/miniconda3/envs/sklearn-bench/lib/python3.8/site-packages/daal4py/sklearn/decomposition/_pca.py", line 260, in _fit
    result = self._fit_full(X, n_components)
  File "/home/zhaojieh/miniconda3/envs/sklearn-bench/lib/python3.8/site-packages/daal4py/sklearn/decomposition/_pca.py", line 153, in _fit_full
    self._fit_full_daal4py(X, min(X.shape))
  File "/home/zhaojieh/miniconda3/envs/sklearn-bench/lib/python3.8/site-packages/daal4py/sklearn/decomposition/_pca.py", line 142, in _fit_full_daal4py
    self.n_samples_, self.n_features_ = n_samples, n_features
AttributeError: can't set attributeCASE sklearn,pca  --data-format pandas --data-order F --dtype float64 --device none JSON DECODING ERROR:
Expecting value: line 1 column 1 (char 0)

I found this issue when developing #134, and you can check CI logs in this PR for more error messages.

This PR uses retcode to check whether there are errors when running benchmark and the error message can be displayed correctly, but the error in benchmark may need to be fixed separately.

Error installing reqs for scikit-learn bench

I run the following commands...

conda create -n intelpybench -c intel python=3.7
conda activate intelpybench
conda install -c intel scikit-learn scikit-learn-intelex pandas tqdm

That final instruction gives me an error:

ERROR conda.core.link:_execute(698): An error occurred while installing package 'intel::scikit-learn-0.24.1-py37h1590dfa_3'.
Rolling back transaction: done

LinkError: post-link script failed for package intel::scikit-learn-0.24.1-py37h1590dfa_3
location of failed script: C:\Users\Colin\.conda\envs\intelpybench3\Scripts\.scikit-learn-post-link.bat
==> script messages <==
<None>
==> script output <==
stdout:
stderr: The syntax of the command is incorrect.

return code: 255

I'm guessing this is due to scikit-learn-intelex being pretty much brand new and still in development.

Change C to higher value

The C parameter for SVC call in SVM benchmark is set to 0.01. This results in large number of support vectors in the solution leading to excessive run-times.

The issue is to find a more appropriate value of C.

Benchmark linear models in higher dimensions

The current benchmarks only use 50 features for 1e6 samples. I would argue that this is not a case where won't would use a linear model as it would under-fit and the same test accuracy could probably be reached much faster with 1e3 data points instead of 1e6 yielding a speed up in the order of 1000x.

It would therefore be more interesting to benchmark linear regression, ridge regression and logistic regression in regimes in the order of 1e3 to 1e5 features.

In particular, Ridge regression is likely to be most useful in cases where num_features >> n_samples, otherwise, Linear regression (no penalty) is likely to give the same result.

reporting format of benchmarks

As discussed here: scikit-learn/scikit-learn#14247 (comment)

I think the current report is very hard to read.
It might be helpful to specify very clearly what the baseline is, that is the meaning of 1 in all the plots - it's your own C++ implementation.

For a comparison with scikit-learn I think doing sklearn speed / your c++ speed would be easier to read as it shows your speedup factor, not our slow-down factor.

Finally, I don't see the number of cores in your benchmark, which is pretty crucial since most of our implementations are single-threaded. Yes, that's a big issue, but saying "we're 100x faster" without saying "on 100 CPUs instead of 1" is quite misleading.
It might be helpful to have a chart of speedup vs number of CPUs.

lot of memory allocations becomes bottleneck

I captured perf data for most of the algorithms and see there are lot many memory allocations happens during the run which become bottleneck. Please refer attached screenshot.

Is there a way to fine tune the memory allocations? like any env variable or cmmandline arguments?

perf data for nusvc

Facing issues while running benchmark on Ubuntu 18.04

Hi,

I am facing below issue while executing ridge algorithm on Ubuntu 18.04. Looks like I may not be using the right versions of dependent packages like pandas, numpy, scipy, scikit-learn, scikit-learn-intelex, etc.. I had installed the scikit-learn and the dependent packages using command pip3 install -r sklearn_bench/requirements.txt. I have listed the versions installed in my system after the error logs. Would like to know if we have any specific package version requirements for Ubuntu 18.04. Thanks!

===============================================================

~/scikit-learn_bench$ python3 runner.py --configs configs/sklearn/performance/ridge.json --output-file results/ridge.json
INFO: Datasets folder is not set, using local folder
INFO: Config: configs/sklearn/performance/ridge.json
INFO: ridge algorithm: 2 case(s), 2 dataset(s)

INFO: python sklearn_bench/ridge.py --arch intel-WilsonCity --data-format pandas --data-order F --dtype float32 --device none --alpha 5 --file-X-train data/synthetic-regression-X-train-10000000x20.npy --file-y-train data/synthetic-regression-y-train-10000000x20.npy --file-X-test data/synthetic-regression-X-train-10000000x20.npy --file-y-test data/synthetic-regression-y-train-10000000x20.npy --dataset-name synthetic_regression

WARNING: Error in benchmark:
Traceback (most recent call last):
File "sklearn_bench/ridge.py", line 19, in
import bench
File "/home/intel/scikit-learn_bench/bench.py", line 38
raise ValueError(f'Impossible to get data type of {type(data)}')
^
SyntaxError: invalid syntaxCASE sklearn,ridge --data-format pandas --data-order F --dtype float32 --device none --alpha 5 JSON DECODING ERROR:
Expecting value: line 1 column 1 (char 0)

INFO: python sklearn_bench/ridge.py --arch intel-WilsonCity --data-format pandas --data-order F --dtype float64 --device none --alpha 5 --file-X-train data/synthetic-regression-X-train-10000000x20.npy --file-y-train data/synthetic-regression-y-train-10000000x20.npy --file-X-test data/synthetic-regression-X-train-10000000x20.npy --file-y-test data/synthetic-regression-y-train-10000000x20.npy --dataset-name synthetic_regression

WARNING: Error in benchmark:
Traceback (most recent call last):
File "sklearn_bench/ridge.py", line 19, in
import bench
File "/home/intel/scikit-learn_bench/bench.py", line 38
raise ValueError(f'Impossible to get data type of {type(data)}')
^
SyntaxError: invalid syntaxCASE sklearn,ridge --data-format pandas --data-order F --dtype float64 --device none --alpha 5 JSON DECODING ERROR:
Expecting value: line 1 column 1 (char 0)

INFO: python sklearn_bench/ridge.py --arch intel-WilsonCity --data-format pandas --data-order F --dtype float32 --device none --alpha 5 --file-X-train data/synthetic-regression-X-train-2000000x100.npy --file-y-train data/synthetic-regression-y-train-2000000x100.npy --file-X-test data/synthetic-regression-X-train-2000000x100.npy --file-y-test data/synthetic-regression-y-train-2000000x100.npy --dataset-name synthetic_regression

WARNING: Error in benchmark:
Traceback (most recent call last):
File "sklearn_bench/ridge.py", line 19, in
import bench
File "/home/intel/scikit-learn_bench/bench.py", line 38
raise ValueError(f'Impossible to get data type of {type(data)}')
^
SyntaxError: invalid syntaxCASE sklearn,ridge --data-format pandas --data-order F --dtype float32 --device none --alpha 5 JSON DECODING ERROR:
Expecting value: line 1 column 1 (char 0)

INFO: python sklearn_bench/ridge.py --arch intel-WilsonCity --data-format pandas --data-order F --dtype float64 --device none --alpha 5 --file-X-train data/synthetic-regression-X-train-2000000x100.npy --file-y-train data/synthetic-regression-y-train-2000000x100.npy --file-X-test data/synthetic-regression-X-train-2000000x100.npy --file-y-test data/synthetic-regression-y-train-2000000x100.npy --dataset-name synthetic_regression

WARNING: Error in benchmark:
Traceback (most recent call last):
File "sklearn_bench/ridge.py", line 19, in
import bench
File "/home/intel/scikit-learn_bench/bench.py", line 38
raise ValueError(f'Impossible to get data type of {type(data)}')
^
SyntaxError: invalid syntaxCASE sklearn,ridge --data-format pandas --data-order F --dtype float64 --device none --alpha 5 JSON DECODING ERROR:
Expecting value: line 1 column 1 (char 0)

WARNING: benchmark running had runtime errors
~/scikit-learn_bench$

===============================================================

~/scikit-learn_bench$ pip3 list
Package Version


alabaster 0.7.12
apipkg 1.4
apturl 0.5.2
asn1crypto 0.24.0
astroid 1.6.0
asv 0.5.1
attrs 21.4.0
Babel 2.11.0
beautifulsoup4 4.6.0
breathe 4.7.3
Brlapi 0.6.6
certifi 2022.6.15
chardet 3.0.4
charset-normalizer 2.0.12
click 6.7
colorama 0.3.7
command-not-found 0.3
commonmark 0.9.1
cryptography 2.1.4
cupshelpers 1.0
daal 2021.5.3
daal4py 2021.5.3
dataclasses 0.8
decorator 4.1.2
defer 1.0.6
distro-info 0.18ubuntu0.18.04.1
docker 5.0.3
docutils 0.18.1
et-xmlfile 1.1.0
execnet 1.4.1
html5lib 0.999999999
httplib2 0.9.2
idna 3.3
imagesize 1.4.1
importlib-metadata 4.8.3
importlib-resources 5.4.0
iniconfig 1.1.1
isort 4.3.4
Jinja2 3.0.3
joblib 1.1.1
keyring 10.6.0
keyrings.alt 3.0
language-selector 0.1
launchpadlib 1.10.6
lazr.restfulclient 0.13.5
lazr.uri 1.0.3
lazy-object-proxy 1.3.1
logilab-common 1.4.1
louis 3.5.0
lxml 4.2.1
macaroonbakery 1.1.3
Mako 1.0.7
MarkupSafe 2.0.1
mccabe 0.6.1
meson 0.56.2
netifaces 0.10.4
numpy 1.19.5
oauth 1.0.1
olefile 0.45.1
openpyxl 3.0.10
packaging 21.3
pandas 1.1.5
pexpect 4.2.1
Pillow 8.4.0
pip 21.3.1
pluggy 1.0.0
prompt-toolkit 1.0.14
protobuf 3.0.0
psutil 5.9.1
py 1.11.0
pycairo 1.16.2
pycrypto 2.6.1
pycups 1.9.73
pyelftools 0.28
Pygments 2.13.0
PyGObject 3.26.1
PyInquirer 1.0.3
pylint 1.8.3
pymacaroons 0.13.0
PyNaCl 1.1.2
pyparsing 3.0.9
pyRFC3339 1.0
pytest 7.0.1
pytest-forked 0.2
pytest-xdist 1.22.1
python-apt 1.6.5+ubuntu0.7
python-dateutil 2.8.2
python-debian 0.1.32
pytz 2018.3
pyxdg 0.25
PyYAML 6.0
recommonmark 0.7.1
regex 2022.8.17
reportlab 3.4.0
requests 2.27.1
requests-unixsocket 0.1.5
roman 2.0.0
scikit-learn 0.24.2
scikit-learn-intelex 2021.5.3
scipy 1.5.4
SecretStorage 2.3.1
setuptools 59.6.0
simplejson 3.13.2
six 1.11.0
snowballstemmer 2.2.0
Sphinx 1.8.0
sphinx-rtd-theme 1.1.1
sphinxcontrib-serializinghtml 1.1.5
sphinxcontrib-websupport 1.2.4
ssh-import-id 5.7
system-service 0.3
systemd-python 234
tbb 2021.8.0
threadpoolctl 3.1.0
toml 0.10.2
tomli 1.2.3
tomli_w 0.4.0
torch 1.10.1
torchvision 0.11.2
tqdm 4.64.1
typing_extensions 4.1.1
ubuntu-drivers-common 0.0.0
ufw 0.36
unattended-upgrades 0.1
urllib3 1.26.10
usb-creator 0.3.3
wadllib 1.3.2
wcwidth 0.2.5
webencodings 0.5
websocket-client 1.3.1
wheel 0.30.0
wrapt 1.9.0
xkit 0.0.0
zipp 3.6.0
zope.interface 4.3.2
~/scikit-learn_bench$

dataset sizes for benchmarks

It would be great if you could do benchmarks with different data set sizes and with tall, wide and parse data, where possible, and report where these are not supported for your solvers.

Unable to run scikit-learn_bench on EMR system

When device=cpu is used, I get the following error. This happens on every benchmark. I have not had any issues running the same on other systems (Intel Xeon Sapphire Rapids based or AMD Zen4 based systems).
Sample command: python sklearn_bench/df_clsf.py --arch sys-abcd_os s --data-format pandas --data-order F --dtype float32 --max-features sqrt --device cpu --num-trees 100 --max-depth 8 --file-X-train data/susy_x_train.npy --file-y-train data/susy_y_train.npy --file-X-test data/susy_x_test.npy --file-y-test data/susy_y_test.npy --dataset-name susy

Traceback (most recent call last):
File "/mlperf/scikit-bench/scikit-learn_bench/sklearn_bench/df_clsf.py", line 98, in
bench.run_with_context(params, main)
File "/mlperf/scikit-bench/scikit-learn_bench/bench.py", line 564, in run_with_context
with sycl_context(params.device):
File "/home/amd/miniconda3/envs/mkl_env/lib/python3.10/contextlib.py", line 135, in enter
return next(self.gen)
File "src/oneapi/oneapi.pyx", line 118, in sycl_context
File "src/oneapi/oneapi.pyx", line 46, in daal4py._oneapi.sycl_execution_context.cinit
RuntimeError: No device of requested type available. Please check https://software.intel.com/content/www/us/en/develop/articles/intel-oneapi-dpcpp-system-requirements.html -1 (PI_ERROR_DEVICE_NOT_FOUND)CASE sklearn,df_clsf --data-format pandas --data-order F --dtype float64 --max-features sqrt --device cpu --num-trees 10 --max-depth 5 JSON DECODING ERROR:
Expecting value: line 1 column 1 (char 0)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.