The scikit-learn_bench's discuss from intelpython

requirements list for environment setup.

I think it will be great to create something like requirements.txt to easily create new environment that will be able to run benchmarks.

xgboost benchmark datasets missing

When benchmark using xgb_cpu_main_config.json. The following datasets are missing

WARNING: Dataset mlsr could not be loaded.
Check the correct name or expand the download in the folder dataset.
INFO: gbt algorithm: 1 case(s), 1 dataset(s)

WARNING: Dataset mortgage1Q could not be loaded.
Check the correct name or expand the download in the folder dataset.
INFO: gbt algorithm: 1 case(s), 1 dataset(s)

WARNING: Dataset plasticc could not be loaded.
Check the correct name or expand the download in the folder dataset.
INFO: gbt algorithm: 1 case(s), 1 dataset(s)

WARNING: Dataset santander could not be loaded.
Check the correct name or expand the download in the folder dataset.

Code quality improvements - adress findings from www.codefactor.io

Please create PRs for addressing code quality issues reported by codefactor scans

https://www.codefactor.io/repository/github/IntelPython/scikit-learn_bench

Some problems are simple fixes and i would expect that they can be fixed across entire repo and not one by one
Other might be super complex and i don't think can be fixed without serious refactoring - code complexity

Integrate support for competitive model compilation frameworks(TVM and ONNX)

Tools such as TVM and ONNX have capabilities for optimizing existing models to achieve better inference- it would be beneficial to know their performance usecases and limitations

HistGradientBoostingEstimator

Hi!
Is there a reason HistGradientBoostingEstimator from sklearn is not included in the benchmark? It should be about as fast as XGBoost.

Make use of "--device(s)" for XGBoost

CPU and GPU configs for XGBoost have only few differences: in data-format (pandas vs cudf) and tree-method (hist vs gpu_hist). Dispatching for them with --devices(s) argument will simplify configs.

Add support for single row inference cases

Please add support for single row inference measurements in benchmarks. So far all cases are oriented on batch computation only

Benchmarks silently execute stock version if scikit-learn-intelex is not installed

If I try to run benchmarks with the command like

python runner.py --configs configs/skl_xpu_config.json

it should run a patched version of scikit-learn algorithms. However, if scikit-learn-intelex package is not installed, benchmarks cannot be patched and should print an error or warning.

I cannot see the message in https://github.com/IntelPython/scikit-learn_bench/blob/master/bench.py#L204 for some reason.

svm.py fails with IndexError

Following instructions in https://medium.com/intel-analytics-software/leverage-intel-optimizations-in-scikit-learn-f562cb9d5544 get errors in all cases. Typical error message is

INFO: python sklearn_bench/svm.py --arch mericas --data-format pandas --data-order F --dtype float64 --max-cache-size 2 --probability -C 1.0 --kernel rbf --device none --file-X-train data/klaverjas_x_train.npy --file-y-train data/klaverjas_y_train.npy --file-X-test data/klaverjas_x_test.npy --file-y-test data/klaverjas_y_test.npy --dataset-name klaverjas

WARNING: Error in benchmark:

Traceback (most recent call last):
File "/home/mericas/scikit-learn_bench-master/sklearn_bench/svm.py", line 107, in
bench.run_with_context(params, main)
File "/home/mericas/scikit-learn_bench-master/bench.py", line 572, in run_with_context
function()
File "/home/mericas/scikit-learn_bench-master/sklearn_bench/svm.py", line 63, in main
train_acc = bench.accuracy_score(y_train, y_pred)
File "/home/mericas/scikit-learn_bench-master/bench.py", line 347, in accuracy_score
return columnwise_score(y_true, y_pred, lambda y1, y2: np.mean(y1 == y2))
File "/home/mericas/scikit-learn_bench-master/bench.py", line 342, in columnwise_score
return [score_func(y[i], yp[i]) for i in range(y.shape[1])]
IndexError: tuple index out of rangeCASE sklearn,svm --data-format pandas --data-order F --dtype float64 --max-cache-size 2 --probability -C 1.0 --kernel rbf --device none JSON DECODING ERROR:
Expecting value: line 1 column 1 (char 0)

Also notice that svm fails using scikit-learn_bench/blob/master/configs/blogs/skl_2021_3.json

The error messages when running benchmark are ignored in some cases

According to https://github.com/IntelPython/scikit-learn_bench/blob/master/runner.py#L306-L310, the benchmark is said to be successful when 'daal4py' not in stderr, but when benchmark fails, stderr also contains 'daal4py' in some cases.

This issue may affect CI test, there are still errors although CIs are passed.

To reproduce, you can add print(stderr) under the if statement at https://github.com/IntelPython/scikit-learn_bench/blob/master/runner.py#L306, and run python runner.py --configs configs/testing/sklearn.json, then you will get errors like:

Traceback (most recent call last):
  File "sklearn_bench/pca.py", line 66, in <module>
    bench.run_with_context(params, main)
  File "/home/zhaojieh/sklearn-benchmark-lulin/bench.py", line 567, in run_with_context
    function()
  File "sklearn_bench/pca.py", line 37, in main
    fit_time, _ = bench.measure_function_time(pca.fit, X_train, params=params)
  File "/home/zhaojieh/sklearn-benchmark-lulin/bench.py", line 267, in measure_function_time
    return time_box_filter(func, *args,
  File "/home/zhaojieh/sklearn-benchmark-lulin/bench.py", line 276, in time_box_filter
    val = func(*args, **kwargs)
  File "/home/zhaojieh/miniconda3/envs/sklearn-bench/lib/python3.8/site-packages/sklearn/decomposition/_pca.py", line 435, in fit
    self._fit(X)
  File "/home/zhaojieh/miniconda3/envs/sklearn-bench/lib/python3.8/site-packages/daal4py/sklearn/decomposition/_pca.py", line 260, in _fit
    result = self._fit_full(X, n_components)
  File "/home/zhaojieh/miniconda3/envs/sklearn-bench/lib/python3.8/site-packages/daal4py/sklearn/decomposition/_pca.py", line 153, in _fit_full
    self._fit_full_daal4py(X, min(X.shape))
  File "/home/zhaojieh/miniconda3/envs/sklearn-bench/lib/python3.8/site-packages/daal4py/sklearn/decomposition/_pca.py", line 142, in _fit_full_daal4py
    self.n_samples_, self.n_features_ = n_samples, n_features
AttributeError: can't set attributeCASE sklearn,pca  --data-format pandas --data-order F --dtype float64 --device none JSON DECODING ERROR:
Expecting value: line 1 column 1 (char 0)

I found this issue when developing #134, and you can check CI logs in this PR for more error messages.

This PR uses retcode to check whether there are errors when running benchmark and the error message can be displayed correctly, but the error in benchmark may need to be fixed separately.

Some links in README.md are broken

Links in following sections of README.md are broken:

How to create conda environment for benchmarking
Algorithms parameters

Error installing reqs for scikit-learn bench

I run the following commands...

conda create -n intelpybench -c intel python=3.7
conda activate intelpybench
conda install -c intel scikit-learn scikit-learn-intelex pandas tqdm

That final instruction gives me an error:

ERROR conda.core.link:_execute(698): An error occurred while installing package 'intel::scikit-learn-0.24.1-py37h1590dfa_3'.
Rolling back transaction: done

LinkError: post-link script failed for package intel::scikit-learn-0.24.1-py37h1590dfa_3
location of failed script: C:\Users\Colin\.conda\envs\intelpybench3\Scripts\.scikit-learn-post-link.bat
==> script messages <==
<None>
==> script output <==
stdout:
stderr: The syntax of the command is incorrect.

return code: 255

I'm guessing this is due to scikit-learn-intelex being pretty much brand new and still in development.

Change C to higher value

The C parameter for SVC call in SVM benchmark is set to 0.01. This results in large number of support vectors in the solution leading to excessive run-times.

The issue is to find a more appropriate value of C.

benchmarking the linear kernel in SVC

Using kernel='linear' in SVC basically makes no sense. You should be using LinearSVC and probably set dual=False.

Benchmark linear models in higher dimensions

The current benchmarks only use 50 features for 1e6 samples. I would argue that this is not a case where won't would use a linear model as it would under-fit and the same test accuracy could probably be reached much faster with 1e3 data points instead of 1e6 yielding a speed up in the order of 1000x.

It would therefore be more interesting to benchmark linear regression, ridge regression and logistic regression in regimes in the order of 1e3 to 1e5 features.

In particular, Ridge regression is likely to be most useful in cases where num_features >> n_samples, otherwise, Linear regression (no penalty) is likely to give the same result.

reporting format of benchmarks

As discussed here: scikit-learn/scikit-learn#14247 (comment)

I think the current report is very hard to read.
It might be helpful to specify very clearly what the baseline is, that is the meaning of 1 in all the plots - it's your own C++ implementation.

For a comparison with scikit-learn I think doing sklearn speed / your c++ speed would be easier to read as it shows your speedup factor, not our slow-down factor.

Finally, I don't see the number of cores in your benchmark, which is pretty crucial since most of our implementations are single-threaded. Yes, that's a big issue, but saying "we're 100x faster" without saying "on 100 CPUs instead of 1" is quite misleading.
It might be helpful to have a chart of speedup vs number of CPUs.

lot of memory allocations becomes bottleneck

I captured perf data for most of the algorithms and see there are lot many memory allocations happens during the run which become bottleneck. Please refer attached screenshot.

Is there a way to fine tune the memory allocations? like any env variable or cmmandline arguments?

Facing issues while running benchmark on Ubuntu 18.04

Hi,

I am facing below issue while executing ridge algorithm on Ubuntu 18.04. Looks like I may not be using the right versions of dependent packages like pandas, numpy, scipy, scikit-learn, scikit-learn-intelex, etc.. I had installed the scikit-learn and the dependent packages using command pip3 install -r sklearn_bench/requirements.txt. I have listed the versions installed in my system after the error logs. Would like to know if we have any specific package version requirements for Ubuntu 18.04. Thanks!

===============================================================

~/scikit-learn_bench$ python3 runner.py --configs configs/sklearn/performance/ridge.json --output-file results/ridge.json
INFO: Datasets folder is not set, using local folder
INFO: Config: configs/sklearn/performance/ridge.json
INFO: ridge algorithm: 2 case(s), 2 dataset(s)

INFO: python sklearn_bench/ridge.py --arch intel-WilsonCity --data-format pandas --data-order F --dtype float32 --device none --alpha 5 --file-X-train data/synthetic-regression-X-train-10000000x20.npy --file-y-train data/synthetic-regression-y-train-10000000x20.npy --file-X-test data/synthetic-regression-X-train-10000000x20.npy --file-y-test data/synthetic-regression-y-train-10000000x20.npy --dataset-name synthetic_regression

WARNING: Error in benchmark:
Traceback (most recent call last):
File "sklearn_bench/ridge.py", line 19, in
import bench
File "/home/intel/scikit-learn_bench/bench.py", line 38
raise ValueError(f'Impossible to get data type of {type(data)}')
^
SyntaxError: invalid syntaxCASE sklearn,ridge --data-format pandas --data-order F --dtype float32 --device none --alpha 5 JSON DECODING ERROR:
Expecting value: line 1 column 1 (char 0)

INFO: python sklearn_bench/ridge.py --arch intel-WilsonCity --data-format pandas --data-order F --dtype float64 --device none --alpha 5 --file-X-train data/synthetic-regression-X-train-10000000x20.npy --file-y-train data/synthetic-regression-y-train-10000000x20.npy --file-X-test data/synthetic-regression-X-train-10000000x20.npy --file-y-test data/synthetic-regression-y-train-10000000x20.npy --dataset-name synthetic_regression

WARNING: Error in benchmark:
Traceback (most recent call last):
File "sklearn_bench/ridge.py", line 19, in
import bench
File "/home/intel/scikit-learn_bench/bench.py", line 38
raise ValueError(f'Impossible to get data type of {type(data)}')
^
SyntaxError: invalid syntaxCASE sklearn,ridge --data-format pandas --data-order F --dtype float64 --device none --alpha 5 JSON DECODING ERROR:
Expecting value: line 1 column 1 (char 0)

INFO: python sklearn_bench/ridge.py --arch intel-WilsonCity --data-format pandas --data-order F --dtype float32 --device none --alpha 5 --file-X-train data/synthetic-regression-X-train-2000000x100.npy --file-y-train data/synthetic-regression-y-train-2000000x100.npy --file-X-test data/synthetic-regression-X-train-2000000x100.npy --file-y-test data/synthetic-regression-y-train-2000000x100.npy --dataset-name synthetic_regression

WARNING: Error in benchmark:
Traceback (most recent call last):
File "sklearn_bench/ridge.py", line 19, in
import bench
File "/home/intel/scikit-learn_bench/bench.py", line 38
raise ValueError(f'Impossible to get data type of {type(data)}')
^
SyntaxError: invalid syntaxCASE sklearn,ridge --data-format pandas --data-order F --dtype float32 --device none --alpha 5 JSON DECODING ERROR:
Expecting value: line 1 column 1 (char 0)

INFO: python sklearn_bench/ridge.py --arch intel-WilsonCity --data-format pandas --data-order F --dtype float64 --device none --alpha 5 --file-X-train data/synthetic-regression-X-train-2000000x100.npy --file-y-train data/synthetic-regression-y-train-2000000x100.npy --file-X-test data/synthetic-regression-X-train-2000000x100.npy --file-y-test data/synthetic-regression-y-train-2000000x100.npy --dataset-name synthetic_regression

WARNING: Error in benchmark:
Traceback (most recent call last):
File "sklearn_bench/ridge.py", line 19, in
import bench
File "/home/intel/scikit-learn_bench/bench.py", line 38
raise ValueError(f'Impossible to get data type of {type(data)}')
^
SyntaxError: invalid syntaxCASE sklearn,ridge --data-format pandas --data-order F --dtype float64 --device none --alpha 5 JSON DECODING ERROR:
Expecting value: line 1 column 1 (char 0)

WARNING: benchmark running had runtime errors
~/scikit-learn_bench$

===============================================================

~/scikit-learn_bench$ pip3 list
Package Version

alabaster 0.7.12
apipkg 1.4
apturl 0.5.2
asn1crypto 0.24.0
astroid 1.6.0
asv 0.5.1
attrs 21.4.0
Babel 2.11.0
beautifulsoup4 4.6.0
breathe 4.7.3
Brlapi 0.6.6
certifi 2022.6.15
chardet 3.0.4
charset-normalizer 2.0.12
click 6.7
colorama 0.3.7
command-not-found 0.3
commonmark 0.9.1
cryptography 2.1.4
cupshelpers 1.0
daal 2021.5.3
daal4py 2021.5.3
dataclasses 0.8
decorator 4.1.2
defer 1.0.6
distro-info 0.18ubuntu0.18.04.1
docker 5.0.3
docutils 0.18.1
et-xmlfile 1.1.0
execnet 1.4.1
html5lib 0.999999999
httplib2 0.9.2
idna 3.3
imagesize 1.4.1
importlib-metadata 4.8.3
importlib-resources 5.4.0
iniconfig 1.1.1
isort 4.3.4
Jinja2 3.0.3
joblib 1.1.1
keyring 10.6.0
keyrings.alt 3.0
language-selector 0.1
launchpadlib 1.10.6
lazr.restfulclient 0.13.5
lazr.uri 1.0.3
lazy-object-proxy 1.3.1
logilab-common 1.4.1
louis 3.5.0
lxml 4.2.1
macaroonbakery 1.1.3
Mako 1.0.7
MarkupSafe 2.0.1
mccabe 0.6.1
meson 0.56.2
netifaces 0.10.4
numpy 1.19.5
oauth 1.0.1
olefile 0.45.1
openpyxl 3.0.10
packaging 21.3
pandas 1.1.5
pexpect 4.2.1
Pillow 8.4.0
pip 21.3.1
pluggy 1.0.0
prompt-toolkit 1.0.14
protobuf 3.0.0
psutil 5.9.1
py 1.11.0
pycairo 1.16.2
pycrypto 2.6.1
pycups 1.9.73
pyelftools 0.28
Pygments 2.13.0
PyGObject 3.26.1
PyInquirer 1.0.3
pylint 1.8.3
pymacaroons 0.13.0
PyNaCl 1.1.2
pyparsing 3.0.9
pyRFC3339 1.0
pytest 7.0.1
pytest-forked 0.2
pytest-xdist 1.22.1
python-apt 1.6.5+ubuntu0.7
python-dateutil 2.8.2
python-debian 0.1.32
pytz 2018.3
pyxdg 0.25
PyYAML 6.0
recommonmark 0.7.1
regex 2022.8.17
reportlab 3.4.0
requests 2.27.1
requests-unixsocket 0.1.5
roman 2.0.0
scikit-learn 0.24.2
scikit-learn-intelex 2021.5.3
scipy 1.5.4
SecretStorage 2.3.1
setuptools 59.6.0
simplejson 3.13.2
six 1.11.0
snowballstemmer 2.2.0
Sphinx 1.8.0
sphinx-rtd-theme 1.1.1
sphinxcontrib-serializinghtml 1.1.5
sphinxcontrib-websupport 1.2.4
ssh-import-id 5.7
system-service 0.3
systemd-python 234
tbb 2021.8.0
threadpoolctl 3.1.0
toml 0.10.2
tomli 1.2.3
tomli_w 0.4.0
torch 1.10.1
torchvision 0.11.2
tqdm 4.64.1
typing_extensions 4.1.1
ubuntu-drivers-common 0.0.0
ufw 0.36
unattended-upgrades 0.1
urllib3 1.26.10
usb-creator 0.3.3
wadllib 1.3.2
wcwidth 0.2.5
webencodings 0.5
websocket-client 1.3.1
wheel 0.30.0
wrapt 1.9.0
xkit 0.0.0
zipp 3.6.0
zope.interface 4.3.2
~/scikit-learn_bench$

dataset sizes for benchmarks

It would be great if you could do benchmarks with different data set sizes and with tall, wide and parse data, where possible, and report where these are not supported for your solvers.

Unable to run scikit-learn_bench on EMR system

When device=cpu is used, I get the following error. This happens on every benchmark. I have not had any issues running the same on other systems (Intel Xeon Sapphire Rapids based or AMD Zen4 based systems).
Sample command: python sklearn_bench/df_clsf.py --arch sys-abcd_os s --data-format pandas --data-order F --dtype float32 --max-features sqrt --device cpu --num-trees 100 --max-depth 8 --file-X-train data/susy_x_train.npy --file-y-train data/susy_y_train.npy --file-X-test data/susy_x_test.npy --file-y-test data/susy_y_test.npy --dataset-name susy

Traceback (most recent call last):
File "/mlperf/scikit-bench/scikit-learn_bench/sklearn_bench/df_clsf.py", line 98, in
bench.run_with_context(params, main)
File "/mlperf/scikit-bench/scikit-learn_bench/bench.py", line 564, in run_with_context
with sycl_context(params.device):
File "/home/amd/miniconda3/envs/mkl_env/lib/python3.10/contextlib.py", line 135, in enter
return next(self.gen)
File "src/oneapi/oneapi.pyx", line 118, in sycl_context
File "src/oneapi/oneapi.pyx", line 46, in daal4py._oneapi.sycl_execution_context.cinit
RuntimeError: No device of requested type available. Please check https://software.intel.com/content/www/us/en/develop/articles/intel-oneapi-dpcpp-system-requirements.html -1 (PI_ERROR_DEVICE_NOT_FOUND)CASE sklearn,df_clsf --data-format pandas --data-order F --dtype float64 --max-features sqrt --device cpu --num-trees 10 --max-depth 5 JSON DECODING ERROR:
Expecting value: line 1 column 1 (char 0)

Datasets used for producing benchmarks in scikit-learn intelex

Hello,
Can I get the information of datasets used for producing benchmark results(speedup values) for different scikit-learn algorithms as shown in figure under Acceleration sub section at https://github.com/intel/scikit-learn-intelex . Image is also attached here:

intelpython / scikit-learn_bench Goto Github PK

scikit-learn_bench's Issues

Recommend Projects

Recommend Topics

Recommend Org