Refactor AutoML additional_metrics method

Add Linear and Logistic Regression support

Please check different scalers for it. https://scikit-learn.org/stable/auto_examples/preprocessing/plot_all_scaling.html#sphx-glr-auto-examples-preprocessing-plot-all-scaling-py - robustscaler looks interesting

Refactor preprocessing

Refactor PreprocessingStep code. Right now there are run and transform methods, please provide fit method instead run.

Import error when installed in fresh venv

I had trouble importing the library in my current working environment, so I've created fresh venv for the test. Results were the same

Terminal output:

Successfully installed Keras-2.2.4 absl-py-0.9.0 astor-0.8.1 catboost-0.13.1 enum34-1.1.6 gast-0.3.2 grpcio-1.26.0 h5py-2.10.0 joblib-0.14.1 keras-applications-1.0.8 keras-preprocessing-1.1.0 lightgbm-2.2.3 markdown-3.1.1 mljar-supervised-0.1.7 mock-3.0.5 numpy-1.18.1 pandas-0.25.3 protobuf-3.11.2 python-dateutil-2.8.1 pytz-2019.3 pyyaml-5.3 scikit-learn-0.22.1 scipy-1.4.1 six-1.13.0 tensorboard-1.13.1 tensorflow-1.13.1 tensorflow-estimator-1.13.0 termcolor-1.1.0 tqdm-4.31.1 werkzeug-0.16.0 wheel-0.33.6 xgboost-0.80`

python
Python 3.7.4 (tags/v3.7.4:e09359112e, Jul  8 2019, 20:34:20) [MSC v.1916 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> from supervised.automl import AutoML
...\automl\env\lib\site-packages\tqdm\_tqdm.py:605: FutureWarning: The Panel class is removed from pandas. Accessing it from the top-level namespace will also be removed in the next version
  from pandas import Panel
Traceback (most recent call last):
  File "..\automl\env\lib\site-packages\tqdm\_tqdm.py", line 613, in pandas
    from pandas.core.groupby.groupby import DataFrameGroupBy, \
ImportError: cannot import name 'DataFrameGroupBy' from 'pandas.core.groupby.groupby' (...\automl\env\lib\site-packages\pandas\core\groupby\groupby.py)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "...\automl\env\lib\site-packages\supervised\__init__.py", line 3, in <module>
    from supervised.automl import AutoML
  File "...\automl\env\lib\site-packages\supervised\automl.py", line 10, in <module>
    tqdm.pandas()
  File "...\automl\env\lib\site-packages\tqdm\_tqdm.py", line 616, in pandas
    from pandas.core.groupby import DataFrameGroupBy, \
ImportError: cannot import name 'PanelGroupBy' from 'pandas.core.groupby' (C:\PKZ\Synced_dirs\Devel\Python\automl\env\lib\site-packages\pandas\core\groupby\__init__.py)
...

As per tqdm issue log, this seems to be fixed in newer version of this library, however pip installation of mljar-supervised installs the pre-fix release.

set path for catboost snapshot

Compute additional metrics for regression

Need to add code for computing additional metrics for regression models.

Additional metrics:

MAE
MSE
RMSE
R^2

set path where to save models

Path should be set in config file. Right now the path is hard coded to '/tmp'.

set interface for learners fit(X,y)

Add option to not shuffle rows

When dealing with time series data it is important to not shuffle rows when performing cross-validation. Would be good to have that as an option.

Add callback to controll number of iterations

Add a simple callback to control the number of iterations.

Add validation with split

Add 2 new options for validation.

Validation with a split.
Validation with a separate dataset. (#101)

warning when importing MLJ

from supervised.automl import AutoML generates this message:

/home/ubuntu/anaconda3/envs/mlj/lib/python3.6/site-packages/scikit_learn-0.21.3-py3.6-linux-x86_64.egg/sklearn/externals/joblib/init.py:15: DeprecationWarning: sklearn.externals.joblib is deprecated in 0.21 and will be removed in 0.23. Please import this functionality directly from joblib, which can be installed with: pip install joblib. If this warning is raised when loading pickled models, you may need to re-serialize those models with scikit-learn 0.21+.
warnings.warn(msg, category=DeprecationWarning)

Thanks

Problem with results_path behavior

Hi Piotr. When setting model_path in AutoML definition, if path already exists this generates an error. System is looking for a .json file which does not exist - unless the model has been fit I assume. Better behavior might be that if path exists but no .json exists then proceed as if we had just created the directory?

Thanks,

Maybe add one line deploy as REST API?

Predicting probability for all models

Hello,
great AutoML tool!
But I have encountered a problem.

When the final model is RF, predict function returns classes (values are equal to 0 or 1).
But for other models (CatBoost, Xgboost, LightGBM, NN) , predict function returns probabilities.
Is it possible to get probabilities when the best model is Random Forest?

Problem with cross validation

Hi Piotr:

Cross-validation no longer works (with or without shuffle). For instance:

automl._validation = {"validation_type": "kfold", "k_folds": 15, "shuffle": False, "stratify": True}
used to work but now generates this error:

AutoML task to be solved: binary_classification
AutoML will use algorithms: ['Xgboost']
AutoML will optimize for metric: logloss
AutoML will try to check about 28 models
Setting a random_state has no effect since shuffle is False. This will raise an error in 0.24. You should leave random_state to its default (None), or set shuffle=True.

ValueError Traceback (most recent call last)
in

~/anaconda3/envs/mlj_shap_2/lib/python3.6/site-packages/supervised/automl.py in fit(self, X_train, y_train, X_validation, y_validation)
520
521 for params in generated_params:
--> 522 self.train_model(params)
523 # hill climbing
524 for params in tuner.get_hill_climbing_params(self._models):

~/anaconda3/envs/mlj_shap_2/lib/python3.6/site-packages/supervised/automl.py in train_model(self, params)
262 raise AutoMLException(f"Cannot create directory {model_path}")
263
--> 264 mf.train() # {"train": {"X": X, "y": y}})
265
266 mf.save(model_path)

~/anaconda3/envs/mlj_shap_2/lib/python3.6/site-packages/supervised/model_framework.py in train(self)
107 np.random.seed(self.learner_params["seed"])
108
--> 109 self.validation = ValidationStep(self.validation_params)
110
111 for k_fold in range(self.validation.get_n_splits()):

~/anaconda3/envs/mlj_shap_2/lib/python3.6/site-packages/supervised/validation/validation_step.py in init(self, params)
21
22 if self.validation_type == "kfold":
---> 23 self.validator = KFoldValidator(params)
24 else:
25 raise Exception("Other validation types are not implemented yet!")

~/anaconda3/envs/mlj_shap_2/lib/python3.6/site-packages/supervised/validation/validator_kfold.py in init(self, params)
57
58 for fold_cnt, (train_index, validation_index) in enumerate(
---> 59 self.skf.split(X, y)
60 ):
61

~/anaconda3/envs/mlj_shap_2/lib/python3.6/site-packages/sklearn/model_selection/_split.py in split(self, X, y, groups)
728 to an integer.
729 """
--> 730 y = check_array(y, ensure_2d=False, dtype=None)
731 return super().split(X, y, groups)
732

~/anaconda3/envs/mlj_shap_2/lib/python3.6/site-packages/sklearn/utils/validation.py in check_array(array, accept_sparse, accept_large_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, warn_on_dtype, estimator)
576 if force_all_finite:
577 _assert_all_finite(array,
--> 578 allow_nan=force_all_finite == 'allow-nan')
579
580 if ensure_min_samples > 0:

~/anaconda3/envs/mlj_shap_2/lib/python3.6/site-packages/sklearn/utils/validation.py in _assert_all_finite(X, allow_nan, msg_dtype)
58 msg_err.format
59 (type_err,
---> 60 msg_dtype if msg_dtype is not None else X.dtype)
61 )
62 # for object dtype data, we only check for NaNs (GH-13254)

ValueError: Input contains NaN, infinity or a value too large for dtype('float64').

automl._validation = {"validation_type": "kfold", "k_folds": 15, "shuffle": True, "stratify": True} also generates an error:

AutoML task to be solved: binary_classification
AutoML will use algorithms: ['Xgboost']
AutoML will optimize for metric: logloss
AutoML will try to check about 28 models