Giter Site home page Giter Site logo

mljar / mljar-supervised Goto Github PK

View Code? Open in Web Editor NEW
2.9K 49.0 386.0 9.72 MB

Python package for AutoML on Tabular Data with Feature Engineering, Hyper-Parameters Tuning, Explanations and Automatic Documentation

Home Page: https://mljar.com

License: MIT License

Python 100.00%
automl machine-learning automatic-machine-learning mljar data-science scikit-learn hyperparameter-optimization feature-engineering xgboost random-forest

mljar-supervised's People

Contributors

aakarsh1011 avatar abtheo avatar adrienpacifico avatar ajschumacher avatar aplonska avatar danielavdar avatar danielr59 avatar diogosilva30 avatar fortierq avatar hk669 avatar kuhung avatar lijm1358 avatar maciekmalachowski avatar molspace avatar neilmehta31 avatar partrita avatar pplonski avatar rafad5 avatar shahules786 avatar suryathiru avatar uditswaroopa avatar zacchaeus00 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

mljar-supervised's Issues

Refactor preprocessing

Refactor PreprocessingStep code. Right now there are run and transform methods, please provide fit method instead run.

Import error when installed in fresh venv

I had trouble importing the library in my current working environment, so I've created fresh venv for the test. Results were the same

Terminal output:

Successfully installed Keras-2.2.4 absl-py-0.9.0 astor-0.8.1 catboost-0.13.1 enum34-1.1.6 gast-0.3.2 grpcio-1.26.0 h5py-2.10.0 joblib-0.14.1 keras-applications-1.0.8 keras-preprocessing-1.1.0 lightgbm-2.2.3 markdown-3.1.1 mljar-supervised-0.1.7 mock-3.0.5 numpy-1.18.1 pandas-0.25.3 protobuf-3.11.2 python-dateutil-2.8.1 pytz-2019.3 pyyaml-5.3 scikit-learn-0.22.1 scipy-1.4.1 six-1.13.0 tensorboard-1.13.1 tensorflow-1.13.1 tensorflow-estimator-1.13.0 termcolor-1.1.0 tqdm-4.31.1 werkzeug-0.16.0 wheel-0.33.6 xgboost-0.80`

python
Python 3.7.4 (tags/v3.7.4:e09359112e, Jul  8 2019, 20:34:20) [MSC v.1916 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> from supervised.automl import AutoML
...\automl\env\lib\site-packages\tqdm\_tqdm.py:605: FutureWarning: The Panel class is removed from pandas. Accessing it from the top-level namespace will also be removed in the next version
  from pandas import Panel
Traceback (most recent call last):
  File "..\automl\env\lib\site-packages\tqdm\_tqdm.py", line 613, in pandas
    from pandas.core.groupby.groupby import DataFrameGroupBy, \
ImportError: cannot import name 'DataFrameGroupBy' from 'pandas.core.groupby.groupby' (...\automl\env\lib\site-packages\pandas\core\groupby\groupby.py)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "...\automl\env\lib\site-packages\supervised\__init__.py", line 3, in <module>
    from supervised.automl import AutoML
  File "...\automl\env\lib\site-packages\supervised\automl.py", line 10, in <module>
    tqdm.pandas()
  File "...\automl\env\lib\site-packages\tqdm\_tqdm.py", line 616, in pandas
    from pandas.core.groupby import DataFrameGroupBy, \
ImportError: cannot import name 'PanelGroupBy' from 'pandas.core.groupby' (C:\PKZ\Synced_dirs\Devel\Python\automl\env\lib\site-packages\pandas\core\groupby\__init__.py)
...

As per tqdm issue log, this seems to be fixed in newer version of this library, however pip installation of mljar-supervised installs the pre-fix release.

Add option to not shuffle rows

When dealing with time series data it is important to not shuffle rows when performing cross-validation. Would be good to have that as an option.

warning when importing MLJ

from supervised.automl import AutoML generates this message:

/home/ubuntu/anaconda3/envs/mlj/lib/python3.6/site-packages/scikit_learn-0.21.3-py3.6-linux-x86_64.egg/sklearn/externals/joblib/init.py:15: DeprecationWarning: sklearn.externals.joblib is deprecated in 0.21 and will be removed in 0.23. Please import this functionality directly from joblib, which can be installed with: pip install joblib. If this warning is raised when loading pickled models, you may need to re-serialize those models with scikit-learn 0.21+.
warnings.warn(msg, category=DeprecationWarning)

Thanks

Problem with results_path behavior

Hi Piotr. When setting model_path in AutoML definition, if path already exists this generates an error. System is looking for a .json file which does not exist - unless the model has been fit I assume. Better behavior might be that if path exists but no .json exists then proceed as if we had just created the directory?

Thanks,

Predicting probability for all models

Hello,
great AutoML tool!
But I have encountered a problem.

When the final model is RF, predict function returns classes (values are equal to 0 or 1).
But for other models (CatBoost, Xgboost, LightGBM, NN) , predict function returns probabilities.
Is it possible to get probabilities when the best model is Random Forest?

Problem with cross validation

Hi Piotr:

Cross-validation no longer works (with or without shuffle). For instance:

automl._validation = {"validation_type": "kfold", "k_folds": 15, "shuffle": False, "stratify": True}
used to work but now generates this error:

AutoML task to be solved: binary_classification
AutoML will use algorithms: ['Xgboost']
AutoML will optimize for metric: logloss
AutoML will try to check about 28 models
Setting a random_state has no effect since shuffle is False. This will raise an error in 0.24. You should leave random_state to its default (None), or set shuffle=True.

ValueError Traceback (most recent call last)
in

~/anaconda3/envs/mlj_shap_2/lib/python3.6/site-packages/supervised/automl.py in fit(self, X_train, y_train, X_validation, y_validation)
520
521 for params in generated_params:
--> 522 self.train_model(params)
523 # hill climbing
524 for params in tuner.get_hill_climbing_params(self._models):

~/anaconda3/envs/mlj_shap_2/lib/python3.6/site-packages/supervised/automl.py in train_model(self, params)
262 raise AutoMLException(f"Cannot create directory {model_path}")
263
--> 264 mf.train() # {"train": {"X": X, "y": y}})
265
266 mf.save(model_path)

~/anaconda3/envs/mlj_shap_2/lib/python3.6/site-packages/supervised/model_framework.py in train(self)
107 np.random.seed(self.learner_params["seed"])
108
--> 109 self.validation = ValidationStep(self.validation_params)
110
111 for k_fold in range(self.validation.get_n_splits()):

~/anaconda3/envs/mlj_shap_2/lib/python3.6/site-packages/supervised/validation/validation_step.py in init(self, params)
21
22 if self.validation_type == "kfold":
---> 23 self.validator = KFoldValidator(params)
24 else:
25 raise Exception("Other validation types are not implemented yet!")

~/anaconda3/envs/mlj_shap_2/lib/python3.6/site-packages/supervised/validation/validator_kfold.py in init(self, params)
57
58 for fold_cnt, (train_index, validation_index) in enumerate(
---> 59 self.skf.split(X, y)
60 ):
61

~/anaconda3/envs/mlj_shap_2/lib/python3.6/site-packages/sklearn/model_selection/_split.py in split(self, X, y, groups)
728 to an integer.
729 """
--> 730 y = check_array(y, ensure_2d=False, dtype=None)
731 return super().split(X, y, groups)
732

~/anaconda3/envs/mlj_shap_2/lib/python3.6/site-packages/sklearn/utils/validation.py in check_array(array, accept_sparse, accept_large_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, warn_on_dtype, estimator)
576 if force_all_finite:
577 _assert_all_finite(array,
--> 578 allow_nan=force_all_finite == 'allow-nan')
579
580 if ensure_min_samples > 0:

~/anaconda3/envs/mlj_shap_2/lib/python3.6/site-packages/sklearn/utils/validation.py in _assert_all_finite(X, allow_nan, msg_dtype)
58 msg_err.format
59 (type_err,
---> 60 msg_dtype if msg_dtype is not None else X.dtype)
61 )
62 # for object dtype data, we only check for NaNs (GH-13254)

ValueError: Input contains NaN, infinity or a value too large for dtype('float64').

automl._validation = {"validation_type": "kfold", "k_folds": 15, "shuffle": True, "stratify": True} also generates an error:

AutoML task to be solved: binary_classification
AutoML will use algorithms: ['Xgboost']
AutoML will optimize for metric: logloss
AutoML will try to check about 28 models

ValueError Traceback (most recent call last)
in

~/anaconda3/envs/mlj_shap_2/lib/python3.6/site-packages/supervised/automl.py in fit(self, X_train, y_train, X_validation, y_validation)
520
521 for params in generated_params:
--> 522 self.train_model(params)
523 # hill climbing
524 for params in tuner.get_hill_climbing_params(self._models):

~/anaconda3/envs/mlj_shap_2/lib/python3.6/site-packages/supervised/automl.py in train_model(self, params)
262 raise AutoMLException(f"Cannot create directory {model_path}")
263
--> 264 mf.train() # {"train": {"X": X, "y": y}})
265
266 mf.save(model_path)

~/anaconda3/envs/mlj_shap_2/lib/python3.6/site-packages/supervised/model_framework.py in train(self)
107 np.random.seed(self.learner_params["seed"])
108
--> 109 self.validation = ValidationStep(self.validation_params)
110
111 for k_fold in range(self.validation.get_n_splits()):

~/anaconda3/envs/mlj_shap_2/lib/python3.6/site-packages/supervised/validation/validation_step.py in init(self, params)
21
22 if self.validation_type == "kfold":
---> 23 self.validator = KFoldValidator(params)
24 else:
25 raise Exception("Other validation types are not implemented yet!")

~/anaconda3/envs/mlj_shap_2/lib/python3.6/site-packages/supervised/validation/validator_kfold.py in init(self, params)
57
58 for fold_cnt, (train_index, validation_index) in enumerate(
---> 59 self.skf.split(X, y)
60 ):
61

~/anaconda3/envs/mlj_shap_2/lib/python3.6/site-packages/sklearn/model_selection/_split.py in split(self, X, y, groups)
728 to an integer.
729 """
--> 730 y = check_array(y, ensure_2d=False, dtype=None)
731 return super().split(X, y, groups)
732

~/anaconda3/envs/mlj_shap_2/lib/python3.6/site-packages/sklearn/utils/validation.py in check_array(array, accept_sparse, accept_large_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, warn_on_dtype, estimator)
576 if force_all_finite:
577 _assert_all_finite(array,
--> 578 allow_nan=force_all_finite == 'allow-nan')
579
580 if ensure_min_samples > 0:

~/anaconda3/envs/mlj_shap_2/lib/python3.6/site-packages/sklearn/utils/validation.py in _assert_all_finite(X, allow_nan, msg_dtype)
58 msg_err.format
59 (type_err,
---> 60 msg_dtype if msg_dtype is not None else X.dtype)
61 )
62 # for object dtype data, we only check for NaNs (GH-13254)

ValueError: Input contains NaN, infinity or a value too large for dtype('float64').

thanks

add test for models

Add test for models

Each model should have test suite with check of:

  • fit
  • predict
  • save and load

Compute more metrics for classifier

  1. It will be nice to compute:
  • F1 score
  • AUC
  • Precision and Recall
  • Matthews correlation coefficient
  1. Compute threshold for which maximize F1 score.
  2. Provide confusion matrix.

Add option to treat any column as categorical

Right now, only columns that are detected by the system as categorical will be converted to numbers. It will be nice to have an option to manually select which column should be treated as categorical.

How best to get diversity in final ensemble?

Hi. I'm running an experiment with standard XGboost parameters. In the end I am surprised to see that only 2 models are retained for the ensemble and usually one is significantly more weighted than the other (11 to 1). What parameters can I change in order to get more models in the ensemble (assuming just XGboost? So far I have tried 10 initial and 15 initial models as well as 5 hill climbing steps and retain 5 models for improvements but with little difference. Thanks

final output can be confusing

This is an example output after all models are done running but there are no column headers to understand what this is.

0 10000000000000.0 0.6530156150511344
1 0.6530156150511344 0.6530156150511344
2 0.6530156150511344 0.6530156151127985
3 0.6530156150511344 0.6529712545924589
4 0.6529712545924589 0.652852026991749
5 0.652852026991749 0.6528075224163983
6 0.6528075224163983 0.6527930299503968

Use tree_method='hist' for Xgboost

Just a suggestion:

Consider using histogram method for Xgboost instead of default 'auto'. See this description:
dmlc/xgboost#1950

In practice it is 5-10x faster and leads to less overfitting.

Also helpful to limit thread usage to the number of physical cores of the system instead of default max virtual cores.

Add LightGBM support

The lightgbm algorithm is already available in the code. Make sure that it works with:

  • binary classification
  • multiclass classification
  • regression

Saving mljar automl model for future use

Hi,
traditionally I had been using pickle package to save models in pkl file and re-use them continuously on live data. I see mljar model has to json and from json methods. Could you please create small poc or example with documentation as how could we re-use it for daily / live data? Thanks. :)

Select number of cross-validation folds

Would be great to be able to set the number of cross-val folds. On the MLJAR.com platform I always use 15. Ideally we can set any number as long as number of folds <= number of rows. thanks

When trying to import AutoML it aborts

>>> import pandas as pd
>>> from supervised.automl import AutoML
/root/python/MLwebsite/lib/python3.6/site-packages/sklearn/externals/joblib/__init__.py:15: DeprecationWarning: sklearn.externals.joblib is deprecated in 0.21 and will be removed in 0.23. Please import this functionality directly from joblib, which can be installed with: pip install joblib. If this warning is raised when loading pickled models, you may need to re-serialize those models with scikit-learn 0.21+.
  warnings.warn(msg, category=DeprecationWarning)
Aborted

this takes about 30 seconds to complete and ends in me returning to bash and not python. I am just attempting to run the quick example in the readme and I have python 3.6.3 in a brand new venv with nothing but this installed via pip.

Problem with learner_time_limit

Hi Piotr:

Thanks for the changes made in the ner version. I'm now getting an error when setting learner_time_limit. Did you take this option out? I see in your code that you are setting it automatically as a function of total_time_limit. Can this be the behavior only if learner_time_limit is None?

Does it work with regression problem?

Does it work with regression problem? Because when I try to use, it considers labels as classes.
Could you show me a regression example in python?

provide labels for true classes

When working with imbalanced datasets, a class may be underrepresented to the point where y_true and y_pred nearly always contain a different number of classes (for example, one class is missing from the predicted values). Because of this, mljar oftentimes cannot be used for imbalanced datasets.

I have attached the error below:

MLJAR AutoML:   0%|          | 0/80 [00:00<?, ?model/s]Traceback (most recent call last):
...
  File "/home/shoe/.virtualenvs/2ravens/lib/python3.6/site-packages/supervised/automl.py", line 256, in fit
    self.not_so_random_step(X, y)
  File "/home/shoe/.virtualenvs/2ravens/lib/python3.6/site-packages/supervised/automl.py", line 207, in not_so_random_step
    m = self.train_model(params, X, y)
  File "/home/shoe/.virtualenvs/2ravens/lib/python3.6/site-packages/supervised/automl.py", line 164, in train_model
    il.train({"train": {"X": X, "y": y}})
  File "/home/shoe/.virtualenvs/2ravens/lib/python3.6/site-packages/supervised/iterative_learner_framework.py", line 75, in train
    self.predictions(learner, train_data, validation_data),
  File "/home/shoe/.virtualenvs/2ravens/lib/python3.6/site-packages/supervised/callbacks/callback_list.py", line 23, in on_iteration_end
    cb.on_iteration_end(logs, predictions)
  File "/home/shoe/.virtualenvs/2ravens/lib/python3.6/site-packages/supervised/callbacks/early_stopping.py", line 59, in on_iteration_end
    predictions.get("y_train_true"), predictions.get("y_train_predicted")
  File "/home/shoe/.virtualenvs/2ravens/lib/python3.6/site-packages/supervised/metric.py", line 58, in __call__
    return self.metric(y_true, y_predicted)
  File "/home/shoe/.virtualenvs/2ravens/lib/python3.6/site-packages/supervised/metric.py", line 24, in logloss
    ll = log_loss(y_true, y_predicted)
  File "/home/shoe/.virtualenvs/2ravens/lib/python3.6/site-packages/sklearn/metrics/classification.py", line 1809, in log_loss
    lb.classes_))
ValueError: y_true and y_pred contain different number of classes 3, 2. Please provide the true labels explicitly through the labels argument. Classes found in y_true: [0 1 2]

Add mutliclass classification

Add support for multiclass classification. The machine learning task should be automatically detected or set manually in AutoML constructor.

Latest version error with 'start_random_models'

Just got this error during:

automl = AutoML(results_path="mlj_v2_res_1", total_time_limit=360,algorithms=model_types,train_ensemble=True,start_random_models=20,hill_climbing_steps=2,top_models_to_improve=2)

TypeError: init() got an unexpected keyword argument 'start_random_models'

Imbalanced classes in multi-class

  1. To reproduce get iris data set
  2. Add one new label in target column
  3. Run the analysis, it should break because of one extra class during cross validation

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.