Giter Site home page Giter Site logo

core's Introduction

SapientML

Generative AutoML for Tabular Data

SapientML is an AutoML technology that can learn from a corpus of existing datasets and their human-written pipelines, and efficiently generate a high-quality pipeline for a predictive task on a new dataset.

PyPI - Version Release Conventional Commits OpenSSF Best Practices

NEW: Available on 🤗 HuggingFace Spaces!!

Open in Spaces

Installation

From PyPI repository

pip install sapientml

From source code:

git clone https://github.com/sapientml/sapientml.git
cd sapientml
pip install poetry
poetry install

Getting Started

Please see our Documentation for further details.

Run AutoML

import pandas as pd
from sapientml import SapientML
from sklearn.metrics import f1_score
from sklearn.model_selection import train_test_split

train_data = pd.read_csv("https://github.com/sapientml/sapientml/files/12481088/titanic.csv")
train_data, test_data = train_test_split(train_data)
y_true = test_data["survived"].reset_index(drop=True)
test_data.drop(["survived"], axis=1, inplace=True)

cls = SapientML(["survived"])

cls.fit(train_data)
y_pred = cls.predict(test_data)

y_pred = y_pred["survived"].rename("survived_pred")
print(f"F1 score: {f1_score(y_true, y_pred)}")

Obtain and Run Generated Code

You can access model field to get a model consisting of generated code after executing fit method. model provides fit, predict, and save method to train a model by generated code, predict from a test data by generated code, and save generated code to a designated folder.

model = sml.fit(train_data, codegen_only=True).model

model.fit(X_train, y_train) # build a model by using another data and the same generated code

y_pred = model.predict(X_test) # prediction by using generated code

model.save("/path/to/output") # save generated code to `path/to/output`

Examples

Dataset Task Target Code
Titanic Dataset classification survived Open In Colab
Hotel Cancellation classification Status Open In Colab
Housing Prices regression SalePrice Open In Colab
Medical Insurance Charges regression charges Open In Colab

Publications

The technologies of the software originates from the following research paper published at the International Conference on Software Engineering (ICSE), which is one of the premier conferences on Software Engineering.

Ripon K. Saha, Akira Ura, Sonal Mahajan, Chenguang Zhu, Linyi Li, Yang Hu, Hiroaki Yoshida, Sarfraz Khurshid, Mukul R. Prasad (2022, May). SapientML: Synthesizing Machine Learning Pipelines by Learning from Human-Written Solutions. In Proceedings of the 44th International Conference on Software Engineering (pp. 1932-1944).

@inproceedings{10.1145/3510003.3510226,
author = {Saha, Ripon K. and Ura, Akira and Mahajan, Sonal and Zhu, Chenguang and Li, Linyi and Hu, Yang and Yoshida, Hiroaki and Khurshid, Sarfraz and Prasad, Mukul R.},
title = {SapientML: Synthesizing Machine Learning Pipelines by Learning from Human-Written Solutions},
year = {2022},
isbn = {9781450392211},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
url = {https://doi.org/10.1145/3510003.3510226},
doi = {10.1145/3510003.3510226},
abstract = {Automatic machine learning, or AutoML, holds the promise of truly democratizing the use of machine learning (ML), by substantially automating the work of data scientists. However, the huge combinatorial search space of candidate pipelines means that current AutoML techniques, generate sub-optimal pipelines, or none at all, especially on large, complex datasets. In this work we propose an AutoML technique SapientML, that can learn from a corpus of existing datasets and their human-written pipelines, and efficiently generate a high-quality pipeline for a predictive task on a new dataset. To combat the search space explosion of AutoML, SapientML employs a novel divide-and-conquer strategy realized as a three-stage program synthesis approach, that reasons on successively smaller search spaces. The first stage uses meta-learning to predict a set of plausible ML components to constitute a pipeline. In the second stage, this is then refined into a small pool of viable concrete pipelines using a pipeline dataflow model derived from the corpus. Dynamically evaluating these few pipelines, in the third stage, provides the best solution. We instantiate SapientML as part of a fully automated tool-chain that creates a cleaned, labeled learning corpus by mining Kaggle, learns from it, and uses the learned models to then synthesize pipelines for new predictive tasks. We have created a training corpus of 1,094 pipelines spanning 170 datasets, and evaluated SapientML on a set of 41 benchmark datasets, including 10 new, large, real-world datasets from Kaggle, and against 3 state-of-the-art AutoML tools and 4 baselines. Our evaluation shows that SapientML produces the best or comparable accuracy on 27 of the benchmarks while the second best tool fails to even produce a pipeline on 9 of the instances. This difference is amplified on the 10 most challenging benchmarks, where SapientML wins on 9 instances with the other tools failing to produce pipelines on 4 or more benchmarks.},
booktitle = {Proceedings of the 44th International Conference on Software Engineering},
pages = {1932–1944},
numpages = {13},
keywords = {AutoML, program synthesis, program analysis, machine learning},
location = {Pittsburgh, Pennsylvania},
series = {ICSE '22}
}

core's People

Contributors

akiraura avatar arima-tsukasa avatar dependabot[bot] avatar ganeshfg avatar ihkao avatar kimusaku avatar kodai-toyota avatar kubo-hiroto avatar sun-ming-fujitsu avatar tashiro-akira avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

core's Issues

A ValueError occurs during hyperparameter tuning in the candidate script using GradientBoosting

Describe the bug
If the hyperparameter 'loss' is' exponential 'in the GradientBoostingClassifier, the AdaBoost algorithm is applied. AdaBoost makes a weak learner that classifies two classes, but since target is multiclass, ValueError occurs.

To Reproduce
Steps to reproduce the behavior:

  1. Show your code calling generate_code().
script
    cls = SapientML(
        target_columns=["species"],
        add_explanation=True,
        split_train_size=0.75,
        hyperparameter_tuning=True,
        hyperparameter_tuning_n_trials=10,
        hyperparameter_tuning_timeout=600,
    )
    
    model = cls.fit(train_data_all).model
  1. Attach the datasets or dataframes input to generate_code() if possible.
    https://scikit-learn.org/stable/auto_examples/datasets/plot_iris_dataset.html

  2. Show the generated code such as 1_default.py when it was generated.

generated code
# *** GENERATED PIPELINE ***

# LOAD DATA
import pandas as pd
train_dataset = pd.read_pickle(r"/home/sugawara/PoC/mobilePF/outputs/training.pkl")

# TRAIN-TEST SPLIT
from sklearn.model_selection import train_test_split
def split_dataset(dataset, train_size=0.75, random_state=17):
    train_dataset, test_dataset = train_test_split(dataset, train_size=train_size, random_state=random_state)
    return train_dataset, test_dataset	
train_dataset, test_dataset = split_dataset(train_dataset)
train_dataset, validation_dataset = split_dataset(train_dataset)

# SUBSAMPLE
# If the number of rows of train_dataset is larger than sample_size, sample rows to sample_size for speedup.
from lib.sample_dataset import sample_dataset
train_dataset = sample_dataset(
    dataframe=train_dataset,
    sample_size=100000,
    target_columns=['species'],
    task_type='classification'
)

test_dataset = validation_dataset


# DETACH TARGET
TARGET_COLUMNS = ['species']
feature_train = train_dataset.drop(TARGET_COLUMNS, axis=1)
target_train = train_dataset[TARGET_COLUMNS].copy()
feature_test = test_dataset.drop(TARGET_COLUMNS, axis=1)
target_test = test_dataset[TARGET_COLUMNS].copy()

# HYPERPARAMETER OPTIMIZATION
import optuna
from sklearn.ensemble import GradientBoostingClassifier
# NEED CV: ex.) optuna.integration.OptunaSearchCV()
class Objective(object):
    def __init__(self, feature_train, target_train, feature_test, target_test, __random_state):
        self.feature_train = feature_train
        self.target_train = target_train
        self.feature_test = feature_test
        self.target_test = target_test 
        self.__random_state = __random_state
    def __call__(self, trial):
        def set_hyperparameters(trial):
            params = {}
            params['loss'] =  trial.suggest_categorical('loss', ['log_loss', 'deviance', 'exponential']) # log_loss 
            params['n_estimators'] =  trial.suggest_int('n_estimators', 10, 1000, log=True) # 100
            params['subsample'] = trial.suggest_float('subsample', 0.2, 1) # 1  
            params['criterion'] = trial.suggest_categorical('criterion', ['friedman_mse', 'squared_error']) # 'friedman_mse'
            params['min_samples_leaf'] = trial.suggest_int('min_samples_leaf', 1, 32, log=True) # 1
            params['max_features'] = trial.suggest_categorical('max_features', ['sqrt','log2', None]) # None 
            return params
        
        # SET DATA
        import numpy as np
    
        if isinstance(self.feature_train, pd.DataFrame):
            feature_train = self.feature_train
        elif isinstance(self.feature_train, np.ndarray):
            feature_train = pd.DataFrame(self.feature_train)
        else:
            feature_train = pd.DataFrame(self.feature_train.toarray())
    
        if isinstance(self.target_train, pd.DataFrame):
            target_train = self.target_train
        elif isinstance(self.target_train, np.ndarray):
            target_train = pd.DataFrame(self.target_train)
        else:
            target_train = pd.DataFrame(self.target_train.toarray())
    
        if isinstance(self.feature_test, pd.DataFrame):
            feature_test = self.feature_test
        elif isinstance(self.feature_test, np.ndarray):
            feature_test = pd.DataFrame(self.feature_test)
        else:
            feature_test = pd.DataFrame(self.feature_test.toarray())
    
        if isinstance(self.target_test, pd.DataFrame):
            target_test = self.target_test
        elif isinstance(self.target_test, np.ndarray):
            target_test = pd.DataFrame(self.target_test)
        else:
            target_test = pd.DataFrame(self.target_test.toarray())
        # MODEL 
        params = set_hyperparameters(trial)
        model = GradientBoostingClassifier(random_state=self.__random_state, **params)
        model.fit(feature_train, target_train.values.ravel())
        y_pred = model.predict(feature_test)
        
        from sklearn import metrics
        score = metrics.f1_score(target_test, y_pred, average='macro')
        
        return score
    
n_trials = 10
timeout = 600 
random_state = 1023 
random_state_model = 42 
direction = 'maximize' 
    
study = optuna.create_study(direction=direction,
                sampler=optuna.samplers.TPESampler(seed=random_state)) 
default_hyperparameters = {'criterion': 'friedman_mse', 'loss': 'log_loss', 'max_features': None, 'min_samples_leaf': 1, 'n_estimators': 100, 'subsample': 1.0}
study.enqueue_trial(default_hyperparameters)
study.optimize(Objective(feature_train, target_train, feature_test, target_test, random_state_model), 
                n_trials=n_trials, 
                timeout=timeout)
best_params = study.best_params
print("best params:", best_params)
print("RESULT: f1: " + str(study.best_value))
  1. Show the messages of SapientML and/or generated code.
ValueError: ExponentialLoss requires 2 classes; got 3 class(es)

Expected behavior

Environment (please complete the following information):

  • SapientML Version: 0.4.12.post0

Additional context
If the target has multiclass, the parameter "loss" must be set to "log_loss" only.

A ValueError occurs during hyperparameter tuning in the candidate script using MLPRegressor.

Describe the bug
For MLP Regressor as candidate ML algorithm, If the parameter alpha is set to a large value, ValueError has occurred.
It causes from the big value of L2 regularization coefficient term alpha.

To Reproduce
Steps to reproduce the behavior:

  1. Show your code calling generate_code().
script
    cls = SapientML(
        target_columns=["SalePrice"],
        add_explanation=True,
        split_train_size=0.75,
        hyperparameter_tuning=True,
        hyperparameter_tuning_n_trials=10,
        hyperparameter_tuning_timeout=600,
    )
    
    model = cls.fit(train_data_all).model
  1. Attach the datasets or dataframes input to generate_code() if possible.
    house-price-prediction-using-regression

  2. Show the generated code such as 1_default.py when it was generated.

generated code
# *** GENERATED PIPELINE ***

# LOAD DATA
import pandas as pd
train_dataset = pd.read_pickle(r"/outputs/training.pkl")

# TRAIN-TEST SPLIT
from sklearn.model_selection import train_test_split
def split_dataset(dataset, train_size=0.75, random_state=17):
    train_dataset, test_dataset = train_test_split(dataset, train_size=train_size, random_state=random_state)
    return train_dataset, test_dataset	
train_dataset, test_dataset = split_dataset(train_dataset)
train_dataset, validation_dataset = split_dataset(train_dataset)

# SUBSAMPLE
# If the number of rows of train_dataset is larger than sample_size, sample rows to sample_size for speedup.
from lib.sample_dataset import sample_dataset
train_dataset = sample_dataset(
    dataframe=train_dataset,
    sample_size=100000,
    target_columns=['SalePrice'],
    task_type='regression'
)

test_dataset = validation_dataset


# PREPROCESSING-1
# Component: Preprocess:SimpleImputer
# Efficient Cause: Preprocess:SimpleImputer is required in this pipeline since the dataset has ['feature:missing_values_presence']. The relevant features are: ['Alley', 'BsmtCond', 'BsmtExposure', 'BsmtFinType1', 'BsmtFinType2', 'BsmtQual', 'Fence', 'FireplaceQu', 'GarageCond', 'GarageFinish', 'GarageQual', 'GarageType', 'GarageYrBlt', 'LotFrontage', 'MasVnrArea', 'MasVnrType', 'MiscFeature', 'PoolQC'].
# Purpose: Imputation transformer for completing missing values
# Form:
#   Input: array of shape (n_features,)
#   Key hyperparameters used: 
#		 "missing_values: int, float, str, np.nan or None, default=np.nan" :: The placeholder for the missing values. All occurrences of missing_values will be imputed. For pandas’ dataframes with nullable integer dtypes with missing values, missing_values should be set to np.nan, since pd.NA will be converted to np.nan.
#		 "strategy: str, default='mean'" :: The imputation strategy. If "mean", then replace missing values using the mean along each column. Can only be used with numeric data. If "most_frequent", then replace missing using the most frequent value along each column. Can be used with strings or numeric data. If there is more than one such value, only the smallest is returned.
# Alternatives: Although  can also be used for this dataset, Preprocess:SimpleImputer is used because it has more  than .
# Order: Preprocess:SimpleImputer should be applied  
import numpy as np
from sklearn.impute import SimpleImputer
NUMERIC_COLS_WITH_MISSING_VALUES = ['GarageYrBlt', 'LotFrontage', 'MasVnrArea']
simple_imputer = SimpleImputer(missing_values=np.nan, strategy='mean')
train_dataset[NUMERIC_COLS_WITH_MISSING_VALUES] = simple_imputer.fit_transform(train_dataset[NUMERIC_COLS_WITH_MISSING_VALUES])
test_dataset[NUMERIC_COLS_WITH_MISSING_VALUES] = simple_imputer.transform(test_dataset[NUMERIC_COLS_WITH_MISSING_VALUES])

# PREPROCESSING-2
import numpy as np
from sklearn.impute import SimpleImputer
STRING_COLS_WITH_MISSING_VALUES = ['BsmtCond', 'BsmtExposure', 'BsmtFinType1', 'BsmtFinType2', 'BsmtQual', 'FireplaceQu', 'GarageCond', 'GarageFinish', 'GarageQual', 'GarageType', 'MasVnrType']
simple_imputer = SimpleImputer(missing_values=np.nan, strategy='most_frequent')
train_dataset[STRING_COLS_WITH_MISSING_VALUES] = simple_imputer.fit_transform(train_dataset[STRING_COLS_WITH_MISSING_VALUES])
test_dataset[STRING_COLS_WITH_MISSING_VALUES] = simple_imputer.transform(test_dataset[STRING_COLS_WITH_MISSING_VALUES])
STRING_ALMOST_MISSING_COLS = ['Alley', 'Fence', 'MiscFeature', 'PoolQC']
train_dataset[STRING_ALMOST_MISSING_COLS] = train_dataset[STRING_ALMOST_MISSING_COLS].astype(str)
test_dataset[STRING_ALMOST_MISSING_COLS] = test_dataset[STRING_ALMOST_MISSING_COLS].astype(str)
train_dataset[STRING_ALMOST_MISSING_COLS] = train_dataset[STRING_ALMOST_MISSING_COLS].fillna('')
test_dataset[STRING_ALMOST_MISSING_COLS] = test_dataset[STRING_ALMOST_MISSING_COLS].fillna('')

# PREPROCESSING-3
# Component: Preprocess:OrdinalEncoder
# Efficient Cause: Preprocess:OrdinalEncoder is required in this pipeline since the dataset has ['feature:str_category_presence', 'feature:str_category_small_presence', 'feature:str_category_binary_presence']. The relevant features are: ['MSZoning', 'Street', 'Alley', 'LotShape', 'LandContour', 'Utilities', 'LotConfig', 'LandSlope', 'Neighborhood', 'Condition1', 'Condition2', 'BldgType', 'HouseStyle', 'RoofStyle', 'RoofMatl', 'Exterior1st', 'Exterior2nd', 'MasVnrType', 'ExterQual', 'ExterCond', 'Foundation', 'BsmtQual', 'BsmtCond', 'BsmtExposure', 'BsmtFinType1', 'BsmtFinType2', 'Heating', 'HeatingQC', 'CentralAir', 'Electrical', 'KitchenQual', 'Functional', 'FireplaceQu', 'GarageType', 'GarageFinish', 'GarageQual', 'GarageCond', 'PavedDrive', 'PoolQC', 'Fence', 'MiscFeature', 'SaleType', 'SaleCondition'].
# Purpose: Encode categorical features as an integer array
# Form:
#   Input: list of arrays
#   Key hyperparameters used: 
#		 "handle_unknown: {'error', 'use_encoded_value'}, default='error'" :: When set to ‘error’ an error will be raised in case an unknown categorical feature is present during transform. When set to ‘use_encoded_value’, the encoded value of unknown categories will be set to the value given for the parameter unknown_value. In inverse_transform, an unknown category will be denoted as None.
#		 "unknown_value: int or np.nan, default=None" :: When the parameter handle_unknown is set to ‘use_encoded_value’, this parameter is required and will set the encoded value of unknown categories. It has to be distinct from the values used to encode any of the categories in fit. If set to np.nan, the dtype parameter must be a float dtype.
# Alternatives: Although [Preprocess:OneHotEncoder] can also be used for this dataset, Preprocess:OrdinalEncoder is used because it has more feature:str_category_small_presence than feature:str_category_binary_presence.
# Order: Preprocess:OrdinalEncoder should be applied  
from sklearn.preprocessing import OrdinalEncoder
CATEGORICAL_COLS = ['Alley', 'BldgType', 'BsmtCond', 'BsmtExposure', 'BsmtFinType1', 'BsmtFinType2', 'BsmtQual', 'CentralAir', 'Condition1', 'Condition2', 'Electrical', 'ExterCond', 'ExterQual', 'Exterior1st', 'Exterior2nd', 'Fence', 'FireplaceQu', 'Foundation', 'Functional', 'GarageCond', 'GarageFinish', 'GarageQual', 'GarageType', 'Heating', 'HeatingQC', 'HouseStyle', 'KitchenQual', 'LandContour', 'LandSlope', 'LotConfig', 'LotShape', 'MSZoning', 'MasVnrType', 'MiscFeature', 'Neighborhood', 'PavedDrive', 'PoolQC', 'RoofMatl', 'RoofStyle', 'SaleCondition', 'SaleType', 'Street', 'Utilities']
ordinal_encoder = OrdinalEncoder(handle_unknown="use_encoded_value", unknown_value=-1)
train_dataset[CATEGORICAL_COLS] = ordinal_encoder.fit_transform(train_dataset[CATEGORICAL_COLS])
test_dataset[CATEGORICAL_COLS] = ordinal_encoder.transform(test_dataset[CATEGORICAL_COLS])

# PREPROCESSING-4
# Component: Preprocess:Log
# Efficient Cause: Preprocess:Log is required in this pipeline since the dataset has ['feature:target_imbalance_score', 'feature:normalized_variation_across_columns', 'feature:max_normalized_stddev']. The relevant features are: ['Id', 'MSSubClass', 'LotFrontage', 'LotArea', 'OverallQual', 'OverallCond', 'YearBuilt', 'YearRemodAdd', '1stFlrSF', 'GrLivArea', 'TotRmsAbvGrd', 'GarageYrBlt', 'MoSold', 'YrSold', 'SalePrice'].
# Purpose: Return the natural logarithm of one plus the input array, element-wise.
# Form:
#   Input: array_like
#   Key hyperparameters used: None
# Alternatives: Although [Preprocess:StandardScaler] can also be used for this dataset, Preprocess:Log is used because it has more feature:target_imbalance_score than feature:max_skewness.
# Order: Preprocess:Log should be applied  
import numpy as np
NUMERIC_COLS_TO_SCALE = ['Id', 'MSSubClass', 'LotFrontage', 'LotArea', 'OverallQual', 'OverallCond', 'YearBuilt', 'YearRemodAdd', '1stFlrSF', 'GrLivArea', 'TotRmsAbvGrd', 'GarageYrBlt', 'MoSold', 'YrSold', 'SalePrice']
train_dataset[NUMERIC_COLS_TO_SCALE] = np.log1p(train_dataset[NUMERIC_COLS_TO_SCALE])
NUMERIC_COLS_TO_SCALE_FOR_TEST = list(set(test_dataset.columns) & set(NUMERIC_COLS_TO_SCALE))
test_dataset[NUMERIC_COLS_TO_SCALE_FOR_TEST] = np.log1p(test_dataset[NUMERIC_COLS_TO_SCALE_FOR_TEST])

# DETACH TARGET
TARGET_COLUMNS = ['SalePrice']
feature_train = train_dataset.drop(TARGET_COLUMNS, axis=1)
target_train = train_dataset[TARGET_COLUMNS].copy()
feature_test = test_dataset.drop(TARGET_COLUMNS, axis=1)
target_test = test_dataset[TARGET_COLUMNS].copy()

# HYPERPARAMETER OPTIMIZATION
import optuna
from sklearn.neural_network import MLPRegressor
# NEED CV: ex.) optuna.integration.OptunaSearchCV()
class Objective(object):
    def __init__(self, feature_train, target_train, feature_test, target_test, __random_state):
        self.feature_train = feature_train
        self.target_train = target_train
        self.feature_test = feature_test
        self.target_test = target_test 
        self.__random_state = __random_state
    def __call__(self, trial):
        def set_hyperparameters(trial):
            params = {}
            params['activation'] =  trial.suggest_categorical('activation', ['identity', 'logistic', 'tanh', 'relu']) # relu 
            params['solver'] = trial.suggest_categorical('solver', ['lbfgs','sgd', 'adam']) # adam 
            params['alpha'] = trial.suggest_loguniform('alpha', 1e-6, 1.0) # 0.0001 
            return params
        
        # SET DATA
        import numpy as np
    
        if isinstance(self.feature_train, pd.DataFrame):
            feature_train = self.feature_train
        elif isinstance(self.feature_train, np.ndarray):
            feature_train = pd.DataFrame(self.feature_train)
        else:
            feature_train = pd.DataFrame(self.feature_train.toarray())
    
        if isinstance(self.target_train, pd.DataFrame):
            target_train = self.target_train
        elif isinstance(self.target_train, np.ndarray):
            target_train = pd.DataFrame(self.target_train)
        else:
            target_train = pd.DataFrame(self.target_train.toarray())
    
        if isinstance(self.feature_test, pd.DataFrame):
            feature_test = self.feature_test
        elif isinstance(self.feature_test, np.ndarray):
            feature_test = pd.DataFrame(self.feature_test)
        else:
            feature_test = pd.DataFrame(self.feature_test.toarray())
    
        if isinstance(self.target_test, pd.DataFrame):
            target_test = self.target_test.copy()
        elif isinstance(self.target_test, np.ndarray):
            target_test = pd.DataFrame(self.target_test)
        else:
            target_test = pd.DataFrame(self.target_test.toarray())
        # MODEL 
        params = set_hyperparameters(trial)
        model = MLPRegressor(random_state=self.__random_state, **params)
        model.fit(feature_train, target_train.values.ravel())
        y_pred = model.predict(feature_test)
        # INVERSE TARGET
        import numpy as np
        COLS_TO_BE_INVERSED = list(set(NUMERIC_COLS_TO_SCALE) & set(TARGET_COLUMNS))
        target_test[COLS_TO_BE_INVERSED] = np.expm1(target_test[COLS_TO_BE_INVERSED])
        y_pred = pd.DataFrame(data=y_pred, columns=TARGET_COLUMNS, index=feature_test.index)
        y_pred[COLS_TO_BE_INVERSED] = np.expm1(y_pred[COLS_TO_BE_INVERSED])
        y_pred = y_pred.to_numpy()
        
        from sklearn import metrics
        score = metrics.r2_score(target_test, y_pred)
        
        return score
    
n_trials = 10
timeout = 600 
random_state = 1023 
random_state_model = 42 
direction = 'maximize' 
    
study = optuna.create_study(direction=direction,
                sampler=optuna.samplers.TPESampler(seed=random_state)) 
default_hyperparameters = {'activation': 'relu', 'alpha': 0.0001, 'solver': 'adam'}
study.enqueue_trial(default_hyperparameters)
study.optimize(Objective(feature_train, target_train, feature_test, target_test, random_state_model), 
                n_trials=n_trials, 
                timeout=timeout)
best_params = study.best_params
print("best params:", best_params)
print("RESULT: r2: " + str(study.best_value))
  1. Show the messages of SapientML and/or generated code.
ValueError: Solver produced non-finite parameter weights. The input data may contain large values and need to be preprocessed.

Expected behavior

Environment (please complete the following information):

  • SapientML Version: 0.4.12.post0

Additional context
The searching space ('alpha', 1e-6, 1.0) has to be set to ('alpha', 1e-6, 1e-3), for example.

For multi-targets, SMOTE is not recommended even if there is a bias

Describe the bug
At _get_target_imbalance_score(), if the target column has multiclass, the imbalance score is calculated as 0 and SMOTE is not recommended for preprocess.

def _get_target_imbalance_score(Y):

To Reproduce
Steps to reproduce the behavior:

  1. Show your code calling generate_code().
script
# Paste your code here. The following is an example.
from sapientml import SapientMLGenerator
sml = SapientMLGenerator()
sml.generate_code('your arguments')
  1. Attach the datasets or dataframes input to generate_code() if possible.
  2. Show the generated code such as 1_default.py when it was generated.
generated code
# Paste the generated code here.
  1. Show the messages of SapientML and/or generated code.

Expected behavior
A clear and concise description of what you expected to happen.

Environment (please complete the following information):

  • OS: [e.g. Ubuntu 20.04]
  • Docker Version (if applicable): [Docker version 20.10.17, build 100c701]
  • Python Version: [e.g. 3.9.12]
  • SapientML Version: 0.5.4

Additional context

  • For the following code line 1020, fix the condition to vc.shape[0] > 10(if follow the comment) or delete it.
        # if there are more than 10 categories, probably it is a regression problem
        if vc.shape[0] > 2:
            return 0
  • In addition, offline learning is required.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.