Giter Site home page Giter Site logo

thieu1995 / mafese Goto Github PK

View Code? Open in Web Editor NEW
55.0 1.0 21.0 4.62 MB

Feature Selection using Metaheuristics Made Easy: Open Source MAFESE Library in Python

Home Page: https://mafese.readthedocs.io

License: GNU General Public License v3.0

Python 100.00%
decision-tree-classifier dimensionality-reduction feature-extraction feature-selection harris-hawks-optimization knn-classifier machine-learning optimization svm-classifier mutual-information

mafese's Introduction

MAFESE


GitHub release Wheel PyPI version PyPI - Python Version PyPI - Status PyPI - Downloads Downloads Tests & Publishes to PyPI GitHub Release Date Documentation Status Chat GitHub contributors GitTutorial DOI License: GPL v3

MAFESE (Metaheuristic Algorithms for FEature SElection) is the biggest python library for feature selection (FS) problem using meta-heuristic algorithms.

  • ๐Ÿ†“ Free software: GNU General Public License (GPL) V3 license
  • ๐Ÿ”„ Total Wrapper-based (Metaheuristic Algorithms): > 200 methods
  • ๐Ÿ“Š Total Filter-based (Statistical-based): > 15 methods
  • ๐ŸŒณ Total Embedded-based (Tree and Lasso): > 10 methods
  • ๐Ÿ” Total Unsupervised-based: โ‰ฅ 4 methods
  • ๐Ÿ“‚ Total datasets: โ‰ฅ 30 (47 classifications and 7 regressions)
  • ๐Ÿ“ˆ Total performance metrics: โ‰ฅ 61 (45 regressions and 16 classifications)
  • โš™๏ธ Total objective functions (as fitness functions): โ‰ฅ 61 (45 regressions and 16 classifications)
  • ๐Ÿ“– Documentation: https://mafese.readthedocs.io/en/latest/
  • ๐Ÿ Python versions: โ‰ฅ 3.7.x
  • ๐Ÿ“ฆ Dependencies: numpy, scipy, scikit-learn, pandas, mealpy, permetrics, plotly, kaleido

Citation Request

Please include these citations if you plan to use this incredible library:

@article{van2024feature,
  title={Feature selection using metaheuristics made easy: Open source MAFESE library in Python},
  author={Van Thieu, Nguyen and Nguyen, Ngoc Hung and Heidari, Ali Asghar},
  journal={Future Generation Computer Systems},
  year={2024},
  publisher={Elsevier},
  doi={10.1016/j.future.2024.06.006},
  url={https://doi.org/10.1016/j.future.2024.06.006},
}

@article{van2023mealpy,
  title={MEALPY: An open-source library for latest meta-heuristic algorithms in Python},
  author={Van Thieu, Nguyen and Mirjalili, Seyedali},
  journal={Journal of Systems Architecture},
  year={2023},
  publisher={Elsevier},
  doi={10.1016/j.sysarc.2023.102871}
}

Usage

Goals

  • Our library provides all state-of-the-art feature selection methods:
    • Unsupervised-based FS
    • Filter-based FS
    • Embedded-based FS
      • Regularization (Lasso-based)
      • Tree-based methods
    • Wrapper-based FS
      • Sequential-based: forward and backward
      • Recursive-based
      • MHA-based: Metaheuristic Algorithms

Installation

$ pip install mafese

After installation, you can import MAFESE and check its installed version:

$ python
>>> import mafese
>>> mafese.__version__

Lib's structure

docs
examples
mafese
    data/
        cls/
            aggregation.csv
            Arrhythmia.csv
            ...
        reg/
            boston-housing.csv
            diabetes.csv
            ...
    wrapper/
        mha.py
        recursive.py
        sequential.py
    embedded/
        lasso.py
        tree.py
    filter.py
    unsupervised.py
    utils/
        correlation.py
        data_loader.py
        encoder.py
        estimator.py
        mealpy_util.py
        transfer.py
        validator.py
    __init__.py
    selector.py
README.md
setup.py

Examples

Let's go through some examples.

1. First, load dataset. You can use the available datasets from Mafese:

# Load available dataset from MAFESE
from mafese import get_dataset

# Try unknown data
get_dataset("unknown")
# Enter: 1      -> This wil list all of avaialble dataset

data = get_dataset("Arrhythmia")
  • Or you can load your own dataset
import pandas as pd
from mafese import Data

# load X and y
# NOTE mafese accepts numpy arrays only, hence the .values attribute
dataset = pd.read_csv('examples/dataset.csv', index_col=0).values
X, y = dataset[:, 0:-1], dataset[:, -1]
data = Data(X, y)

2. Next, prepare your dataset

2.1 Split dataset into train and test set

data.split_train_test(test_size=0.2, inplace=True)
print(data.X_train[:2].shape)
print(data.y_train[:2].shape)

2.2 Feature Scaling

data.X_train, scaler_X = data.scale(data.X_train, scaling_methods=("standard", "minmax"))
data.X_test = scaler_X.transform(data.X_test)

data.y_train, scaler_y = data.encode_label(data.y_train)   # This is for classification problem only
data.y_test = scaler_y.transform(data.y_test)

3. Next, choose the Selector that you want to use by first import them:

## First way, we recommended 
from mafese import UnsupervisedSelector, FilterSelector, LassoSelector, TreeSelector
from mafese import SequentialSelector, RecursiveSelector, MhaSelector, MultiMhaSelector

## Second way
from mafese.unsupervised import UnsupervisedSelector
from mafese.filter import FilterSelector
from mafese.embedded.lasso import LassoSelector
from mafese.embedded.tree import TreeSelector
from mafese.wrapper.sequential import SequentialSelector
from mafese.wrapper.recursive import RecursiveSelector
from mafese.wrapper.mha import MhaSelector, MultiMhaSelector

4. Next, create an instance of Selector class you want to use:

feat_selector = UnsupervisedSelector(problem='classification', method='DR', n_features=5)

feat_selector = FilterSelector(problem='classification', method='SPEARMAN', n_features=5)

feat_selector = LassoSelector(problem="classification", estimator="lasso", estimator_paras={"alpha": 0.1})

feat_selector = TreeSelector(problem="classification", estimator="tree")

feat_selector = SequentialSelector(problem="classification", estimator="knn", n_features=3, direction="forward")

feat_selector = RecursiveSelector(problem="classification", estimator="rf", n_features=5)

feat_selector = MhaSelector(problem="classification", estimator="knn",
                            optimizer="BaseGA", optimizer_paras=None,
                            transfer_func="vstf_01", obj_name="AS")

list_optimizers = ("OriginalWOA", "OriginalGWO", "OriginalTLO", "OriginalGSKA")
list_paras = [{"epoch": 10, "pop_size": 30}, ]*4
feat_selector = MultiMhaSelector(problem="classification", estimator="knn",
                            list_optimizers=list_optimizers, list_optimizer_paras=list_paras,
                            transfer_func="vstf_01", obj_name="AS")

5. Fit the model to X_train and y_train

feat_selector.fit(data.X_train, data.y_train)

6. Get the information

# check selected features - True (or 1) is selected, False (or 0) is not selected
print(feat_selector.selected_feature_masks)
print(feat_selector.selected_feature_solution)

# check the index of selected features
print(feat_selector.selected_feature_indexes)

7. Call transform() on the X that you want to filter it down to selected features

X_train_selected = feat_selector.transform(data.X_train)
X_test_selected = feat_selector.transform(data.X_test)

8.You can build your own evaluating method or use our method.

If you use our method, don't transform the data.

8.1 You can use difference estimator than the one used in feature selection process

feat_selector.evaluate(estimator="svm", data=data, metrics=["AS", "PS", "RS"])

## Here, we pass the data that was loaded above. So it contains both train and test set. So, the results will look 
like this: 
{'AS_train': 0.77176, 'PS_train': 0.54177, 'RS_train': 0.6205, 'AS_test': 0.72636, 'PS_test': 0.34628, 'RS_test': 0.52747}

8.2 You can use the same estimator in feature selection process

X_test, y_test = data.X_test, data.y_test
feat_selector.evaluate(estimator=None, data=data, metrics=["AS", "PS", "RS"])

For more usage examples please look at examples folder.

Support

Some popular questions

  1. Where do I find the supported metrics like above ["AS", "PS", "RS"]. What is that?

You can find it here: https://github.com/thieu1995/permetrics or use this

from mafese import MhaSelector 

print(MhaSelector.SUPPORTED_REGRESSION_METRICS)
print(MhaSelector.SUPPORTED_CLASSIFICATION_METRICS)
  1. How do I know my Selector support which estimator? which methods?
print(feat_selector.SUPPORT) 

Or you better read the document from: https://mafese.readthedocs.io/en/latest/

  1. I got this type of error. How to solve it?
raise ValueError("Existed at least one new label in y_pred.")
ValueError: Existed at least one new label in y_pred.

This occurs only when you are working on a classification problem with a small dataset that has many classes. For instance, the "Zoo" dataset contains only 101 samples, but it has 7 classes. If you split the dataset into a training and testing set with a ratio of around 80% - 20%, there is a chance that one or more classes may appear in the testing set but not in the training set. As a result, when you calculate the performance metrics, you may encounter this error. You cannot predict or assign new data to a new label because you have no knowledge about the new label. There are several solutions to this problem.

  • 1st: Use the SMOTE method to address imbalanced data and ensure that all classes have the same number of samples.
from imblearn.over_sampling import SMOTE
import pandas as pd
from mafese import Data

dataset = pd.read_csv('examples/dataset.csv', index_col=0).values
X, y = dataset[:, 0:-1], dataset[:, -1]

X_new, y_new = SMOTE().fit_resample(X, y)
data = Data(X_new, y_new)
  • 2nd: Use different random_state numbers in split_train_test() function.
import pandas as pd 
from mafese import Data 

dataset = pd.read_csv('examples/dataset.csv', index_col=0).values
X, y = dataset[:, 0:-1], dataset[:, -1]
data = Data(X, y)
data.split_train_test(test_size=0.2, random_state=10)   # Try different random_state value 

Official Links

Related Documents

  1. https://neptune.ai/blog/feature-selection-methods
  2. https://www.blog.trainindata.com/feature-selection-machine-learning-with-python/
  3. https://github.com/LBBSoft/FeatureSelect
  4. https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-019-2754-0
  5. https://github.com/scikit-learn-contrib/boruta_py
  6. https://elki-project.github.io/
  7. https://sci2s.ugr.es/keel/index.php
  8. https://archive.ics.uci.edu/datasets
  9. https://python-charts.com/distribution/box-plot-plotly/
  10. https://plotly.com/python/box-plots/?_ga=2.50659434.2126348639.1688086416-114197406.1688086416#box-plot-styling-mean--standard-deviation

mafese's People

Contributors

aliasgharheidaricom avatar ngochungnguyenlg avatar thieu1995 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar

mafese's Issues

[BUG]: TypeError: __init__() takes from 2 to 3 positional arguments but 4 were given

Description of the bug

Getting this error while calling the feature selection fit function
TypeError: init() takes from 2 to 3 positional arguments but 4 were given

Steps To Reproduce

In
feat_selector.fit(data.X_train, data.y_train)

I'm
Getting this error while running the code
TypeError: init() takes from 2 to 3 positional arguments but 4 were given

it happened in the file "mealpy_util.py" in init function when calling the super().__init__(lb, ub, minmax, **kwargs)

I tried to change the superclass "problem" function init to receive the lb,ub as bounds but still getting the error

any help?

Additional Information

No response

Problem in using Metaheuristic-based feature selection

Whenever I am trying to use the following line of code-
feat_selector.fit(data.X_train, data.y_train, fit_weights=(0.9, 0.1), verbose=True)

I am getting the following error-
ValueError: Found array with 0 feature(s) (shape=(5854, 0)) while a minimum of 1 is required by KNeighborsClassifier.

Getting ERROR while executing MhaSelector(...).fit

While executing the following code in colab

from mafese.wrapper.mha import MhaSelector
from mafese import get_dataset

data = get_dataset("Arrhythmia")
data.split_train_test(test_size=0.2)
print(data.X_train.shape, data.X_test.shape) # (361, 279) (91, 279)

define mafese feature selection method

feat_selector = MhaSelector(problem="classification", estimator="knn",
optimizer="BaseGA", optimizer_paras=None,
transfer_func="vstf_01", obj_name="AS")

find all relevant features

feat_selector.fit(data.X_train, data.y_train, fit_weights=(0.9, 0.1), verbose=True)

check selected features - True (or 1) is selected, False (or 0) is not selected

print(feat_selector.selected_feature_masks)
print(feat_selector.selected_feature_solution)

check the index of selected features

print(feat_selector.selected_feature_indexes)

call transform() on X to filter it down to selected features

X_train_selected = feat_selector.transform(data.X_train)
X_test_selected = feat_selector.transform(data.X_test)

I am getting the following error

Requested CLASSIFICATION dataset: Arrhythmia found and loaded!
(361, 279) (91, 279)


TypeError Traceback (most recent call last)

in <cell line: 13>()
11 transfer_func="vstf_01", obj_name="AS")
12 # find all relevant features
---> 13 feat_selector.fit(data.X_train, data.y_train, fit_weights=(0.9, 0.1), verbose=True)
14
15 # check selected features - True (or 1) is selected, False (or 0) is not selected

1 frames

/usr/local/lib/python3.10/dist-packages/mafese/utils/mealpy_util.py in init(self, lb, ub, minmax, data, estimator, transfer_func, obj_name, metric_class, fit_weights, fit_sign, obj_paras, name, **kwargs)
25 self.obj_paras = obj_paras
26 self.name = name
---> 27 super().init(lb, ub, minmax, **kwargs)
28
29 def amend_position(self, position=None, lb=None, ub=None):

TypeError: Problem.init() takes from 2 to 3 positional arguments but 4 were given

This is only for MhaSelector, MultiMhaSelector

Question related to Mafese and Mealpy (in Telegram group)

Hello group, I have some fundamental questiona regarding mafese and mealpy.
I am using mafese's wrapper-based feature selection package. The goal is to find out
the important features which impacts the most in my datatset.
The problem is regression, where I have the independent features, somewhere between 120 to 150
and 1 dependent feature which is the label.
The independent variables are either 0 or 1 and the label variable is a continuous float value.

So my questions are:

  1. which optimizer should i use (from the metaheuristic algorithm)
  2. what should be the optimizer_paras and what are epochs and pop_size and how they will affect my feature-selection result.
  3. I am currently using obje_name as "MSE", should I use other metrics.
  4. what are current best and global best
  5. what are the bounds? like the lb and ub (upper and lower bounds, default is 8 and -8). Should I change those
    and how they will affect the feature-selection process.
  6. I am also see the logger info/output with the statement:
    "Solving 2-objective optimization problem with weights: [1. 0.]."
    and "Problem: P, Epoch: 1, Current best: 0.07396610841061782, Global best: 0.0739661084106"
    what does these mean.
  7. After running the fit function we I see selected_feature_indexes, where its showing the important columns,
    what if I want to chose the top-k important features like select top 8 or 9, etc

Thanks

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.