Giter Site home page Giter Site logo

scutr's Introduction

scutr: SMOTE and Cluster-Based Undersampling Technique in R

Imbalanced training datasets impede many popular classifiers. To balance training data, a combination of oversampling minority classes and undersampling majority classes is necessary. This package implements the SCUT (SMOTE and Cluster-based Undersampling Technique) algorithm, which uses model-based clustering and synthetic oversampling to balance multiclass training datasets.

This implementation only works on numeric training data and works best when there are more than two classes. For binary classification problems, other packages may be better suited.

The original SCUT paper uses SMOTE (essentially linear interpolation between points) for oversampling and expectation maximization clustering, which fits a mixture of Gaussian distributions to the data. These are the default methods in scutr, but random oversampling as well as some distance-based undersampling techniques are available.

Installation

You can install the released version of scutr from CRAN with:

install.packages("scutr")

And the development version from GitHub with:

# install.packages("devtools")
devtools::install_github("s-kganz/scutr")

Example Usage

We start with an imbalanced dataset that comes with the package.

library(scutr)
data(imbalance)
imbalance <- imbalance[imbalance$class %in% c(2, 3, 19, 20), ]
imbalance$class <- as.numeric(imbalance$class)

plot(imbalance$V1, imbalance$V2, col=imbalance$class)

table(imbalance$class)
#> 
#>   2   3  19  20 
#>  20  30 190 200

Then, we apply SCUT with SMOTE oversampling and k-means clustering with seven clusters.

scutted <- SCUT(imbalance, "class", undersample = undersample_kmeans,
                usamp_opts = list(k=7))
plot(scutted$V1, scutted$V2, col=scutted$class)

table(scutted$class)
#> 
#>   2   3  19  20 
#> 110 110 110 110

The dataset is now balanced and we have retained the distribution of the data.

scutr's People

Contributors

s-kganz avatar

Stargazers

 avatar  avatar

Watchers

 avatar  avatar

scutr's Issues

options in usamp_opts/osamp_opts do not work

The options for SCUT do not work.
Error in dist(data[, -which(names(data) == cls_col)], ...) :
unused argument (k = 7)

I run and that is the error.
library(scutr)
data(imbalance)
imbalance <- imbalance[imbalance$class %in% c(2, 3, 19, 20), ]
imbalance$class <- as.numeric(imbalance$class)

plot(imbalance$V1, imbalance$V2, col=imbalance$class)

scutted <- SCUT(imbalance, "class", undersample = undersample_kmeans,
usamp_opts = list(k=7))

It also does not work in a specific data set i am working on. I have problems with the options in general.
Could you explain me if there is a problem in the function or something wrong from my side?

smote did not work for small number of cases

Here is the number of cases in each class (total 22 classes, 460 samples, average is 21)
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
3 4 2 44 27 10 7 15 15 10 5 8 115 8 3 46 38 6 4 58
21 22
3 29
I ran scut usng default.
command: scutted <- SCUT(zz, "class")
I get the following error:
Error in get.knnx(data, query, k, algorithm) : ANN: ERROR------->
Calls: SCUT ... -> SMOTE -> knearest -> -> get.knnx
In addition: Warning message:
In get.knnx(data, query, k, algorithm) : k should be less than sample size!

It seems for SMOTE to work, K cannot be greater than or equal to the sample size. (in my case it's 2). But to generate 20 samples, k have to >2. is there a way to get around this? Thanks!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.