Giter Site home page Giter Site logo

categoryencodings's Introduction

categoryEncodings

R-CMD-check AppVeyor build status Codecov test coverage Lifecycle: stable License: GPL v3 CRAN status

categoryEncodings intends to provide a fast way to encode ‘factor’ or qualitative variables through various methods. The packages uses data.table as the backend for speed, with as few other dependencies as possible. Most of the methods are based on the paper of Johannemann et al.(2019) - Sufficient Representations for Categorical Variables (arXiv:1908.09874).

The current version features automatic inference of factors and uses a very simple heuristic for encoding, as well as allowing manual controls.

Installation

You can install the latest version of categoryEncodings from github using the devtools package

devtools::install_github("JSzitas/categoryEncodings")

NOTE: The latest stable version available from CRAN contains features deprecated in the current development version - I hope to resolve this soon, and publish the development version.

Example

Here we want to encode all of the factors in a given data.frame.

library(categoryEncodings)
# create some example data
data_fm <- cbind( data.frame( 
   matrix( rnorm(5*100),ncol = 5)),
           sample(sample(letters, 10), 100, replace = TRUE),
           sample(sample(letters, 20), 100, replace = TRUE),
           sample(sample(1:10, 5), 100, replace = TRUE),
           sample(sample(1:50, 35), 100, replace = TRUE ),
           sample(1:2, 100, replace = TRUE ))
colnames(data_fm)[6:10] <- c( "few_letters",  "many_letters",
                              "some_numbers", "many_numbers",
                              "binary" ) 
# it does not matter how many factor variables there are, whether they are encoded as factors
# and whether you supply a method to encode them by - some simple inference of factors is done
# based on the number of distinct values in every variable - over a certain threshold 
# a variable is deemed as essentially a factor, and treated as such for conversion 
# you will be notified of which variables are being converted via a warning
result <- encoder(data_fm)
#> Warning in handle_factors(X, fact): 
#>  Inferring factors: 
#>  few_letters 
#> many_letters 
#> some_numbers 
#> many_numbers 
#> binary
# note that due to the data.table back-end, the result has to be saved to an object to be 
# visible: otherwise printing is suppressed.   
head(result$encoded)
#>            X1         X2          X3          X4          X5
#> 1:  0.9991667 -0.8563340 -0.02101734 -1.18764199 -0.09672291
#> 2: -0.0408273  0.6100193  1.05773261  0.07433868 -1.51293569
#> 3: -3.0223377  0.2533907 -0.29310384 -0.67700946 -1.17672457
#> 4:  0.2860257  0.2831005 -1.04582237  0.12541452  1.10729575
#> 5:  1.0268237 -0.1994257  0.13540853 -2.24425666 -0.48221181
#> 6:  2.8180022 -0.1213061 -1.07908356  0.72342565  1.35827814
#>    few_letters_X1_mean few_letters_X2_mean few_letters_X3_mean
#> 1:           0.1410635         -0.16953195          -0.4281999
#> 2:          -0.0593829          0.30228766          -0.2516727
#> 3:          -0.3437670          0.12934365          -0.2391281
#> 4:          -0.2300727          0.03448788          -0.5145622
#> 5:          -0.3437670          0.12934365          -0.2391281
#> 6:           0.4468059          0.21750095          -0.1616724
#>    few_letters_X4_mean few_letters_X5_mean many_letters_X1_mean
#> 1:        -0.007318328         -0.03573149         -0.009508356
#> 2:        -0.444374549         -0.33955771          0.752536873
#> 3:        -0.376740256          0.11779062         -0.009508356
#> 4:         0.275041665         -0.10717482         -0.127222281
#> 5:        -0.376740256          0.11779062          1.060956129
#> 6:         0.207355045         -0.12851764          0.425306025
#>    many_letters_X2_mean many_letters_X3_mean many_letters_X4_mean
#> 1:           0.41121640          -0.30025118         0.0508638206
#> 2:          -0.10888477          -0.03649697        -0.0004609365
#> 3:           0.41121640          -0.30025118         0.0508638206
#> 4:          -0.46090763          -0.86687255        -0.4073961849
#> 5:           0.39560841          -0.25271095        -0.2260046598
#> 6:           0.05403458          -0.01756019         0.0760504513
#>    many_letters_X5_mean some_numbers_1_SPCA some_numbers_2_SPCA
#> 1:          -0.58924951          0.22504803           0.1600847
#> 2:          -0.77765543         -0.16802721           0.2750081
#> 3:          -0.58924951          0.06959357           0.1655578
#> 4:           0.46077097          0.06959357           0.1655578
#> 5:           0.01249808         -0.51984950          -0.2600594
#> 6:           0.41030929         -0.51984950          -0.2600594
#>    some_numbers_3_SPCA some_numbers_4_SPCA some_numbers_5_SPCA
#> 1:          0.15298987        -0.094904684         0.005508219
#> 2:         -0.15234943        -0.009941846         0.003210824
#> 3:          0.05426824         0.126029279        -0.003644279
#> 4:          0.05426824         0.126029279        -0.003644279
#> 5:          0.02925956         0.017447113         0.029312642
#> 6:          0.02925956         0.017447113         0.029312642
#>    many_numbers_X1_mean many_numbers_X2_mean many_numbers_X3_mean
#> 1:           -0.1255713           -0.3799080           0.02312113
#> 2:           -0.9257131            0.3821702          -0.09373120
#> 3:           -0.9257131            0.3821702          -0.09373120
#> 4:           -0.9257131            0.3821702          -0.09373120
#> 5:            1.4593446           -0.1602822          -0.04870327
#> 6:            1.4593446           -0.1602822          -0.04870327
#>    many_numbers_X4_mean many_numbers_X5_mean binary_1_SPCA binary_2_SPCA
#> 1:           -0.1330829            0.4755544    -0.1327683    0.01185374
#> 2:           -0.1590854           -0.5274548    -0.1327683    0.01185374
#> 3:           -0.1590854           -0.5274548    -0.1327683    0.01185374
#> 4:           -0.1590854           -0.5274548    -0.1327683    0.01185374
#> 5:           -0.6368855            0.1839555    -0.1327683    0.01185374
#> 6:           -0.6368855            0.1839555    -0.1327683    0.01185374

We also recover a function closure which we can reuse to fit new data, as long as it conforms to the same format:

# to fit to any dataset you can either call it directly - it is a single argument function
data_fm_encoded <- result$fitted_encoder(data_fm)

# or rename it, and stash it away for later use
encoding_function <- result$fitted_encoder

You also get a “de-encoding” function -

deencoder <- result$fitted_deencoder

This undoes the “encoding”, effectively returning the original data. This can be quite useful for interpretability methods, where the interpretation becomes easier for un-encoded data. Note that this sadly does not maintain the order of the data from the original - and some attributes may be lost. Nonetheless, the recovered data is almost the same, and equivalent for all practical purposes:

original <- data.table::data.table(data_fm)
deencoded <- deencoder( result$encoded ) 


all.equal( data.table::setorder(original), 
           data.table::setorder(deencoded), 
           check.attributes = FALSE )
#> [1] TRUE

Contributing

Please do contribute to the projects, all contributions are welcome, as long as people keep things civil - there is no need for negativity, hatred, and rudeness. Also, please do refrain from adding unnecessary dependencies (Ex: pipe) to the package (such pull requests as would add an unnecessary dependency will be denied/ suspended until the code can be made dependency free). This package wants to be as lightweight as possible - even if this means the code is a bit harder to write and maintain.

categoryencodings's People

Contributors

jszitas avatar

Stargazers

 avatar  avatar  avatar

Watchers

 avatar

Forkers

minghao2016

categoryencodings's Issues

Random crashes where there is only one category level

None of the methods currently check that the factors specified by fact actually have more than 1 level. When this happens, the function crashes loudly, giving the WRONG error message of trying to assign variable names to an empty vector.

The fix here would be to explicitly check for this situation (categorical variable with 1 level), and maybe even optionally remove these.

Joins in encode categories reduce the number of columns

The joins in encode categories will not perform a full outer join in some cases. This leads to a loss of rows. The culprit seems to be:

`res <- X[final[[1]], on = all_factor[1]]

if(length(all_factor) > 1){
for(i in 2:length(all_factor)){
res <- res[final[[i]], on = all_factor[i]]
}
}`

Where the [] operator performs, implicitly, an inner join. A fast fix which seems to resolve the issue so far (requires more testing) would be:

` res <- merge(X, final[[1]], by = all_factor[1], all = TRUE)

if(length(all_factor) > 1){
for(i in 2:length(all_factor)){
res <- merge( res, final[[i]], by = all_factor[i], all = TRUE )
}
}`

ie. replacing by an explicit merge.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.