Giter Site home page Giter Site logo

thomasp85 / lime Goto Github PK

View Code? Open in Web Editor NEW
479.0 31.0 107.0 10.88 MB

Local Interpretable Model-Agnostic Explanations (R port of original Python package)

Home Page: https://lime.data-imaginist.com/

License: Other

R 87.86% C++ 10.55% CSS 1.05% JavaScript 0.26% C 0.28%
r model-checking modeling model-evaluation caret

lime's Introduction

lime

R-CMD-check Codecov test coverage CRAN_Release_Badge CRAN_Download_Badge

There once was a package called lime,

Whose models were simply sublime,

It gave explanations for their variations,

one observation at a time.

lime-rick by Mara Averick


This is an R port of the Python lime package (https://github.com/marcotcr/lime) developed by the authors of the lime (Local Interpretable Model-agnostic Explanations) approach for black-box model explanations. All credits for the invention of the approach goes to the original developers.

The purpose of lime is to explain the predictions of black box classifiers. What this means is that for any given prediction and any given classifier it is able to determine a small set of features in the original data that has driven the outcome of the prediction. To learn more about the methodology of lime read the paper and visit the repository of the original implementation.

The lime package for R does not aim to be a line-by-line port of its Python counterpart. Instead it takes the ideas laid out in the original code and implements them in an API that is idiomatic to R.

An example

Out of the box lime supports a long range of models, e.g. those created with caret, parsnip, and mlr. Support for unsupported models are easy to achieve by adding a predict_model and model_type method for the given model.

The following shows how a random forest model is trained on the iris data set and how lime is then used to explain a set of new observations:

library(caret)
library(lime)

# Split up the data set
iris_test <- iris[1:5, 1:4]
iris_train <- iris[-(1:5), 1:4]
iris_lab <- iris[[5]][-(1:5)]

# Create Random Forest model on iris data
model <- train(iris_train, iris_lab, method = 'rf')

# Create an explainer object
explainer <- lime(iris_train, model)

# Explain new observation
explanation <- explain(iris_test, explainer, n_labels = 1, n_features = 2)

# The output is provided in a consistent tabular format and includes the
# output from the model.
explanation
#> # A tibble: 10 × 13
#>    model_type   case  label label_prob model_r2 model_intercept model_prediction
#>    <chr>        <chr> <chr>      <dbl>    <dbl>           <dbl>            <dbl>
#>  1 classificat… 1     seto…          1    0.695           0.118            0.991
#>  2 classificat… 1     seto…          1    0.695           0.118            0.991
#>  3 classificat… 2     seto…          1    0.680           0.123            0.974
#>  4 classificat… 2     seto…          1    0.680           0.123            0.974
#>  5 classificat… 3     seto…          1    0.668           0.134            0.972
#>  6 classificat… 3     seto…          1    0.668           0.134            0.972
#>  7 classificat… 4     seto…          1    0.668           0.132            0.980
#>  8 classificat… 4     seto…          1    0.668           0.132            0.980
#>  9 classificat… 5     seto…          1    0.691           0.125            0.980
#> 10 classificat… 5     seto…          1    0.691           0.125            0.980
#> # … with 6 more variables: feature <chr>, feature_value <dbl>,
#> #   feature_weight <dbl>, feature_desc <chr>, data <list>, prediction <list>

# And can be visualised directly
plot_features(explanation)

lime also supports explaining image and text models. For image explanations the relevant areas in an image can be highlighted:

explanation <- .load_image_example()

plot_image_explanation(explanation)

Here we see that the second most probably class is hardly true, but is due to the model picking up waxy areas of the produce and interpreting them as wax-light surface.

For text the explanation can be shown by highlighting the important words. It even includes a shiny application for interactively exploring text models:

interactive text explainer

Installation

lime is available on CRAN and can be installed using the standard approach:

install.packages('lime')

To get the development version, install from GitHub instead:

# install.packages('devtools')
devtools::install_github('thomasp85/lime')

Code of Conduct

Please note that the ‘lime’ project is released with a Contributor Code of Conduct. By contributing to this project, you agree to abide by its terms.

lime's People

Contributors

batpigandme avatar chrismuir avatar christophm avatar emilhvitfeldt avatar jeroen avatar jonmcalder avatar ledgerw avatar maelle avatar martinju avatar mdancho84 avatar millerjoey avatar nielsenmarkus11 avatar pkopper avatar pommedeterresautee avatar samleegithub avatar thomasp85 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

lime's Issues

Error in explain function with H2O GBM regression model - Error in if (r2 > max) { : missing value where TRUE/FALSE needed

Hi, can you please check again into issue #46 ?
Just for curiosity I tried droping the month.lbl variable and now I dont get the warning message but stil have the same error message even though my training data covers the full feature space.

library(tidyverse)
library(h2o)
library(lime)

dataset_url <- "https://www.dropbox.com/s/t3o1zvzq0t7emz4/sales.RDS?raw=1"
sales_aug <- readRDS(gzcon(url(dataset_url)))

sales_aug <- sales_aug %>% select(-month.lbl) # Dropping factor variable with non full feature range

train <- sales_aug %>% filter(month <= 8)
valid <- sales_aug %>% filter(month == 9)
test <- sales_aug %>% filter(month >= 10)

h2o.init()
h2o.no_progress()
train <- as.h2o(train)
valid <- as.h2o(valid)
test <- as.h2o(test)

y <- "amount"
x <- setdiff(names(train), y)

leaderboard <- h2o.automl(x, y, training_frame = train, validation_frame = valid, leaderboard_frame = test, max_runtime_secs = 30, stopping_metric = "MSE", seed = 12345)
gbm_model <- leaderboard@leader

explainer <- lime(as.data.frame(train), gbm_model, bin_continuous = FALSE)
explanation <- explain(as.data.frame(test[1:5,]), explainer, n_features = 5)
#> Error in if (r2 > max) {: missing value where TRUE/FALSE needed

Error using H2O xgboost model

Opening this new issue as requested in issue #40.

When using lime::explain with the H2O xgboost model it throws the following error:

Error: java.lang.IllegalArgumentException: Given domain has 0 classes, but predictions have 3 columns (per-class probabilities) for multinomial metrics.

Here is a reproducible example using the Iris dataset:

library(h2o)
h2o.init()

# First show that it's successful without any missing data
full_iris_frame <- as.h2o(iris)
full_mdl <- h2o.xgboost(training_frame = full_iris_frame, y = "Species")

full_explainer <- lime::lime(dplyr::select(as.data.frame(full_iris_frame), -Species), full_mdl)

full_explanation <- lime::explain(dplyr::select(as.data.frame(full_iris_frame)[1:4,], -Species),
                             full_explainer, n_labels = 3 , n_features = 3)

Error in explain function with H2O GBM regression model - Error in if (r2 > max) { : missing value where TRUE/FALSE needed

I'm trying to use the package with an H2O gbm regression model, but I get this error:

explainer <- lime(as.data.frame(train_h2o), gbm_model, bin_continuous = FALSE)
explanation <- explain(as.data.frame(test_h2o[1:10,]), explainer, n_feature = 5)
#> Error in if (r2 > max) { : missing value where TRUE/FALSE needed
#> In addition: Warning message:
#> In `[<-.factor`(`*tmp*`, iseq, value = c(1L, 1L, 1L, 1L, 1L, 1L,  :
#>  invalid factor level, NA generated

I have no NA values on my dataset and the structure looks like this:

str(head(train))
'data.frame':	6 obs. of  25 variables:
 $ monto    : num  958 363 340 299 382 ...
 $ feriado  : Factor w/ 2 levels "FALSE","TRUE": 2 1 1 1 1 1
 $ cumple   : Factor w/ 2 levels "FALSE","TRUE": 1 1 1 1 1 1
 $ index.num: num  1.48e+09 1.48e+09 1.48e+09 1.48e+09 1.48e+09 ...
 $ year     : num  2017 2017 2017 2017 2017 ...
 $ year.iso : num  2016 2017 2017 2017 2017 ...
 $ half     : num  1 1 1 1 1 1
 $ quarter  : num  1 1 1 1 1 1
 $ month    : num  1 1 1 1 1 1
 $ month.xts: num  0 0 0 0 0 0
 $ month.lbl: Factor w/ 8 levels "Abril","Agosto",..: 3 3 3 3 3 3
 $ day      : num  1 2 3 4 5 6
 $ wday     : num  1 2 3 4 5 6
 $ wday.xts : num  0 1 2 3 4 5
 $ wday.lbl : Factor w/ 7 levels "domingo","jueves",..: 1 3 4 5 2 7
 $ mday     : num  1 2 3 4 5 6
 $ qday     : num  1 2 3 4 5 6
 $ yday     : num  1 2 3 4 5 6
 $ mweek    : num  5 1 1 1 1 1
 $ week     : num  1 1 1 1 1 1
 $ week.iso : num  52 1 1 1 1 1
 $ week2    : num  1 1 1 1 1 1
 $ week3    : num  1 1 1 1 1 1
 $ week4    : num  1 1 1 1 1 1
 $ mday7    : num  1 1 1 1 1 1

I don't understand the error message, any clue on what is happening here?

Error in y[1, ] : incorrect number of dimensions

Hello, I think the lime methodology and package are great. But I am getting the error "Error in y[1, ] : incorrect number of dimensions" when I try to use lime on a randomForest model. Here is my code and error:

I think the issue is how I am defining "predict_model.randomForest" but not sure how to fix the "y[1, ]" error. Help would be very much appreciated. Thank you.

library(lime)
library(randomForest)
library(MASS)

model <- randomForest(medv ~ ., data = Boston, keep.forest = TRUE)

predict_model.randomForest <- function(x, newdata, type, ...) {
  res <- predict(x, newdata = newdata, type = ifelse(type == "raw", "response", type))
  switch(type,
         raw = data.frame(Response = res, stringsAsFactors = FALSE),
         prob = as.data.frame(res, check.names = FALSE)
  )
}

model_type.randomForest <- function(x, ...) "regression"

explanation <- lime(Boston[, 1:13], model)
explanations <- explain(Boston[1:5, 1:13], explanation, n_features = 3, n_labels = 1)

**Error in y[1, ] : incorrect number of dimensions**

Font family not found in Windows font database

Hi Thomas,

I am trying to use your lime package and whenever I am running the plot_features function, I get the following warning in R console multiple times:
Warning messages:
1: In grid.Call(L_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font family not found in Windows font database

I get same warning around 8 to 10 times. Is there some specific font that you are using only available in Linux machine?

It would great if you can suggest a workaround or change this in the code. The warnings are a bit annoying.

Regards,

Dependency on maggritr and purr

For comfort reasons only I have heavily used these two packages.
I can easily remove them, it will make the code less readable and longer.

However, we have too many dependencies right now, and the package is still small. Some of them will probably break in the future (Harvey packages have beautiful API but a tendency to evolve in the long term) and so on... (it also explains why I want to remove the other dependencies).

So, what do you think about cleaning the package?
It may happen after the change the API you have planned.

Error in plot_explanations() function with a regression model - Error in combine_vars(data, params$plot_env, vars, drop = params$drop) : At least one layer must contain all variables used for facetting

Hi, does plot_explanations() work with regression models? if it does, what am I doing wrong?

Here is a reproducible example:

library(tidyverse)
library(h2o)
library(lime)

dataset_url <- "https://www.dropbox.com/s/m2a0at1vk0xf3cr/sales.rds?raw=1"
sales_aug <- readRDS(gzcon(url(dataset_url)))

train <- sales_aug %>% filter(month <= 8)
valid <- sales_aug %>% filter(month == 9)
test <- sales_aug %>% filter(month == 10)

h2o.init()
h2o.no_progress()
train <- as.h2o(train)
valid <- as.h2o(valid)
test <- as.h2o(test)

y <- "amount"
x <- setdiff(names(train), y)

leaderboard <- h2o.automl(x, y, training_frame = train, validation_frame = valid, leaderboard_frame = test, max_runtime_secs = 30, stopping_metric = "deviance", seed = 12345)

model <- leaderboard@leader

explainer <- lime(as.data.frame(train[,-1]), model)
#> Warning: Data contains numeric columns with zero variance
explanation <- explain(as.data.frame(test[1:5,-1]), explainer, n_features = 5)

plot_explanations(explanation)
#> Error in combine_vars(data, params$plot_env, vars, drop = params$drop): At least one layer must contain all variables used for facetting

Lime Error in y[1, ] incorrect number of dimensions

Hi

Sorry in advance as I expect this is a daft error on my behalf, but I keep getting an incorrect number of dimensions error. Code:

library(caret)
library(lime)
x = runif(100)
y = runif(100)
traindata = data.frame(x,y)
x = runif(100)
y = runif(100)
testdata = data.frame(x,y)
summary(testdata)
model <- train(y ~ x, data = traindata,method = 'lm')
summary(model)
prediction <- predict(model,testdata)
table(prediction,testdata$y)
explainer <-lime(traindata,model)
explanation <-explain(testdata,explainer,labels = c('x'),n_features = 2)

Error in y[1, ] : incorrect number of dimensions
In addition: Warning message:
In explain.data.frame(testdata, explainer, labels = c("x"), n_features = 2) :
"labels" and "n_labels" arguments are ignored when explaining regression models

Regression Example

Hi Thomas, thanks for your work here! I wanted to test the regression functionality similar to the python lime example here.

Everything seems to work correctly until the call to explain. Here is my reprex:

library(caret)
#> Loading required package: lattice
#> Loading required package: ggplot2
library(lime)

# load data
boston <- MASS::Boston

# Split up the data set
boston_test <- boston[1:100, 1:13]
boston_train <- boston[-(1:100), 1:13]
boston_lab <- boston[[14]][-(1:100)]

# Create Random Forest model on boston data
model_reg <- train(boston_train, boston_lab, method = 'rf')
#> randomForest 4.6-12
#> Type rfNews() to see new features/changes/bug fixes.
#> 
#> Attaching package: 'randomForest'
#> The following object is masked from 'package:ggplot2':
#> 
#>     margin

# Create an explainer object
explainer_reg <- lime(boston_train, model_reg)

# Explain new observation
explanation_reg <- explain(
  boston_test, 
  explainer_reg, 
  labels = NULL, 
  n_labels = 1, 
  n_features = 5
  )
#> Warning in explain.data.frame(boston_test, explainer_reg, labels = NULL, :
#> "labels" and "n_labels" arguments are ignored when explaining regression
#> models
#> Error in y[1, ]: incorrect number of dimensions

I may be missing something obvious but I haven't discovered it yet. Also as a note, if I attempt to leave the labels and n_labels arguments out of the function call it throws an error. Thanks again.

Error loading lime

I installed lime on ubuntu 16.04, but I am getting the following error when I try to load in Rstudio:

Error: package or namespace load failed for ‘lime’ in get(Info[i, 1], envir = env):
lazy-load database '/usr/local/lib/R/site-library/dplyr/R/dplyr.rdb' is corrupt
Além disso: Warning message:
In get(Info[i, 1], envir = env) : internal error -3 in R_decompress1

Explanations should be returned as a list and not a data.frame

Several columns content are repeated on each row for a given case. The format of the result data is not adapted to the content of the explanations.
Instead, we should return a nested list:

  • each is represented by a slot in the list
  • each slot is a list itself containing several slots:
    • a data.frame with information related to the selected features (feature name, description, weight, ...)
    • a scalar with label_prob
    • model type
    • model intercept
    • data
    • prediction

Would improve readability of the explanations in a natural way (because the representation would look like what it means) and would be easy to implement the modification without breaking everything.
Let me know if you want me to PR that.

Support for time series recurrent neuronal networks

Hi @thomasp85 thanks for this super nice package!
I hope it is appropriate to ask a quick question here.?

Does your implementation already support explainers for time series recurrent neuronal networks?
I am a bit new to keras and lime in R. But this usecase would make it quite attractive to dive a bit deeper into this.

Regression support

I saw that regression support was added to the Python implementation, and that you commented there but received no response. Are you planning on adding support for regression models to this package?

Remove dependency on dplyr

We are using only bind_rows function from dplyr. May be we want to replace it by a rbind call.
df <- Reduce(rbind, listOfDataFrames)
do.call("rbind",listOfDataFrames)

Error: from glmnet Fortran code (error code 7777); All used predictors have zero variance

Hi!

Thanks for the great implementation of LIME in R. I think packages such as this will become extremely popular in the future. Maybe this question is best directed at the developer of glmnet...but I thought I would start here.

I am attempting to use LIME on one of my projects but continue to bump into the below error when I run the created 'explain' function after having run the lime() function:
"Error: from glmnet Fortran code (error code 7777); All used predictors have zero variance"

I haven't been able to figure out the reason for the above error - it just seems to pop up when I add additional features (or also when I add records) to my training data. All of the columns I use have some variance so I don't think the error is correct. Googling the error also hasn't helped so far. Note that I don't believe that the problem is with the test data since I can change that and the error persists.

Below is a (hopefully) reproducible example that shows how adding one extra feature (column) to the training data can trigger the error.


# Load libraries
library(caret)
library(lime)

# Manually create example training data
col1 <- c(0.25, 0.25, 0.89, -1.65, 0.89, -0.38, -1.65, 0.89, -0.38, 0.89)
col2 <- c(-1.25, 0.99, 0.44, -0.28, -1.86, 0.82, 0.82, -0.28, 0.99, -0.28)
col3 <- c(-0.92, -0.92, -0.92, 0.97, 0.42, -0.92, -0.92, 1.31, 0.66, 1.23)
col4 <- c(0.43, 0.79, -2.77, -0.07, 0.41, 0.12, 0.11, 0.32, 0.32, 0.32)
col5 <- c(0.65, 0.51, -2.78, 0.28, 0.46, 0, -0.05, 0.34, 0.28, 0.3)
col6 <- c(0.17, 1.07, 0.17, 0.17, 0.39, -2.54, 0.17, 0.7, -0.7, 0.41)
col7 <- c(0.42,0.56, 0.09, 0.32, 0.38, 0.46, -0.09, -2.8, 0.31, 0.34)
col8 <- c(0.21,0.47, 0.21, 0.21, 0.36, -0.18, 0.21, 0.8, -2.75, 0.47)
col9 <- c(0.28, 0.28, 0.28, 0.28, -2.84, 0.28, 0.28, 0.44, 0.28, 0.46)

# Combine training data into data frames
data1Train <- data.frame(col1, col2, col3, col4, col5, col6, col7, col8) # LIME works with this data
data2Train <- data.frame(col1, col2, col3, col4, col5, col6, col7, col8, col9) # LIME doesn't work with this data

# Create training data target variable
targetTrain <- as.factor(c(1, 0, 1, 0, 1, 0, 0, 0, 1, 1))

# Manually create example test data
col1 <- c(4.69, 4.69)
col2 <- c(-0.59, -0.94)
col3 <- c(1.35, -0.92)
col4 <- c(-2.77, -0.7)
col5 <- c(-2.78, 0.21)
col6 <- c(-2.54, 0.02)
col7 <- c(-2.8, 0.07)
col8 <- c(-2.75, 0.18)
col9 <- c(-2.84, -2.84)

# Combine test data into data frames
data1Test <- data.frame(col1, col2, col3, col4, col5, col6, col7, col8)
data2Test <- data.frame(col1, col2, col3, col4, col5, col6, col7, col8, col9)

# Create test data target variable (note that this data is not used below)
targetTest <- as.factor(c(0, 1))

# Run first model and explain predictions using lime
model1       <- train(data1Train, targetTrain, method = 'rf', tuneLength = 1, trControl = trainControl(method = "none"))
explain1     <- lime(data1Train, model1)
explanation1 <- explain1(data1Test, n_labels = 1, n_features = 2)
plot_features(explanation1, ncol = 1)

# Run second model and explain predictions using lime
model2       <- train(data2Train, targetTrain, method = 'rf', tuneLength = 1, trControl = trainControl(method = "none"))
explain2     <- lime(data2Train, model2)
explanation2 <- explain2(data2Test, n_labels = 1, n_features = 2)
plot_features(explanation2, ncol = 1)

Using lime() on xgboost object

Hi, and thank you for an excellent package!

I am trying to apply the lime package to a model fitted with xgboost (using the original xgboost package), but the lime function does not seem to accept the input format even if the predict function works fine.

Example using both xgbDmatrix and a regular matrix

x = matrix(rnorm(100*10),ncol=10)
y = rnorm(100)

xgbDMatrix.obj<- xgb.DMatrix(data=x,
label = y)

mod = xgb.train(data = xgbDMatrix.obj,nrounds = 100) # Variant 1, using xgbDMatrix format of data input
#predict(mod,x,type="prob") # works fine
explain <- lime(x=x,model=mod) # Throws error

mod = xgboost(data = x,label = y,nrounds = 100) # Variant 2 using a regular matrix + vector as data input
#predict(mod,x,type="prob") # works fine
explain <- lime(x=x,model=mod) # Throws error

In the readme you mention manually building a predict function.
If that is the solution here, could you please provide some guidelines on how to do that?

explaining rpart

Hello,

I am currently trying out your library since I think it would be great if one could use lime in R. However when using rpart I am getting a weird error. Here is the code to reproduce it:

iris_test <- iris[52, 1:4]
iris_train <- iris[-(52), 1:4]
iris_lab <- iris[[5]][-(52)]
rpart_train <- iris[-(52),]

#I want to explain the 52nd row of the dataset

model <- rpart(Species ~ Sepal.Length + Sepal.Width + Petal.Length + Petal.Width,
data = rpart_train,
method='class')

explain <- lime(iris_train, model, bin_continuous = TRUE, n_bins = 5)
explanation <- explain(iris_test, n_labels = 2, n_features = 3)

when I run the last line, I get the following error:

Error in $<-.data.frame(*tmp*, "case", value = "52") :
replacement has 1 row, data has 0
In addition: Warning messages:
1: Unknown or uninitialised column: 'feature'.
2: Unknown or uninitialised column: 'feature'.

I'm not sure whether it's because I am doing something wrong since I am fairly new in R or whether it's an actual issue.

lime::lime error

Hi,

I'm running a lgb.train model and when I do it with iris data I have no problems to run lime function. However, when I use my dataset (~50k observations and 115 variables) I get the following error:

# Error in cut.default(x[[i]], unique(explainer$bin_cuts[[i]]), labels = FALSE,  : 
  invalid number of intervals

I tried to run line by line from function's code , but I have no clues.
My target variable is binary (0 and 1) and all my dataset is numeric.

my code is:

model <- lgb.train(data = train, 
                   label = target,
                   obj = "binary", 
                   eval = "binary_logloss",
                   nrounds = 100,
                   early_stopping = 100)

expl <- lime(train, model)

I also tried to transform train into a data.frame, no lucky though.

P.S.: my dataset is too larget to put it here,

can not install lime package under linux system

permutations.cpp: In function ‘Rcpp::List get_index_permutations(Rcpp::IntegerVector, int)’:
permutations.cpp:15:80: error: ‘sample’ was not declared in this scope
IntegerVector nb = sample(original_document.size(), number_permutations, true);
^
make: *** [permutations.o] Error 1
ERROR: compilation failed for package ‘lime’

Error using H2O with xgboost (and gbm)....

Hello,

I tried to reproduce what dkincaid referred in issue #50.
It seemed that with h2o.gbm() there was no issue, and at least to me it produces an error too.

I am using h2o version 3.14.0.3 although I got the same behaviour with the latest nightly build 3.15.0.4104

Just for the sake of clarification let me reproduce the two examples:

For h2o.xgboost()

library(lime)
library(h2o)
h2o.init()

# XGBOOST
# First show that it's successful without any missing data
full_iris_frame <- as.h2o(iris)
full_xgb <- h2o.xgboost(training_frame = full_iris_frame, y = "Species")

full_explainer <- lime::lime(dplyr::select(as.data.frame(full_iris_frame), -Species), full_xgb)

full_explanation <- lime::explain(dplyr::select(as.data.frame(full_iris_frame)[1:4,], -Species),
                                  full_explainer, n_labels = 3 , n_features = 3)

That in my Mac produces this error:

> # XGBOOST
> # First show that it's successful without any missing data
> full_iris_frame <- as.h2o(iris)
  |=========================================================================================| 100%
> full_xgb <- h2o.xgboost(training_frame = full_iris_frame, y = "Species")
  |=========================================================================================| 100%
> 
> full_explainer <- lime::lime(dplyr::select(as.data.frame(full_iris_frame), -Species), full_xgb)
> 
> full_explanation <- lime::explain(dplyr::select(as.data.frame(full_iris_frame)[1:4,], -Species),
+                                   full_explainer, n_labels = 3 , n_features = 3)
Error: The class of model must have a model_type method. Models other than those from `caret` and `mlr` must have a `model_type` method defined manually e.g. model_type.mymodelclass <- function(x, ...) "classification"

And for the h2o.gbm model:

 GBM
# First show that it's successful without any missing data
full_iris_frame <- as.h2o(iris)
full_gbm <- h2o.gbm(training_frame = full_iris_frame, y = "Species")

full_explain_gbm <- lime::lime(dplyr::select(as.data.frame(full_iris_frame), -Species), full_gbm)

full_explanation_gbm <- lime::explain(dplyr::select(as.data.frame(full_iris_frame)[1:4,], -Species),
                                  full_explain_gbm, n_labels = 3 , n_features = 3)

That generates the same type of error:

> # GBM
> # First show that it's successful without any missing data
> full_iris_frame <- as.h2o(iris)
  |=========================================================================================| 100%
> full_gbm <- h2o.gbm(training_frame = full_iris_frame, y = "Species")
  |=========================================================================================| 100%
> 
> full_explain_gbm <- lime::lime(dplyr::select(as.data.frame(full_iris_frame), -Species), full_gbm)
> 
> full_explanation_gbm <- lime::explain(dplyr::select(as.data.frame(full_iris_frame)[1:4,], -Species),
+                                   full_explain_gbm, n_labels = 3 , n_features = 3)
Error: The class of model must have a model_type method. Models other than those from `caret` and `mlr` must have a `model_type` method defined manually e.g. model_type.mymodelclass <- function(x, ...) "classification"

If we run the example included in explain() help everything runs smoothly.

> # LDA
> # Explaining a model based on tabular data
> library(MASS)
> iris_test <- iris[1, 1:4]
> iris_train <- iris[-1, 1:4]
> iris_lab <- iris[[5]][-1]
> # Create linear discriminant model on iris data
> model <- lda(iris_train, iris_lab)
> # Create explanation object
> explanation <- lime(iris_train, model)
> 
> # This can now be used together with the explain method
> lda_explain <- explain(iris_test, explanation, n_labels = 1, n_features = 2)

There is a clear difference between the structure of full_explainer_gbm and explanation objects, in particular what is included in full_explainer_gbm$model and explanation$model

full_explainer_gbm$model

> str(full_explain_gbm$model)
Formal class 'H2OMultinomialModel' [package "h2o"] with 5 slots
  ..@ model_id     : chr "GBM_model_R_1511024279025_2"
  ..@ algorithm    : chr "gbm"
  ..@ parameters   :List of 6
  .. ..$ model_id      : chr "GBM_model_R_1511024279025_2"
  .. ..$ training_frame: chr "iris"
  .. ..$ seed          : num -8.9e+18
  .. ..$ distribution  : chr "multinomial"
  .. ..$ x             : chr [1:4] "Sepal.Length" "Sepal.Width" "Petal.Length" "Petal.Width"
  .. ..$ y             : chr "Species"
  ..@ allparameters:List of 44
  .. ..$ model_id                             : chr "GBM_model_R_1511024279025_2"
  .. ..$ training_frame                       : chr "iris"
  .. ..$ nfolds                               : int 0
  .. ..$ keep_cross_validation_predictions    : logi FALSE
  .. ..$ keep_cross_validation_fold_assignment: logi FALSE
  .. ..$ score_each_iteration                 : logi FALSE
  .. ..$ score_tree_interval                  : int 0
  .. ..$ fold_assignment                      : chr "AUTO"
  .. ..$ ignore_const_cols                    : logi TRUE
  .. ..$ balance_classes                      : logi FALSE
  .. ..$ max_after_balance_size               : num 5
  .. ..$ max_confusion_matrix_size            : int 20
  .. ..$ max_hit_ratio_k                      : int 0
  .. ..$ ntrees                               : int 50
  .. ..$ max_depth                            : int 5
  .. ..$ min_rows                             : num 10
  .. ..$ nbins                                : int 20
  .. ..$ nbins_top_level                      : int 1024
  .. ..$ nbins_cats                           : int 1024
  .. ..$ r2_stopping                          : num 1.8e+308
  .. ..$ stopping_rounds                      : int 0
  .. ..$ stopping_metric                      : chr "AUTO"
  .. ..$ stopping_tolerance                   : num 0.001
  .. ..$ max_runtime_secs                     : num 0
  .. ..$ seed                                 : num -8.9e+18
  .. ..$ build_tree_one_node                  : logi FALSE
  .. ..$ learn_rate                           : num 0.1
  .. ..$ learn_rate_annealing                 : num 1
  .. ..$ distribution                         : chr "multinomial"
  .. ..$ quantile_alpha                       : num 0.5
  .. ..$ tweedie_power                        : num 1.5
  .. ..$ huber_alpha                          : num 0.9
  .. ..$ sample_rate                          : num 1
  .. ..$ col_sample_rate                      : num 1
  .. ..$ col_sample_rate_change_per_level     : num 1
  .. ..$ col_sample_rate_per_tree             : num 1
  .. ..$ min_split_improvement                : num 1e-05
  .. ..$ histogram_type                       : chr "AUTO"
  .. ..$ max_abs_leafnode_pred                : num 1.8e+308
  .. ..$ pred_noise_bandwidth                 : num 0
  .. ..$ categorical_encoding                 : chr "AUTO"
  .. ..$ calibrate_model                      : logi FALSE
  .. ..$ x                                    : chr [1:4] "Sepal.Length" "Sepal.Width" "Petal.Length" "Petal.Width"
  .. ..$ y                                    : chr "Species"
  ..@ model        :List of 17
  .. ..$ cross_validation_models                      : NULL
  .. ..$ cross_validation_predictions                 : NULL
  .. ..$ cross_validation_holdout_predictions_frame_id: NULL
  .. ..$ cross_validation_fold_assignment_frame_id    : NULL
  .. ..$ model_summary                                :ClassesH2OTableand 'data.frame':	1 obs. of  9 variables:
  .. .. ..$ number_of_trees         : num 50
  .. .. ..$ number_of_internal_trees: num 150
  .. .. ..$ model_size_in_bytes     : num 28293
  .. .. ..$ min_depth               : num 1
  .. .. ..$ max_depth               : num 5
  .. .. ..$ mean_depth              : num 4.92
  .. .. ..$ min_leaves              : num 2
  .. .. ..$ max_leaves              : num 13
  .. .. ..$ mean_leaves             : num 10.1
  .. .. ..- attr(*, "header")= chr "Model Summary"
  .. .. ..- attr(*, "formats")= chr [1:9] "%d" "%d" "%d" "%d" ...
  .. .. ..- attr(*, "description")= chr ""
  .. ..$ scoring_history                              :ClassesH2OTableand 'data.frame':	51 obs. of  6 variables:
  .. .. ..$ timestamp                    : chr [1:51] "2017-11-18 17:58:54" "2017-11-18 17:58:54" "2017-11-18 17:58:54" "2017-11-18 17:58:54" ...
  .. .. ..$ duration                     : chr [1:51] " 0.025 sec" " 0.193 sec" " 0.225 sec" " 0.255 sec" ...
  .. .. ..$ number_of_trees              : num [1:51] 0 1 2 3 4 5 6 7 8 9 ...
  .. .. ..$ training_rmse                : num [1:51] 0.667 0.604 0.546 0.495 0.448 ...
  .. .. ..$ training_logloss             : num [1:51] 1.099 0.927 0.791 0.684 0.595 ...
  .. .. ..$ training_classification_error: num [1:51] 0.64 0.04 0.0467 0.0467 0.0467 ...
  .. .. ..- attr(*, "header")= chr "Scoring History"
  .. .. ..- attr(*, "formats")= chr [1:6] "%s" "%s" "%d" "%.5f" ...
  .. .. ..- attr(*, "description")= chr ""
  .. ..$ training_metrics                             :Formal class 'H2OMultinomialMetrics' [package "h2o"] with 5 slots
  .. .. .. ..@ algorithm: chr "gbm"
  .. .. .. ..@ on_train : logi TRUE
  .. .. .. ..@ on_valid : logi FALSE
  .. .. .. ..@ on_xval  : logi FALSE
  .. .. .. ..@ metrics  :List of 17
  .. .. .. .. ..$ __meta              :List of 3
  .. .. .. .. .. ..$ schema_version: int 3
  .. .. .. .. .. ..$ schema_name   : chr "ModelMetricsMultinomialV3"
  .. .. .. .. .. ..$ schema_type   : chr "ModelMetricsMultinomial"
  .. .. .. .. ..$ model               :List of 4
  .. .. .. .. .. ..$ __meta:List of 3
  .. .. .. .. .. .. ..$ schema_version: int 3
  .. .. .. .. .. .. ..$ schema_name   : chr "ModelKeyV3"
  .. .. .. .. .. .. ..$ schema_type   : chr "Key<Model>"
  .. .. .. .. .. ..$ name  : chr "GBM_model_R_1511024279025_2"
  .. .. .. .. .. ..$ type  : chr "Key<Model>"
  .. .. .. .. .. ..$ URL   : chr "/3/Models/GBM_model_R_1511024279025_2"
  .. .. .. .. ..$ model_checksum      : num 2.36e+18
  .. .. .. .. ..$ frame               :List of 4
  .. .. .. .. .. ..$ __meta:List of 3
  .. .. .. .. .. .. ..$ schema_version: int 3
  .. .. .. .. .. .. ..$ schema_name   : chr "FrameKeyV3"
  .. .. .. .. .. .. ..$ schema_type   : chr "Key<Frame>"
  .. .. .. .. .. ..$ name  : chr "iris"
  .. .. .. .. .. ..$ type  : chr "Key<Frame>"
  .. .. .. .. .. ..$ URL   : chr "/3/Frames/iris"
  .. .. .. .. ..$ frame_checksum      : num 8.13e+18
  .. .. .. .. ..$ description         : NULL
  .. .. .. .. ..$ model_category      : chr "Multinomial"
  .. .. .. .. ..$ scoring_time        : num 1.51e+12
  .. .. .. .. ..$ predictions         : NULL
  .. .. .. .. ..$ MSE                 : num 0.00284
  .. .. .. .. ..$ RMSE                : num 0.0533
  .. .. .. .. ..$ nobs                : int 150
  .. .. .. .. ..$ r2                  : num 0.996
  .. .. .. .. ..$ hit_ratio_table     :ClassesH2OTableand 'data.frame':	3 obs. of  2 variables:
  .. .. .. .. .. ..$ k        : chr [1:3] "1" "2" "3"
  .. .. .. .. .. ..$ hit_ratio: num [1:3] 1 1 1
  .. .. .. .. .. ..- attr(*, "header")= chr "Top-3 Hit Ratios"
  .. .. .. .. .. ..- attr(*, "formats")= chr [1:2] "%s" "%f"
  .. .. .. .. .. ..- attr(*, "description")= chr ""
  .. .. .. .. ..$ cm                  :List of 2
  .. .. .. .. .. ..$ __meta:List of 3
  .. .. .. .. .. .. ..$ schema_version: int 3
  .. .. .. .. .. .. ..$ schema_name   : chr "ConfusionMatrixV3"
  .. .. .. .. .. .. ..$ schema_type   : chr "ConfusionMatrix"
  .. .. .. .. .. ..$ table :ClassesH2OTableand 'data.frame':	4 obs. of  5 variables:
  .. .. .. .. .. .. ..$ setosa    : num [1:4] 50 0 0 50
  .. .. .. .. .. .. ..$ versicolor: num [1:4] 0 50 0 50
  .. .. .. .. .. .. ..$ virginica : num [1:4] 0 0 50 50
  .. .. .. .. .. .. ..$ Error     : num [1:4] 0 0 0 0
  .. .. .. .. .. .. ..$ Rate      : chr [1:4] "0 / 50" "0 / 50" "0 / 50" "0 / 150"
  .. .. .. .. .. .. ..- attr(*, "header")= chr "Confusion Matrix"
  .. .. .. .. .. .. ..- attr(*, "formats")= chr [1:5] "%d" "%d" "%d" "%.4f" ...
  .. .. .. .. .. .. ..- attr(*, "description")= chr "Row labels: Actual class; Column labels: Predicted class"
  .. .. .. .. ..$ logloss             : num 0.0188
  .. .. .. .. ..$ mean_per_class_error: num 0
  .. ..$ validation_metrics                           :Formal class 'H2OMultinomialMetrics' [package "h2o"] with 5 slots
  .. .. .. ..@ algorithm: chr "gbm"
  .. .. .. ..@ on_train : logi FALSE
  .. .. .. ..@ on_valid : logi TRUE
  .. .. .. ..@ on_xval  : logi FALSE
  .. .. .. ..@ metrics  : NULL
  .. ..$ cross_validation_metrics                     :Formal class 'H2OMultinomialMetrics' [package "h2o"] with 5 slots
  .. .. .. ..@ algorithm: chr "gbm"
  .. .. .. ..@ on_train : logi FALSE
  .. .. .. ..@ on_valid : logi FALSE
  .. .. .. ..@ on_xval  : logi TRUE
  .. .. .. ..@ metrics  : NULL
  .. ..$ cross_validation_metrics_summary             : NULL
  .. ..$ status                                       : NULL
  .. ..$ start_time                                   : num 1.51e+12
  .. ..$ end_time                                     : num 1.51e+12
  .. ..$ run_time                                     : int 1030
  .. ..$ help                                         :List of 21
  .. .. ..$ validation_metrics                           : chr "Validation data model metrics"
  .. .. ..$ cross_validation_metrics_summary             : chr "Cross-validation model metrics summary"
  .. .. ..$ run_time                                     : chr "Runtime in milliseconds"
  .. .. ..$ status                                       : chr "Job status"
  .. .. ..$ domains                                      : chr "Domains for categorical columns"
  .. .. ..$ model_category                               : chr "Category of the model (e.g., Binomial)"
  .. .. ..$ __meta                                       : chr "Metadata on this schema instance, to make it self-describing."
  .. .. ..$ variable_importances                         : chr "Variable Importances"
  .. .. ..$ model_summary                                : chr "Model summary"
  .. .. ..$ scoring_history                              : chr "Scoring history"
  .. .. ..$ help                                         : chr "Help information for output fields"
  .. .. ..$ end_time                                     : chr "End time in milliseconds"
  .. .. ..$ names                                        : chr "Column names"
  .. .. ..$ cross_validation_fold_assignment_frame_id    : chr "Cross-validation fold assignment (each row is assigned to one holdout fold)"
  .. .. ..$ start_time                                   : chr "Start time in milliseconds"
  .. .. ..$ training_metrics                             : chr "Training data model metrics"
  .. .. ..$ cross_validation_models                      : chr "Cross-validation models (model ids)"
  .. .. ..$ cross_validation_metrics                     : chr "Cross-validation model metrics"
  .. .. ..$ cross_validation_predictions                 : chr "Cross-validation predictions, one per cv model (deprecated, use cross_validation_holdout_predictions_frame_id instead)"
  .. .. ..$ init_f                                       : chr "The Intercept term, the initial model function value to which trees make adjustments"
  .. .. ..$ cross_validation_holdout_predictions_frame_id: chr "Cross-validation holdout predictions (full out-of-sample predictions on training data)"
  .. ..$ variable_importances                         :ClassesH2OTableand 'data.frame':	4 obs. of  4 variables:
  .. .. ..$ variable           : chr [1:4] "Petal.Width" "Petal.Length" "Sepal.Width" "Sepal.Length"
  .. .. ..$ relative_importance: num [1:4] 258.86 195.48 2.89 2.33
  .. .. ..$ scaled_importance  : num [1:4] 1 0.75517 0.01117 0.00901
  .. .. ..$ percentage         : num [1:4] 0.56327 0.42536 0.00629 0.00508
  .. .. ..- attr(*, "header")= chr "Variable Importances"
  .. .. ..- attr(*, "formats")= chr [1:4] "%s" "%5f" "%5f" "%5f"
  .. .. ..- attr(*, "description")= chr ""
  .. ..$ init_f                                       : num 0

and explain$model

> str(explanation$model)
List of 8
 $ prior  : Named num [1:3] 0.329 0.336 0.336
  ..- attr(*, "names")= chr [1:3] "setosa" "versicolor" "virginica"
 $ counts : Named int [1:3] 49 50 50
  ..- attr(*, "names")= chr [1:3] "setosa" "versicolor" "virginica"
 $ means  : num [1:3, 1:4] 5 5.94 6.59 3.43 2.77 ...
  ..- attr(*, "dimnames")=List of 2
  .. ..$ : chr [1:3] "setosa" "versicolor" "virginica"
  .. ..$ : chr [1:4] "Sepal.Length" "Sepal.Width" "Petal.Length" "Petal.Width"
 $ scaling: num [1:4, 1:2] 0.8281 1.5296 -2.1952 -2.8042 -0.0214 ...
  ..- attr(*, "dimnames")=List of 2
  .. ..$ : chr [1:4] "Sepal.Length" "Sepal.Width" "Petal.Length" "Petal.Width"
  .. ..$ : chr [1:2] "LD1" "LD2"
 $ lev    : chr [1:3] "setosa" "versicolor" "virginica"
 $ svd    : num [1:2] 48.18 4.56
 $ N      : int 149
 $ call   : language lda(x = iris_train, iris_lab)
 - attr(*, "class")= chr "lda"

While both objects have the same class:

> class(full_explain_gbm)
[1] "data_frame_explainer" "explainer"            "list"                
> class(explanation)
[1] "data_frame_explainer" "explainer"            "list"  

My current machine's setup is this:

> version
               _                           
platform       x86_64-apple-darwin15.6.0   
arch           x86_64                      
os             darwin15.6.0                
system         x86_64, darwin15.6.0        
status                                     
major          3                           
minor          4.2                         
year           2017                        
month          09                          
day            28                          
svn rev        73368                       
language       R                           
version.string R version 3.4.2 (2017-09-28)
nickname       Short Summer                
> 

Thanks,
Carlos.

Error: assert_that: assertion must return a logical value

Kindly request you to priority this

library(MASS)
library(lime)
iris_test <- iris[1, 1:4]
iris_train <- iris[-1, 1:4]
iris_lab <- iris[[5]][-1]
model <- lda(iris_train, iris_lab)
explanation <- lime(iris_train, model)
explain(iris_test, explanation, n_labels = 1, n_features = 2)

R 3.3.2 64-bit
Lime 0.3.0

When I have seen trace-back

image

This one is only for example that 'explain' function of lime is not working in my system .I have similar work which require 'explain' function of lime and giving same parameter and thrown up with similar

Error: assert_that: assertion must return a logical value

assert_that(is.null(labels) + is.null(n_labels) == 1, msg = "You need to choose between labels and n_labels parameters.")

Thanks in Advance

Dependenct on tibble

I noticed that tibble is just used at onece in the source code and may be easily replaced by a data.frame call.
Do we need a dependency on it?

lime/R/lime.R

Line 84 in 71521e8

tibble(label = label, feature = names(coefs), feature_weight = unname(coefs), model_r2 = r2, model_intercept = intercept)

Error in UseMethod("lime") : no applicable method for 'lime' applied to an object of class "H2OFrame"

I'm trying to set up the explainer, and I'm having trouble with my code. The automl_leader model is the leading h2o model, and the train_h2o data set is 15% of the overall dataset, with the first column being a binary (yes/no) variable I would like to predict. I've tried both lines of code below with no success. Any thoughts?

Option 1:

explainer <- lime::lime(
as.data.frame(train_h2o[,-1]),
model = automl_leader,
bin_continuous = FALSE)

Option 2:

explainer <- lime(train_h2o, automl_leader)

error

dear all,

I noticed that if there is a imbalanced variable in the model (in my case, a GLM from the binomial family, and a non-response variable with 300000 TRUE and 11 FALSE) than it will build an explainer, but it will not build an explanation with a contrast error. as soon as I delete the imbalanced variables, it does work.

has anyone encountered this?

regards, Daan

Error with h2o.xgboost()...

Yes, thanks for the indication (#51), after updating the package h2o.gbm() does not produce any error. But the h2o.xgboost() gets this:

> # XGBOOST
> # First show that it's successful without any missing data
> full_iris_frame <- as.h2o(iris)
  |========================================================================================| 100%
> full_xgb <- h2o.xgboost(training_frame = full_iris_frame, y = "Species")
  |========================================================================================| 100%
> 
> full_explainer <- lime::lime(dplyr::select(as.data.frame(full_iris_frame), -Species), full_xgb)
> 
> full_explanation <- lime::explain(dplyr::select(as.data.frame(full_iris_frame)[1:4,], -Species),
+                                   full_explainer, n_labels = 3 , n_features = 3)
  |========================================================================================| 100%
  |========================================================================================| 100%
Error in glmnet(x[, c(features, j), drop = FALSE], y, weights = weights,  : 
  x should be a matrix with 2 or more columns
> 

Thanks,
Carlos.

Problems with missing values (NA) with caret created GBM model

Still having some issues with missing values throwing errors. There was a fix in #45 for H2O models, but a similar problem is happening with a caret created GBM model as well. This one is giving the following error:

Error in if (all(weights[-1] == 0)) { : 
  missing value where TRUE/FALSE needed

Here is a code snippet to reproduce:

# Create a data frame from the Iris data and randomly set some values to NA
myIris <- purrr::map_df(iris[,-5], function(x) {x[sample(c(TRUE, NA), prob = c(0.8, 0.2), size = length(x), replace = TRUE)]})

myIris <- cbind(myIris, Species=iris$Species)

library(caret)
trainIndex <- createDataPartition(myIris$Species, p = 0.8, list = FALSE)
trainDataNa <- myIris[trainIndex,]
testDataNa <- myIris[-trainIndex,]

mdlNa <- train(Species ~ ., data = trainData, method = "gbm", na.action = na.pass, verbose = FALSE)

explainerNa <- lime::lime(dplyr::select(trainDataNa, -Species), mdlNa)
explanationNa <- lime::explain(dplyr::select(testDataNa, -Species), explainerNa, 
                             n_labels = 3, n_features = 3)

Interestingly if I use trainDataNa in the explain I get a different error -

Error in glm.fit(x = x_fit, y = y[[label]], weights = weights, family = gaussian()) : 
  NA/NaN/Inf in 'y'

Checklist for CRAN submission

  • Settle on return format
  • Decide on image inclusion
  • Remove unwanted dependencies
  • Improve unit tests
  • Improve plot_text_explanation
  • Vignette
  • Logo

Add image explanation

This should be the focus (beyond bug fixes) for the next version. It will bring the R version on par with the Python one.

One of the biggest challenge of this is the superpixel segmentation - a crude implementation in R can be seen here

Another possible challenge is general memory usage - image data is much larger so the permutations will take both time and space.

On top of my head I believe the input should be image files rather than in-memory images. We can then provide a preprocessor function as with text analysis to allow the user to get the image data into the format they need for the model. This will solve the memory issue of permutations, as well as the fact that there seem to be no common image class with widespread use in modeling in R.

I believe the magick package should provide all the infrastructure needed for the lime side of things...

@pommedeterresautee you are free to comment and suggest things for this - I'll take the implementation upon me.

Problem using LIME with XGBoost

Hi thomasp85, and thank you for this great package.

There is a problem when using XGBoost, specifically in the predict_model.xgb.Booster function. If the input to that function is a data.frame, the function executes:

newdata <- xgboost::xgb.DMatrix(as.matrix(newdata))

If newdata contains columns with type factor, then as.matrix(newdata) will be a character matrix, and xgboost::xgb.DMatrix will crash.

I may be wrong, but in my opinion there is a deeper problem here. In my particular case, I "one-hot encoded" the factor features before training the model. So what should the explain function input be? If I use the test set without one-hot encoding, xgboost will crash when trying to predict. On the other hand, if I use the one-hot encoded test set, will the permutations done by the package make sense?

In general, the LIME paper separates features used to train the model from human-understandable "variables" that will explain the model locally. How do we do that separation using this package?


EDIT: I am looking at the Python implementation. They let you enter the prediction function as a parameter. I think that would be a nice (flexible) solution to the problem. That way we can define the required transformations inside the custom prediction function.


EDIT2: Ok, I did not read this:

Out of the box lime supports models created using the caret and mlr frameworks. Support for other models are easy to achieve by adding a predict_model and model_type method for the given model.

This solves my problem. I'm sorry :)

Error in bin_cuts[[i]] : subscript out of bounds

Hello Thomas,

Great work you are doing here. I had an error when creating the explain object:

Error in bin_cuts[[i]] : subscript out of bounds

In my case, I have just one numerical feature to be explain a multiclass lable using a h2o gbm object.

Here is my (non-reproducible) code:

explainer = lime::lime(trainDF, model = model_gbm, bin_continuous = TRUE) explanation =explain(test_lime, explainer = explainer,n_features = 1, n_labels = 10)

I could be doing something wrong and would love to find out what it could be. Many thanks.

Handling NAs

Hi there,

Lime currently does not seem to support NAs in data. Here's an example:

library(caret)
library(lime)

set.seed(123)

x = as.data.frame(matrix(rnorm(100*10), ncol=10))
x$V1 = ifelse(x$V2 > 0, NA, x$V1) # introduce random NAs in V1
y = round(runif(100))
y = as.factor(y)
levels(y) = c("no", "yes")
data = cbind(x, target = y)

fitControl <- trainControl(method = "repeatedcv",
                           number = 10,
                           repeats = 1,
                           allowParallel = TRUE,
                           classProbs = TRUE,
                           summaryFunction = twoClassSummary)

XGBModel = train(target ~ ., 
                 data = data,
                 trControl = fitControl, 
                 method = "xgbTree",
                 search = "random", 
                 metric = "ROC",
                 na.action = na.pass) # force XGB to take NAs into account

prediction = predict.train(XGBModel, data, na.action = na.pass, type = "prob") # works fine

explain = lime(data, XGBModel, bin_continuous = T, n_permutations = 1000) # error

# Error in quantile.default(x[[i]], seq(0, 1, length.out = n_bins + 1)) : 
#   missing values and NaN's not allowed if 'na.rm' is FALSE

explain = lime(na.omit(data), XGBModel, bin_continuous = T, n_permutations = 1000) # works fine

It would be great if NAs could be handled like XGB does.

Thanks guys for your work!

Explain README sentence

I am not sure to understand the sentence : The global model explanation using submodular picks is not supported in either packages.
Can we make it more clear? Add context?
(I imagine it s related to generalizing the methodology but I am not sure)

lime function with date columns

The explain function errors whenever I have a date column in my dataset. This is a minor issue but I thought I should flag it anyways.

Error during explain with H2O GBM/XGB models - "NA/NaN/Inf in 'x'"

I'm trying to use the package with an H2O xgboost model (I've also tried it with GBM and get the same thing. The error is:

Error in glm.fit(x = x_fit, y = y, weights = weights, family = gaussian()) : 
  NA/NaN/Inf in 'x'

Here is the code I'm running:

explainer <- lime::lime(as.data.frame(wellnessTrain), mdl)

explanation <- lime::explain(as.data.frame(wellnessTest),
                       explainer, n_labels = 1, n_features = 2)

This is caused by having some NA values in the data frame, but I thought that this had already been fixed in issue #8. I verified this by removing the three columns that have NA values as a test. These NA values are meaningful and H2O's GBM and XGBoost handle them by creating a category for the missing value after binning the unmissing feature values. Is there any easy fix here?

scope of lime package in R

Hi, thank you for LIME package, it is so interesting for me.
I have some questions, it seems to me that the current version of LIME package in R doesn't support the text and image data, is it right? so, except random forest what other kinds of classifier does it support?

Integrate `h2o` with `lime`

@thomasp85
Thanks for your work getting lime setup in R. Already had several use cases where I implemented h2o and lime together... the two work really well. If you're OK with it, I'll work on an integration to get most of the classes (at least the major h2o models) into lime.
-Matt

using LIME with caret and method="nnet"

Lime R-pkg is great!
Thanks to your suggestions,
the code now works fine.

2 quick Questions

Q1:
Is there a way to make the
plot_features(explanation[1:8, ])
display the plot cases (15,18,25,7) in the same order
as the cases in the test.set (7,15,18,25...) ?

reason: it's easier to present results to the end User
if the case numbers are in the same order
as in the test.set file...

Q2:
Would it be possible
to include inside/next to each color plot bar,
the actual value of a column ?
(ie: next to condition:
Petal.Length <= 1.6
you would display the actual value for Case 7: 1.4)

reason: avoids User having to consult the test.set file for each Case in plot .
The tested value is right there, in the plot... :-)

///////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////
Complete example code below :
inTrain <- createDataPartition(y=iris$Species, p=0.75, list=FALSE) # 75% for the train.set
train.set <- iris[inTrain,]
test.set <- iris[-inTrain,]

model <- train(Species ~ ., train.set, method='nnet', trace = FALSE, preProc = c("center", "scale"))
prediction <- predict(model, test.set[-5])
table(prediction, test.set$Species)
prediction <- predict(model, test.set[-5], type="prob")

now LIME!
Create an explainer object
explainer <- lime(train.set, model)

Explain new observation:
explanation <- explain(test.set[,-5], explainer, n_labels = 1, n_features = 2)
plot_features(explanation[1:8, ])

head(test.set)
Sepal.Length Sepal.Width Petal.Length
7 4.6 3.4 1.4
15 5.8 4.0 1.2
18 5.1 3.5 1.4
25 4.8 3.4 1.9
28 5.2 3.5 1.5
29 5.2 3.4 1.4

Thanks Thomas!!

Integer variables converted to double during LIME

I'll try to produce a minimal reproducible example, but I tried a few caret methods, and it did not go wrong for any of them. However, it went wrong when using the ctree classifier from the party library, since it seems to be strict at not accepting that the test set has doubles instead of integers.

makeGeneric <- function(ctreemodel){
  return(structure(list(ctreemodel), class = "myclass"))
}

predict.myclass <- function(model, newdata, type="prob", ...){
  stopifnot(type == "prob")
  predict(model[[1]], newdata, type = "prob") %>% data.frame %>% t %>% 
    data.frame("false" = 1 - ., "true" = .)
}

model_type.myclass <- function(x, ...) "classification"

FT <- read.csv("https://raw.githubusercontent.com/vincentarelbundock/Rdatasets/master/csv/datasets/Titanic.csv")
FT <- na.omit(FT)
FT$Age <- as.integer(FT$Age)

ctreemodel <- party::ctree(Survived ~ PClass + Sex + Age, FT[-1,])
genericModel <- makeGeneric(ctreemodel) 

explainer <- lime::lime(FT[-1,], genericModel)
explanation <- lime::explain(FT[1,], explainer, n_labels = 1, n_features = 2) 

As you can see, I make a class that is used to predict using the ctreemodel. the data.frame("false" = 1-., "true" = .) only works for binary classification, but that is not the issue here since it is easy to extend to multiclass classification. party proceeds to throw the following error:

 Error in checkData(oldData, RET) : 
  Classes of new data do not match original data 

Note that this error does not occur when I comment out the FT$Age <- as.integer(FT$Age) line.
I used browser() during the predict.myclass function, and it turned out that the newdata passed by lime had its integer variable replaced by a double. Then, it goes wrong during the predict(model[[1]], newdata, type = "prob") code, since this function expects Age to be an integer, but lime converted it to a double somehow.

Error: All permutations have no similarity to the original observation. Try setting bin_continuous to TRUE and/or increase kernel_size

Hi Thomas, there is an error at the final stage of this analysis. When running explain() function on an h2o model, I get the following error:
Error: All permutations have no similarity to the original observation. Try setting bin_continuous to TRUE and/or increase kernel_size
I have tried both the suggestions in the error. If I change the bin_continous to TRUE, the lime() does not work and other kernel sizes do not work either. Any thought on how to solve this and therefore be able to get the results with the plot_features() function?
Thanks in advance!

library(dplyr)
library(readxl)
library(httr)
library(h2o)
library(lime)

GET("https://community.watsonanalytics.com/wp-content/uploads/2015/03/WA_Fn-UseC_-HR-Employee-Attrition.xlsx",
write_disk(tf <- tempfile(fileext = ".xls")))
hr_data_raw <- read_xlsx(tf)

hr_data <- hr_data_raw %>%
mutate_if(is.character, as.factor) %>%
select(Attrition, everything())

h2o.init()
h2o.no_progress()

hr_data_h2o <- as.h2o(hr_data)
split_h2o <- h2o.splitFrame(hr_data_h2o, c(0.7, 0.15), seed = 1234 )
train_h2o <- h2o.assign(split_h2o[[1]], "train" ) # 70%
valid_h2o <- h2o.assign(split_h2o[[2]], "valid" ) # 15%
test_h2o <- h2o.assign(split_h2o[[3]], "test" ) # 15%

y <- "Attrition"
x <- setdiff(names(train_h2o), y)
automl_models_h2o <- h2o.automl(
x = x,
y = y,
training_frame = train_h2o,
validation_frame = valid_h2o,
leaderboard_frame = test_h2o,
max_runtime_secs = 30)

automl_leader <- automl_models_h2o@leader

explainer <- lime::lime(
as.data.frame(train_h2o[,-1]),
model = automl_leader,
bin_continuous = F)

explanation <- lime::explain(
as.data.frame(test_h2o[1:10, -1]),
explainer = explainer,
n_labels = 1,
n_features = 4)

Error: All permutations have no similarity to the original observation.
Try setting bin_continuous to TRUE and/or increase kernel_size

#Cannot Continue
plot_features(explanation)

Error when defining custom model_type which returns regression predictions as a data.frame

Hi,

Thank you for your R implementation of lime! I'm keen to implement lime with a custom model I have built. I have defined my predict_model and model_type functions as recommended within the help pages, however, when using them in conjunction with the lime and explain functions I receive the error:

Error in y[1, ] : incorrect number of dimensions

The code I'm running is simply:

explanation <- lime(data.frame(train), model)
out <- explain(test, explanation, n_labels = 1, n_features = 2, n_permutations = 1000, feature_select = "highest_weights")

My problem is a regression problem and the issue seems to be triggered by the n_labels with a regression problem but I can't run anything unless I set either that parameter or the labels parameter. The issue arises within explain.data.frame, during the construction of the res object,

When the object case_res is created, it appears to be done so via a call to the predict_model function I have defined. As default, the predict function returns a data frame with a single column of the regression predictions. In the call to model_permutations, it passes case_res[i, ] as the y argument and at this stage it converts it to a vector, whereby the code fails as model_permutations seems to be looking for a data frame in the line:

Do you have any recommendations regarding what the output of my predict function should specifically look like, or if there is anything glaringly obvious with how I'm attempting to implement the custom function. My predict_model and model_type functions are included below. I'm afraid I can't provide any more specific details of the attributes of the model itself.

Kind regards,
Khalim

predict_model.model <- function(x, newdata, type = "raw", ...) {
res <- data.frame(predict(x= x, newdata = newdata, ...))

switch(
type,
raw = data.frame(Response = res, stringsAsFactors = FALSE),
prob = as.data.frame(res$posterior, check.names = FALSE)
)
}

model_type.model <- function(x, ...){
'regression'
}

Example for predicting success/failure and multiclass ordinal problems

Hi Thomas, would you entertain a documentation PR on this? I often use LIME for 2-class success/failure problems and sometimes multiclass ordinal. Yesterday I had some UX questions from a new colleague. We were looking at plot_features() output for several cases of a binary classification problem. He asked: "How do you read the combination of supports/contradicts, label, and probability?" His question wasn't trivial from to answer from the plot text alone, because I had by habit used n_labels = 1 instead of fixing the single specific label “success” in explain(). It was my fault, of course, but I bet it’s a common-enough use-case that a paragraph following the iris example would be helpful.

What happened in more detail: when framed in success/failure terms, the label changing between plots forces a direction switch in the interpretation (which is hard to grok especially when looking at any more than 3-4 features). The plot reading becomes: "The current label is failure, so high probability is bad, and the green-bar feature that supports high probability of failure is (in the context of this case for my problem) a negative, so a green bar for this plot is bad." That reading would reverse for the success label. The simple solution is to specify explain(labels = “success”…) or for multiclass ordinal, labels = c(“high”, “medium”, “low”)., etc.

I'm really enjoying lime and the others, thank you! I'm using >3 of your packages in an average day lately.

Slow with thousands of features

Thanks for porting lime to R!

I'm trying to explain a large xgboost model, with thousands of predictors from a TF-IDF matrix. Creating an "explainer" is fast, but explaining single observations using lime::explain takes hours, making its use unfeasible for production. Is this a side-effect from the implementation?

(Unfortunately, I can't provide a reprex).

Thanks,
J

Various questions

I have used the R-package LIME with great enthusiasm. This has generated some questions :

  1. For continuous variables with bin_continuous=FALSE, it seems like what is shown in the plot generated using plot_features() are the coefficients in the linear regression model and not the coefficients multiplied with the corresponding feature values. Is this correct? If yes, wouldn't it be more intuitive to plot the coefficients multiplied with the corresponding feature values, since this is the real contribution for each feature? (or to plot the standardized coefficients from the regression if these are available).

  2. For continuous variables with bin_continuous=FALSE, it seems like the distance (used when computing the weights) is computed using scaled variables, while the ridge regression is performed on unscaled values. Is this correct? If yes, I assume this is due to the fact that standardization is performed in glmnet() by default?

  3. For continuous variables with bin_continuous=TRUE, it seems like the data set used for the regression (and for the computation of weights) consists of zeroes and ones only, where the value in row i for variable j is 1 if the bin for this variable in this row is equal to the bin for the same variable in the observation vector and 0 otherwise. Is this correct? If yes, doesn't one then discard a lot of information, since there obviously is a larger distance between e.g. bins 1 and 5 than there is between bins 1 and 2?

  4. In the ridge regression you seem to have hard-coded the value of lambda to be 0.001. Is there a particular reason behind choosing this value?

  5. From the R-code it seems like you generate a new data set for each observation you want to explain? As you write yourself in https://cran.r-project.org/web/packages/lime/vignettes/Understanding_lime.html
    the permuted data set is independent from the observation to be explained. Hence, wouldn't it then be logical to use the same permuted data set for all cases to be explained?

  6. From the R-code it seems like there is possible to use time series data and that what is done using LIME with such data is to generate the permuted data set by sampling from the training data (i.e no noise). Is that correct? Then, when fitting the linear model, the different observations in the time series seems to be regarded as independent data. Is that correct?

The explain() gives a maximum of only 9 features for explanation

Hello,

I have a regression dataset with 174 features, and 10 observations(samples).

explanation <- explain(dataset_test, explainer, n_features=20, feature_select="lasso_path")

I am trying to get the model explain atleast 20 features for each observation, Hence the resultant dataset should have 20 features (20 rows) and its respective features_value for each case.

Here If I have 10 observations (as the dataset is regression). Then the resultant "explain dataset" should have 20*10 rows. But instead the # of features to explain the model seems to be restrictive to count 9. Even after I have mentioned the n_features=20 as the parameter.

Is my understanding correct. Or is this a bug?

Please let me know.

Regards
Sourabh

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.