zmjones / edarf Goto Github PK

exploratory data analysis using random forests

License: MIT License

R 92.55% TeX 7.45%

random-forest r rstats exploratory-data-analysis machine-learning

edarf's Introduction

Functions useful for exploratory data analysis using random forests.

This package extends the functionality of random forests fit by party (multivariate, regression, and classification), randomForestSRC (regression and classification,), randomForest (regression and classification), and ranger (classification and regression).

The subdirectory pkg contains the actual package. The package can be installed with devtools.

devtools::install_github("zmjones/edarf", subdir = "pkg")

Functionality includes:

partial_dependence which computes the expected prediction made by the random forest if it were marginalized to only depend on a subset of the features. plot_pd plots the results.
variable_importance which computes feature importance for arbitrary loss functions, aggregated across the training data or for individual observations. This may also be used for subsets of the feature space in order to detect interactions.
extract_proximity and plot_prox which computes or extracts proximity matrices and plots them using a biplot given a matrix of principal components of said matrix.

If you use the package for research, please cite it.

@article{jones2016,
  doi = {10.21105/joss.00092},
  url = {http://dx.doi.org/10.21105/joss.00092},
  year  = {2016},
  month = {oct},
  publisher = {The Open Journal},
  volume = {1},
  number = {6},
  author = {Zachary M. Jones and Fridolin J. Linder},
  title = {edarf: Exploratory Data Analysis using Random Forests},
  journal = {The Journal of Open Source Software}
}

Pull requests, bug reports, feature requests, etc. are welcome!

edarf's People

Contributors

Stargazers

Watchers

Forkers

christophergandrud reuning philipppro mikajoh david5ive svetistefan guhjy mnwright lilixu-ccnu grseb9s

edarf's Issues

Support for random forest fit by ranger

First of all, your package is awesome! I merely have a friendly feature request.

The ranger package, which you are probably familiar with, have huge speed improvements compared to randomForest and randomForestSRC. I often find the latter two to be too slow for large (or even medium-sized) data sets.

Do you have any plans to implement support for random forest fit by ranger, or do you know if anybody else is working on that?

party survival

implement party survival in partial_dependence.RandomForest

list output

partial_dependence returns a dataframe of lists when the number of variables is greater than 1.

Example does not work

Something with "gridsize" is wrong:

> library(randomForest)
randomForest 4.6-12
Type rfNews() to see new features/changes/bug fixes.
> library(edarf)
> 
> data(iris)
> data(swiss)
> 
> ## classification
> fit <- randomForest(Species ~ ., iris)
> pd <- partial_dependence(fit, iris, c("Petal.Width", "Sepal.Length"))
Fehler in ifelse(gridsize >= nunique, nunique, gridsize) : 
  Objekt 'gridsize' nicht gefunden
> pd_int <- partial_dependence(fit, iris, c("Petal.Width", "Sepal.Length"), interaction = TRUE)
Fehler in ifelse(gridsize >= nunique, nunique, gridsize) : 
  Objekt 'gridsize' nicht gefunden
> 
> ## Regression
> fit <- randomForest(Fertility ~ ., swiss)
> pd <- partial_dependence(fit, swiss, "Education")
Fehler in ifelse(gridsize >= nunique, nunique, gridsize) : 
  Objekt 'gridsize' nicht gefunden
> pd_int <- partial_dependence(fit, swiss, c("Education", "Catholic"), interaction = TRUE)
Fehler in ifelse(gridsize >= nunique, nunique, gridsize) : 
  Objekt 'gridsize' nicht gefunden

plot functions

we need to give all the output from partial dependence a simple s3 class, and then there should be a plot generic for the bivariate pd output. the output from partial_dependence should still operate like a normal data.frame though.

column classes in output

as can be seen in the tests. think fix_columns could fix this but haven't done it yet

library(devtools)
test() # from root of package

potential bug in pd inner

I'm not 100% sure. But I think in
pred <- predict(fit, newdata = df, outcome = "test")$predicted.oob

the outcome argument should be 'train'. The part in the documentation is a bit cryptic:

"If outcome="test", the predictor is calculated by using y-outcomes from the test data (outcome information must be present). In this case, the terminal nodes from the grow-forest are recalculated using the y-outcomes from the test set. This yields a modified predictor in which the topology of the forest is based solely on the training data, but where the predicted value is based on the test data. Error rates and VIMP are calculated by bootstrapping the test data and using out-of-bagging to ensure unbiased estimates. See the examples for illustration"

But I don't think we want to recalculate the terminal nodes from the original forest, but just drop down the test data and get a prediction. I came to it bc i got different and counterintuitive results from it with the outcome='test' but matching and intuitive results with outcome='train'.

I changed it in the package code to do some tests.

plot_pd: no geom_line-lines if predictor ordered factor

First of all: I ❤️ your package! :-)

When using plot_pd, no lines are drawn if the predictor is an ordered factor. See the interaction plot in this example using the BreastCancer data set from mlbench: section 5.2.

Or is this a feature and not a bug? That is, is this an assumption in the ggplot2-philosophy that points from ordered factors shouldn't be connected?

This could possibly be addressed using the group argument in aes? See here.

OOB parameter for variable importance when using party

Just a quick note that when the variable importance is underpinned by the party package (at least the current version 1.0-25) the OOB parameter does not work, i.e. you cannot get the code to use out of bag samples.

This is because the party "predict" code that you (eventually) call (line 138 from https://github.com/cran/party/blob/R-3.0.3/R/RandomForest.R) eventually forces OOB to false if you are passing new data (which you are since you permute the data). Modifying RandomForest.R in the party code to prevent this forced behaviour seems to fix this, but I've not looked into why this behaviour is there in the first place to ensure that the code subsequently logically does the right thing. In any case since this is not your code perhaps it's just worth noting that the OOB parameter doesn't work in this case in the documentation?

survival

is simply outputting the chf sufficient?

i don't think that it is, we should try to use every possible output if possible. we should also do that with party.

range ivar_points

If a variable has a lot of unique values and empirical = TRUE, then it can happen, that pd is only calculated for a snippet of the range.

[feature request] interpreting forests by extracting simple rules

It is very interesting approach to data analysis to use complex machine learning models for exploratory data analysis.

There are some works that tries to make tree ensembles interpretable by extracting simple and relevant rules from them.

inTrees (R package)
defragTrees (Python implementation only)

I think these works are also useful for exploratory data analysis.

Are you interested in integrating these methods in edarf package?

add vignette and example data

was thinking either the marijuana data or some of the human rights data

patch or remove party var_est

to use the variance estimator with cforest we need the patched version of cforest. the patch needs to party needs to be accepted, we need to find a workaround, or this should be removed

refactor/expand plot functions

especially plot_pd needs this, very messy. probably we can make this simple as in plotPartialPrediction in mlr. plot_imp probably needs a refactor as well

additionally contour/tile plots for regression and survival tasks should be implemented (code again in plotPartialPrediction)

dependencies failed to install

Just installed edarf on a new machine and the automatic installation of dependencies failed. Could however install the dependencies manually (specifically, in order: 'zoo', 'sandwich', 'strucchange' (Dependencies to party?)).
Not a big deal but weird and not sure if systme specific

add randomSurvivalForest function

i am just going to put them all in one function with some simple switching. i may get to this today

ci plotting

we should add the option to use error bars instead of a ribbon. it is somewhat misleading with the ribbon since it is a point-wise confidence interval.

randomForest.default

randomForest(X, y), which calls randomForest.default is broken. maybe we should have a separate method for this? needs to be added to the tests as well

multivariate plotting

i think you can handle multivariate pd plots in the same way that you handle class probs or interactions. the only difficulties i can think of is when a interaction plot for multivariate outcomes is requested

multivariate example

there needs to be a multivariate example

add asserts to pd

need to test arguments

party predict

use the faster, stripped down predict methods

Examples in the help files

Some things I noticed while looking at the help files:

extract_proximity has no example
extract_prox does not exist, but is in your readme file?
Why do you have ##not run in the plot_prox file?

speed improvements to tree prediction matrix for rfsrc

for rfsrc the loop could be written in C++. this is in ci.R

missingness

the data must be checked for this

drop binary factor

when doing binary classification with type = "prob" in partial_dependence we should drop one factor level, (how we know which idk, an option?), change the prob attribute as FALSE and assign the column for the other level the name of the target variable. This will get handled like regression in plot_pd.

license missing

I have to check this:

License: Does the repository contain a plain-text LICENSE file with the contents of an OSI approved software license?

You have a license file, but not with the contents of an OSI approved software license.

class probability bug

error when running the iris example with class probability output

library(randomForest)
library(edarf)
data(iris)
fit <- randomForest(Species ~ ., iris)
pd <- partial_dependence(fit, iris, "Petal.Width", type = "prob")
plot(pd)
Error in as.character(x$label) : 
  cannot coerce type 'closure' to vector of type 'character'

Cite packages in the paper?

I am not completely sure about this, but maybe you should.

discretization

we should have the ability to discretize a variable that we want to use for facetting in the partial dependence plots. we could do this in the plot method or in the pd methods

'uniform = FALSE' does not work

using the example from the help:

library(randomForest) ; library(edarf)
data(iris); data(swiss)
`## classification` `fit = randomForest(Species ~ ., iris)`
## WORKING ##
pd = partial_dependence(fit, c("Sepal.Width", "Sepal.Length"), data = iris[, -ncol(iris)])

## NOT WORKING ##
pd = partial_dependence(fit, c("Sepal.Width", "Sepal.Length"), data = iris[, -ncol(iris)], ,uniform = FALSE)
## Error in partial_dependence(fit, c("Sepal.Width", "Sepal.Length"), data = iris[, :
## Assertion on 'y' failed: Must be of type 'data.frame', not 'double'.

Cheers, thanks for having written this useful package !

multivariate tests

multivariate regression/classification using party is not currently tested. should be easy to create a small simulated dataset and test it (esp. since it is only possible w/ party).

regression plotting

data(swiss)

fit <- randomForest(Fertility ~ ., swiss)
pd <- partial_dependence(fit, swiss, "Education")
plot(pd)

gives bad output

help file, randomForest/RandomForest

help apparently doesn't distinguish between the two, so only one shoes up (RandomForest).

add error checking to partial_dependence methods

should give easily understood errors, should check all arguments

plot_twoway_partial: argument "grid" is missing

Hi Zack!

Congratulations for your package! It's very indeed very convenient.

I was testing the code today and found a bug. I ran the examples you provided here and I got this message:

plot_twoway_partial(pd_reg$Education, pd_reg$pred, smooth = TRUE)
Error in noNA(grid) : argument "grid" is missing, with no default

I'm pretty sure it's an easy fix, but I'm not sure how to do it (and maybe you want to update the example too). I'm using Revolution R Open 8.0 and R 3.1.1 on Ubuntu 14.04, if that matters.

Thanks!

Reference in the joss paper above image

https://github.com/openjournals/joss-reviews/files/525302/10.21105.joss.00092.pdf

Moreover the joss paper is different to the paper in your github repo.

add randomForest function

it will return partial dependence on one predictor (an option in the call to randomForest), but doesn't do multivariate partial dependence.

Test of proximity matrix missing?

readme

You have two times extract_proximity in the readme. Moreover you have randomforest_distance, but the function is randomforest_dist.
You do not mention randomforest_dist in the paper, but maybe this is not a problem. I really do not know what happens in randomforest_dist, you could add a reference or better explanation in the help.
Also references in the help of the other functions.

Sorry for being very critical, but I think this could really improve the documentation.

add tests

class matching

some classes are still output incorrectly by partial_dependence. with the iris data in some instances numeric variables are output as integers. haven't figured out why this is the case.

Cores

mclapply with detect cores doe not work on windows bc it detects several cores but can't use them. I don't completely understand the stuff you wrote yet so I don't want to mess with it but we could do:
if(.Platform$OS.type=='windows') CORES <- 1
that should work.

Possible bug: Edarf functions give error when using a ranger object with categorical variables

Dear mr. Jones,

Thank you for developing edarf; it seems very useful but I am getting errors with both variable_importance and partial_dependence when using a model with categorical variables. It is possible that the problem has to do with the fact that several of the levels of these categorical variables are dropped from the model, because they do not have any observations in the training data. For variable_importance the error is:

variable_importance(ranger_model, vars=c("paper", "genus"), data=training_data)
Error in variable_importance(ranger_model, vars = c("paper", "genus"), :
Assertion on 'y' failed: Must have length 1, but has length 3.

For partial_dependence, the error is:
Warning messages:
1: In names(mp)[ncol(mp)] = target :
number of items to replace is not a multiple of replacement length
2: In names(mp)[ncol(mp)] = target :
number of items to replace is not a multiple of replacement length

The graphical functions do not work in the presence of these errors. Is there anything I can do to prevent them?

Sincerely,
Caspar

cran check

have been working on sorting all of this out the past day or two. see 5afbba8

one error i haven't solved yet is

In addition: Warning messages:
1: replacing previous import by 'party::proximity' when loading 'non-package environment' 
2: replacing previous import by 'party::varimp' when loading 'non-package environment' 
3: replacing previous import by 'party::varimpAUC' when loading 'non-package environment'

the googling i've done suggests this is from reimporting things, but so far as i can tell i am only doing so once. this problem was somehow introduced in the past day or so by me.

all of the tests are now passing. new tests were added too.

for the S3 methods to be consistent they all have to have the same arguments as the generic. there are some cases where this is nonsensical, e.g., one of the methods can take additional arguments but the other two can't, so the generic has ... which is unused in the two methods. not sure how to document that.

class probabilities

am pulling class probabilities from predict.randomForest. this is not consistent with the behavior of classification via party and maybe not for randomForestSRC. the package should either only produce one (argmax(pi_c)) or have options to choose. if the latter option than this has to have an option for the plotting function. the current bar plot option for class probabilities is pretty lame.

foreign function call

in var_est.RandomForest i call R_predictRF_weights which is a C function in party. R CMD check generates a note for this reason and will mean we can't submit to CRAN even if randomForestCI is there and randomForest gets updated as well. if we want to fix this we can either write some C++ that does this given ensemble (i.e. reimplement R_predictRF_weights) but for one tree at a time (might be the best performance) or modify the party code and hope Hothorn puts it in.

error with install from devtools

Hi there - read your paper. Interesting stuff. Would like to play around with your package, but I'm getting an install error on my mac using devtools. Have you encountered this?

devtools::install_github('zmjones/edarf')

Downloading github repo zmjones/edarf@master
Installing edarf
Installing dependencies for edarf:
RcppArmadillo
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0 10 1239k 10 130k 0 0 168k 0 0:00:07 --:--:-- 0:00:07 168k100 1239k 100 1239k 0 0 1356k 0 --:--:-- --:--:-- --:--:-- 1356k

The downloaded binary packages are in
/var/folders/29/dj4wm1714q7flrdgvg7src8h0000gn/T//RtmpxEpsRy/downloaded_packages
'/Library/Frameworks/R.framework/Resources/bin/R' --vanilla CMD INSTALL
'/private/var/folders/29/dj4wm1714q7flrdgvg7src8h0000gn/T/RtmpxEpsRy/devtoolsa4220b25ed7/zmjones-edarf-e39a4c5'
--library='/Library/Frameworks/R.framework/Versions/3.1/Resources/library' --install-tests

installing source package ‘edarf’ ...
** libs
clang++ -I/Library/Frameworks/R.framework/Resources/include -DNDEBUG -I/usr/local/include -I/usr/local/include/freetype2 -I/opt/X11/include -I"/Library/Frameworks/R.framework/Versions/3.1/Resources/library/Rcpp/include" -I"/Library/Frameworks/R.framework/Versions/3.1/Resources/library/RcppArmadillo/include" -fPIC -Wall -mtune=core2 -g -O2 -c RcppExports.cpp -o RcppExports.o
clang++ -I/Library/Frameworks/R.framework/Resources/include -DNDEBUG -I/usr/local/include -I/usr/local/include/freetype2 -I/opt/X11/include -I"/Library/Frameworks/R.framework/Versions/3.1/Resources/library/Rcpp/include" -I"/Library/Frameworks/R.framework/Versions/3.1/Resources/library/RcppArmadillo/include" -fPIC -Wall -mtune=core2 -g -O2 -c edarf.cpp -o edarf.o
clang++ -dynamiclib -Wl,-headerpad_max_install_names -undefined dynamic_lookup -single_module -multiply_defined suppress -L/usr/local/lib -o edarf.so RcppExports.o edarf.o -L/Library/Frameworks/R.framework/Resources/lib -lRlapack -L/Library/Frameworks/R.framework/Resources/lib -lRblas -L/usr/local/lib/gcc/x86_64-apple-darwin13.0.0/4.8.2 -lgfortran -lquadmath -lm -F/Library/Frameworks/R.framework/.. -framework R -Wl,-framework -Wl,CoreFoundation
ld: warning: directory not found for option '-L/usr/local/lib/gcc/x86_64-apple-darwin13.0.0/4.8.2'
ld: library not found for -lgfortran
clang: error: linker command failed with exit code 1 (use -v to see invocation)
make: *** [edarf.so] Error 1
ERROR: compilation failed for package ‘edarf’
removing ‘/Library/Frameworks/R.framework/Versions/3.1/Resources/library/edarf’
Error: Command failed (1)

Bivariate partial dependence for multiple predictors

Atm passing several predictors to var just produces interactions
Implement a facet plot for several predictors