Giter Site home page Giter Site logo

edarf's People

Contributors

arfon avatar christophergandrud avatar flinder avatar zmjones avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

edarf's Issues

Cores

mclapply with detect cores doe not work on windows bc it detects several cores but can't use them. I don't completely understand the stuff you wrote yet so I don't want to mess with it but we could do:
if(.Platform$OS.type=='windows') CORES <- 1
that should work.

potential bug in pd inner

I'm not 100% sure. But I think in
pred <- predict(fit, newdata = df, outcome = "test")$predicted.oob

the outcome argument should be 'train'. The part in the documentation is a bit cryptic:

"If outcome="test", the predictor is calculated by using y-outcomes from the test data (outcome information must be present). In this case, the terminal nodes from the grow-forest are recalculated using the y-outcomes from the test set. This yields a modified predictor in which the topology of the forest is based solely on the training data, but where the predicted value is based on the test data. Error rates and VIMP are calculated by bootstrapping the test data and using out-of-bagging to ensure unbiased estimates. See the examples for illustration"

But I don't think we want to recalculate the terminal nodes from the original forest, but just drop down the test data and get a prediction. I came to it bc i got different and counterintuitive results from it with the outcome='test' but matching and intuitive results with outcome='train'.

I changed it in the package code to do some tests.

drop binary factor

when doing binary classification with type = "prob" in partial_dependence we should drop one factor level, (how we know which idk, an option?), change the prob attribute as FALSE and assign the column for the other level the name of the target variable. This will get handled like regression in plot_pd.

randomForest.default

randomForest(X, y), which calls randomForest.default is broken. maybe we should have a separate method for this? needs to be added to the tests as well

party survival

implement party survival in partial_dependence.RandomForest

class matching

some classes are still output incorrectly by partial_dependence. with the iris data in some instances numeric variables are output as integers. haven't figured out why this is the case.

dependencies failed to install

Just installed edarf on a new machine and the automatic installation of dependencies failed. Could however install the dependencies manually (specifically, in order: 'zoo', 'sandwich', 'strucchange' (Dependencies to party?)).
Not a big deal but weird and not sure if systme specific

patch or remove party var_est

to use the variance estimator with cforest we need the patched version of cforest. the patch needs to party needs to be accepted, we need to find a workaround, or this should be removed

readme

You have two times extract_proximity in the readme. Moreover you have randomforest_distance, but the function is randomforest_dist.
You do not mention randomforest_dist in the paper, but maybe this is not a problem. I really do not know what happens in randomforest_dist, you could add a reference or better explanation in the help.
Also references in the help of the other functions.

Sorry for being very critical, but I think this could really improve the documentation.

list output

partial_dependence returns a dataframe of lists when the number of variables is greater than 1.

[feature request] interpreting forests by extracting simple rules

It is very interesting approach to data analysis to use complex machine learning models for exploratory data analysis.

There are some works that tries to make tree ensembles interpretable by extracting simple and relevant rules from them.

I think these works are also useful for exploratory data analysis.

Are you interested in integrating these methods in edarf package?

add randomForest function

it will return partial dependence on one predictor (an option in the call to randomForest), but doesn't do multivariate partial dependence.

Example does not work

Something with "gridsize" is wrong:

> library(randomForest)
randomForest 4.6-12
Type rfNews() to see new features/changes/bug fixes.
> library(edarf)
> 
> data(iris)
> data(swiss)
> 
> ## classification
> fit <- randomForest(Species ~ ., iris)
> pd <- partial_dependence(fit, iris, c("Petal.Width", "Sepal.Length"))
Fehler in ifelse(gridsize >= nunique, nunique, gridsize) : 
  Objekt 'gridsize' nicht gefunden
> pd_int <- partial_dependence(fit, iris, c("Petal.Width", "Sepal.Length"), interaction = TRUE)
Fehler in ifelse(gridsize >= nunique, nunique, gridsize) : 
  Objekt 'gridsize' nicht gefunden
> 
> ## Regression
> fit <- randomForest(Fertility ~ ., swiss)
> pd <- partial_dependence(fit, swiss, "Education")
Fehler in ifelse(gridsize >= nunique, nunique, gridsize) : 
  Objekt 'gridsize' nicht gefunden
> pd_int <- partial_dependence(fit, swiss, c("Education", "Catholic"), interaction = TRUE)
Fehler in ifelse(gridsize >= nunique, nunique, gridsize) : 
  Objekt 'gridsize' nicht gefunden

Support for random forest fit by ranger

First of all, your package is awesome! I merely have a friendly feature request.

The ranger package, which you are probably familiar with, have huge speed improvements compared to randomForest and randomForestSRC. I often find the latter two to be too slow for large (or even medium-sized) data sets.

Do you have any plans to implement support for random forest fit by ranger, or do you know if anybody else is working on that?

'uniform = FALSE' does not work

using the example from the help:

library(randomForest) ; library(edarf)
data(iris); data(swiss)
`## classification` `fit = randomForest(Species ~ ., iris)`
## WORKING ##
pd = partial_dependence(fit, c("Sepal.Width", "Sepal.Length"), data = iris[, -ncol(iris)])

## NOT WORKING ##
pd = partial_dependence(fit, c("Sepal.Width", "Sepal.Length"), data = iris[, -ncol(iris)], ,uniform = FALSE)
## Error in partial_dependence(fit, c("Sepal.Width", "Sepal.Length"), data = iris[, :
## Assertion on 'y' failed: Must be of type 'data.frame', not 'double'.

Cheers, thanks for having written this useful package !

plot functions

we need to give all the output from partial dependence a simple s3 class, and then there should be a plot generic for the bivariate pd output. the output from partial_dependence should still operate like a normal data.frame though.

license missing

I have to check this:

  • License: Does the repository contain a plain-text LICENSE file with the contents of an OSI approved software license?

You have a license file, but not with the contents of an OSI approved software license.

refactor/expand plot functions

especially plot_pd needs this, very messy. probably we can make this simple as in plotPartialPrediction in mlr. plot_imp probably needs a refactor as well

additionally contour/tile plots for regression and survival tasks should be implemented (code again in plotPartialPrediction)

plot_twoway_partial: argument "grid" is missing

Hi Zack!

Congratulations for your package! It's very indeed very convenient.

I was testing the code today and found a bug. I ran the examples you provided here and I got this message:

plot_twoway_partial(pd_reg$Education, pd_reg$pred, smooth = TRUE)
Error in noNA(grid) : argument "grid" is missing, with no default

I'm pretty sure it's an easy fix, but I'm not sure how to do it (and maybe you want to update the example too). I'm using Revolution R Open 8.0 and R 3.1.1 on Ubuntu 14.04, if that matters.

Thanks!

OOB parameter for variable importance when using party

Just a quick note that when the variable importance is underpinned by the party package (at least the current version 1.0-25) the OOB parameter does not work, i.e. you cannot get the code to use out of bag samples.

This is because the party "predict" code that you (eventually) call (line 138 from https://github.com/cran/party/blob/R-3.0.3/R/RandomForest.R) eventually forces OOB to false if you are passing new data (which you are since you permute the data). Modifying RandomForest.R in the party code to prevent this forced behaviour seems to fix this, but I've not looked into why this behaviour is there in the first place to ensure that the code subsequently logically does the right thing. In any case since this is not your code perhaps it's just worth noting that the OOB parameter doesn't work in this case in the documentation?

multivariate tests

multivariate regression/classification using party is not currently tested. should be easy to create a small simulated dataset and test it (esp. since it is only possible w/ party).

column classes in output

as can be seen in the tests. think fix_columns could fix this but haven't done it yet

library(devtools)
test() # from root of package

cran check

have been working on sorting all of this out the past day or two. see 5afbba8

one error i haven't solved yet is

In addition: Warning messages:
1: replacing previous import by 'party::proximity' when loading 'non-package environment' 
2: replacing previous import by 'party::varimp' when loading 'non-package environment' 
3: replacing previous import by 'party::varimpAUC' when loading 'non-package environment' 

the googling i've done suggests this is from reimporting things, but so far as i can tell i am only doing so once. this problem was somehow introduced in the past day or so by me.

all of the tests are now passing. new tests were added too.

for the S3 methods to be consistent they all have to have the same arguments as the generic. there are some cases where this is nonsensical, e.g., one of the methods can take additional arguments but the other two can't, so the generic has ... which is unused in the two methods. not sure how to document that.

error with install from devtools

Hi there - read your paper. Interesting stuff. Would like to play around with your package, but I'm getting an install error on my mac using devtools. Have you encountered this?

devtools::install_github('zmjones/edarf')

Downloading github repo zmjones/edarf@master
Installing edarf
Installing dependencies for edarf:
RcppArmadillo
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0 10 1239k 10 130k 0 0 168k 0 0:00:07 --:--:-- 0:00:07 168k100 1239k 100 1239k 0 0 1356k 0 --:--:-- --:--:-- --:--:-- 1356k

The downloaded binary packages are in
/var/folders/29/dj4wm1714q7flrdgvg7src8h0000gn/T//RtmpxEpsRy/downloaded_packages
'/Library/Frameworks/R.framework/Resources/bin/R' --vanilla CMD INSTALL
'/private/var/folders/29/dj4wm1714q7flrdgvg7src8h0000gn/T/RtmpxEpsRy/devtoolsa4220b25ed7/zmjones-edarf-e39a4c5'
--library='/Library/Frameworks/R.framework/Versions/3.1/Resources/library' --install-tests

  • installing source package ‘edarf’ ...
    ** libs
    clang++ -I/Library/Frameworks/R.framework/Resources/include -DNDEBUG -I/usr/local/include -I/usr/local/include/freetype2 -I/opt/X11/include -I"/Library/Frameworks/R.framework/Versions/3.1/Resources/library/Rcpp/include" -I"/Library/Frameworks/R.framework/Versions/3.1/Resources/library/RcppArmadillo/include" -fPIC -Wall -mtune=core2 -g -O2 -c RcppExports.cpp -o RcppExports.o
    clang++ -I/Library/Frameworks/R.framework/Resources/include -DNDEBUG -I/usr/local/include -I/usr/local/include/freetype2 -I/opt/X11/include -I"/Library/Frameworks/R.framework/Versions/3.1/Resources/library/Rcpp/include" -I"/Library/Frameworks/R.framework/Versions/3.1/Resources/library/RcppArmadillo/include" -fPIC -Wall -mtune=core2 -g -O2 -c edarf.cpp -o edarf.o
    clang++ -dynamiclib -Wl,-headerpad_max_install_names -undefined dynamic_lookup -single_module -multiply_defined suppress -L/usr/local/lib -o edarf.so RcppExports.o edarf.o -L/Library/Frameworks/R.framework/Resources/lib -lRlapack -L/Library/Frameworks/R.framework/Resources/lib -lRblas -L/usr/local/lib/gcc/x86_64-apple-darwin13.0.0/4.8.2 -lgfortran -lquadmath -lm -F/Library/Frameworks/R.framework/.. -framework R -Wl,-framework -Wl,CoreFoundation
    ld: warning: directory not found for option '-L/usr/local/lib/gcc/x86_64-apple-darwin13.0.0/4.8.2'
    ld: library not found for -lgfortran
    clang: error: linker command failed with exit code 1 (use -v to see invocation)
    make: *** [edarf.so] Error 1
    ERROR: compilation failed for package ‘edarf’
  • removing ‘/Library/Frameworks/R.framework/Versions/3.1/Resources/library/edarf’
    Error: Command failed (1)

survival

is simply outputting the chf sufficient?

i don't think that it is, we should try to use every possible output if possible. we should also do that with party.

multivariate plotting

i think you can handle multivariate pd plots in the same way that you handle class probs or interactions. the only difficulties i can think of is when a interaction plot for multivariate outcomes is requested

range ivar_points

If a variable has a lot of unique values and empirical = TRUE, then it can happen, that pd is only calculated for a snippet of the range.

ci plotting

we should add the option to use error bars instead of a ribbon. it is somewhat misleading with the ribbon since it is a point-wise confidence interval.

Possible bug: Edarf functions give error when using a ranger object with categorical variables

Dear mr. Jones,

Thank you for developing edarf; it seems very useful but I am getting errors with both variable_importance and partial_dependence when using a model with categorical variables. It is possible that the problem has to do with the fact that several of the levels of these categorical variables are dropped from the model, because they do not have any observations in the training data. For variable_importance the error is:

variable_importance(ranger_model, vars=c("paper", "genus"), data=training_data)
Error in variable_importance(ranger_model, vars = c("paper", "genus"), :
Assertion on 'y' failed: Must have length 1, but has length 3.

For partial_dependence, the error is:
Warning messages:
1: In names(mp)[ncol(mp)] = target :
number of items to replace is not a multiple of replacement length
2: In names(mp)[ncol(mp)] = target :
number of items to replace is not a multiple of replacement length

The graphical functions do not work in the presence of these errors. Is there anything I can do to prevent them?

Sincerely,
Caspar

plot_pd: no geom_line-lines if predictor ordered factor

First of all: I ❤️ your package! :-)

When using plot_pd, no lines are drawn if the predictor is an ordered factor. See the interaction plot in this example using the BreastCancer data set from mlbench: section 5.2.

Or is this a feature and not a bug? That is, is this an assumption in the ggplot2-philosophy that points from ordered factors shouldn't be connected?

This could possibly be addressed using the group argument in aes? See here.

regression plotting

data(swiss)

fit <- randomForest(Fertility ~ ., swiss)
pd <- partial_dependence(fit, swiss, "Education")
plot(pd)

gives bad output

class probabilities

am pulling class probabilities from predict.randomForest. this is not consistent with the behavior of classification via party and maybe not for randomForestSRC. the package should either only produce one (argmax(pi_c)) or have options to choose. if the latter option than this has to have an option for the plotting function. the current bar plot option for class probabilities is pretty lame.

class probability bug

error when running the iris example with class probability output

library(randomForest)
library(edarf)
data(iris)
fit <- randomForest(Species ~ ., iris)
pd <- partial_dependence(fit, iris, "Petal.Width", type = "prob")
plot(pd)
Error in as.character(x$label) : 
  cannot coerce type 'closure' to vector of type 'character'

discretization

we should have the ability to discretize a variable that we want to use for facetting in the partial dependence plots. we could do this in the plot method or in the pd methods

foreign function call

in var_est.RandomForest i call R_predictRF_weights which is a C function in party. R CMD check generates a note for this reason and will mean we can't submit to CRAN even if randomForestCI is there and randomForest gets updated as well. if we want to fix this we can either write some C++ that does this given ensemble (i.e. reimplement R_predictRF_weights) but for one tree at a time (might be the best performance) or modify the party code and hope Hothorn puts it in.

Examples in the help files

Some things I noticed while looking at the help files:

  • extract_proximity has no example
  • extract_prox does not exist, but is in your readme file?
  • Why do you have ##not run in the plot_prox file?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.