grf-labs / grf Goto Github PK
View Code? Open in Web Editor NEWGeneralized Random Forests
Home Page: https://grf-labs.github.io/grf/
License: GNU General Public License v3.0
Generalized Random Forests
Home Page: https://grf-labs.github.io/grf/
License: GNU General Public License v3.0
Hello,
I am just wondering why the honesty parameter is set to TRUE by default ?! And do you think for next releases we could use grf with caret ?
Thank you
We should likely implement the approach suggested in ESL where in each node, factor variables are ordered by their mean outcome before performing the split. This should be properly generalized to handle non-regression forests.
A few items in particular:
grf.Rd
), as it is currently quite bare-bones. #516I know that Windows isn't officially supported, but it was fairly easy to get working so I thought I'd post how. The problem is that there are symlinks in r-package/gradient.forest
to files elsewhere and while the Windows filesystem can make symlinks, it is now well supported in tools like unzip and some git implementations. The ranger
package had a similar issue. They fixed it by reversing the symlinks (having the real files in the package area and then symlinks in the other folders pointing there).
To work around this issue on Windows, just:
r-package/gradient.forest
with the real files and directories (replace the src/src
and src/third_party
directories, and copy over the files from bindings
to src
).r-package/build_package.R
install.packages("path/to/gradient.forest_X.X.X.X.tar.gz", repos = NULL, type="source")
It would be helpful to have the ability in the grf package to generate a doubly-robust ATE estimate for a separate test dataset. Right now I can use the predict statement to generate predictions of tau in a test dataset; however, taking the average of the predictions may yield very different results than the doubly-robust estimator for the ATE in situations where there is large shrinkage bias (as is the case in my data).
Hello,
I trying to understand the output information of one tree from a causal forest.
In this example, the number of training samples is 772.
tau.forest = causal_forest(x, y, z, num.trees = 500)
tree1=get_tree(tau.forest, 1)
I print out one of the final node of this tree and I get this output
tree1$nodes[5]
[[1]]
[[1]]$is_leaf
[1] TRUE
[[1]]$samples
[1] 113 694 712 705 232 584 704 0 718 19 700 109
What does it mean that one of the samples of this final node is zero?
Is it possible to have access to the source code of '_grf_instrumental_train'?
Thanks.
I think this package is really interesting. It would be nice if there was a function to do a varImpPlot, similar to what is available in the randomForest package. Right now there is nothing I am aware of that would allow me to identify which variables have the largest impact on the heterogeneity of the effect after fitting a causal forest.
We made several design trade-offs for causal forests that improve statistical accuracy and UX at the cost of performance. For example,
We should create an 'optimized' mode for simple prediction-only use cases that favors raw performance in terms of speed + memory usage.
Hi,
I've noticed that with the causal_forest
function the memory spikes during the estimation routine. With a multi-core specification, it the memory spike is right after the cores are done running and processes collapse to a single core. The memory spike also occurs when it is run with one core. In comparison, the causalForest
function in the causalTree
package does not have this issue.
For a data set with 300k observations, a binary treatment vector, and a outcome vector, the causal_forest
function run with 15 cores initially takes ~3.1 GB of memory. Then, as the cores finish and the processes is collapses to one core, the memory spikes past 50 GB. This 50 GB is measured and hard capped by a Slurm job scheduler.
Do you know why the memory spikes at that stage in the causal_forest
estimation and is there any way to circumvent the memory spike?
This may be vaguely related to Issue 137.
Thank you for your help and for making this package available.
Hello,
About the OptimizedPredictionStrategy. The code says "only 'optimized' prediction strategies can provide variance estimates" and "OptimizedPredictionStrategy does not predict using a list of neighboring samples and weights".
Following your example:
#Predict with confidence intervals; growing more trees is now recommended.
c.forest = causal_forest(X, Y, W, num.trees = 4000)
c.pred = predict(c.forest, X.test, estimate.variance = TRUE)
Does this mean that cpred$predictions were not calculated using the weights (i.e. equation 19 in "Generalized Random Forest" paper was not used)?
Many thanks.
A user expressed interest in running an experiment involving ~20 million samples and ~100 features. We should consider supporting growing trees + predicting across multiple machines, likely by splitting the data across these shards.
We should provide a method to convert a grf_tree to an rpart object, for the convenience of plotting and predicting with a single tree.
Hi,
Let's say that you have a parametric distribution where the goal is to let the parameter(s) be a function of covariates. If its likelihood and its gradient are tractable, it fits neatly within your framework through the score equation(s). For instance, I think that Stefan used forests to estimate the parameters of a GPD as functions of covariates.
I wondered if you could eventually provide an implementation example of CustomRelabelingStrategy/CustomPredictionStrategy for this kind of setup, which encompass, in my opinion, many cases of practical interest.
Show the grid of output from cross-validation together with MSE and std dev of MSE, rather than just the result of the tuning. Make clear that result of tuning is available in forest$tunable.params when tuning is done as part of the forest estimation.
It may make sense to integrate with the pdp
or mlr
packages, which provide this functionality for the models they support.
Thank you for making this package available. This package is incredibly useful for some projects on which I'm working.
I am running into trouble when I call regression_forest with a single regressor. For example,
n = 50; p = 1
X = matrix(rnorm(n*p), n, p)
Y = X * rnorm(n)
r.forest = regression_forest(X, Y)
r.forest
reports the error
Error in rowSums(split.freq) :
'x' must be an array of at least two dimensions
Hello:
I am working on a project with a large number of observations (around 300,000 or so). In this project, there are approximately 20 explanatory variables that are useful in determining the treatment of the individual (W). I have tried to use the causal_forest and estimate_average_effect functions and, sometimes, it returns an NaN response.
While I cannot go into the nuances of the data too much, approximately 30% of the sample receives the treatment. The Y variable is binary, so we are looking at the percentage change in behavior based on the treatment. For this outcome variable, we see an overall rate (for both treatment and control groups combined) of 1 being around 14%, so the outcome being "positive" is quite low across the entire sample. Do you have any suggestions on how I might be able to avoid the NaN? As I mentioned, sometimes I will get an estimated treatment effect, other times the NaN will be returned. Should I increase the sample fraction? The min node size?
My thought is that there must be a tree in the sample that is causing the NaN issue (since there are occasions where I do get a numerical response rather than NaN). Is there a way I can view each tree in isolation and take out any with an NaN?
Any suggestion would be appreciated. Thank you for your time.
I have been applying causal forest to datas that have tens millions of observations, and I've been encountering memory issues. When building a forest with only 1000 trees (lower than the recommended amount), the whole forest takes more than 200 GB of memory, even though the raw data is only 2 GB. I think the reason is each tree stores a copy of a fraction of the data (both in the "nodes" and "oob_samples" attributes).
One strategy I had of reducing the memory intensity is to use a small "sample.fraction" parameter. However, when I set a low value, the memory size for an individual tree actually went up. I think the reason is the size of tree$oob_samples went up, which makes sense. Is there a way we can not save the out of sample observations for each tree?
Several interfaces are currently missing documentation, and it can be difficult to understand expected parameters + return values, etc.
Some pieces of the repo set-up are non-obvious, and could use documentation. We should at least document the following:
I am very interested in using your algorithm to estimate heterogeneous treatment effects, but cannot find any documentation on whether and how it can be applied to panel data. Is there a way of implementing your model that is compatible with the standard fixed effects unbiasedness assumptions?
While migrating from the previous -causalTree- package to -grf-, I noticed a curious case for -grf- when running the following causal_forest example. When I expand the range of true treatment effect (changed X.test[,1] to be a sequence between -2 and 4), the estimated Treatment effect is systematically below the truth when TE is >2. Any suggestion why this might happen is hugely appreciated.
Below from the usage example:
library(grf)
n = 2000; p = 10
X = matrix(rnorm(n*p), n, p)
X.test = matrix(0, 101, p)
X.test[,1] = seq(-2, 4, length.out = 101) # this is the only line I changed from the example
W = rbinom(n, 1, 0.5)
Y = pmax(X[,1], 0) * W + X[,2] + pmin(X[,3], 0) + rnorm(n)
tau.forest = causal_forest(X, Y, W, num.trees = 4000)
tau.hat = predict(tau.forest, X.test, estimate.variance = TRUE)
sigma.hat = sqrt(tau.hat$variance.estimates)
plot(X.test[,1], tau.hat$predictions, ylim = range(tau.hat$predictions + 1.96 * sigma.hat, tau.hat$predictions - 1.96 * sigma.hat, 0, 4), xlab = "x", ylab = "tau", type = "l")
lines(X.test[,1], tau.hat$predictions + 1.96 * sigma.hat, col = 1, lty = 2)
lines(X.test[,1], tau.hat$predictions - 1.96 * sigma.hat, col = 1, lty = 2)
lines(X.test[,1], pmax(0, X.test[,1]), col = 2, lty = 1)
Currently, there are three different structures that track information about the input data:
Data
: the full input data matrix, including X values, as well as Y, W, Z, etc. values if they're available. Optimized for use in tree splitting. Not stored as part of the serialized forest.Observables
: a map from observable ID (like TREATMENT
) to its index in the data matrix (like 11
). Not stored as part of the serialized forest.Observations
: a map from observable ID and sample ID to the value for that observable. Note that this data structure duplicates the information stored in Data
and Observables
in a convenient format. Stored as part of a serialized forest, as it is made available during prediction to the various 'prediction strategies'.Ideally, Data
would encode all of this information, to avoid having to keep track of various data structures. In this case, we could either store the entire Data
object during serialization, or instead move to a model where no data is stored during serialization (and the original data must always be supplied again during prediction). The second option likely makes the most sense, as it will be quite memory-intensive to serialize the whole data matrix. It also has a natural conceptual justification: forests models are non-parametric, and are composed of both the original data, and the learned forest structure.
For the causal forest algorithm to work, there must be sufficient overlap among similar samples in terms of their treatment. Currently when given data with very little overlap, causal_forest
may fail with an error along the lines of Error in t.default(T) : argument is not a matrix
.
We should aim to detect the case where there is not sufficient overlap, and fail with an informative error message. By default before training a causal forest, we calculate treatment propensities (W.hat
in the code), and this may be a natural point to test for low overlap.
After C++ training completes, we serialize the contents of the forest to a byte stream, and pass this up to R. This serialized forest is then passed back to subsequent C++ functions for prediction, analysis, etc.
We currently an extra copy of the forest during serialization. It's not straightforward to remove this extra copy, because of a limitation in R/ Rcpp's support of variable-sized byte streams. In particular, Rcpp raw vectors need to be presized before they are filled, so we first need to calculate the size of the serialized payload, then copy over the data (https://github.com/swager/grf/blob/master/r-package/grf/bindings/RcppUtilities.cpp#L29).
Looking at performance profiles, these extra copies cause total memory usage to spike when training has completed, and results are being piped up into R. This is likely a big contributor to the memory issues observed in #179. We should take a closer look here to see if there are any optimizations or compromises we can make.
Several classes were repurposed from ranger, or contain code that was quickly written for experimentation purposes. The following classes in particular could use a refactor/ clean-up pass: *Data
, globals
, utility
, ForestTrainer
, *SplittingRule
, Tree
, TreeTrainer
, RandomSampler
.
Some known areas for improvement:
Data
require initialization calls like set_outcome
and sort
after their construction.TreeOptions
.DISALLOW_COPY_AND_ASSIGN
macro.override
.I'm curious as to what your roadmap looks like for binary outcome variables?
If I understand the package correctly, all the methods produce a forest in the end. I am interested in understanding which variables drive the most heterogeneity in treatment effects. So that I can then show 1 group 1 message, and another group another message - it would be easier to implement from an infrastructure standpoint than predicting a treatment effect for each individual.
I am interested in adding the ability to output feature importances, would this be difficult? Or alternatively, is there a way to create 1 well-grown tree? Thank you in advance!
Just curious if there was bandwidth/interest in adding sample weights for training? If not, would you be open to a PR?
Thanks for your time!
Right now GRF is based on an IID assumption. It would be nice to be able to use GRF on data with a hierarchical structure. This is especially relevant in RCTs where studies occur across administrative units like provinces, towns, school districts, etc.
It seems like it would be helpful to create a function similar to estimate_average_effect that estimates the average partial effects and its variance for continuous treatments.
Hello,
Thank you, once again, for answering my question a few months back. It has been immensely helpful. Though, I do have a new question for you. I am trying to see if there is a way to open up the tree/forest structure to update the predicted (Y) values while maintaining the tree/forest structure for a given regression_forest object.
For example, assume that you use a set of X variables and a response of Y, which gives you forest object F. This forest object has the samples at each terminal node (shown from the forest$node). Is there a way to insert a new set of Y's, say Y', keeping the same object F? This is for an iterative project where I want to maintain the flexible, predictive nature of the forest object, but I need to update Y's recursively. I am finding that, with re-estimation of new forest objects at each iteration causes very minute instability that I believe can be resolved if a forest object can be maintained where the predicted values from the leaves can be updated.
I hope this is clear enough.
We use 'const' in critical places, but we should do a general const correctness pass in the core C++ files.
Following randomForest
and ranger
, it would be nice to implement Breiman & Cutler's permutation importance metric. This would just require calculating the mean difference in OOB error between a feature's real and permuted observations across all trees. This value may optionally be scaled by the original feature's standard deviation to get something like a z-score.
Currently, only one component of prediction is parallelized across multiple threads: determining which tree leaves each test sample falls in.
We should also parallelize the step in which predictions are collected given this leaf membership information. A non-trivial amount of work is performed during this step, especially with the more complex strategies like LocalLinearPredictionStrategy
, and since we may additionally compute variance estimates and debiased errors.
Note that in this form of parallelization, test samples would be distributed across threads. This differs from the current instances of parallelization, where trees are split across threads.
Does GRF support datasets where the treatment variable is a factor with more than 2 levels? For example W = factor('treatment', 'control', 'treatment2', 'treatment3')
For the average treatment effect, I understand a tree is used to get heterogeneity, then a lm
call is used to get the average treatment effect. I think if W is a factor with > 2 levels lm would still work and report the ATE for the non "control" levels of the factor.
Because the package was initially based on ranger, it includes a mix of styles (both classic pointers and shared_ptr
). We should settle on a single approach that's in line with current best C++11 practices.
Should I expect that if I set min.node.size = 10
, every leaf of every tree in a causal_forest
should have at least 10 sample
's? Or am misunderstanding what min.node.size
is supposed to mean? I have found that setting positive values for min.node.size
does not result in leaves with at least that many samples.
Here is an MWE inspired by the examples in the documentation:
library(grf)
library(purrr)
leafsamples <- function(t) {
leaves <-
t %>%
pluck("nodes") %>%
keep("is_leaf") %>%
map(pluck("samples"))
return(leaves)
}
n = 2000; p = 10
X = matrix(rnorm(n*p), n, p)
W = rbinom(n, 1, 0.5)
Y = pmax(X[,1], 0) * W + X[,2] + pmin(X[,3], 0) + rnorm(n)
tau.forest = causal_forest(X, Y, W, min.node.size = 4, seed = 1)
leaves <- leafsamples(get_tree(tau.forest, 1))
On my machine, only 53 of the 159 leaves in the first tree of this forest have 4 or more samples.
I'm generating simple linear data (in logit space), but the predicted TEs from causal_forest
seem to be uncorrelated with the true oracle TEs. I would expect much better results than I'm currently getting. Here's my data-generating process:
simple_oracle_data <- function(nsamp, beta=c(2,-1,0,0,0), te=1, hete=1,
seed=3) {
set.seed(seed)
x <- mvtnorm::rmvnorm(nsamp, mean=rep(0, length(beta)))
s <- rbinom(n=nsamp, size=1, prob=0.5)
t <- rbinom(n=nsamp, size=1, prob=0.5)
# x drives main effects, s drives HETE
z <- x %*% beta + t*te + s*t*hete
# each example gets Z(t=0)
z_ctl <- x %*% beta
# each example gets Z(t=1)
z_tmt <- x %*% beta + s*hete + te
# Create tibble
df <- tibble::as_tibble(x)
# Rename "V#" to "x#" for numbered covariates
names(df) <- gsub("V", "x", names(df))
# Add subgroup indicator to df
df[, "s"] <- as.double(s)
# Add treatment indicator to df
df[, "t"] <- as.factor(t)
df[, "y"] <- as.double(rbinom(n=nsamp, size=1, prob=plogis(z)))
df[, "p_true"] <- as.vector(plogis(z))
df[, "p_ctl"] <- as.vector(plogis(z_ctl))
df[, "p_tmt"] <- as.vector(plogis(z_tmt))
df[, "te_true"] <- as.vector(plogis(z_tmt) - plogis(z_ctl))
list("data" = df)
}
I generate train and test samples using this function, and then make predictions on the test set as follows:
oracle_test <- function(ntrain=1000, ntest=10000, add_hete=TRUE, seed=7) {
odtrain <- simple_oracle_data(nsamp = ntrain, seed = seed)
# Pull out X, W, and Y
xtrain <- model.matrix(~ ., data = select(odtrain$data, starts_with("x"),
starts_with("s")) )[, -1]
wtrain <- as.numeric(odtrain$data$t)
ytrain <- as.numeric(odtrain$data$y)
# Train the model
cf <-causal_forest(xtrain, wtrain, ytrain, seed=seed)
# Get test data
odtest <- simple_oracle_data(nsamp = ntest, seed = seed + 1)
# Pull out X, W, and Y
xtest <- model.matrix(~ ., data=select(odtest$data, starts_with("x"),
starts_with("s")))[, -1]
ytest <- as.numeric(odtest$data$y)
ypred = predict(cf, xtest)
cbind(tibble(ytrue = odtest$data$y, tetrue = odtest$data$te_true,
ypred = as.numeric(ypred$predictions)),
select(odtest$data, starts_with("s")))
}
When comparing the predicted TEs in ypred
to the known oracle TEs in the test set tetrue
, I get a weak correlation:
> cor(ot$tetrue, ot$ypred)
[1] 0.0984858
With a different random seed, this correlation comes out negative. Here's a scatter plot of the predictions from causal_forest
versus the true TEs:
I would expect it to be fairly easy for the forest to pick up heterogeneity in this case. I've checked things over fairly carefully, and a linear model returns reasonable coefficients:
> summary(glm(y ~ x1 + x2 + x3 + x4 + x5 + s*t, data = odtrain$data, family = 'binomial'))
Call:
glm(formula = y ~ x1 + x2 + x3 + x4 + x5 + s * t, family = "binomial",
data = odtrain$data)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.4840 -0.6913 0.2513 0.6656 2.7131
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.07221 0.15444 -0.468 0.640
x1 1.80200 0.12282 14.672 < 2e-16 ***
x2 -0.93426 0.09889 -9.448 < 2e-16 ***
x3 0.06619 0.08389 0.789 0.430
x4 0.06071 0.08354 0.727 0.467
x5 -0.03834 0.08456 -0.453 0.650
s 0.34898 0.23363 1.494 0.135
t1 1.20393 0.24501 4.914 8.93e-07 ***
s:t1 0.15643 0.34158 0.458 0.647
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 1352.99 on 999 degrees of freedom
Residual deviance: 868.98 on 991 degrees of freedom
AIC: 886.98
Number of Fisher Scoring iterations: 5
It seems like the causal forest does a poor job of predicting treatment effect, however I'm fairly new to R and causal forestes, so even more than your typical issue, this is likely due to user error.
The underlying CRAN package could be made lighterweight by re-using the Eigen headers as shipped by RcppEigen. Many other packages do (currently 127 reverse depends). Code duplication is generally seen as something to avoid. Obviously, as for other rules there we may be good exceptions and I may be overlooking something here.
But I tested this real quick by starting from yesterday's CRAN upload; just making in 'LinkingTo: Rcpp, RcppEigen' was all it took; I could then remove the src/third_party
subtree. Happy to send a PR if you're interested.
I installed successfully gradient-forest package on my Mac, but was getting an error while running test example provided in the repository documentation.
tau.forest = causal.forest(X, Y, W)
Error in Y - Y.hat : non-numeric argument to binary operator
error comes from causal.forest.R function, line 96 of the code:
input.data <- as.matrix(cbind(X, Y - Y.hat, W - W.hat))
I am interested in using causal_forest with a treatment variable, W, and covariates X that are categorical variables. I would like to use sparse.model.matrix(~ ., data = X)
to generate the covariates since the 1 hot encoding will be very sparse.
I believe the training function ultimately calls this cpp function https://github.com/swager/grf/blob/15c7af3b1cf39fc1ae231fe325a8d800ad78ff14/r-package/grf/bindings/RegressionForestBindings.cpp#L12. The Rcpp NumericMatrix class is for dense matrices, is it possible to get this for sparse data?
ci.group.size
was used in training the forest, so that it can be reused during prediction. Currently, we force the client to pass in the same ci.group.size
during both training and prediction, which is quite fragile.ci.group.size
is set to 1. This is non-intuitive -- there should be a more robust way of turning off confidence intervals.Currently, we don't prevent splits from occurring that could result in nodes with size less than min.node.size
. Instead, the algorithm simply stops splitting if a node's size is less than or equal to min.node.size
.
Although this is also the current behavior in both ranger and the original randomForest package, it's not intuitive and can be quite misleading. We should look into updating the splitting algorithm so that min.node.size
is enforced as a true minimum.
After reading a few of the papers on causal forests, it seems like the idea is the trees will decide to split more often on variables that cause the treatment effect to vary (that is, moderators of the experimental effect). However, the object returned by causal_forest
seems to work like any other machine learning algorithm focused on prediction—not causal statistical inference—in the way the predict
function associated with it works.
How can I use the causal_forest
function to find where the treatment varies?
For example, I simulate a randomized experiment with a binary outcome where there is a positive effect for women and a negative effect for black men. There are also nuisance variables in here. How would I use causal_forest
to tell me that, "The trees are tending to split most on gender, and then when it leads to male, it the trees also are more likely to split on race." This would help show me where the treatment effects are occurring.
Here are the data:
set.seed(1839)
n <- 5000
X <- data.frame(
condition = factor(sample(c("control", "treatment"), n, TRUE)),
gender = factor(sample(c("male", "female"), n, TRUE)),
race = factor(sample(c("black", "white", "hispanic", "asian"), n, TRUE)),
generation = factor(sample(c("millennial", "x", "babyboomer"), n, TRUE)),
has_kids = factor(sample(c("no", "yes"), n, TRUE))
)
y <- factor(ifelse(
X$gender == "female" & X$condition == "treatment",
sample(c("positive", "negative"), n, TRUE, c(.50, .50)),
ifelse(X$race == "black" & X$condition == "treatment",
sample(c("positive", "negative"), n, TRUE, c(.12, .88)),
sample(c("positive", "negative"), n, TRUE, c(.40, .60))
)
))
And then I fit a model using grf::causal_forest
:
library(grf)
mod <- causal_forest(
model.matrix(~ -1 + ., X[, -1]), # convert factors to dummies
(as.numeric(y) - 1), # convert outcome to 0 or 1
(as.numeric(X[, 1]) - 1), # convert treatment to 0 or 1
num.trees = 4000
)
Looking at mod
, I see some variable importance issues that hints at where the splits are occurring most:
Variable importance:
1 2 3 4 5 6 7 8
0.233 0.203 0.267 0.061 0.055 0.046 0.081 0.054
However, this doesn't capture how multiple X
variables may depend on one another in their interaction with W
on Y
(what may be called a three-way interaction in the general linear model world).
If I use the predict()
function, I can only look at individual-level conditional treatment effects and their associated variance estimates. For example, just the first row of my data:
predict(mod, model.matrix(~ -1 + ., X[1, -1]), estimate.variance = TRUE)
However, how could I estimate the treatment effect and variance for the category women, collapsing across all other variables? Or the combination of male and Black, collapsing across all other variables? Is this as simple as averaging the treatment effects on the training set within the groups of interest? And if so, how are confidence intervals calculated from those?
Additionally, can the causal_forest
function tell me where these variations are most likely to occur? It seems like this is possible from the papers I have read on causal forests (as well as earlier papers on causal trees and transformed outcome trees, etc.), but if I may very well be mistaken.
There's a corner case where R crashes when num.trees is less than 8 on my computer:
library(grf)
n = 2000; p = 10
X = matrix(rnorm(n*p), n, p)
X.test = matrix(0, 101, p)
X.test[,1] = seq(-2, 2, length.out = 101)
W = rbinom(n, 1, 0.5)
Y = pmax(X[,1], 0) * W + X[,2] + pmin(X[,3], 0) + rnorm(n)
tau.forest = causal_forest(X, Y, W, num.trees=7)
R then aborts on my computer instantly.
It should be possible provide a weight for each variable, that will be used during tree growing when sampling the variables to consider for a split. This way, we can prohibit trees from splitting on certain variables, prefer to split more frequently on others, etc.
Hi,
I was trying to use your method to predict sales in a grocery store, and my data contains more than a few factor predictors. Some of them can easily be converted to numeric (e.g., hour of day), but some of them should really be treated as categorical (e.g. holiday or not).
An example is the following:
library(gradient.forest)
# Sample size
n = 1e3
# A data.frame with customers/holidays, respectively numeric/factor predictors
predictors = data.frame(customers = rnorm(n),
holidays = factor(rbinom(n, 1, 1/10)))
# Sales as a function of customers and holidays
sales = X$customers * rnorm(n) + as.numeric(levels(X$holidays)) * rnorm(n)
q.forest = quantile.forest(predictors, sales, quantiles=c(0.1, 0.5, 0.9))
Unfortunately, I get:
Error in quantile_train(quantiles, input.data, outcome.index.zeroindexed, :
not compatible with requested type
One obvious fix is to convert the factor predictors to numeric, which means that they will be treated as continuous variables, but is it really "correct" to do that?
These take up a substantial amount of memory, and could instead be calculated on-the-fly from Tree#get_leaf_samples or something similar.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.