grf-labs / grf Goto Github PK

View Code? Open in Web Editor NEW

953.0 46.0 249.0 62.91 MB

Generalized Random Forests

Home Page: https://grf-labs.github.io/grf/

License: GNU General Public License v3.0

CMake 0.10% R 33.88% C++ 65.70% TeX 0.28% JavaScript 0.05%

causal-inference random-forest econometrics machine-learning statistics causal-forest

grf's People

Contributors

Stargazers

Watchers

Forkers

alxsoares jmzeng jmeller rinafriedberg egnha dpuelz ma0511 qianyangzzz seekshreyas seoncheolpark nanaakwasiabayieboateng gweehwa guhjy lminer walterwzhang mlikelihood jjchern kingsharaman kcrum afcarl sanjaybasu turji chirayukong yatharth ebaruw rajkumarkarthik mindis buyannemekh rugilmartin gannawag statmixedml clu0 jiacli mingyal kendonb hulleywood jakobzeitler 775371 singularitysun davidahirshberg jec2017 vandith hulalazz itamarcaspi btskinner mnwright evanmunro pedrojma ras44 erikcs hzhz2020 sitakantapanda maxghenis klosins dgai91 conradwangt xinxin-wang-0128 adeldaoud ozaltun leofrota markmusumba edenhuangsh ganzeb linyingyang fpli-mbr adamwanggang krogerkai davidczy caiwei-zhang joeyqiang949 xychenunc lucylulu chaiyunhai dasfawg ryz0521 phoebetwong zhan-gao pp33pant ginward cwalker4 witmanyin yilongliang ericzepeda cmarmst emossotto uos-hgig albertocenzato rebeccah16 alexiajm chenhongqi0716 fabrizioleone ryurko jeverding jbdatascience zminmin buptdbj mavpanos yuhsienliu tj1994 predt

grf's Issues

grf with caret ?!

Hello,

I am just wondering why the honesty parameter is set to TRUE by default ?! And do you think for next releases we could use grf with caret ?

Thank you

Add support for factor variables.

We should likely implement the approach suggested in ESL where in each node, factor variables are ordered by their mean outcome before performing the split. This should be properly generalized to handle non-regression forests.

Improve package documentation.

A few items in particular:

Improve the main help file (grf.Rd), as it is currently quite bare-bones. #516
Do a pass through the method documentation, making sure parameters are fully explained and that they're clearly marked 'experimental' as needed.
Make sure examples on the methods are helpful and capture the primary use case.
Explain algorithmic details not obvious from the 'Generalized Random Forests' paper, for example honesty and split penalization. #290

Windows installation

I know that Windows isn't officially supported, but it was fairly easy to get working so I thought I'd post how. The problem is that there are symlinks in r-package/gradient.forest to files elsewhere and while the Windows filesystem can make symlinks, it is now well supported in tools like unzip and some git implementations. The ranger package had a similar issue. They fixed it by reversing the symlinks (having the real files in the package area and then symlinks in the other folders pointing there).

To work around this issue on Windows, just:

Replace the symlinks in r-package/gradient.forest with the real files and directories (replace the src/src and src/third_party directories, and copy over the files from bindings to src).
Run r-package/build_package.R
Then in R, install.packages("path/to/gradient.forest_X.X.X.X.tar.gz", repos = NULL, type="source")

estimate_average_effect function for test sample

It would be helpful to have the ability in the grf package to generate a doubly-robust ATE estimate for a separate test dataset. Right now I can use the predict statement to generate predictions of tau in a test dataset; however, taking the average of the predictions may yield very different results than the doubly-robust estimator for the ATE in situations where there is large shrinkage bias (as is the case in my data).

What does it mean that one of the samples in a final node of a tree is zero?

Hello,

I trying to understand the output information of one tree from a causal forest.
In this example, the number of training samples is 772.

tau.forest = causal_forest(x, y, z, num.trees = 500)
tree1=get_tree(tau.forest, 1)
I print out one of the final node of this tree and I get this output
tree1$nodes[5]
[[1]]
[[1]]$is_leaf
[1] TRUE
[[1]]$samples
[1] 113 694 712 705 232 584 704 0 718 19 700 109

What does it mean that one of the samples of this final node is zero?
Is it possible to have access to the source code of '_grf_instrumental_train'?
Thanks.

varImpPlot for causal trees

I think this package is really interesting. It would be nice if there was a function to do a varImpPlot, similar to what is available in the randomForest package. Right now there is nothing I am aware of that would allow me to identify which variables have the largest impact on the heterogeneity of the effect after fitting a causal forest.

Create an optional 'optimized' mode for the prediction-only use case.

We made several design trade-offs for causal forests that improve statistical accuracy and UX at the cost of performance. For example,

we automatically perform 'local centering', which involves training two regression forests before the causal one
we store information about which samples were out-of-bag (OOB) for each tree in order to be able to make OOB predictions

We should create an 'optimized' mode for simple prediction-only use cases that favors raw performance in terms of speed + memory usage.

Memory Spike in Causal Forest Estimation

Hi,

I've noticed that with the causal_forest function the memory spikes during the estimation routine. With a multi-core specification, it the memory spike is right after the cores are done running and processes collapse to a single core. The memory spike also occurs when it is run with one core. In comparison, the causalForest function in the causalTree package does not have this issue.

For a data set with 300k observations, a binary treatment vector, and a outcome vector, the causal_forest function run with 15 cores initially takes ~3.1 GB of memory. Then, as the cores finish and the processes is collapses to one core, the memory spikes past 50 GB. This 50 GB is measured and hard capped by a Slurm job scheduler.

Do you know why the memory spikes at that stage in the causal_forest estimation and is there any way to circumvent the memory spike?

This may be vaguely related to Issue 137.

Thank you for your help and for making this package available.

About predicted values

Hello,
About the OptimizedPredictionStrategy. The code says "only 'optimized' prediction strategies can provide variance estimates" and "OptimizedPredictionStrategy does not predict using a list of neighboring samples and weights".

Following your example:

#Predict with confidence intervals; growing more trees is now recommended.
c.forest = causal_forest(X, Y, W, num.trees = 4000)
c.pred = predict(c.forest, X.test, estimate.variance = TRUE)

Does this mean that cpred$predictions were not calculated using the weights (i.e. equation 19 in "Generalized Random Forest" paper was not used)?
Many thanks.

Add support for parallelizing across multiple shards.

A user expressed interest in running an experiment involving ~20 million samples and ~100 features. We should consider supporting growing trees + predicting across multiple machines, likely by splitting the data across these shards.

Add a conversion from grf_tree to rpart objects.

We should provide a method to convert a grf_tree to an rpart object, for the convenience of plotting and predicting with a single tree.

Add example of custom forest for log-likelihood maximization

Hi,

Let's say that you have a parametric distribution where the goal is to let the parameter(s) be a function of covariates. If its likelihood and its gradient are tractable, it fits neatly within your framework through the score equation(s). For instance, I think that Stefan used forests to estimate the parameters of a GPD as functions of covariates.

I wondered if you could eventually provide an implementation example of CustomRelabelingStrategy/CustomPredictionStrategy for this kind of setup, which encompass, in my opinion, many cases of practical interest.

Improve output from tuning

Show the grid of output from cross-validation together with MSE and std dev of MSE, rather than just the result of the tuning. Make clear that result of tuning is available in forest$tunable.params when tuning is done as part of the forest estimation.

Add support for partial dependence plots.

It may make sense to integrate with the pdp or mlr packages, which provide this functionality for the models they support.

Using regression_forest with a single regressor

Thank you for making this package available. This package is incredibly useful for some projects on which I'm working.

I am running into trouble when I call regression_forest with a single regressor. For example,

n = 50; p = 1
X = matrix(rnorm(n*p), n, p)
Y = X * rnorm(n)
r.forest = regression_forest(X, Y)

r.forest

reports the error

Error in rowSums(split.freq) : 
  'x' must be an array of at least two dimensions

Receiving an NaN estimate and standard error in estimate of average treatment effect for the treated

Hello:

I am working on a project with a large number of observations (around 300,000 or so). In this project, there are approximately 20 explanatory variables that are useful in determining the treatment of the individual (W). I have tried to use the causal_forest and estimate_average_effect functions and, sometimes, it returns an NaN response.

While I cannot go into the nuances of the data too much, approximately 30% of the sample receives the treatment. The Y variable is binary, so we are looking at the percentage change in behavior based on the treatment. For this outcome variable, we see an overall rate (for both treatment and control groups combined) of 1 being around 14%, so the outcome being "positive" is quite low across the entire sample. Do you have any suggestions on how I might be able to avoid the NaN? As I mentioned, sometimes I will get an estimated treatment effect, other times the NaN will be returned. Should I increase the sample fraction? The min node size?

My thought is that there must be a tree in the sample that is causing the NaN issue (since there are occasions where I do get a numerical response rather than NaN). Is there a way I can view each tree in isolation and take out any with an NaN?

Any suggestion would be appreciated. Thank you for your time.

Memory Issue with Gradient Forest for Large Data Set (sample.fraction)

I have been applying causal forest to datas that have tens millions of observations, and I've been encountering memory issues. When building a forest with only 1000 trees (lower than the recommended amount), the whole forest takes more than 200 GB of memory, even though the raw data is only 2 GB. I think the reason is each tree stores a copy of a fraction of the data (both in the "nodes" and "oob_samples" attributes).

One strategy I had of reducing the memory intensity is to use a small "sample.fraction" parameter. However, when I set a low value, the memory size for an individual tree actually went up. I think the reason is the size of tree$oob_samples went up, which makes sense. Is there a way we can not save the out of sample observations for each tree?

Round out documentation on C++ interfaces.

Several interfaces are currently missing documentation, and it can be difficult to understand expected parameters + return values, etc.

Add better C++ / R developer documentation.

Some pieces of the repo set-up are non-obvious, and could use documentation. We should at least document the following:

The recommended development set-up, including IDEs for both C++ and R, and how to run tests.
How the C++ / R build works and can be modified.

Panel data implementation

I am very interested in using your algorithm to estimate heterogeneous treatment effects, but cannot find any documentation on whether and how it can be applied to panel data. Is there a way of implementing your model that is compatible with the standard fixed effects unbiasedness assumptions?

Causal_Forest: systematically underestimate treatment effect when TE is large in the usage example

While migrating from the previous -causalTree- package to -grf-, I noticed a curious case for -grf- when running the following causal_forest example. When I expand the range of true treatment effect (changed X.test[,1] to be a sequence between -2 and 4), the estimated Treatment effect is systematically below the truth when TE is >2. Any suggestion why this might happen is hugely appreciated.

Below from the usage example:

library(grf)

Generate data.

n = 2000; p = 10
X = matrix(rnorm(n*p), n, p)
X.test = matrix(0, 101, p)
X.test[,1] = seq(-2, 4, length.out = 101) # this is the only line I changed from the example

Perform treatment effect estimation.

W = rbinom(n, 1, 0.5)
Y = pmax(X[,1], 0) * W + X[,2] + pmin(X[,3], 0) + rnorm(n)
tau.forest = causal_forest(X, Y, W, num.trees = 4000)
tau.hat = predict(tau.forest, X.test, estimate.variance = TRUE)
sigma.hat = sqrt(tau.hat$variance.estimates)
plot(X.test[,1], tau.hat$predictions, ylim = range(tau.hat$predictions + 1.96 * sigma.hat, tau.hat$predictions - 1.96 * sigma.hat, 0, 4), xlab = "x", ylab = "tau", type = "l")
lines(X.test[,1], tau.hat$predictions + 1.96 * sigma.hat, col = 1, lty = 2)
lines(X.test[,1], tau.hat$predictions - 1.96 * sigma.hat, col = 1, lty = 2)
lines(X.test[,1], pmax(0, X.test[,1]), col = 2, lty = 1)

Can see that the estimate systematically goes below the truth (in red).

Add roxygen examples for instrumental forests.

Simplify the internal handling around data, observations, and 'observables'.

Currently, there are three different structures that track information about the input data:

Data: the full input data matrix, including X values, as well as Y, W, Z, etc. values if they're available. Optimized for use in tree splitting. Not stored as part of the serialized forest.
Observables: a map from observable ID (like TREATMENT) to its index in the data matrix (like 11). Not stored as part of the serialized forest.
Observations: a map from observable ID and sample ID to the value for that observable. Note that this data structure duplicates the information stored in Data and Observables in a convenient format. Stored as part of a serialized forest, as it is made available during prediction to the various 'prediction strategies'.

Ideally, Data would encode all of this information, to avoid having to keep track of various data structures. In this case, we could either store the entire Data object during serialization, or instead move to a model where no data is stored during serialization (and the original data must always be supplied again during prediction). The second option likely makes the most sense, as it will be quite memory-intensive to serialize the whole data matrix. It also has a natural conceptual justification: forests models are non-parametric, and are composed of both the original data, and the learned forest structure.

Fail gracefully if there is not sufficient treatment overlap.

For the causal forest algorithm to work, there must be sufficient overlap among similar samples in terms of their treatment. Currently when given data with very little overlap, causal_forest may fail with an error along the lines of Error in t.default(T) : argument is not a matrix.

We should aim to detect the case where there is not sufficient overlap, and fail with an informative error message. By default before training a causal forest, we calculate treatment propensities (W.hat in the code), and this may be a natural point to test for low overlap.

Avoid making extra copies of the forest when serializing to and from R.

After C++ training completes, we serialize the contents of the forest to a byte stream, and pass this up to R. This serialized forest is then passed back to subsequent C++ functions for prediction, analysis, etc.

We currently an extra copy of the forest during serialization. It's not straightforward to remove this extra copy, because of a limitation in R/ Rcpp's support of variable-sized byte streams. In particular, Rcpp raw vectors need to be presized before they are filled, so we first need to calculate the size of the serialized payload, then copy over the data (https://github.com/swager/grf/blob/master/r-package/grf/bindings/RcppUtilities.cpp#L29).

Looking at performance profiles, these extra copies cause total memory usage to spike when training has completed, and results are being piped up into R. This is likely a big contributor to the memory issues observed in #179. We should take a closer look here to see if there are any optimizations or compromises we can make.

Bring all classes up to style and code quality standards.

Several classes were repurposed from ranger, or contain code that was quickly written for experimentation purposes. The following classes in particular could use a refactor/ clean-up pass: *Data, globals, utility, ForestTrainer, *SplittingRule, Tree, TreeTrainer, RandomSampler.

Some known areas for improvement:

Make sure objects are immutable as much as possible. Currently, several objects like Data require initialization calls like set_outcome and sortafter their construction.
Avoid trivial getters and setters on data classes like TreeOptions.
Explicitly delete copy constructors instead of defining DISALLOW_COPY_AND_ASSIGN macro.
Mark overridden methods using override.

Causal forests with binary outcomes

I'm curious as to what your roadmap looks like for binary outcome variables?

Feature importance

If I understand the package correctly, all the methods produce a forest in the end. I am interested in understanding which variables drive the most heterogeneity in treatment effects. So that I can then show 1 group 1 message, and another group another message - it would be easier to implement from an infrastructure standpoint than predicting a treatment effect for each individual.

I am interested in adding the ability to output feature importances, would this be difficult? Or alternatively, is there a way to create 1 well-grown tree? Thank you in advance!

Sample weights

Just curious if there was bandwidth/interest in adding sample weights for training? If not, would you be open to a PR?

Thanks for your time!

Account for hierarchical structure

Right now GRF is based on an IID assumption. It would be nice to be able to use GRF on data with a hierarchical structure. This is especially relevant in RCTs where studies occur across administrative units like provinces, towns, school districts, etc.

Average Partial Effects estimates and variances using causal_forest

It seems like it would be helpful to create a function similar to estimate_average_effect that estimates the average partial effects and its variance for continuous treatments.

Updating Predicted Values While Maintaining Current Tree (Forest) Structure with Regression_Forest

Hello,

Thank you, once again, for answering my question a few months back. It has been immensely helpful. Though, I do have a new question for you. I am trying to see if there is a way to open up the tree/forest structure to update the predicted (Y) values while maintaining the tree/forest structure for a given regression_forest object.

For example, assume that you use a set of X variables and a response of Y, which gives you forest object F. This forest object has the samples at each terminal node (shown from the forest$node). Is there a way to insert a new set of Y's, say Y', keeping the same object F? This is for an iterative project where I want to maintain the flexible, predictive nature of the forest object, but I need to update Y's recursively. I am finding that, with re-estimation of new forest objects at each iteration causes very minute instability that I believe can be resolved if a forest object can be maintained where the predicted values from the leaves can be updated.

I hope this is clear enough.

Ensure const correctness.

We use 'const' in critical places, but we should do a general const correctness pass in the core C++ files.

Add permutation importance for GRFs

Following randomForest and ranger, it would be nice to implement Breiman & Cutler's permutation importance metric. This would just require calculating the mean difference in OOB error between a feature's real and permuted observations across all trees. This value may optionally be scaled by the original feature's standard deviation to get something like a z-score.

Fully parallelize prediction collection.

Currently, only one component of prediction is parallelized across multiple threads: determining which tree leaves each test sample falls in.

We should also parallelize the step in which predictions are collected given this leaf membership information. A non-trivial amount of work is performed during this step, especially with the more complex strategies like LocalLinearPredictionStrategy, and since we may additionally compute variance estimates and debiased errors.

Note that in this form of parallelization, test samples would be distributed across threads. This differs from the current instances of parallelization, where trees are split across threads.

Multiple Treatment Regimes

Does GRF support datasets where the treatment variable is a factor with more than 2 levels? For example W = factor('treatment', 'control', 'treatment2', 'treatment3')

For the average treatment effect, I understand a tree is used to get heterogeneity, then a lm call is used to get the average treatment effect. I think if W is a factor with > 2 levels lm would still work and report the ATE for the non "control" levels of the factor.

Standardize C++ pointer style.

Because the package was initially based on ranger, it includes a mix of styles (both classic pointers and shared_ptr). We should settle on a single approach that's in line with current best C++11 practices.

What does min.node.size actually do?

Should I expect that if I set min.node.size = 10, every leaf of every tree in a causal_forest should have at least 10 sample's? Or am misunderstanding what min.node.size is supposed to mean? I have found that setting positive values for min.node.size does not result in leaves with at least that many samples.

Here is an MWE inspired by the examples in the documentation:

library(grf)
library(purrr)

leafsamples <- function(t) {
  leaves <-
    t %>%
    pluck("nodes") %>%
    keep("is_leaf") %>%
    map(pluck("samples"))
  return(leaves)
}

n = 2000; p = 10
X = matrix(rnorm(n*p), n, p)
W = rbinom(n, 1, 0.5)
Y = pmax(X[,1], 0) * W + X[,2] + pmin(X[,3], 0) + rnorm(n)
tau.forest = causal_forest(X, Y, W, min.node.size = 4, seed = 1)

leaves <- leafsamples(get_tree(tau.forest, 1))

On my machine, only 53 of the 159 leaves in the first tree of this forest have 4 or more samples.

Simple oracle test doesn't seem to give good results

I'm generating simple linear data (in logit space), but the predicted TEs from causal_forest seem to be uncorrelated with the true oracle TEs. I would expect much better results than I'm currently getting. Here's my data-generating process:

simple_oracle_data <- function(nsamp, beta=c(2,-1,0,0,0), te=1, hete=1, 
                               seed=3) {
  set.seed(seed)
  x <- mvtnorm::rmvnorm(nsamp, mean=rep(0, length(beta)))
  s <- rbinom(n=nsamp, size=1, prob=0.5)
  t <- rbinom(n=nsamp, size=1, prob=0.5)
  # x drives main effects, s drives HETE
  z <- x %*% beta + t*te + s*t*hete 
  # each example gets Z(t=0)
  z_ctl <- x %*% beta
  # each example gets Z(t=1)
  z_tmt <- x %*% beta + s*hete + te
  
  # Create tibble
  df <- tibble::as_tibble(x)
  # Rename "V#" to "x#" for numbered covariates
  names(df) <- gsub("V", "x", names(df))
  # Add subgroup indicator to df
  df[, "s"] <- as.double(s)
  # Add treatment indicator to df
  df[, "t"] <- as.factor(t)
    
  df[, "y"] <- as.double(rbinom(n=nsamp, size=1, prob=plogis(z)))
  df[, "p_true"] <- as.vector(plogis(z))
  df[, "p_ctl"] <- as.vector(plogis(z_ctl))
  df[, "p_tmt"] <- as.vector(plogis(z_tmt))
  df[, "te_true"] <- as.vector(plogis(z_tmt) - plogis(z_ctl))
  list("data" = df)
}

I generate train and test samples using this function, and then make predictions on the test set as follows:

oracle_test <- function(ntrain=1000, ntest=10000, add_hete=TRUE, seed=7) {
  odtrain <- simple_oracle_data(nsamp = ntrain, seed = seed)
  # Pull out X, W, and Y
  xtrain <- model.matrix(~ ., data = select(odtrain$data, starts_with("x"), 
                                            starts_with("s")) )[, -1]
  wtrain <- as.numeric(odtrain$data$t) 
  ytrain <- as.numeric(odtrain$data$y) 
  # Train the model
  cf <-causal_forest(xtrain, wtrain, ytrain, seed=seed)
  
  # Get test data
  odtest <- simple_oracle_data(nsamp = ntest, seed = seed + 1)
  # Pull out X, W, and Y
  xtest <- model.matrix(~ ., data=select(odtest$data, starts_with("x"), 
                                         starts_with("s")))[, -1]
  ytest <- as.numeric(odtest$data$y) 
  ypred = predict(cf, xtest)
  cbind(tibble(ytrue = odtest$data$y, tetrue = odtest$data$te_true, 
               ypred = as.numeric(ypred$predictions)), 
        select(odtest$data, starts_with("s")))
}

When comparing the predicted TEs in ypred to the known oracle TEs in the test set tetrue, I get a weak correlation:

> cor(ot$tetrue, ot$ypred)
[1] 0.0984858

With a different random seed, this correlation comes out negative. Here's a scatter plot of the predictions from causal_forest versus the true TEs:

I would expect it to be fairly easy for the forest to pick up heterogeneity in this case. I've checked things over fairly carefully, and a linear model returns reasonable coefficients:

> summary(glm(y ~ x1 + x2 + x3 + x4 + x5 + s*t, data = odtrain$data, family = 'binomial'))

Call:
glm(formula = y ~ x1 + x2 + x3 + x4 + x5 + s * t, family = "binomial", 
    data = odtrain$data)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-2.4840  -0.6913   0.2513   0.6656   2.7131  

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept) -0.07221    0.15444  -0.468    0.640    
x1           1.80200    0.12282  14.672  < 2e-16 ***
x2          -0.93426    0.09889  -9.448  < 2e-16 ***
x3           0.06619    0.08389   0.789    0.430    
x4           0.06071    0.08354   0.727    0.467    
x5          -0.03834    0.08456  -0.453    0.650    
s            0.34898    0.23363   1.494    0.135    
t1           1.20393    0.24501   4.914 8.93e-07 ***
s:t1         0.15643    0.34158   0.458    0.647    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 1352.99  on 999  degrees of freedom
Residual deviance:  868.98  on 991  degrees of freedom
AIC: 886.98

Number of Fisher Scoring iterations: 5

It seems like the causal forest does a poor job of predicting treatment effect, however I'm fairly new to R and causal forestes, so even more than your typical issue, this is likely due to user error.

Remove embedded copy of Eigen3 and use 'LinkingTo: RcppEigen'

The underlying CRAN package could be made lighterweight by re-using the Eigen headers as shipped by RcppEigen. Many other packages do (currently 127 reverse depends). Code duplication is generally seen as something to avoid. Obviously, as for other rules there we may be good exceptions and I may be overlooking something here.

But I tested this real quick by starting from yesterday's CRAN upload; just making in 'LinkingTo: Rcpp, RcppEigen' was all it took; I could then remove the src/third_party subtree. Happy to send a PR if you're interested.

getting an error from causal.forest running an example

I installed successfully gradient-forest package on my Mac, but was getting an error while running test example provided in the repository documentation.

tau.forest = causal.forest(X, Y, W)
Error in Y - Y.hat : non-numeric argument to binary operator

error comes from causal.forest.R function, line 96 of the code:
input.data <- as.matrix(cbind(X, Y - Y.hat, W - W.hat))

Does causal_forest support a sparse.model.matrix?

I am interested in using causal_forest with a treatment variable, W, and covariates X that are categorical variables. I would like to use sparse.model.matrix(~ ., data = X) to generate the covariates since the 1 hot encoding will be very sparse.

I believe the training function ultimately calls this cpp function https://github.com/swager/grf/blob/15c7af3b1cf39fc1ae231fe325a8d800ad78ff14/r-package/grf/bindings/RegressionForestBindings.cpp#L12. The Rcpp NumericMatrix class is for dense matrices, is it possible to get this for sparse data?

De-hackify the confidence interval implementation.

The serialized forest should store what ci.group.size was used in training the forest, so that it can be reused during prediction. Currently, we force the client to pass in the same ci.group.size during both training and prediction, which is quite fragile.
To indicate that no confidence intervals should be calculated, the ci.group.size is set to 1. This is non-intuitive -- there should be a more robust way of turning off confidence intervals.

Ensure that min.node.size is strictly enforced.

Currently, we don't prevent splits from occurring that could result in nodes with size less than min.node.size. Instead, the algorithm simply stops splitting if a node's size is less than or equal to min.node.size.

Although this is also the current behavior in both ranger and the original randomForest package, it's not intuitive and can be quite misleading. We should look into updating the splitting algorithm so that min.node.size is enforced as a true minimum.

Using causal_forest to estimate average treatment effects over subgroups

After reading a few of the papers on causal forests, it seems like the idea is the trees will decide to split more often on variables that cause the treatment effect to vary (that is, moderators of the experimental effect). However, the object returned by causal_forest seems to work like any other machine learning algorithm focused on prediction—not causal statistical inference—in the way the predict function associated with it works.

How can I use the causal_forest function to find where the treatment varies?

For example, I simulate a randomized experiment with a binary outcome where there is a positive effect for women and a negative effect for black men. There are also nuisance variables in here. How would I use causal_forest to tell me that, "The trees are tending to split most on gender, and then when it leads to male, it the trees also are more likely to split on race." This would help show me where the treatment effects are occurring.

Here are the data:

set.seed(1839)
n <- 5000
X <- data.frame(
  condition = factor(sample(c("control", "treatment"), n, TRUE)),
  gender = factor(sample(c("male", "female"), n, TRUE)),
  race = factor(sample(c("black", "white", "hispanic", "asian"), n, TRUE)),
  generation = factor(sample(c("millennial", "x", "babyboomer"), n, TRUE)),
  has_kids = factor(sample(c("no", "yes"), n, TRUE))
)
y <- factor(ifelse(
  X$gender == "female" & X$condition == "treatment",
  sample(c("positive", "negative"), n, TRUE, c(.50, .50)),
  ifelse(X$race == "black" & X$condition == "treatment",
    sample(c("positive", "negative"), n, TRUE, c(.12, .88)),
    sample(c("positive", "negative"), n, TRUE, c(.40, .60))
  )
))

And then I fit a model using grf::causal_forest:

library(grf)
mod <- causal_forest(
  model.matrix(~ -1 + ., X[, -1]), # convert factors to dummies
  (as.numeric(y) - 1), # convert outcome to 0 or 1
  (as.numeric(X[, 1]) - 1), # convert treatment to 0 or 1
  num.trees = 4000
)

Looking at mod, I see some variable importance issues that hints at where the splits are occurring most:

Variable importance: 
    1     2     3     4     5     6     7     8 
0.233 0.203 0.267 0.061 0.055 0.046 0.081 0.054

However, this doesn't capture how multiple X variables may depend on one another in their interaction with W on Y (what may be called a three-way interaction in the general linear model world).

If I use the predict() function, I can only look at individual-level conditional treatment effects and their associated variance estimates. For example, just the first row of my data:

predict(mod, model.matrix(~ -1 + ., X[1, -1]), estimate.variance = TRUE)

However, how could I estimate the treatment effect and variance for the category women, collapsing across all other variables? Or the combination of male and Black, collapsing across all other variables? Is this as simple as averaging the treatment effects on the training set within the groups of interest? And if so, how are confidence intervals calculated from those?

Additionally, can the causal_forest function tell me where these variations are most likely to occur? It seems like this is possible from the papers I have read on causal forests (as well as earlier papers on causal trees and transformed outcome trees, etc.), but if I may very well be mistaken.

Corner Case when setting num.trees

There's a corner case where R crashes when num.trees is less than 8 on my computer:

library(grf)

Generate data.

n = 2000; p = 10
X = matrix(rnorm(n*p), n, p)
X.test = matrix(0, 101, p)
X.test[,1] = seq(-2, 2, length.out = 101)

Perform treatment effect estimation.

W = rbinom(n, 1, 0.5)
Y = pmax(X[,1], 0) * W + X[,2] + pmin(X[,3], 0) + rnorm(n)
tau.forest = causal_forest(X, Y, W, num.trees=7)

R then aborts on my computer instantly.

Add support for variable weights when sampling possible split variables.

It should be possible provide a weight for each variable, that will be used during tree growing when sampling the variables to consider for a split. This way, we can prohibit trees from splitting on certain variables, prefer to split more frequently on others, etc.

Factor predictors

Hi,

I was trying to use your method to predict sales in a grocery store, and my data contains more than a few factor predictors. Some of them can easily be converted to numeric (e.g., hour of day), but some of them should really be treated as categorical (e.g. holiday or not).

An example is the following:

library(gradient.forest)

# Sample size
n = 1e3

# A data.frame with customers/holidays, respectively numeric/factor predictors
predictors = data.frame(customers = rnorm(n), 
                        holidays = factor(rbinom(n, 1, 1/10)))

# Sales as a function of customers and holidays
sales = X$customers * rnorm(n) + as.numeric(levels(X$holidays)) * rnorm(n)

q.forest = quantile.forest(predictors, sales, quantiles=c(0.1, 0.5, 0.9))

Unfortunately, I get:

Error in quantile_train(quantiles, input.data, outcome.index.zeroindexed,  : 
  not compatible with requested type

One obvious fix is to convert the factor predictors to numeric, which means that they will be treated as continuous variables, but is it really "correct" to do that?

Avoid storing the out-of-bag samples for each tree.

These take up a substantial amount of memory, and could instead be calculated on-the-fly from Tree#get_leaf_samples or something similar.

grf-labs / grf Goto Github PK

grf's People

Contributors

Stargazers

Watchers

Forkers

grf's Issues

Generate data.

Perform treatment effect estimation.

Can see that the estimate systematically goes below the truth (in red).

Generate data.

Perform treatment effect estimation.

Recommend Projects

Recommend Topics

Recommend Org