Make DoubleML available for R (≥ 4.0.2)

Describe the feature you want to propose or implement

I cannot install DoubleML in below configs

SessionInfo (Microsoft R Open 4.0.2)

> sessionInfo()
R version 4.0.2 (2020-06-22)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Red Hat Enterprise Linux 8.3 (Ootpa)

Matrix products: default
BLAS:   /opt/microsoft/ropen/4.0.2/lib64/R/lib/libRblas.so
LAPACK: /opt/microsoft/ropen/4.0.2/lib64/R/lib/libRlapack.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=en_US.UTF-8       
 [4] LC_COLLATE=en_US.UTF-8     LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                  LC_ADDRESS=C              
[10] LC_TELEPHONE=C             LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] RevoUtils_11.0.2     RevoUtilsMath_11.0.0

loaded via a namespace (and not attached):
[1] compiler_4.0.2 parallel_4.0.2 tools_4.0.2

Commands Ran and outputs

> install.packages("DoubleML")
Installing package into ‘<XXX>/app/R40_Library’
(as ‘lib’ is unspecified)
Warning in install.packages :
  package ‘DoubleML’ is not available (for R version 4.0.2)

Propose a possible solution or implementation

No response

Did you consider alternatives to the proposed solution. If yes, please describe

No response

Comments, context or references

No response

Dimensions of properties like the estimated coefficients in docu (R)

This issue might be relevant for R as well:

DoubleML/doubleml-for-py#85

We should ...

... shortly explain the dimensions all_coef (etc.) in the documentation
... add column and row names to the dimensions of the object

In R we should think of the (currently private) get__...-methods in this context, for example:

doubleml-for-r/R/double_ml.R

Line 784 in 8790659

get__psi_a = function() self$psi_a[, private$i_rep, private$i_treat],

Indended behavior (and execption handling / warning) of repetitive calls to `set_ml_nuisance_params`

See DoubleML/doubleml-for-py#99

Reduce unit test times

Some of our unit tests just take too long. With ON_CRAN='false' it takes around 30 minutes on github actions. We should find better parametrization while keeping a similar level of coverage.

Extensions and refinements for the trimming of propensity scores (IRM & IIVM)

This is the corresponding issue for the R package for extensions & refinements of the propensity scores in the IRM & IIVM

Original issue: DoubleML/doubleml-for-py#109

Failing unit test on CRAN solaris

See https://www.r-project.org/nosvn/R.check/r-patched-solaris-x86/DoubleML-00check.html

checking tests ... [139s/138s] ERROR
  Running ‘testthat_regression_tests.R’ [138s/138s]
Running the tests in ‘tests/testthat_regression_tests.R’ failed.
Complete output:
  >
  > library("testthat")
  > library("patrick")
  > library("DoubleML")
  >
  > testthat::test_check("DoubleML")
  ── ERROR (test-double_ml_iivm.R:33:3): Unit tests for IIVM: cv_glmnet_dml2_LATE_
  Error: 'NA' indices are not (yet?) supported for sparse Matrices
  Backtrace:
       █
    1. ├─rlang::eval_tidy(code, args)
    2. └─DoubleML:::dml_irmiv(...) test-double_ml_iivm.R:33:2
    3. └─mlr3::resample(task_p, ml_p, resampling_p, store_models = TRUE) helper-11-dml_irmiv.R:91:2
    4. └─future.apply::future_lapply(...)
    5. └─future.apply:::future_xapply(...)
    6. ├─future::value(fs)
    7. └─future:::value.list(fs)
    8. ├─future::resolve(...)
    9. └─future:::resolve.list(...)
   10. └─future:::signalConditionsASAP(obj, resignal = FALSE, pos = ii)
   11. └─future:::signalConditions(...)
  
  ── ERROR (test-double_ml_irm.R:33:3): Unit tests for IRM: cv_glmnet_dml2_ATE_1_0
  Error: missing value where TRUE/FALSE needed
  Backtrace:
       █
    1. ├─rlang::eval_tidy(code, args)
    2. └─DoubleML:::dml_irm(...) test-double_ml_irm.R:33:2
    3. └─mlr3::resample(task_m, ml_m, resampling_m, store_models = TRUE) helper-10-dml_irm.R:69:2
    4. └─future.apply::future_lapply(...)
    5. └─future.apply:::future_xapply(...)
    6. ├─future::value(fs)
    7. └─future:::value.list(fs)
    8. ├─future::resolve(...)
    9. └─future:::resolve.list(...)
   10. └─future:::signalConditionsASAP(obj, resignal = FALSE, pos = ii)
   11. └─future:::signalConditions(...)

Minor inconsistency between user guide notation and the code?

I have a question about a potential inconsistency between the notation provided in the user guide and the code. If not an inconsistency, then it must represent my own misunderstanding of the notation in the user guide (and if so, my apologies in advance).

Looking at the documentation to estimate the variance of the estimator, I would describe the expression as J_{0}^{-2} multiplied by the mean of \psi^2, where this latter term is represented by the double sum over folds and observations within each fold. The N^{-1} here serves to calculate the mean over this double sum.

However, in the code here, the quantity above is premultiplied by an additional N^{-1} term.

I suspect the code is correct, and so that's why this seems more like an issue about the notation in the documentation. I looked at Theorem 3.2 in the published paper but I had trouble identifying where the extra N^{-1} term would come from.

Is this a notation problem or am I missing something?

Thanks,
Brett

Messages in DoubleML

Handle messages in DoubleML during instantiation, fitting, tuning etc. of models

[Bug]: the result of Lasso learner is different from others

Describe the bug

Hi, the DML package is really useful for me and I am using it to conduct my master thesis. I have tried LightGBM/RF/Xgboost/Lasso for learners. The results of LightGBM/RF/Xgboost are similar but the results of Lasso is rather different. The following is a part of the results. Can you help me with that issue?

Minimum reproducible code snippet

LassoFormula =xnames[1]

for (name in xnames[-1]){
LassoFormula = paste0(LassoFormula,'+',name)
}
LassoFormula = paste0('~(',LassoFormula,")^2")

LassoFormula = formula(LassoFormula)# create the formula
#features_flex = data.frame(model.matrix(LassoFormula, dataS)) #second order term
model_data = data.table("y"= dataS[, ynames],
"d" = dataS[, "IndShareSuccessful"],
features_flex)

################################ Lasso

DMLLasso = function(yname){
set.seed(123)
lasso = lrn("regr.cv_glmnet", nfolds = 5, s = "lambda.min") #set g model
lasso_class = lrn("classif.cv_glmnet", nfolds = 5, s = "lambda.min")# set m model

data_dml_flex = DoubleMLData$new(model_data,
y_col = paste0('y.',yname),
d_cols ='d.IndShareSuccessful')
dml_plr_lasso = DoubleMLPLR$new(data_dml_flex,
ml_g = lasso,
ml_m = lasso_class,
n_folds = 3)
dml_plr_lasso$fit()
dml_plr_lasso$summary()
}

Expected Result

I think the results of different learners should be similar.

Actual Result

indicators	Lasso	lightGBM	Xgboost	RF
a	-0.105	-13.424***	-13.410***	-13.025***
b	0.001	0.265***	0.275***	0.259***
c	0.003	0.186***	0.187***	0.185***
d	-0.017	20.600***	21.417***	20.165***
e	1.701	13.227***	13.282***	12.853***
f	16.672	10.549*	15.637**	8.339

Versions

sessionInfo()
R version 4.1.2 (2021-11-01)
Platform: x86_64-apple-darwin17.0 (64-bit)
Running under: macOS Big Sur 11.5.2

Matrix products: default
LAPACK: /Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRlapack.dylib

locale:
[1] zh_CN.UTF-8/zh_CN.UTF-8/zh_CN.UTF-8/C/zh_CN.UTF-8/zh_CN.UTF-8

attached base packages:
[1] grid stats graphics grDevices utils datasets methods base

other attached packages:
[1] xtable_1.8-4 mlr3tuning_0.9.0 paradox_0.7.1
[4] mlr3learners.lightgbm_0.0.10.9001 xgboost_1.5.0.2 WRS2_1.1-3
[7] tmcn_0.2-13 forcats_0.5.1 stringr_1.4.0
[10] purrr_0.3.4 readr_2.0.1 tidyr_1.1.3
[13] tibble_3.1.3 tidyverse_1.3.1 stargazer_5.2.2
[16] sm_2.2-5.7 scales_1.1.1 readxl_1.3.1
[19] randomForest_4.6-14 ranger_0.13.1 qdapRegex_0.7.2
[22] plotrix_3.8-2 plotly_4.10.0 ggplot2_3.3.5
[25] plm_2.4-3 np_0.60-11 nnet_7.3-16
[28] nlme_3.1-153 MultiGHQuad_1.2.0 mvtnorm_1.1-3
[31] mlr3_0.13.1 MatchIt_4.3.2 lubridate_1.7.10
[34] lm.beta_1.5-1 lfe_2.8-7.1 lawstat_3.4
[37] knitr_1.33 kSamples_1.2-9 SuppDists_1.1-9.7
[40] kknn_1.3.1 gridExtra_2.3 grf_2.0.2
[43] gmm_1.6-6 glmnet_4.1-3 ggridges_0.5.3
[46] frequentdirections_0.1.0 fixest_0.10.1 expm_0.999-6
[49] Matrix_1.3-4 DoubleML_0.4.1 dgof_1.2
[52] data.table_1.14.2 contextual_0.9.8.4 coda_0.19-4
[55] BTYDplus_1.2.0 BTYD_2.4.3 dplyr_1.0.7
[58] optimx_2021-6.12 hypergeo_1.2-13 broom_0.7.11
[61] bit64_4.0.5 bit_4.0.4 beepr_1.3
[64] AER_1.2-9 survival_3.2-13 sandwich_3.0-1
[67] lmtest_0.9-38 zoo_1.8-9 car_3.0-12
[70] carData_3.0-5 devtools_2.4.3 usethis_2.1.3

loaded via a namespace (and not attached):
[1] SparseM_1.81 ModelMetrics_1.2.2.2 R.methodsS3_1.8.1 maxLik_1.5-2
[5] clusterGeneration_1.3.7 R.utils_2.11.0 rpart_4.1-15 doParallel_1.0.16
[9] generics_0.1.0 callr_3.7.0 future_1.23.0 tzdb_0.1.2
[13] xml2_1.3.2 assertthat_0.2.1 gower_0.2.2 xfun_0.24
[17] hms_1.1.0 fansi_0.5.0 dbplyr_2.1.1 igraph_1.2.6
[21] DBI_1.1.1 htmlwidgets_1.5.3 reshape_0.8.8 stats4_4.1.2
[25] ellipsis_0.3.2 backports_1.2.1 vctrs_0.3.8 remotes_2.4.1
[29] quantreg_5.86 abind_1.4-5 caret_6.0-78 cachem_1.0.5
[33] withr_2.4.2 itertools_0.1-3 mlr3learners_0.5.1 vroom_1.5.4
[37] bdsmatrix_1.3-4 checkmate_2.0.0 prettyunits_1.1.1 cluster_2.1.2
[41] lazyeval_0.2.2 crayon_1.4.1 elliptic_1.4-0 recipes_0.1.17
[45] pkgconfig_2.0.3 pkgload_1.2.3 rlang_0.4.11 globals_0.14.0
[49] lifecycle_1.0.0 MatrixModels_0.5-0 palmerpenguins_0.1.0 modelr_0.1.8
[53] Kendall_2.2 cellranger_1.1.0 rprojroot_2.0.2 matrixStats_0.61.0
[57] mc2d_0.1-21 boot_1.3-28 reprex_2.0.1 base64enc_0.1-3
[61] processx_3.5.2 png_0.1-7 viridisLite_0.4.0 rjson_0.2.21
[65] R.oo_1.24.0 shape_1.4.6 parallelly_1.28.1 jpeg_0.1-9
[69] memoise_2.0.1 magrittr_2.0.1 plyr_1.8.6 audio_0.1-10
[73] compiler_4.1.2 miscTools_0.6-26 RColorBrewer_1.1-2 cli_3.1.0
[77] listenv_0.8.0 ps_1.6.0 htmlTable_2.3.0 Formula_1.2-4
[81] MASS_7.3-54 tidyselect_1.1.1 stringi_1.7.3 latticeExtra_0.6-29
[85] tools_4.1.2 mlr3misc_0.10.0 future.apply_1.8.1 parallel_4.1.2
[89] rstudioapi_0.13 uuid_0.1-4 foreign_0.8-81 foreach_1.5.1
[93] cubature_2.0.4.2 prodlim_2019.11.13 digest_0.6.27 lava_1.6.10
[97] quadprog_1.5-8 Rcpp_1.0.7 R.devices_2.17.0 httr_1.4.2
[101] contfrac_1.1-12 Rdpack_2.1.3 colorspace_2.0-2 rvest_1.0.1
[105] fs_1.5.0 readstata13_0.10.0 splines_4.1.2 lgr_0.4.3
[109] bbotk_0.4.0 conquer_1.2.1 sessioninfo_1.2.1 dreamerr_1.2.3
[113] jsonlite_1.7.2 timeDate_3043.102 testthat_3.1.0 ipred_0.9-12
[117] R6_2.5.1 Hmisc_4.6-0 pillar_1.6.2 htmltools_0.5.1.1
[121] glue_1.4.2 fastmap_1.1.0 deSolve_1.30 class_7.3-19
[125] codetools_0.2-18 pkgbuild_1.2.0 utf8_1.2.2 lattice_0.20-45
[129] numDeriv_2016.8-1.1 curl_4.3.2 desc_1.4.0 munsell_0.5.0
[133] iterators_1.0.13 haven_2.4.3 reshape2_1.4.4 gtable_0.3.0
[137] rbibutils_2.2.7

packageVersion('DoubleML')
[1] ‘0.4.1’
packageVersion('mlr3')
[1] ‘0.13.1’

Note on CRAN: Undeclared package ‘bbotk’ in Rd xrefs

We get a note on CRAN about an undeclared package bbotk in Rd xrefs.
Guess this line in the documentation is causing this:

doubleml-for-r/R/double_ml.R

Line 348 in dec9a4d

    
           #' * `terminator` \cr A [Terminator][bbotk::Terminator] object. Default is `mlr3tuning::trm("evals", n_evals = 20)`.

--> Presumably this can be solved by adding bbotk to the suggests in the description file.

Bug in the aggregation of standard errors from repeated cross-fitting

I think there is a bug in the aggregation of standard errors from repeated cross-fitting.

Description

The aggregation formula stated in Chernozhukov et al. (2018) is
$formula1$
Note that we also state the same here in the user guide: https://docs.doubleml.org/stable/guide/resampling.html#repeated-cross-fitting-with-k-folds-and-m-repetition
For the implementation it is important to point out that attribute(s) se are not equal to sigma_hat but the scaled asymptotic standard error, i.e.,
$formula2$
The same also applies to the standard errors from the repeated splits _all_se, i.e.,
$formula3$
Therefore, the correct formula for aggregating the asymptotic / scaled standard errors is
$formula4$

Implementation

The implementation in the Python package is in line with the above description and should be correct, see https://github.com/DoubleML/doubleml-for-py/blob/ce44e849a45091006d29679ee45e235f2ad0555a/doubleml/double_ml.py#L1203-L1205

The implementation in R must be adapted as described above, see

doubleml-for-r/R/double_ml.R

Lines 1152 to 1154 in f45de1f

    
           self$se = sqrt(apply( 
        
             self$all_se^2 + (self$all_coef - self$coef)^2, 1, 
        
             function(x) median(x, na.rm = TRUE)))

Unit Tests

We don't seem to have unit tests being sensitive for the bug fix in the aggregation formula. In my ongoing major update of the unit test framework I will add this extension.
In our R vs. Python package tests we so far didn't had a test case with repeated cross-fitting and therefore the difference between the implementations didn't become visible: I added such a test case in DoubleML/doubleml-py-vs-r#4. As the tests are now sensitive for the aggregation formula, they also fail in the PR which will be resolved when the R package got its bug fix.

Documentation

The user guide (https://docs.doubleml.org/stable/guide/resampling.html#repeated-cross-fitting-with-k-folds-and-m-repetition) is already quite precise in this regard (see screenshot below). However, I would change one small thing: In attribute _all_se we don't store the unscaled standard errors sigma_hat_m but the scaled / asymptotic standard errors sigma_hat_m / sqrt(N). We should adapt this accordingly.

Note on CRAN: 'LazyData' is specified without a 'data' directory

See https://cloud.r-project.org/web/checks/check_results_DoubleML.html

Version: 0.2.1
Check: LazyData
Result: NOTE
     'LazyData' is specified without a 'data' directory
Flavors: r-devel-linux-x86_64-debian-clang, r-devel-linux-x86_64-debian-gcc, r-devel-linux-x86_64-fedora-clang, r-devel-linux-x86_64-fedora-gcc, r-devel-windows-ix86+x86_64, r-devel-windows-x86_64-gcc10-UCRT, r-patched-linux-x86_64

Failing builds on github actions with macOS

See https://github.com/DoubleML/doubleml-for-r/actions/runs/487600657
Difficult to track down what caused it, but it might be related to this change in r-lib/actions: r-lib/actions#229

Unit test for the extraction of predictions fails for non-glmnet learner

If one replaces the learner in the unit test https://github.com/DoubleML/doubleml-for-r/blob/master/tests/testthat/test-double_ml_plr_export_preds.R with something else than "regr.cv_glmnet" it fails. Even for "regr.lm". The cross-validated glmnet seems to go through because it produces constant predictions in each fold.

Inconsistent initilization of `task_type` between PLIV and other models (PLR, IRM, IIVM)

PLR, IRM and IIVM

For PLR, IRM and IIVM, we first initialize the private property task_type to NULL for all necessary nuisance parts and during the call to assert_learner it is filled up with content. See

doubleml-for-r/R/double_ml_plr.R

Lines 150 to 154 in acb9d46

    
           private$task_type = list( 
        
             "ml_g" = NULL, 
        
             "ml_m" = NULL) 
        
           ml_g = private$assert_learner(ml_g, "ml_g", Regr = TRUE, Classif = FALSE) 
        
           ml_m = private$assert_learner(ml_m, "ml_m", Regr = TRUE, Classif = TRUE)

doubleml-for-r/R/double_ml_irm.R

Lines 192 to 196 in acb9d46

    
           private$task_type = list( 
        
             "ml_g" = NULL, 
        
             "ml_m" = NULL) 
        
           ml_g = private$assert_learner(ml_g, "ml_g", Regr = TRUE, Classif = TRUE) 
        
           ml_m = private$assert_learner(ml_m, "ml_m", Regr = FALSE, Classif = TRUE)

doubleml-for-r/R/double_ml_iivm.R

Lines 248 to 254 in acb9d46

    
           private$task_type = list( 
        
             "ml_g" = NULL, 
        
             "ml_m" = NULL, 
        
             "ml_r" = NULL) 
        
           ml_g = private$assert_learner(ml_g, "ml_g", Regr = TRUE, Classif = TRUE) 
        
           ml_m = private$assert_learner(ml_m, "ml_m", Regr = FALSE, Classif = TRUE) 
        
           ml_r = private$assert_learner(ml_r, "ml_r", Regr = FALSE, Classif = TRUE)

PLIV

For PLIV this initialization does not happen, but still it seems to work as expected, see

doubleml-for-r/R/double_ml_pliv.R

Lines 208 to 216 in acb9d46

    
           ml_g = private$assert_learner(ml_g, "ml_g", 
        
             Regr = TRUE, 
        
             Classif = FALSE) 
        
           ml_m = private$assert_learner(ml_m, "ml_m", 
        
             Regr = TRUE, 
        
             Classif = FALSE) 
        
           ml_r = private$assert_learner(ml_r, "ml_r", 
        
             Regr = TRUE, 
        
             Classif = FALSE)

Possible solution

In the base class DoubleML the private property task_type is initialized to an empty list, which in my view suffices.

doubleml-for-r/R/double_ml.R

Line 1153 in acb9d46

task_type = list(),

It is then filled up with meaningful content when assert_learner is being called for the learners assigned for the different nuisance parts. Therefore, I guess we could simplify by removing the additional nuisance-part specific initialization to NULL being done for PLR, IRM and IIVM.

Miscellaneous

I would furthermore would suggest to add some sort of assertion in the helper function dml_cv_predict. Basically, I wouldn't accept something else than "regr" or "classif". The code will anyways fail with any other choice, like NULL, because then variable resp_name would never be assigned, see

doubleml-for-r/R/helper.R

Lines 162 to 166 in acb9d46

    
           if (task_type == "regr") { 
        
             resp_name = "response" 
        
           } else if (task_type == "classif") { 
        
             resp_name = "prob.1" 
        
           }

Missing excpetion handling for infinite / missing predictions

There is no exception handling in-place in case some learner produces infinite or missing predictions. Basically, very silently the estimates are becoming NA's without a warning or exception.

See for example:

library(DoubleML)

g = function(x) {
  res = sin(x)^2
  return(res)
}

m = function(x, nu = 0, gamma = 1) {
  xx = sinh(gamma) / (cosh(gamma) - cos(x - nu))
  res = 0.5 / pi * xx
  return(res)
}

dgp1_irmiv = function(theta, N, k) {
  
  b = 1 / (1:k)
  sigma = clusterGeneration::genPositiveDefMat(k, "unifcorrmat")$Sigma
  
  X = mvtnorm::rmvnorm(N, sigma = sigma)
  G = g(as.vector(X %*% b))
  M = m(as.vector(X %*% b))
  
  pr_z = 1 / (1 + exp(-(1) * X[, 1] * b[5] + X[, 2] * b[2] + rnorm(N)))
  z = rbinom(N, 1, pr_z)
  
  U = rnorm(N)
  pr = 1 / (1 + exp(-(1) * (0.5 * z + X[, 1] * (-0.5) + X[, 2] * 0.25 - 0.5 * U + rnorm(N))))
  d = rbinom(N, 1, pr)
  err = rnorm(N)
  
  y = theta * d + G + 4 * U + err
  
  data = data.frame(y, d, z, X)
  
  return(data)
}

set.seed(1282)
df = dgp1_irmiv(0.5, 1000, 20)
Xnames = names(df)[names(df) %in% c("y", "d", "z") == FALSE]
dml_data = double_ml_data_from_data_frame(df,
                                          y_col = "y",
                                          d_cols = "d", x_cols = Xnames, z_col = "z")

ml_g = mlr3::lrn("regr.rpart", cp = 0.01, minsplit = 20)
ml_m = mlr3::lrn("classif.rpart", cp = 0.01, minsplit = 20)
ml_r = mlr3::lrn("classif.rpart", cp = 0.01, minsplit = 20)

set.seed(3141)
double_mliivm_obj = DoubleMLIIVM$new(
  data = dml_data,
  n_folds = 5,
  ml_g = ml_g,
  ml_m = ml_m,
  ml_r = ml_r,
  dml_procedure = "dml2",
  trimming_threshold = 0,
  score = "LATE")
double_mliivm_obj$fit()
print(double_mliivm_obj$coef)
print(double_mliivm_obj$se)

It is then getting even more confusing if one thereafter calls the method bootstrap(). This results in exception

double_mliivm_obj$bootstrap()
Error in double_mliivm_obj$bootstrap(): Apply fit() before bootstrap().

which is obviously not the root cause and also the remark to apply fit() will obviously not fix the issue.

I propose to implement a check for finite predictions similar to the check in the Python package: https://github.com/DoubleML/doubleml-for-py/blob/b3cbdb572fce435c18ec67ca323645900fc901b5/doubleml/_utils.py#L204-L208

Avoid overriding learner parameters during turing

Currently, the parameter values of a learner (<learner>$param_set$values) are overriden when calling dml_tune

doubleml-for-r/R/helper.R

Line 133 in b97814b

ml_learner = initiate_learner(learner, learner_class, params = list())

Meaningful error message if sample splitting was not yet set

Calling

dml_plr_obj = DoubleMLPLR$new(make_plr_CCDDHNR2018(),
                                                        lrn("regr.ranger"), lrn("regr.ranger"),
                                                        draw_sample_splitting = FALSE)
dml_plr_obj$fit()

produces error message

 Error in .__ResamplingCustom__instantiate(self = self, private = private,  : 
  Assertion on 'train_sets' failed: Must be of type 'list', not 'NULL'.

More meaningful would be something in the lines of https://github.com/DoubleML/doubleml-for-py/blob/a574e0afcab0e7cce475925f1344399e75dd4a11/doubleml/double_ml.py#L238-L239.

Classification Learners in PLR

use of classifiers in PLR binary treatment case , cf. DoubleML/doubleml-for-py#86

[Unit Test Extension]: Implement "default setting unit tests"

In the python package DoubleML, we do have unit tests for model defaults, see https://github.com/DoubleML/doubleml-for-py/blob/master/doubleml/tests/test_doubleml_model_defaults.py. The intention behind such "default setting unit tests" is twofold:

It should assert that the defaults are valid / meaningful, i.e., the code runs through successfully with default values for the input parameters.
The unit tests serve as a reminder to update the documentation of defaults in case a default value is being changed.

Such "default setting unit tests" could be done for the initialization of the model classes as well as for the most important methods.

Note: Such "default setting unit tests" would have been sensitive for bugs like #155 & #156

Failing builds on github actions with development version of mlr3

https://github.com/DoubleML/doubleml-for-r/actions/runs/1657577777

I have to check why the build fails, seems to be related to markdown / vignettes

Add column and row names to the dimensions of the objects

Add column and row names to the dimensions of the objects like psi, all_coef, etc.
Discuss for which objects it is helpful and how to add the names

[Bug]: Tuning with default `tune_settings` fails

Describe the bug

Tuning with default tune_settings fails. Typo in

doubleml-for-r/R/double_ml.R

Line 774 in acb9d46

terminator = mlr3tunin::trm("evals", n_evals = 20),

Minimum reproducible code snippet

library(DoubleML)
library(mlr3)
library(mlr3learners)
library(data.table)
set.seed(2)
ml_g = lrn("regr.ranger", num.trees = 10, max.depth = 2)
ml_m = ml_g$clone()
obj_dml_data = make_plr_CCDDHNR2018(alpha = 0.5)
dml_plr_obj = DoubleMLPLR$new(obj_dml_data, ml_g, ml_m)
par_grids = list("ml_g" = paradox::ParamSet$new(list(
    paradox::ParamInt$new("num.trees", lower = 1, upper = 10, default = 5))),
    "ml_m" = paradox::ParamSet$new(list(
    paradox::ParamInt$new("num.trees", lower = 1, upper = 10, default = 5))))

dml_plr_obj$tune(param_set=par_grids)

Expected Result

No exception

Actual Result

Exception

 Error in loadNamespace(name) : there is no package called ‘mlr3tunin’

Versions

> sessionInfo()
R version 4.0.4 (2021-02-15)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 21.10

> packageVersion('DoubleML')
[1] ‘0.4.1’
> packageVersion('mlr3')
[1] ‘0.11.0.9000’

Hyperlinks in R API reference are broken

If one selects a subclass (like DoubleMLPLR) and expands the inherited methods (from DoubleML) the hyperlinks don't work.

See screenshots

It links to https://docs.doubleml.org/r/DoubleML/html/DoubleML.html#method-confint. It is reachable under https://docs.doubleml.org/r/stable/reference/DoubleML.html#method-confint.
--> Could be related to the "stable" subfolder but maybe it's also a specification in pkgdown which is not yet set appropriately.

Support for Categorical D in PLIV

I'm having some trouble with the PLIV on R. It doesn't appear to support binary treatments as it doesn't let you do a classifier for ml_r. Am I doing something wrong here? Thanks

different learners for different treatments in Simultaneous Inference

Hi,
I have an idea to develop the package for simultaneous inference.

When the nature of the treatments are different (continuous or binary) it is not possible to run the function DoubleMLPLR, for example. Because there is only one choice for the argument ml_m to estimate related nuisance function. To more elaborate,
consider two treatments d1 and d2 which are continuous and binary, respectively.
To estimate the nuisance function in the case of causal inference for d1 we must apply a machine learning method for family gaussian. While to for causal inference in the case of d2 we must apply a machine learning method for logistic regression. Thus, users must define a continuous version of d2 or convert d1 to a binary treatment to have a same-nature treatments.

However, in some cases, the program automatically detect the nature of the treatments (for example regr.gbm learner from the package gbm).

If the argument ml_m can be of type list as length as d_cols, we can run DoubleMLPLR for different-nature treatment situation.

Thanks for your hot pkg!

Calculating RMSE using D and Y nuisance model residuals

does the DoublML package have the option to output the residuals of the nuisance models, for example when computing RMSE for predicting D and Y, in order to compare different methods for estimating them. Maybe there is an existing code example somehwere that I couldn't find.

thank you

Patrick 0.1.0 will have backwards incompatible changes

Hi Double ML team!

Thanks for using patrick for creating parameterized tests.

I am going to start the process of releasing a backwards incompatible change in the package.

In the past, the undocumented test_name parameter could be used to in cases data frames and as an argument for naming tests
I am moving this to a documented argument in with_parameters_test_that(). The argument is also getting the name .test_name in order to distinguish it from test cases passed by a user

In version 0.1.0, patrick will throw a warning about this change and rename input as appropriate. In the future, this warning will be dropped. Addressing it requires changing your use of test_name to .test_name.

Apologies for any inconvenience that this causes. Please let me know how else I can help.

Best wishes,
Michael

Pass score function for IRM

Follow-up to this #124 (comment) but for the IRM model.

I want to modify the score function for IRM to allow weights. In the example provided in the manual, I only see how to pass g_hat and m_hat for the PLM. However, IRM requires passing g0_hat and g1_hat. How do I do it?

PLM from the manual:

# Here:
# y: dependent variable
# d: treatment variable
# g_hat: predicted values from regression of Y on X's
# m_hat: predicted values from regression of D on X's
# smpls: sample split under consideration, can be ignored in this example

score_manual = function(y, d, g_hat, m_hat, smpls) {
  resid_y = y - g_hat
  resid_d = d - m_hat
  psi_a = -1 * resid_d * resid_d 
  psi_b = resid_d * resid_y 
  psis = list(psi_a = psi_a, psi_b = psi_b)
  return(psis)
}

Depreciation warning when applying pkgdown

When applying pkgdown to build our docu we get a depreciation warning:

[WARNING] Deprecated: markdown_github. Use gfm instead.

Add a CITATION.cff to the Repo

See https://twitter.com/natfriedman/status/1420122675813441540 & https://github.com/citation-file-format/citation-file-format.

R CMD check Note about doi

Found the following URLs which should use \doi (with the DOI name only):
  File ‘fetch_401k.Rd’:
    https://doi.org/10.1111/ectj.12097
  File ‘fetch_bonus.Rd’:
    https://doi.org/10.1111/ectj.12097
  File ‘make_iivm_data.Rd’:
    http://dx.doi.org/10.2139/ssrn.3619201
  File ‘make_plr_CCDDHNR2018.Rd’:
    https://doi.org/10.1111/ectj.12097

Missleading entries in evaluated score functions & predictions in case of estimation without cross-fitting (`apply_cross_fitting = FALSE`)

Description

When a DoubleML model is estimated with apply_cross_fitting = FALSE and n_folds = 2, there are misleading entries in the evaluated score functions as well as the exported predictions. Basically for all indices in the test set the entries are correct and also used for estimating the causal paramter(s), etc. However, for all indices which are not part of the test set, the predictions are filled up with zeros. These zero-predictions are then also later used when evaluating the score functions. These entries in psi, psi_a and psi_b are never used but in my view still misleading. In the case at hand, I would propose to fill the predictions and evaluated score function values with NA instead of zeros and non-meaningful values, respectively.

Example

> ml_g = lrn("regr.ranger", num.trees = 10, max.depth = 2)
> ml_m = ml_g$clone()
> obj_dml_data = make_plr_CCDDHNR2018(alpha = 0.5)
> dml_plr_obj = DoubleMLPLR$new(obj_dml_data, ml_g, ml_m,
+                               n_folds=2, apply_cross_fitting = FALSE)
> dml_plr_obj$fit(store_predictions = TRUE)
> dml_plr_obj$predictions$ml_g[1:10,,]
 [1]  0.0000000  0.5718869  0.7672342  0.6698870  0.0000000  1.5471172  1.1006015  0.0000000  0.0000000
[10] -0.2258972
> dml_plr_obj$psi[1:10]
 [1] -0.5875342  0.8229460 -0.3105735  0.6203550  0.2614734  0.7999844  1.1656477  0.3464782 -0.6397427
[10]  0.7832788
> obj_dml_data$data$y[1:10]*obj_dml_data$data$d[1:10] == dml_plr_obj$psi_b[1:10]
 [1]  TRUE FALSE FALSE FALSE  TRUE FALSE FALSE  TRUE  TRUE FALSE

Method `set_ml_nuisance_params` overwrites hyperparameters from initilization

Setting hyperparameters via the method set_ml_nuisance_params results in a call to

lrn$param_set$values = params

which according to mlr3 docu (I also tested it) results in replacing all hyperparameters by defaults except the ones in list params ("Note that this operation replaces all previously set hyperparameter values.").

Is that the intended behavior? I would favor going for, which results in only replacing the explicitly mentioned ones.

lrn$param_set$values = mlr3misc::insert_named(lrn$param_set$values, params)

Rename of column "row_id" -> "row_ids"

The next mlr3 version will include a refactoring which is breaking your package.
The column "row_id" of as.data.table.Prediction() will be renamed to "row_ids" (c.f. mlr-org/mlr3#547).
It would be great if you could update your package accordingly and implement a workaround in the fashion of the following lines to ease the transition:

tab = as.data.table(prediction)
data.table::setnames(tab, old = "row_id", new = "row_ids", skip_absent = TRUE) # rename col for mlr3 <= 0.10.0

Thanks and let us know if you are missing some getters or converters.

[Bug]: Unable to perform ensemble learners

Describe the bug

Hello, I am very fascinated with this great algorithm for causal machine learning analysis.
But when I was trying to test ensemble learners in R, I faced this error indicating that the learner for ml_g and ml_m must be of Class 'LearnerRegr'.

I first got this error when trying it on Interactive IV Model.
I also tried it with the exact same codes described on the User Guide website. I posted the link of the page below.
https://docs.doubleml.org/stable/examples/R_double_ml_pipeline.html?highlight=ensemble

I tried all kinds of solutions that I could think of, but was unable to go through this error message below.
Please help.

Thanks in advance!

Error in private$assert_learner(ml_g, "ml_g", Regr = TRUE, Classif = FALSE) :
Invalid learner provided for ml_g: must be of class 'LearnerRegr'

Minimum reproducible code snippet

Initiate new DoubleML object and estimate with graph learner

set.seed(123)
obj_dml_plr_sim_pipe_ensemble = DoubleMLPLR$new(dml_data_sim, ml_g = ensemble_pipe_regr, ml_m = ensemble_pipe_regr)
Error in private$assert_learner(ml_g, "ml_g", Regr = TRUE, Classif = FALSE) :
Invalid learner provided for ml_g: must be of class 'LearnerRegr'
obj_dml_plr_sim_pipe_ensemble$fit()
Error: object 'obj_dml_plr_sim_pipe_ensemble' not found

Expected Result

Results of the Double ML with ensemble learner

Actual Result

Error in private$assert_learner(ml_g, "ml_g", Regr = TRUE, Classif = FALSE) :
Invalid learner provided for ml_g: must be of class 'LearnerRegr'

Versions

sessionInfo()
R version 4.1.2 (2021-11-01)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 22000)

Matrix products: default

locale:
[1] LC_COLLATE=Korean_Korea.949 LC_CTYPE=Korean_Korea.949 LC_MONETARY=Korean_Korea.949 LC_NUMERIC=C LC_TIME=Korean_Korea.949

attached base packages:
[1] splines stats graphics grDevices utils datasets methods base

other attached packages:
[1] mlr3pipelines_0.4.0 data.table_1.14.2 mlr3learners_0.5.2 mlr3_0.13.3 DoubleML_0.4.1 sandwich_3.0-1 lmtest_0.9-39 zoo_1.8-9
[9] MASS_7.3-54 glmnet_4.1-3 Matrix_1.3-4 rpart_4.1-15 fastDummies_1.6.3 np_0.60-11 causalweight_1.0.2 ranger_0.13.1
[17] openxlsx_4.2.5 ivreg_0.6-1 forcats_0.5.1 stringr_1.4.0 dplyr_1.0.7 purrr_0.3.4 readr_2.1.1 tidyr_1.1.4
[25] tibble_3.1.6 ggplot2_3.3.5 tidyverse_1.3.1

loaded via a namespace (and not attached):
[1] paradox_0.8.0 cubature_2.0.4.4 colorspace_2.0-2 ellipsis_0.3.2 class_7.3-19 rprojroot_2.0.2 fs_1.5.2
[8] rstudioapi_0.13 proxy_0.4-26 listenv_0.8.0 remotes_2.4.2 MatrixModels_0.5-0 mlr3tuning_0.13.0 fansi_0.5.0
[15] mvtnorm_1.1-3 lubridate_1.8.0 xml2_1.3.3 codetools_0.2-18 knitr_1.37 pkgload_1.2.4 Formula_1.2-4
[22] jsonlite_1.7.2 broom_0.7.11 dbplyr_2.1.1 hdm_0.3.1 compiler_4.1.2 httr_1.4.2 backports_1.4.1
[29] assertthat_0.2.1 fastmap_1.1.0 cli_3.1.0 prettyunits_1.1.1 quantreg_5.88 htmltools_0.5.2 tools_4.1.2
[36] igraph_1.2.11 gtable_0.3.0 glue_1.6.0 clusterGeneration_1.3.7 Rcpp_1.0.7 carData_3.0-5 SuperLearner_2.0-28
[43] cellranger_1.1.0 vctrs_0.3.8 iterators_1.0.14 xfun_0.29 ps_1.6.0 globals_0.14.0 testthat_3.1.1
[50] rvest_1.0.2 lifecycle_1.0.1 future_1.24.0 scales_1.1.1 lgr_0.4.3 hms_1.1.1 parallel_4.1.2
[57] SparseM_1.81 readstata13_0.10.0 curl_4.3.2 yaml_2.2.1 gam_1.20.1 stringi_1.7.6 desc_1.4.0
[64] foreach_1.5.2 e1071_1.7-9 checkmate_2.0.0 palmerpenguins_0.1.0 pkgbuild_1.3.1 boot_1.3-28 zip_2.2.0
[71] shape_1.4.6 rlang_0.4.12 pkgconfig_2.0.3 evaluate_0.14 lattice_0.20-45 processx_3.5.2 tidyselect_1.1.1
[78] parallelly_1.31.0 magrittr_2.0.1 R6_2.5.1 generics_0.1.1 nnls_1.4 DBI_1.1.2 pillar_1.6.4
[85] haven_2.4.3 withr_2.4.3 survival_3.2-13 abind_1.4-5 future.apply_1.8.1 modelr_0.1.8 crayon_1.4.2
[92] car_3.0-12 xgboost_1.5.2.1 uuid_1.0-3 utf8_1.2.2 tzdb_0.2.0 rmarkdown_2.11 grid_4.1.2
[99] readxl_1.3.1 callr_3.7.0 mlr3misc_0.10.0 bbotk_0.5.1 reprex_2.0.1 digest_0.6.29 LARF_1.4
[106] munsell_0.5.0 quadprog_1.5-8

packageVersion('DoubleML')
[1] ‘0.4.1’
packageVersion('mlr3')
[1] ‘0.13.3’

Use active bindings in the R6 OOP implementation

For public fields R6 active bindings (https://r6.r-lib.org/articles/Introduction.html#active-bindings) are pretty similar to a property (with getter and setter) in Python. I am considering to use that in our implementation as well. Currently, we have a lot of public fields of which many actually shouldn't be settable. Basically you can pretty easy screw up things by setting some of the properties to an invalid value after initialization, say

> dml_plr = DoubleMLPLR$new(dml_data, ml_g, ml_m)
> dml_plr$dml_procedure = 'an_invalid_algo_name'
> dml_plr$fit()
Controls variables do not include other treatment variables
Set treatment variable d to d1.
Error in self$all_coef[private$i_treat, private$i_rep] = value : 
  number of items to replace is not a multiple of replacement length

It does not result in a meaningful error message. In Python we already heavily rely on properties with (or without) setters. So basically we can use this as a basis to move towards active bindings.

[Bug]: Tuning with default `tune_settings` fails

Describe the bug

Tuning with default tune_settings fails (even after fixing #155). If tune_settings$measure is set to NULL according to the docu (https://docs.doubleml.org/r/stable/reference/DoubleML.html#method-tune) default measures should be used.

Minimum reproducible code snippet

library(DoubleML)
library(mlr3)
library(mlr3learners)
library(data.table)
set.seed(2)
ml_g = lrn("regr.ranger", num.trees = 10, max.depth = 2)
ml_m = ml_g$clone()
obj_dml_data = make_plr_CCDDHNR2018(alpha = 0.5)
dml_plr_obj = DoubleMLPLR$new(obj_dml_data, ml_g, ml_m)
par_grids = list("ml_g" = paradox::ParamSet$new(list(
    paradox::ParamInt$new("num.trees", lower = 5, upper = 6, default = 5))),
    "ml_m" = paradox::ParamSet$new(list(
    paradox::ParamInt$new("num.trees", lower = 5, upper = 6, default = 5))))
tune_settings = list(
    n_folds_tune = 5,
    rsmp_tune = mlr3::rsmp("cv", folds = 5),
    measure = NULL,
    terminator = mlr3tuning::trm("evals", n_evals = 20),
    algorithm = mlr3tuning::tnr("grid_search"),
    resolution = 5)

dml_plr_obj$tune(param_set=par_grids, tune_settings = tune_settings)

Expected Result

No exception

Actual Result

Exception

Error in private$assert_tune_settings(tune_settings) : 
  Assertion on 'tune_settings$measure' failed: Must be of type 'list', not 'NULL'.

Versions

> sessionInfo()
R version 4.0.4 (2021-02-15)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 21.10

> packageVersion('DoubleML')
[1] ‘0.4.1’
> packageVersion('mlr3')
[1] ‘0.11.0.9000’

Logarithmic spacing in grid search during tuning

When tuning LASSO, I didn't find a way to specify the grid with logarithmic spacing, even though it seems natural to me. The default is equal spacing.

library(DoubleML)
library(mlr3)
library(paradox)
library(mlr3tuning)

# set logger to omit messages during tuning and fitting
lgr::get_logger("mlr3")$set_threshold("warn")
lgr::get_logger("bbotk")$set_threshold("warn")

set.seed(3141)
n_obs = 500
n_vars = 100
theta = rep(3, 3)
# generate matrix-like objects and use the corresponding wrapper
X = matrix(stats::rnorm(n_obs * n_vars), nrow = n_obs, ncol = n_vars)
y = X[, 1:3, drop = FALSE] %*% theta  + stats::rnorm(n_obs)
df = data.frame(y, X)

doubleml_data = double_ml_data_from_data_frame(df,
                                               y_col = "y",
                                               d_cols = c("X1"),
                                               x_cols = c("X2","X3"))

set.seed(1234)
ml_g = lrn("regr.glmnet")
ml_m = lrn("regr.glmnet")
doubleml_plr = DoubleMLPLR$new(doubleml_data, ml_g, ml_m)

par_grids = list(
  "ml_g" = ParamSet$new(list(
    ParamDbl$new("lambda", lower = 0.0001, upper = 10))),  # I WANT LOGARITHMIC SPACING HERE, eg. 1e-5, 1e-4, 1e-3, etc
  "ml_m" =  ParamSet$new(list(
    ParamDbl$new("lambda", lower = 0.05, upper = 0.1))))

tune_settings = list(terminator = trm("evals", n_evals = 100),
                     algorithm = tnr("grid_search", resolution = 11),
                     rsmp_tune = rsmp("cv", folds = 5),
                     measure = list("ml_g" = msr("regr.mse"),
                                    "ml_m" = msr("regr.mse")))

doubleml_plr$tune(param_set = par_grids, tune_settings = tune_settings)

doubleml_plr$tuning_res

# BUT THE SPACING ON THE GRID IS LINEAR
doubleml_plr$tuning_res$X1$ml_g[[1]]$tuning_result[[1]]$tuning_archive %>% arrange(lambda)

mlr3tuning API change

We will upload a new version of mlr3tuning to CRAN.
This line will no longer work.

doubleml-for-r/R/helper.R

Line 132 in 9b30b42

tuning_archive = tuning_instance$archive$data()

You have to use tuning_instance$archive$data since the data table is now accessible via a public field instead by a function.

`helper*` files are no longer recommended by `testthat`

We use a lot of helper* files in our unit tests. However, helper* files are no longer / not recommended by testthat. So we may want to check whether setup* is the more appropriate construction.

Support for ensemble multiple learners for ml_g and ml_m

Thanks for developing this great package.

I was wondering if you support estimating E[Y|X] or E[D|X] with super learners, i.e. we can use multiple learners to cross-fit E[Y|X] and E[D|X] as below? The weights of each learner are estimated based on their cross-fitting performance. Or I was wondering how could the double ML framework work together with the SuperLearner package?

learner = lrns(c("regr.glm","regr.gam","regr.bart"), k=2)
ml_g = learner$clone()

Many thanks!!!

[Bug]: Tuning fails with non-meaningful error message when `tune_settings$measure` real subset of the nuisance parts

Describe the bug

The tune method allows to specify a nuisance specific measure. If the list provided contains a name that is not a nuisance part, a meaningful error message is produced, i.e., tune_settings[['measure']] = list(ml_m = "regr.mae", ml_wrong_name = "regr.rmse") results in something like:

Error in private$assert_tune_settings(tune_settings) : 
  Invalid name of measure ml_m, ml_r 
 measure must be a named list with elements named ml_g, ml_m

However, if the list of measures is a real subset (e.g. tune_settings[['measure']] = list(ml_m = "regr.mae")) of the nuisance parts it fails with a non-meaningful error message:

Error in default_measures(task_type)[[1L]] : subscript out of bounds

Minimum reproducible code snippet

library(DoubleML)
library(mlr3)
library(mlr3learners)
library(data.table)
set.seed(2)
ml_g = lrn("regr.ranger", num.trees = 10, max.depth = 2)
ml_m = ml_g$clone()
obj_dml_data = make_plr_CCDDHNR2018(alpha = 0.5)
dml_plr_obj = DoubleMLPLR$new(obj_dml_data, ml_g, ml_m)
par_grids = list("ml_g" = paradox::ParamSet$new(list(
    paradox::ParamInt$new("num.trees", lower = 5, upper = 6, default = 5))),
    "ml_m" = paradox::ParamSet$new(list(
    paradox::ParamInt$new("num.trees", lower = 5, upper = 6, default = 5))))
default_tune_settings = list(
    n_folds_tune = 5,
    rsmp_tune = mlr3::rsmp("cv", folds = 5),
    measure = NULL,
    terminator = mlr3tuning::trm("evals", n_evals = 20),
    algorithm = mlr3tuning::tnr("grid_search"),
    resolution = 5)

tune_settings = default_tune_settings
tune_settings[['measure']] = list(ml_m = "regr.mae")

dml_plr_obj$tune(param_set=par_grids, tune_settings = tune_settings)

Expected Result

I would expect that the ML methods are tuned successfully. For all nuisance parts where a measure was actively set, I expect it to be used and for all other nuisance parts I would expect that it falls back to the default measure. I expect this behavior, because otherwise it wouldn't make sense to check for subset here

doubleml-for-r/R/double_ml.R

Lines 1316 to 1323 in acb9d46

    
           if (!test_names(names(tune_settings$measure), 
        
             subset.of = valid_learner)) { 
        
             stop(paste( 
        
               "Invalid name of measure", paste0(names(tune_settings$measure), 
        
                 collapse = ", "), 
        
               "\n measure must be a named list with elements named", 
        
               paste0(valid_learner, collapse = ", "))) 
        
           }

.

Alternative expected behavior: As an alternative we could enforce that either for every nuisance part there is measure set or for none (resulting in default measures being used for every nuisance part). If we go for this alternative solution, we should check for exactly matching list keys instead of checking for a subset. This would then produce a meaningful error message. However, I prefer the above described solution where we fall back to default measures for every nuisance part where no measure was actively set (the implementation of this selective fallback solution would be easy).

Actual Result

Error in default_measures(task_type)[[1L]] : subscript out of bounds

Versions

> sessionInfo()
R version 4.0.4 (2021-02-15)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 21.10

> packageVersion('DoubleML')
[1] ‘0.4.1’
> packageVersion('mlr3')
[1] ‘0.11.0.9000’

Score function with weighting

The pdf documentation (link) suggests that a user can specify the "score" parameter (at least for PLM estimator) in the call of DoubleMLPLR$new(), "for example, to adjust the DML estimators in terms of a re-weighting".

This is exactly my situation. How can I pass the weights?

The template for the score function in the example doesn't have a weight parameter... I want to do something like this (the only change I made to the code is added Ws):

# Here:
# y: dependent variable
# d: treatment variable
# g_hat: predicted values from regression of Y on X's
# m_hat: predicted values from regression of D on X's
# smpls: sample split under consideration, can be ignored in this example

score_manual = function(y, d, g_hat, m_hat, smpls) {
  resid_y = y - g_hat
  resid_d = d - m_hat
  psi_a = -1 * resid_d * resid_d * W   # HERE
  psi_b = resid_d * resid_y * W      # and HERE
  psis = list(psi_a = psi_a, psi_b = psi_b)
  return(psis)
}

Refactor / rewrite the helper functions extract_prediction & extract_prediction_list

Functions are quite complex for the functionality that they should implement
Code duplication due to dependency to mlr3 version
Use data.table functionalities in a more readable (maybe also more efficient) way

Categorical D

Hi,
thanks for developing this!
this might be a silly question, but would it be possible for D to be categorical?
Best,
Hans

DoubleMLData: Add checks for the intersections of y_col, d_cols, x_cols, z_cols

Issue DoubleML/doubleml-for-py#84 is also relevant for R: checks for indices

Bootstrap algorithm

After estimation of DoubleML models, we apply a multiplier boostrap algorithm to obtain valid simultaneous inference (see also the user guide https://docs.doubleml.org/stable/guide/sim_inf.html or https://arxiv.org/abs/2103.09603). The implementation so far is not aligned with that in case of dml_procedure='dml1' and needs to be slightly adapted.

[API Documentation]: Include documentation for `store_models` option in `DoubleML$fit()`

Describe the issue related to the API documentation

In the current version of the R package documentation, we do not include a description of the store_models option when calling DoubleML$fit() (see #169) . Apparently, the reason for this is that the documentation has not been rendered in PR #169.

Suggested alternative or fix

Run devtools::document() and upload updated documentation

	self$se = sqrt(apply(
	self$all_se^2 + (self$all_coef - self$coef)^2, 1,
	function(x) median(x, na.rm = TRUE)))

	private$task_type = list(
	"ml_g" = NULL,
	"ml_m" = NULL)
	ml_g = private$assert_learner(ml_g, "ml_g", Regr = TRUE, Classif = FALSE)
	ml_m = private$assert_learner(ml_m, "ml_m", Regr = TRUE, Classif = TRUE)

	private$task_type = list(
	"ml_g" = NULL,
	"ml_m" = NULL,
	"ml_r" = NULL)
	ml_g = private$assert_learner(ml_g, "ml_g", Regr = TRUE, Classif = TRUE)
	ml_m = private$assert_learner(ml_m, "ml_m", Regr = FALSE, Classif = TRUE)
	ml_r = private$assert_learner(ml_r, "ml_r", Regr = FALSE, Classif = TRUE)

	ml_g = private$assert_learner(ml_g, "ml_g",
	Regr = TRUE,
	Classif = FALSE)
	ml_m = private$assert_learner(ml_m, "ml_m",
	Regr = TRUE,
	Classif = FALSE)
	ml_r = private$assert_learner(ml_r, "ml_r",
	Regr = TRUE,
	Classif = FALSE)

	if (task_type == "regr") {
	resp_name = "response"
	} else if (task_type == "classif") {
	resp_name = "prob.1"
	}

	if (!test_names(names(tune_settings$measure),
	subset.of = valid_learner)) {
	stop(paste(
	"Invalid name of measure", paste0(names(tune_settings$measure),
	collapse = ", "),
	"\n measure must be a named list with elements named",
	paste0(valid_learner, collapse = ", ")))
	}

doubleml / doubleml-for-r Goto Github PK

doubleml-for-r's Issues

Describe the feature you want to propose or implement

SessionInfo (Microsoft R Open 4.0.2)

Commands Ran and outputs

Propose a possible solution or implementation

Did you consider alternatives to the proposed solution. If yes, please describe

Comments, context or references

Describe the bug

Minimum reproducible code snippet

################################ Lasso

Expected Result

Actual Result

Versions

Description

Implementation

Unit Tests

Documentation

PLR, IRM and IIVM

PLIV

Possible solution

Miscellaneous

Describe the bug

Minimum reproducible code snippet

Expected Result

Actual Result

Versions

Description

Example

Describe the bug

Minimum reproducible code snippet

Initiate new DoubleML object and estimate with graph learner

Expected Result

Actual Result

Versions

Describe the bug

Minimum reproducible code snippet

Expected Result

Actual Result

Versions

Describe the bug

Minimum reproducible code snippet

Expected Result

Actual Result

Versions

Describe the issue related to the API documentation

Suggested alternative or fix

Recommend Projects

Recommend Topics

Recommend Org