masurp / specr Goto Github PK

View Code? Open in Web Editor NEW

65.0 4.0 6.0 30.9 MB

Conducting and Visualizing Specification Curve Analyses

Home Page: https://masurp.github.io/specr

License: GNU General Public License v3.0

R 100.00%

r rstats specification-curve multiverse

specr's People

Contributors

Stargazers

Watchers

Forkers

dcosme apoorvalal cbjrobertson tomstafford clarenceke cretecht

specr's Issues

Customizable output statistics

Still room for improvement in this regard.
At the moment, we use tidyr::broom to summarize the model statistics, which is a good starting point. However, it would be great if this would be the standard procedure and run_specs() would further allow to customize the output.

run_specs (and run_spec) respectively should allow for customized model functions

Currently, run_specs only uses glm()to estimate the relevant statistics. It should be possible to run any type of model by providing a "model function", e.g.;

mymodel <- function(df, formula){
glm(formula, data = df, family = "binomial")
}

to run_specs() or even several different model functions.

Calculate models for all combinations of covariates

Hi there,

I am trying to run specifications for all possible combinations of covariates. Currently I only get specifications for each individual covariate and then an additional one that has all covariates. I would like to look at all possible subsets as well.

As a workaround, I have generated all combinations and pasted them into model formula strings using helper code I wrote myself (e.g.: 'cov1 + cov2 + cov3', 'cov1 + cov2', 'cov1 + cov2', etc.). When passing this to the covariates arguments it seems to calculate the respective models. However, there are two issues: 1) It crashes once I use a high number of model specifications (e.g. >4000). 2) In the plot it doesn't correctly show which covariates were used. Instead, it always just colors the "all_covariates" option.

e.g.

covariates <- c("airport_dist", "conservative", 'male', 'age', 'popdens',
                'manufact', 'tourism', 'academics', 'medinc', 'healthcare',
                'y_centroid_county', 'evanrate')


spec_results <- run_specs(df = df_us_slope_prev, 
                     y = c("onset_prev"), 
                     x = c("pers_e"), 
                     model = c("lm"), 
                     controls = covariates)

dim(spec_results)

gives me dimensions of 14x12 but it shoould be 4095x12 if all combinations were considered.

Guidance would be greatly appreciated!

Don't get "no covariates" in fixed effects regression

I'm using fixest::feols to run a fixed effects regression, but the "no covariates" row doesn't show up.

library(dplyr)
library(fixest)
library(specr)

test_formula <- function(formula,data) {
  formula <-as.formula(paste0(formula, "|group1")) # fixed effects for group1
  feols(formula,data)
}
results <- run_specs(
  df = example_data,
  y = c("y1", "y2"),
  x = c("x1", "x2"),
  model = 'test_formula',
  controls = c("c1", "c2")
)

plot_specs(results)

I notice that it works if the regression formula includes a variable before the fixed effects section:

test_formula2 <- function(formula,data) {
  formula <-as.formula(paste0(formula, "+ c3 |group1")) # control for c3 in all specifications
  feols(formula,data)
}
results2 <- run_specs(
  df = dplyr::mutate(example_data, c3 = runif(dplyr::n())),
  y = c("y1", "y2"),
  x = c("x1", "x2"),
  model = 'test_formula2',
  controls = c("c1", "c2")
)

plot_specs(results2)

run_specs following setup_specs

For my analysis, there's a certain control variable which I want to be included in all specifications.
This is very easy to do after the fact by filtering for only rows where it appears:

 specr_results %>% 
  filter(str_detect(string = controls, pattern = fixed("important_control_var")))

When specr_results is small, this is ok. However, when specr_results is large, making it actually estimate the models without this crucial control variable is redundant. I thought to use setup_specs to setup my estimation tibble, filter that and then run_specs on that, thus reducing computation time immensely.

However, it seems that setup_specs is for illustration purposes only(?). There's no way that I see to pass on a setup_specs tibble to run_specs.

If I'm right, I suggest that you add an option to run_specs to get a setup_specs object as input OR to get x,y, controls etc.

P.S

Thanks for a great package!

Problem running PGLM function, even though PLM works fine

Hi!

I work with panel data and have therefore specified a few different custom functions using the PLM package. I input these as the models in run_specs() and it works just fine.

Example:

within.plm.func <- function(formula, data) {
  plm(formula = formula, 
      data = data,
      model = "within")
}

I am, however, interested in running a negative binomial panel data analysis, and for that I need the PGLM package where I can specify the family = negbin, also.

Example: within.pglm.func <- function(formula, data) {
  pglm(formula = formula, 
      data = data,
      model = "within",
      family = negbin)
}

When I use this function in run_specs() I get the following error:

Error: Problem with `mutate()` input `coefs`.
x No tidy method recognized for this list.
i Input `coefs` is `map(.data$res, broom::tidy, conf.int = TRUE, conf.level = conf.level)`.

Run `rlang::last_error()` to see where the error occurred.
In addition: Der var 50 eller flere advarsler (brug warnings() for at se den første 50).

I have tried to install the development package, but the error persists.
If you know what causes it or how to fix it, I would greatly appreciate any input.

Best regards,
Toric

Singleton control set produces duplicates

When the control set is a singleton, run_specs() outputs duplicates.

library(dplyr)

results <- run_specs(
  df = example_data,
  y = c("y1"),
  x = c("x1"),
  model = c("lm"),
  controls = c("c1")
  # controls = c("c1","c2")
)
plot_specs(results,choices=c("x","y","controls"))
distinct(results,across(x:subsets)) # grab distinct rows

My guess is that it's doing the combination category (ie, c1+c1), which is the same as c1 by itself.

Combinations of subsets

If you define multiple subset variables, you should get separate specs for (a) none, (b) individual subsets, and (c) all combinations of subsets.

Error in run_spec with input res

Hi,

Thanks for the wounderful specr package - I really love the plot and its compatibility with custom model functions so I am very much looking forward to using specr in my forthcoming papers!

However, I am currently running into problems when specifying my custom model function. When trying to run a specification analysis with an underlying spatial econometrics model from the splm packe, the run_spec() function returns the following error message:

Error (Test.R#216): Problem with 'mutate()' input 'res'.
x $ operator is invalid for atomic vectors
i Input 'res' is 'map2(.data$model, formula, ~do.call(.x, list(data = df, formula = .y)))'.

Here a reproducible example:

library(plm)
library(splm)
library(specr)
data(Produc, package = "plm")
data(usaww, package = "splm")

spml_specr <- function (formula, data,...){
  spml(formula = formula, data = data, listw=mat2listw(usaww),
             effect = "twoways", model = "within")
}

results <- run_specs(df=Produc,
                          y=c("gsp", "log(gsp)"),
                          x= c("pcap","log(pcap)"),
                          model = "spml_specr",
                          controls = c("log(pc)","log(emp)","unemp"))

In my debugging attempts, I have already specifyied a costum tidy function and I also re-installed the specr package with all dependencies. However, the error persists. I really would like to adjust my costum model function spml_specr() to connect spml() with specr. However, I just can't figure out the reason for the error message. The run_specr() function is impressive, but I am not so familiar with pipe operators and the map() functions to be able to trace the origin of the error.

Can anybody please give me a hint? I would appreciate it very much. Thanks a lot!

How to customise the colours in specr plots

Thank you once again for building this great package. I have read the Alternative way to visualize specification results section, but I want to know how to change the colours in type (i.e. curve, results, and samplesizes) as part of the 'standard' output plot.

For curve and results, I did + scale_color_manual(values = c("#HEXCODE")) which worked great. However, for samplesize, although it looks like a geom_bar, I am not sure why scale_fill_manual (or equally scale_color_manual) isn't working? It just returns the default grey colour.

Added support for `lmer` and `glmer` mixed model estimation

Currently, run_specs does not appear to support specification of models with random effects (i.e. glmer or lmer models). I note that since #3, run_specs supports customisable glms, but would love to see a random effects parameter added to run_specs, so e.g. the following usage would be possible:

my_glmer <- function(formula,data){
    glmer(formula, data = data, family = binomial)
}

results <- run_specs(df = example_data,
                      y = c("y1", "y2"),
                      x = c("x1", "x2"),
                      model = c("lm"),
                      controls = c("c1", "c2"),
                      random_groups = c("group1"),
                      random_variance_components = c("1","x1","c1"),
                      subsets = list(group1 = unique(example_data$group1),
                                    group2 = unique(example_data$group2)))

where random_groups specified the level 2 grouping variables, and random_variance_components specified whether to specify randomly varying intercepts only, or slopes also for listed variables.

Update plotting functions

Create functions that allow to plot each part of the overall plot individually (including full functionality of ggplot2)
Wrapper functions that combines plots (e.g., using cowplot) and allows to further customize the appearance.

Error when using `subsets` argument in `run_specs`

When running the example:

results<- run_specs(df = example_data,
y = c("y1", "y2"),
x = c("x1", "x2"),
model = c("lm", "lm_gauss"),
controls = c("c1", "c2"),
subsets = list(group1 = unique(example_data$group1),
group2 = unique(example_data$group2)))

I get the following error:

Error in get(as.character(..2)) : object '1' not found

This occurs with my own code and with the example in the vignette.

Models are not found when they are elements of a list.

I have a model called PQP

This code:

run_specs(df = db$db1,
          y = c("y"),
          x = c("x_000"),
          model = "PQP"
          ) -> a

runs with no errors.

Now I enlist PQP in the list models.
This code:

run_specs(df = db$db1,
          y = c("y"),
          x = c("x_000"),
          model = "models$PQP"
          ) -> a

gives back this error:

Error in `dplyr::mutate()`:
! Problem while computing `res = map2(.data$model, formula,
  ~do.call(.x, list(data = df, formula = .y)))`.
Caused by error in `models$PQP`:
! could not find function "models$PQP"

UPDATE:

TimTeaFan provided a debug:

get_model <- function(x) {
  x_str <- str2lang(x)
  if (is.name(x_str)) {
    return(x)
    } else if (is.call(x_str)) {
    eval(x_str)
  }
}

Then specr:::run_spec is corrected like this:

specs %>%
  dplyr::mutate(formula = pmap(.,
                               specr:::create_formula)
  ) %>% tidyr::unnest(formula) %>% 
  dplyr::mutate(res = map2(model,
                           formula,
                           ~ do.call(get_model(.x), list(data = df, formula = .y))))

specs using weighted survey data

Is there a way to specify the inclusion of weights in the specifications?

I'm replying on cross-national survey data (European Social Survey) where combined post-stratifcation and country population weight is required to produce representative results.

Bug: `plot_specs()` plot is uninterpretable if there are many covariates.

Thank you for your work on this package and sharing it with everyone.

I'm running in to an issue where I have a long list of control variables, which causes the output of plot_specs() to be cut off and uninterpretable. See reprex below, but imagine that there at least 20 more covariates after the 3 listed. Would it be possible to rename this long label for controls as simply "all covariates", to correspond with the label "no covariates"?

library(tidyverse)
library(specr)

results_test <- 
  run_specs(iris,
            x = c("Sepal.Length"),
            y = c("Sepal.Width"),
            controls = c("Petal.Length", "Petal.Width", "Species"),
            model = "lm")

plot_specs(results_test)

^{Created on 2022-08-15 by the reprex package (v2.0.1)}

Request: Perform joint test across specification curves

Hi there,

some authors suggest to run an additional inference statistical test on the distribution of effect sizes and test statistics across specification curves. E.g. mean/median effect size different from zero, share of significant results higher than what would be expected under H0, average test statistic different from what would be expected under H0.

Source: Simonsohn, Uri, Joseph P. Simmons, and Leif D. Nelson. “Specification Curve Analysis.” Nature Human Behaviour 4, no. 11 (November 2020): 1208–14. https://doi.org/10.1038/s41562-020-0912-z.

What is your opinion on this? Would it make sense to implement (some of) these tests in specr? Can you suggest any current workarounds? Your response would be greatly appreciated.

Best wishes,
Heinrich

Parallelization of the functions using `furrr`

run_specs()currently uses only one core. In the long run, we should aim at making it as fast as possible (given the large numbers of specifications that one usually wants to estimate).

Request: allow weighting and clustering

I want to use weights and clustering using fixest::feols, but this uses a comma in the formula, which seems to break specr.

library(dplyr)
library(fixest)

example_data <- example_data %>%
  mutate(weight = runif(n()))

feols(y1~x1+x2 | group2, weights=~weight, data=example_data)
feols(y1~x1+x2 | group2, cluster=~group1, data=example_data)
feols(y1~x1+x2 | group2, cluster=~group1, weights=~weight, data=example_data) # these work

fe <- function(formula,data){
  formula <- as.formula(paste0(formula, "|group2"))
  feols(formula,data)
}

cluster_weight <- function(formula,data) {
  formula <- as.formula(paste0(formula, "|group2,cluster=~group1,weights=~weight"))
  feols(formula,data)
}

results_fe <- run_specs(df = example_data, 
                     y = c("y1", "y2"), 
                     x = c("x1", "x2"), 
                     model = c("fe"),
                     controls = c("c1", "c2"))
plot_specs(results_fe) # this works

results_cw <- run_specs(df = example_data, 
                     y = c("y1", "y2"), 
                     x = c("x1", "x2"), 
                     model = c("cluster_weight"),
                     controls = c("c1", "c2"))

Running the clustering/weighting model, I get this error:

np Package errors

Hello,

I'm running specr with the np package and encountering an issue.
Here's the code

kde.reg <- function(formula, data){
  bw <- npregbw(formula=formula,
                data = data,
                nmulti = 5,
                ckertype = "epanechnikov", ukertype = "liracine",
                regtype="lc", bwmethod="cv.aic")
  npreg(bw)
}

test_specs <- setup(data = test_df,
                    y = c("y1", "y2"),
                    x = c("x1", "x2", "x3"),
                    model = "kde.reg"
                    )

summary(test_specs, rows = 50)
test_results <- specr(test_specs)

I've attached a snippet of the code error below for context. I'm familiar with the error message and have solved it before by selecting the data for the explanatory variables. However, in specr, selecting inside of the model function causes errors of its own. Would this mean that specr isn't compatible with np or is there a way around this issue?

UPDATE: I've narrowed down the issue. In spec.R, at line 182, when it calls kde.reg the formula it passed in (formula = structure("y1 ~ x2 + 1", class = c("glue", "character"))) seems to be causing an error. I recreated the error by calling the npregbw with this formula format. When I replaced it with y1 ~ x2 + 1 as a standalone formula, the error was resolved. Even when including the structure(list...)) argument that spec.R creates. Although, I'm still unclear on how to resolve this issue.

UPDATE AGAIN: I was able to resolve the error by defining formula like formula=as.formula(formula), since specr defines it as a character when passed into the function. However, I'm running into a tidy error now. Working with tidy on that, but won't delete this for now.

Interactions question

Hello,

Incredible tool -- thank you very much.

A question about interactions. I am having trouble with specifying an interaction of a control with an x variable. If I include the "x_var:control1" in the controls specification, it removes either that specific x_var or the specific controls (even the non-interactive "control1"). It looks like I can include it in the add to formula, but then it includes it across all specifications. Is there some way to specify this in the specs()?

Thank you. If this is not clear, I can make try to make a reproducible example.

Progress bars

The run_specs() function should include some sort of progress visualization. Otherwise, one does not know whether the function is working or not.... ;)

Show choices together with options on the left?

Thanks for this great package! I understand that the charts are based on ggplot2 defaults - but would it be possible to shift the choices to be displayed together with their levels on the left? Given how closely they belong together, that would seem to make it much more readable, as in the example below (unfortunately created with base R plots, code here?

How does specr handle dummy coded categorical variables

This is probably a dumb question but how does specr handle dummy coded variable for regression analyses. For example, let's say I dummy coded ethnicity into several new variables and the biggest group as the reference category (coded as 0). How does specr treat this when treating it as a covariate.

df <- df %>%
mutate(
ethnicity_white_european_heritage = ifelse(ethnicg == 1, 1, 0),
ethnicity_black_caribbean_heritage = ifelse(ethnicg == 2, 1, 0),
ethnicity_black_african_heritage = ifelse(ethnicg == 3, 1, 0),
ethnicity_any_other_ethnic_minority = ifelse(ethnicg == 4, 1, 0),
ethnicity_indian = ifelse(ethnicg == 5, 1, 0),
ethnicity_pakistani = ifelse(ethnicg == 6, 1, 0),
ethnicity_bangladeshi = ifelse(ethnicg == 7, 1, 0),
ethnicity_mixed_race = ifelse(ethnicg == 8, 1, 0)
)

There's no scenario when you would only consider 1/2/3/4 of the n number of new ethnicity variables on their own as covariates. All the dummy coded variables would have to be input together right?

Problem treating x´s as pairs

Hi!

I am interested in running a specification curve based on a multilevel model in which I use the within centered x variable as well as the cluster mean. Therefore I would like for specr to treat my two variables: within_centered and cluster_mean as x´s that are always present together.

Looking at the response to issue 11 I figured that the below code might do the trick.

results <- run_specs(df = centered,
                      y = c("y"),
                      x = c("within_centered + cluster_mean"), 
                      model = c("lmer_ri_1"),
                      controls = c(var1, var2, var3),
                      keep.results = T)

The results of this code is however an empty dataframe.

Do you know why? And is it possible to do what I am intending?

Best regards,
Toric

Request: order by specification

speccurve for Stata allows you to order the plot by specification, instead of coefficient size:

This would be a nice feature to have.

Error message in Specr

Hi there,

When I try to run through the analysis in the specr package using the example_data or my own data, everything is fine until I try the run_specs command, in which I get the following error message:

> results <- run_specs(df = example_data, 
+                      y = c("y1", "y2"), 
+                      x = c("x1", "x2"), 
+                      model = c("lm", "lm_gauss"), 
+                      controls = c("c1", "c2"), 
+                      subsets = list(group1 = unique(example_data$group1),
+                                     group2 = unique(example_data$group2)))
Error: Problem with `mutate()` input `..1`.
✖ Column `obs` not found in `.data`
ℹ Input `..1` is `.data$obs`.
Run `rlang::last_error()` to see where the error occurred.
> rlang::last_error()
<error/dplyr_error>
Problem with `mutate()` input `..1`.
✖ Column `obs` not found in `.data`
ℹ Input `..1` is `.data$obs`.
Backtrace:
  1. specr::run_specs(...)
 33. dplyr:::stop_error_data_pronoun_not_found(...)
 34. dplyr:::stop_dplyr(index, dots, fn, problem = msg, .dot_data = TRUE)
>

If you have any idea why this error is coming up or how to solve the issue that would be greatly appreciated. Thanks!

Erin

Feature request: sets of control variables

When I'm considering which control variables to add, I'm often considering one set or another, not one variable at a time. Would you consider a syntax that supports sets of controls?

Another approach would be to consider every possible subset of controls, but the number of possible combinations increases pretty fast.

Here's an example:

library(specr)

# Currently, this code considers models with c1, c2, and c3 individually,
# no controls, and all three together.
results <- run_specs(
  df = dplyr::mutate(example_data, c3=runif(dplyr::n())), 
  y = c("y1", "y2"), 
  x = c("x1", "x2"), 
  model = c("lm"), 
  controls = c("c1", "c2", "c3")
)
dplyr::distinct(results, controls)
#> # A tibble: 5 x 1
#>   controls     
#>   <chr>        
#> 1 c1 + c2 + c3 
#> 2 c1           
#> 3 c2           
#> 4 c3           
#> 5 no covariates

# But what if these models with only one control variable are
# not in my set of "all plausible specifications"?
# Maybe I would consider c1 + c2 or c1 + c3.
results2 <- run_specs(
  df = dplyr::mutate(example_data, c3=runif(dplyr::n())), 
  y = c("y1", "y2"), 
  x = c("x1", "x2"), 
  model = c("lm"), 
  controls = list(c("c1", "c2"), c("c1", "c3"))
)
#> Error in model.frame.default(formula = "y1 ~ x1 + c(\"c1\", \"c2\") + c(\"c1\", \"c3\")", : variable lengths differ (found for 'c("c1", "c2")')

Variance decomposition

As an alternative visualization/analysis, provide variance components for the specs and the variability of a predefined parameter.

Run a list of predefined specifications

Hi there,

I am trying to run a specification curve analysis on a predefined list of specifications that I generate using my own code. This list is a subset of the combinations of a very large set of variables. Theoretically, I could run analyses for my list of specifications by running all combinations and then subsetting the output to include only the relevant specifications, but I think that this is computationally prohibitive. Ideally, I would like to use a dataset of the same format as the output of the setup_specs() function as input for the run_specs() function. This would provide more control with respect to the list of specifications that are run. Is this currently possible or should I fork and modify the code?

I am thinking about something like this:

df_specs <- my_own_setup_specs(...)
run_specs(df_specs)

Guidance would be much appreciated!

Categorical independet variable

Not sure if this is truly an issue or just a limitation. I can't seem to run a specr analysis with independent variables that are categorical. Tried using as.factor and as.character, none works. I tried two different datasets to be sure. Is it a limitation or am I missing something?

Testing the effect of an interaction term

Hello, love the package so far - thanks for your work on it!

I want to add an interaction term to my specification curve analysis. How do I do this?

My current set-up is as follows.

setup(data = dat,
           x = c("lonely_mostly_or_not", "hinctnta"),
           y = c("mean_p_eff_cgg_dpr_z", "mean_p_eff_dpr_z"),
           model = c("lmer"),
           controls = c("agea", "icgndra", "jbexpnt", "health_problems"),
           add_to_formula = "(1 | nuts1)")

I want to test the interaction between lonely_mostly_or_not and hinctnta. I have tried specifying it as both an x variable (lonely_mostly_or_not * hnctnta) and I've tried to add it to the add_to_formula argument (lonely_mostly_or_not * hnctnta + (1 | nuts1)). Neither seemed to work, and I didn't see anything about this in the documentation (but I am probably missing something obvious!).

Thanks for your help!

masurp / specr Goto Github PK

specr's People

Contributors

Stargazers

Watchers

Forkers

specr's Issues

Recommend Projects

Recommend Topics

Recommend Org