tidymodels / infer Goto Github PK

View Code? Open in Web Editor NEW

718.0 718.0 80.0 177.64 MB

An R package for tidyverse-friendly statistical inference

Home Page: https://infer.tidymodels.org

License: Other

R 98.30% TeX 1.70%

infer's Introduction

tidymodels

Overview

tidymodels is a “meta-package” for modeling and statistical analysis that shares the underlying design philosophy, grammar, and data structures of the tidyverse.

It includes a core set of packages that are loaded on startup:

broom takes the messy output of built-in functions in R, such as lm, nls, or t.test, and turns them into tidy data frames.
dials has tools to create and manage values of tuning parameters.
dplyr contains a grammar for data manipulation.
ggplot2 implements a grammar of graphics.
infer is a modern approach to statistical inference.
parsnip is a tidy, unified interface to creating models.
purrr is a functional programming toolkit.
recipes is a general data preprocessor with a modern interface. It can create model matrices that incorporate feature engineering, imputation, and other help tools.
rsample has infrastructure for resampling data so that models can be assessed and empirically validated.
tibble has a modern re-imagining of the data frame.
tune contains the functions to optimize model hyper-parameters.
workflows has methods to combine pre-processing steps and models into a single object.
yardstick contains tools for evaluating models (e.g. accuracy, RMSE, etc.).

A list of all tidymodels functions across different CRAN packages can be found at https://www.tidymodels.org/find/.

You can install the released version of tidymodels from CRAN with:

install.packages("tidymodels")

Install the development version from GitHub with:

# install.packages("pak")
pak::pak("tidymodels/tidymodels")

When loading the package, the versions and conflicts are listed:

library(tidymodels)
#> ── Attaching packages ────────────────────────────────────── tidymodels 1.2.0 ──
#> ✔ broom        1.0.5      ✔ recipes      1.0.10
#> ✔ dials        1.2.1      ✔ rsample      1.2.0 
#> ✔ dplyr        1.1.4      ✔ tibble       3.2.1 
#> ✔ ggplot2      3.5.0      ✔ tidyr        1.3.1 
#> ✔ infer        1.0.6      ✔ tune         1.2.0 
#> ✔ modeldata    1.3.0      ✔ workflows    1.1.4 
#> ✔ parsnip      1.2.1      ✔ workflowsets 1.1.0 
#> ✔ purrr        1.0.2      ✔ yardstick    1.3.1
#> ── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ──
#> ✖ purrr::discard() masks scales::discard()
#> ✖ dplyr::filter()  masks stats::filter()
#> ✖ dplyr::lag()     masks stats::lag()
#> ✖ recipes::step()  masks stats::step()
#> • Learn how to get started at https://www.tidymodels.org/start/

Contributing

This project is released with a Contributor Code of Conduct. By contributing to this project, you agree to abide by its terms.

For questions and discussions about tidymodels packages, modeling, and machine learning, please post on RStudio Community.
Most issues will likely belong on the GitHub repo of an individual package. If you think you have encountered a bug with the tidymodels metapackage itself, please submit an issue.
Either way, learn how to create and share a reprex (a minimal, reproducible example), to clearly communicate about your code.
Check out further details on contributing guidelines for tidymodels packages and how to get help.

infer's People

Contributors

Stargazers

Watchers

Forkers

beanumber nicksolomon mason-datamaterials laderast maryannet mine-cetinkaya-rundel ismayc thiyangt zoudj nfultz hongzhonglu guhjy ktaranov frankfarach jkaupp echasnovski perlatex topepo jrnold gridl blacng corinne-riddell jimtyhurst pirategrunt libardo1 naren1991 sherifsameh jefflieng statunizaga cattystats chengjingfeng richierocks gravitytrope biov jayleetx jorgebs94 kennethban dorgol romainfrancois wjhopper milan-chicago torockel parkvillegeek standardgalactic rpruim jbourak kmdupr33 lanetk laszlokiraly anhnguyendepocen brendanhcullen arrendi xiaochi-liu jimsforks nusmed-health-informatics mgaldame owain-s pawanramamali lukerobert minireference gcabebe alanshore15 nickeopti petzi53 liubao210 imarcello munozkz bnolan7 mustafa77 cram070 nalanda carlganz nusmed-health-informatics wmorgan485 ggpinto drhmoosavi

infer's Issues

Print methods

I'd like people to be able to break the inferential pipeline at anypoint and have the output make sense. The default tibble/data.frames that we're working in are pretty good, but there are other attributes related to hypothesize() that would be nice to print (similar to how "Groups: cyl" will be printed at the top of the output of group_by(mtcars, cyl).

Seems like all we'd need to do to do this is add an infer class to the dataframe then make a print method that will supercede that of tbl and data.frame. I was hoping to just tweak the existing print methods for that but...I've been having a tough time tracking them down. They appear to be in tibble, but I haven't gotten much more than learning about format() and trunc_mat().

Has anyone been able to find them?

bug for one prop tests

It appears that if you're working w a two-level categorical variable, it's totally fine to assign just one null value of a proportion.

mtcars %>%
  specify(response = am) %>% # alt: am ~ NULL (or am ~ 1)
  hypothesize(null = "point", p = c("1" = .25)) %>% 
  generate(reps = 100, type = "simulate") %>% 
  calculate(stat = "prop")

We're currently throwing an error message, however, when you try to do a one-prop test on a variable that has many levels. If you convert the GoF example:

> mtcars %>%
+   specify(cyl ~ NULL) %>% # alt: response = cyl
+   hypothesize(null = "point", p = c("4" = .5)) %>%
+   generate(reps = 100, type = "simulate") %>%
+   calculate(stat = "prop", success = "4")
Error in parse_params(dots, x) : 
  The factor variable that you have specified has 3 levels and you've only assigned null probabilities to 1 levels.

Handling of NAs

Currently, if the input data has an NA, no warnings are shown but the resulting simulated statistics vector can have some NAs since calculate doesn't specify to handle NAs when taking mean, median, etc. One option is to specify that in calculate. The other option is to make the user get rid of NAs first in the exploratory stage, and then feed the data into the infer steps. Former is a bit more user friendly, but hides an important consideration. Latter can be tricky for students when doing inference for two variables since they would need to remove pairs of NAs from both variables. Thoughts?

Add string to factor coercion to hypothesize

Add website to GitHub repo page

@andrewpbray: Indulge @ismayc and my inner pedants and add the pkgdown https://infer.netlify.com/index.html site to the GitHub repo front page by editing here:

query about cool figure

I had a question about the cool figure (above). Shouldn't the approximation arrow point to "calculate statistic" since that's where the approximation picks up?

Just my $0.02, Nick

Hypothesize function

Proposed input: dataframe/tibble that has already been select()ed down to variables of interest.
Proposted output: tibble with a flag for the null hypothesis selected, similar to group_by()

I like the idea of having students specify a null argument here, probably in the form of a character string. Deciding how to describe each of the null's we want will be difficult. I took a stab at them in the examples in the README, but I have lots of questions:

Should parameters show up in these?
Should there be any reference to the columns in the data frame that comes in or should we be doing checks of types within the function?
Will we rerun into problems when going back and forth between the computational and approximation versions of a particular test? That is, are there cases where the null hypotheses aren't exactly the same?
Should we add an argument for alternative? My sense right now is no - that the goal of this is the visualize()ation of the sampling distribution. We could always write a p-value() function that has an alternate argument.

generate() chokes when factor levels are integers?

mtcars %>%
   specify(response = am) %>%
   hypothesize(null = "point", p = c("0" = 0.25, "1" = 0.75)) %>%
   generate(reps = 100, type = "simulate")

 Error in sample.int(length(x), size, replace, prob) : 
  invalid first argument

S3 classes?

Instead of setting an attribute, should hypothesis() return an object of class proptest or whatever? That way calculate() will already know what to do.

Calculate should not use `mean` as the var name in output

This is in the summarise step of calculate. Since mean is a function it's not good practice to also use it as variable name, same with median etc. Alternatives:

Generically call all of them stat so the output looks the same regardless of what statistic is used, this means user has to remember which it was, which is ok I think since in the function call they specify it.
- If we go with this we can add a note to the output saying "stat is mean" or "stat is diff in means" something like that, but at least the variable name stays the same which is nice for subsequent calculations or visualizations.
Use names like "boot_mean", "boot_median", etc. Or some other prefix.

My vote is for Option 1.

Why is Med capitalized?

Was there a specific purpose for this? If not, I suggest using lowercase, as needing to capitalize might trip up students. Happy to do a small PR for this but didn't want to touch it before checking to make sure the capitalization isn't serving some other purpose.

Error in sample.int(length(x), size, replace, prob) : NA in probability vector

It's late...what am I doing wrong?

outcomes <- data_frame(candidate = c("clinton", "trump"))
outcomes %>%
  specify(response = candidate) %>%
  hypothesize(null = "point", p = c(0.5, 0.5)) %>%
  generate(reps = 1000, type = "simulate") %>%
  calculate(stat = "prop") %>%
  visualize()

Error in sample.int(length(x), size, replace, prob) : 
  NA in probability vector

Some feedback

ran goodpractice::gp() only thing of note for later is may want to stick to 80 line width across all code, doc lines, and test files
i believe Title in description file needs to be sentence case
license: CC0 is valid license now on CRAN - but perhaps consider a more software specific license like MIT or GPL for example
http://github.com/andrewpbray/infer to https
add a code of conduct?
did you check build_win() yet?
some more parameter input checking would be useful, e.g., generate(reps = "foobar") doesn't error well
lots of warnings in the mtcars_examples vignette, maybe suppress those
vignettes do make sense to me, though I don't do stats much anymore, sooooooo
The exported functions (particularly calculate) lean towards being kinda big for my taste, could be DRYed out a bit if possible
no examples in the specify man file?
as a newb, i wasn't really sure if there was a order that you use the functions or not? some guidance on that would be good. the functions error nicely if specify is not used first, but seems like docs should also say that. the pkg level man file would be a good place for that.
error behavior for going from data.frame to specify, to generate (skipping specify) wasn't too good, but maybe users won't do that

assertion to add

More informative error message needed out of specify():

load(url("http://s3.amazonaws.com/assets.datacamp.com/production/course_5694/datasets/gss_sampled.RData"))
gss2016 <- filter(gss_sampled, year == 2016)
ggplot(gss2016, aes(x = cappun)) +
  geom_bar()
gss2016 %>%
  specify(response = cappun, success = "FAVOR")

Adding missing grouping variables: year

Discrepancy between README and calculate() examples of diff in proportion

Running ?calculate, the example works:

library(infer)
library(tidyverse)
# Example
  mtcars %>%
    dplyr::mutate(am = factor(am), vs = factor(vs)) %>%
    specify(am ~ vs, success = "1") %>%
    hypothesize(null = "independence") %>%
    generate(reps = 100, type = "permute") %>%
    calculate(stat = "diff in props", order = c("1", "0"))

However the README example doesn't

mtcars %>%
  specify(am ~ vs) %>% # alt: response = am, explanatory = vs
  hypothesize(null = "independence") %>%
  generate(reps = 100, type = "permute") %>%
  calculate(stat = "diff in props")

as we need specify(am ~ vs, success = "1") and calculate(stat = "diff in props", order = c("1", "0")). Perhaps set default values for specify(success) and calculate(order)?

support for approximation methods

Did we close the door on this? It seems like

mtcars %>%
  specify(mpg ~ hp) %>%
  hypothesize(null = "independence") %>% 
  calculate(stat = "slope") %>%
  visualize()

would still make sense. It would return something like:

mod <- lm(mpg ~ hp, data = mtcars)
ggplot(data = filter(broom::tidy(mod), term == "hp"), aes(x = estimate)) + 
  stat_function(fun = dt, args = list(df = mod$df.residual)) + 
  scale_x_continuous(limits = c(-4, 4)) + 
  geom_vline(aes(xintercept = estimate), color = "red")

Update keynote diagrams

We will need to update the keynote diagram that sits in the README for hypothesis tests and also add the one for confidence intervals. I rarely use Keynote so I suspect one of you can do this a lot faster and nicer than my attempts.

Add warnings for all the mistakes one can make

After implementation

assertive package is not auto-installed

When trying to run

library(infer)
library(tidyverse)
mtcars <- as.data.frame(mtcars) %>%
  mutate(cyl = factor(cyl),
          vs = factor(vs),
          am = factor(am),
          gear = factor(gear),
          carb = factor(carb))
mtcars %>%
  specify(am ~ vs) %>% # alt: response = am, explanatory = vs
  hypothesize(null = "independence") %>%
  generate(reps = 100, type = "permute") %>%
  calculate(stat = "diff in props")

I get Error in loadNamespace(name) : there is no package called ‘assertive’

Here is my sessionInfo

R version 3.4.1 (2017-06-30)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS Sierra 10.12.6

Matrix products: default
BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] infer_0.1.0

loaded via a namespace (and not attached):
 [1] Rcpp_0.12.14     knitr_1.17       bindr_0.1        magrittr_1.5     munsell_0.4.3    colorspace_1.3-2
 [7] R6_2.2.2         rlang_0.1.6      plyr_1.8.4       stringr_1.2.0    dplyr_0.7.4      tools_3.4.1     
[13] grid_3.4.1       gtable_0.2.0     htmltools_0.3.6  lazyeval_0.2.1   yaml_2.1.14      rprojroot_1.2   
[19] digest_0.6.13    assertthat_0.2.0 tibble_1.3.4     bookdown_0.5     bindrcpp_0.2     ggplot2_2.2.1   
[25] glue_1.2.0       evaluate_0.10.1  rmarkdown_1.8    stringi_1.1.6    compiler_3.4.1   scales_0.5.0    
[31] backports_1.1.1  pkgconfig_2.0.1

Generate function

Proposed input: a tibble with some meta-info from hypothesize()
Proposed output: a much taller tibble that is the input repeat()ed n times along with an indexing column.

This function is pretty much the same as Mine's rep_sample_n() and it encapsulates the idea that the null hypothesis should encode some mechanism by which you can generate additional data sets under that null hypothesis. Some thoughts/questions:

Right now, the only argument I'm thinking of is repeat() for the number of iterations. The meta-info that's appended with hypothesize() should be sufficient to know whether this is a permutation or simulation situation.
This function could also be called simulate().
If we want this to package to be scalable to larger data sets and still run fairly fast, we might want to write some backup functionality here. As in: if the data set is really big, hold off on spooling out 1000 copies of it in memory. We can probably be more efficient about that if we don't need to show this repeated data set as output here.

Use tidyeval functionality

This is relevant to calculate but maybe others too, if any of those use underscore verbs from dplyr, which is now deprecated functionality.

Prototype for calculating for stat == "mean"

  if (stat == "mean") {
    col <- setdiff(names(x), "replicate")
    x %>%
      dplyr::group_by(replicate) %>%
      dplyr::summarize(stat = mean(!!sym(col)))
  }

Note that this requires importing the sym function from the rlang package, which dplyr currently does not import automatically.

Is someone up for updating the calculate function with this, or should I? We also need to resolve #15 and #16 to and all of the updates can be done at once.

open or closed system?

I've gone at least quickly through all the issues now. It seems like you are trying to create a very closed system -- users will only be able to conduct hypothesis tests that you have imagined in advance they will want to do to and for which they know the magic character strings that name the test. This is only marginally different from having a separate function for each test.

A true grammar of hypothesis testing would be a more open system where a new hypothesis test could be conducted without needing to add to the package.

For starters, in calculate(), why isn't stat a function (or a string that names a function). That would allow users to use any test statistic by providing a function of the proper arity for the data. It would also allow users to directly apply the stat function to the data set in question to get the observed value of the test statistic.

Here are some test cases for your "grammar":

how do I test that two medians are the same using the difference in medians as a test statistic?
ditto for other quantiles
how do I test that 2 groups have the same variance using the difference in standard deviations as a test statistic?
how do I test that 3 groups have the same variance using the variance of the variances as a test statistic?

Calculate function

Proposed input: dataframe/tibble that has generated data in a form to calculate statistics across replicates
Proposed output: tibble containing the replicated statistics

Where is the grammar?

Just looking over this briefly, I don't see the "grammar". Do you have a grammar in mind? What are its elements (nouns and verbs)? How are they defined? This seems to be motivated primarily by the desire to use %>% and not by a well-defined grammar. Am I missing something?

Perhaps a wiki page could be created to describe the grammar.

Hypothesis testing diagram

The following diagram was inspired by Alan Downey's "There is only one test" and can be found in keynote form in the figs directory.

Some possible desiderata of this diagram:

sufficiently general to describe most/all classical hypothesis tests
specific enough to deconstruct the inferential black box into the key conceptual steps
the pipeline should be able to be broken at any step and the output should make sense
adaptive to both computational (permutation/simulation) and mathematical approximation approaches.

Any suggestions at this most general level?

name of select replacement

specify()
formulate()?

Have any summary function work in calculate

I really like using The Lady Tasting Tea to motivate hypothesis testing. The following code does the trick:

library(tidyverse)
library(infer)
lady_tasting_tea <- data_frame(
  first = as.factor(c(rep("milk",4), rep("tea",4)))
  )

lady_tasting_tea %>%
  specify(response = first) %>% # alt: am ~ NULL (or am ~ 1)
  hypothesize(null = "point", p = c("milk" = .5, "tea" = .5)) %>% 
  generate(reps = 1000, type = "simulate") %>% 
  calculate(stat = "prop")

However, it would be great if the final stat could be "sum". That way there is one less layer of abstraction between the experiment and the null distribution (students can read off count directly, instead of reading off proportions)

In a more general setting, any many-to-one summary function would be great, for example all those that work with dplyr::summarize()

add test statistics ref line to visualize()?

visualize() shows the sampling distribution. Should we add a call to + geom_vline() that illustrates where in the sampling distribution the test statistic is? Do we still have that information by the time we get to visualize()?

Do NAs get resampled?

The calculate function takes the na.rm argument for the statistic it's calculating. But does this mean at the generate stage, say if we're doing a bootstrap interval, NAs from the original sample get resampled? If so, I believe this is not a good idea. Because chances are when we look at the sample size for the original sample those NAs don't factor into it. And a bootstrap sample should be the same size as the original sample. If indeed the NAs from the original sample don't make it into the bootstrap samples, then isn't the na.rm argument in calculate unnecessary?

Success should always be defined

In working on a solution for #53 I realized that if success isn't defined by the user we make an assumption and use the first level as success. I don't think this is a good approach because it makes the function "just work" without the student properly formulating the inference chain. I propose removing https://github.com/andrewpbray/infer/blob/f034a8629dd42c83d4fbb33dd1b80324262dbb12/R/calculate.R#L60 and https://github.com/andrewpbray/infer/blob/f034a8629dd42c83d4fbb33dd1b80324262dbb12/R/calculate.R#L61 and instead throwing an error saying success must be defined when the response variable is categorical. I can put that in the same PR as the order solution.

Also, as far as I can tell currently the "diff in props" example isn't working since success functionality isn't built into the difference in proportions case (only the single proportion).

Couldn't catch 'em all

@andrewpbray It appears we called both vignettes "flights example" and the second one should be "mtcars example": https://cran.r-project.org/web/packages/infer/index.html

calculating statistic from random samples drawn from a population

@andrewpbray @mine-cetinkaya-rundel @beanumber @ismayc @nicksolomon

Another thing I do in my course is to create a sampling distribution of slopes taken from a huge population. I do this in chapter 1 before I've done any inference. The idea is just to visualize how the slopes change from sample to sample outside the context of making specific hypotheses.

Again, would appreciate your vote on which of these options seems best / most consistent with what we are doing.

First:

explanatory <- rnorm(1000, 0, 5)
response <- 40 + explanatory*2 + rnorm(1000,0,10)
popdata <- data.frame(explanatory,response)

(A)
Uses rep_sample_n to get many samples. Two plots here: one is superimposed lines on a scatterplot, one is a histogram.

manysamples <- popdata %>%
rep_sample_n(size=50, reps=100)

ggplot(manysamples, aes(x=explanatory, y=response, group=replicate)) + 
geom_point() + 
geom_smooth(method="lm", se=FALSE) 

manylms <- manysamples %>% 
group_by(replicate) %>% 
do(tidy(lm(response ~ explanatory, data=.))) %>%
filter(term=="explanatory")

ggplot(manylms, aes(x=estimate)) + geom_histogram()

(B)
This code doesn't exist yet... we could write an additional option to generate in infer... maybe use the argument sample?

manylms <- popdata %>%
  specify(response ~ explanatory) %>%
  hypothesize(null = "independence") %>%
  generate(reps = 100, type = "sample", size=50) %>%
  calculate(stat = "slope") 

# I don't know how to use the above to make the slope plot in ggplot (which I believe to be very powerful).  
# I'd need both the slope and the intercept to draw the lines... so maybe this is a moot point??

ggplot(manylms, aes(x=estimate)) + geom_histogram()

Only have monochromatic bins in null distribution + p-value histograms?

In the Randomization approach to t-statistic example:

library(nycflights13)
library(dplyr)
library(stringr)
library(infer)
set.seed(2017)

# Create data frame
fli_small <- flights %>% 
  sample_n(size = 500) %>% 
  mutate(half_year = case_when(
    between(month, 1, 6) ~ "h1",
    between(month, 7, 12) ~ "h2"
  )) %>% 
  mutate(day_hour = case_when(
    between(hour, 1, 12) ~ "morning",
    between(hour, 13, 24) ~ "not morning"
  )) %>% 
  select(arr_delay, dep_delay, half_year, 
         day_hour, origin, carrier)

# Null
obs_t <- fli_small %>% 
  t_stat(formula = arr_delay ~ half_year)
t_null_distn <- fli_small %>%
  specify(arr_delay ~ half_year) %>% # alt: response = arr_delay, explanatory = half_year
  hypothesize(null = "independence") %>%
  generate(reps = 1000, type = "permute") %>%
  calculate(stat = "t")
t_null_distn %>% 
  visualize(obs_stat = obs_t, direction = "two_sided")

I think having the bin that straddles the test-statistic value (and its negative) have two colors will definitely trip novices out. Is there anyway to have two bin boundaries be the test statistic value (and its negative) by default, so that we have monochromatic bins? And only if people specify a particular bin/binwidth, resort to this two-color binning scheme?

Fixing travis

@andrewpbray @beanumber Not quite sure why Travis isn't working anymore. Any thoughts?

https://travis-ci.org/andrewpbray/infer

https://github.com/andrewpbray/infer/blob/master/.travis.yml

Calculate currently doesn't return anything

Is this right? That's my experience with it.

What should it return, as in, what should the object be called? Options:

dist
sim_dist or perm_dist or boot_dist depending on generate type
boot_mean or perm_mean or perm_median etc. depending on both generate type and stat?
???

Still think the parameters for a multi-level factor should be a vector

Especially since they are going to get passed to the prob argument of sample(), which is already a vector.

prob	
A vector of probability weights for obtaining the elements of the vector being sampled.

Student/teaching experience with generate type

I'm using infer in my in-class teaching this semester, and I can't yet definitively say how learning is going, but in general based on informal observation, seems ok.

However one type of question I've been getting repeatedly is "How do I know when to use which type for generating samples?". This is a good question of course, and the answer leads us to discussing differences. But what I'm observing is that students are trying to make if this then that type lists (like if I have categorical variable and point null then simulate). And what's worrisome about it is that it feels a bit like the cheatsheets students put together in courses that teach theoretical inference with type of tests and SE formulas. One of the main motivations for coming up with consistent syntax was to move away from such thinking.

Have others using this in in-class teaching experienced this? Any thoughts? @andrewpbray @beanumber @hardin47

Note: The one alternative I can think of is to have the package decide the type of generate based on the info that comes before it. This is similar to how it worked in my inference function. But it hides what I think is an important detail. Unfortunately my limited experience in showing that detail hasn't been super positive, but I'm willing to accept that's the way to go, just wanted to hear from others what they think after having students use these functions.

How do you plan to support tests in the context of lm() and glm()?

As @nicholasjorton suggested, it will be important to be able to move easily beyond the traditional Intro Stats tests to this more flexible space -- else you are just a dead end.

Perhaps it would be better to start from that end than from the 1- and 2-sample end. Most of the 1- and 2-sample things can be thought of as special cases of this more general setting anyway.

lazy eval issue?

* checking R code for possible problems ... NOTE
calculate: no visible binding for global variable ‘xbar’
calculate: no visible binding for global variable ‘xtilde’
calculate: no visible binding for global variable ‘prop’
visualize: no visible binding for global variable ‘stat’

This used to be fixed by using mutate_() with a formula, but now we are supposed to use rlang?? I think @ismayc knows how to fix this...

calculating the observed statistic

@mine-cetinkaya-rundel @andrewpbray @beanumber @nicksolomon

As I finish up the permutation test, I realize that calculating the linear model test statistic (slope) is not consistent with the structure of the infer package. Please vote on (A), (B), (C), or (D) and let @ismayc (or @andrewpbray ??) know if / how the infer package should be updated.

First:

library(infer)
twins <- read.csv("http://andrewpbray.github.io/data/twins.csv")
twin.slps <- twins %>%
  specify(Foster ~ Biological) %>%
  hypothesize(null = "independence") %>%
  generate(reps = 100, type = "permute") %>%
  calculate(stat = "slope")

(A)
Very tidy, but I would have to explain do.

obs.slp <- twins %>% do(tidy(lm(Foster ~ Biological, data = .))) %>%
    filter(term == "Biological") %>%
    dplyr::select(estimate)

sum(abs(obs.slp) > = abs(twin.slps) ) / 100

(B)
Mostly tidy, but I'm not piping in the twins dataset.

obs.slp <- tidy(lm(Foster ~ Biological, data = twins)) %>% 
     filter(term == "Biological") %>%
     dplyr::select(estimate)

sum(abs(obs.slp) > = abs(twin.slps) ) / 100

obs.slp <- lm(Foster ~ Biological, data = twins)$coef[2]

sum(abs(obs.slp) > = abs(twin.slps) ) / 100

(D)
Create a type="identity" option in generate. This doesn't work yet. Someone would have to write it. and is identity the right word? What about observed?

obs.slp <- twins %>%
  specify(Foster ~ Biological) %>%
  hypothesize(null = "independence") %>%
  generate(type = "identity") %>%
  calculate(stat = "slope") 

sum(abs(obs.slp) > = abs(twin.slps) ) / 100

Including transformed statistics like t and z

It seems that if we are planning to also include approximation methods that use the t and standard normal distribution that we should also include something like

Two categorical (2 level) variables

mtcars %>%
  specify(am ~ vs) %>% # alt: response = am, explanatory = vs
  hypothesize(null = "independence") %>%
  generate(reps = 100, type = "permute") %>%
  calculate(stat = "z")

One numerical (one mean)

mtcars %>%
  specify(response = mpg) %>%
  generate(reps = 100, type = "bootstrap") %>%
  calculate(stat = "t")

I think this is especially needed since if we'd like to plot both the computation and approximation methods on the same plot. Would something like this make more sense in that case for displaying both?

mtcars %>%
  specify(am ~ vs) %>% # alt: response = am, explanatory = vs
  hypothesize(null = "independence") %>%
  generate(reps = 100, type = "permute") %>%
  calculate(stat = "z") %>%
  visualize(method = "both", tail = "left", extreme_as = test_stat)

But this brings up the issue of the student needing to calculate the observed (transformed) test statistic by entering the formula for z and also doing a group_by. Hmm... What do you all think?

logic in hypothesize()

Note:

> mtcars %>% 
    select(cyl) %>%
    hypothesize(null = "point", p1 = .25, p2 = .25, p3 = .50)

Error in hypothesize(., null = "point", p1 = 0.25, p2 = 0.25, p3 = 0.5) : 
  The number of parameters exceeds the number of columns.

The code is checking to see whether the number of parameters is greater than the number of columns in the matrix. But here we want to check whether it is the same as the number of factor levels in the one variable. Does it create a problem anywhere else if I change this?

failed to import pipe?

> library(infer)
> mtcars %>% hypothesize(null = "independence")
Error in mtcars %>% hypothesize(null = "independence") : 
  could not find function "%>%"

Should we just move dplyr to Depends?

Where should the group_by go?

Currently we have the df coming out of generate() getting grouped by replicate. In calculate, we'll need to redo the grouping for the difference case, but for the single variable case, we don't need those group_by()s in there, so we?

More generally, does it make more sense to group by in generate() or calculate(). I kinda like having it generate because that output doesn't really make any sense in an ungrouped setting.

Diff in means/props/medians

Currently I believe the subtraction is being done in the alphabetical order of the levels if the categorical variable is character, and in the levels order if variable is factor. I think we need an order argument for calculate for "diff in ..."situations. This is how I had solved this problem in the inference function. Happy to do a PR but wanted to see (1) am I interpreting current status correctly and (2) is everyone on board with a new order argument that will be something like order = c("level1" - "level2") which translates to mean(y[x == "level1"]) - mean(y[x == "level2"]).

Fix how specify() deals w multi-level factors

> library(infer)
> library(tidyverse)
> load(url("http://s3.amazonaws.com/assets.datacamp.com/production/course_5694/datasets/gss_sampled.RData"))
> gss2016 <- dplyr::filter(gss_sampled, year == 2016)
> table(gss2016$cappun)

   IAP  FAVOR OPPOSE     DK     NA 
     0     85     65      0      0 
> gss2016 %>%
+   specify(response = cappun, success = "FAVOR") %>%
+   hypothesize(null = "point", p = .5)
Adding missing grouping variables: `year`
Error in names(dots$p) <- c(attr(x, "success"), missing_lev) : 
  'names' attribute [5] must be the same length as the vector [2]

I'd like to make this code work in a situation like this when there are old levels with no observations. I'll add re-releveling within hypothesize().

naming of functions

Instead of generate() with different types, would it be better to have simulate(), bootstrap() and permute() just be separate functions? They are "verbs" after all... Wouldn't this make it even clearer what is happening?

Leverage modelr more heavily

I still wonder whether leveraging a supported tool like modelr is the way to go. Note that permutation functions may soon be coming to modelr:

tidyverse/modelr#47

error messages

The following cases have arisen that lead to very unhelpful error messages.

Problem: beach_mountain has more than two levels.

library(tidyverse)
# library(devtools)
# install_github("andrewpbray/oilabs")
library(oilabs)
library(infer)
data(survey141)

survey141 %>%
  specify(response = beach_mountain) %>%
  hypothesize(null = "point", p = c("Beach" = .5, "Mountains" = .5)) %>%
  generate(reps = 300, type = "simulate")

Problem: didn't specify an explanatory variable in a permutation test.

CI for slope

@andrewpbray

I'm doing CIs for the linear model case. Can that option also be implemented in infer?

Also, I don't know if you were asking for advice for the linear model hypothesize function, But I prefer "slope=0". That way, the ideas are easier to generalize to multivariate models. I'm also not opposed to having both options be possible as hypothesize arguments.