tidymodels / infer Goto Github PK
View Code? Open in Web Editor NEWAn R package for tidyverse-friendly statistical inference
Home Page: https://infer.tidymodels.org
License: Other
An R package for tidyverse-friendly statistical inference
Home Page: https://infer.tidymodels.org
License: Other
We will need to update the keynote diagram that sits in the README for hypothesis tests and also add the one for confidence intervals. I rarely use Keynote so I suspect one of you can do this a lot faster and nicer than my attempts.
I still wonder whether leveraging a supported tool like modelr
is the way to go. Note that permutation functions may soon be coming to modelr
:
Currently we have the df coming out of generate()
getting grouped by replicate
. In calculate, we'll need to redo the grouping for the difference case, but for the single variable case, we don't need those group_by()
s in there, so we?
More generally, does it make more sense to group by in generate()
or calculate()
. I kinda like having it generate because that output doesn't really make any sense in an ungrouped setting.
@andrewpbray @beanumber Not quite sure why Travis isn't working anymore. Any thoughts?
https://travis-ci.org/andrewpbray/infer
https://github.com/andrewpbray/infer/blob/master/.travis.yml
After implementation
Running ?calculate
, the example works:
library(infer)
library(tidyverse)
# Example
mtcars %>%
dplyr::mutate(am = factor(am), vs = factor(vs)) %>%
specify(am ~ vs, success = "1") %>%
hypothesize(null = "independence") %>%
generate(reps = 100, type = "permute") %>%
calculate(stat = "diff in props", order = c("1", "0"))
However the README example doesn't
mtcars %>%
specify(am ~ vs) %>% # alt: response = am, explanatory = vs
hypothesize(null = "independence") %>%
generate(reps = 100, type = "permute") %>%
calculate(stat = "diff in props")
as we need specify(am ~ vs, success = "1")
and calculate(stat = "diff in props", order = c("1", "0"))
. Perhaps set default values for specify(success)
and calculate(order)
?
It appears that if you're working w a two-level categorical variable, it's totally fine to assign just one null value of a proportion.
mtcars %>%
specify(response = am) %>% # alt: am ~ NULL (or am ~ 1)
hypothesize(null = "point", p = c("1" = .25)) %>%
generate(reps = 100, type = "simulate") %>%
calculate(stat = "prop")
We're currently throwing an error message, however, when you try to do a one-prop test on a variable that has many levels. If you convert the GoF example:
> mtcars %>%
+ specify(cyl ~ NULL) %>% # alt: response = cyl
+ hypothesize(null = "point", p = c("4" = .5)) %>%
+ generate(reps = 100, type = "simulate") %>%
+ calculate(stat = "prop", success = "4")
Error in parse_params(dots, x) :
The factor variable that you have specified has 3 levels and you've only assigned null probabilities to 1 levels.
This is relevant to calculate
but maybe others too, if any of those use underscore verbs from dplyr, which is now deprecated functionality.
Prototype for calculating for stat == "mean"
if (stat == "mean") {
col <- setdiff(names(x), "replicate")
x %>%
dplyr::group_by(replicate) %>%
dplyr::summarize(stat = mean(!!sym(col)))
}
Note that this requires importing the sym
function from the rlang package, which dplyr currently does not import automatically.
Is someone up for updating the calculate function with this, or should I? We also need to resolve #15 and #16 to and all of the updates can be done at once.
visualize()
shows the sampling distribution. Should we add a call to + geom_vline()
that illustrates where in the sampling distribution the test statistic is? Do we still have that information by the time we get to visualize()
?
goodpractice::gp()
only thing of note for later is may want to stick to 80 line width across all code, doc lines, and test filesbuild_win()
yet?generate(reps = "foobar")
doesn't error wellmtcars_examples
vignette, maybe suppress thosecalculate
) lean towards being kinda big for my taste, could be DRYed out a bit if possiblespecify
man file?Did we close the door on this? It seems like
mtcars %>%
specify(mpg ~ hp) %>%
hypothesize(null = "independence") %>%
calculate(stat = "slope") %>%
visualize()
would still make sense. It would return something like:
mod <- lm(mpg ~ hp, data = mtcars)
ggplot(data = filter(broom::tidy(mod), term == "hp"), aes(x = estimate)) +
stat_function(fun = dt, args = list(df = mod$df.residual)) +
scale_x_continuous(limits = c(-4, 4)) +
geom_vline(aes(xintercept = estimate), color = "red")
In the Randomization approach to t-statistic example:
library(nycflights13)
library(dplyr)
library(stringr)
library(infer)
set.seed(2017)
# Create data frame
fli_small <- flights %>%
sample_n(size = 500) %>%
mutate(half_year = case_when(
between(month, 1, 6) ~ "h1",
between(month, 7, 12) ~ "h2"
)) %>%
mutate(day_hour = case_when(
between(hour, 1, 12) ~ "morning",
between(hour, 13, 24) ~ "not morning"
)) %>%
select(arr_delay, dep_delay, half_year,
day_hour, origin, carrier)
# Null
obs_t <- fli_small %>%
t_stat(formula = arr_delay ~ half_year)
t_null_distn <- fli_small %>%
specify(arr_delay ~ half_year) %>% # alt: response = arr_delay, explanatory = half_year
hypothesize(null = "independence") %>%
generate(reps = 1000, type = "permute") %>%
calculate(stat = "t")
t_null_distn %>%
visualize(obs_stat = obs_t, direction = "two_sided")
I think having the bin that straddles the test-statistic value (and its negative) have two colors will definitely trip novices out. Is there anyway to have two bin boundaries be the test statistic value (and its negative) by default, so that we have monochromatic bins? And only if people specify a particular bin
/binwidth
, resort to this two-color binning scheme?
Currently I believe the subtraction is being done in the alphabetical order of the levels if the categorical variable is character, and in the levels
order if variable is factor. I think we need an order argument for calculate for "diff in ..."
situations. This is how I had solved this problem in the inference
function. Happy to do a PR but wanted to see (1) am I interpreting current status correctly and (2) is everyone on board with a new order
argument that will be something like order = c("level1" - "level2")
which translates to mean(y[x == "level1"]) - mean(y[x == "level2"])
.
Proposed input: dataframe/tibble that has already been select()
ed down to variables of interest.
Proposted output: tibble with a flag for the null hypothesis selected, similar to group_by()
I like the idea of having students specify a null
argument here, probably in the form of a character string. Deciding how to describe each of the null's we want will be difficult. I took a stab at them in the examples in the README, but I have lots of questions:
alternative
? My sense right now is no - that the goal of this is the visualize()
ation of the sampling distribution. We could always write a p-value()
function that has an alternate
argument.When trying to run
library(infer)
library(tidyverse)
mtcars <- as.data.frame(mtcars) %>%
mutate(cyl = factor(cyl),
vs = factor(vs),
am = factor(am),
gear = factor(gear),
carb = factor(carb))
mtcars %>%
specify(am ~ vs) %>% # alt: response = am, explanatory = vs
hypothesize(null = "independence") %>%
generate(reps = 100, type = "permute") %>%
calculate(stat = "diff in props")
I get Error in loadNamespace(name) : there is no package called ‘assertive’
Here is my sessionInfo
R version 3.4.1 (2017-06-30)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS Sierra 10.12.6
Matrix products: default
BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRlapack.dylib
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] infer_0.1.0
loaded via a namespace (and not attached):
[1] Rcpp_0.12.14 knitr_1.17 bindr_0.1 magrittr_1.5 munsell_0.4.3 colorspace_1.3-2
[7] R6_2.2.2 rlang_0.1.6 plyr_1.8.4 stringr_1.2.0 dplyr_0.7.4 tools_3.4.1
[13] grid_3.4.1 gtable_0.2.0 htmltools_0.3.6 lazyeval_0.2.1 yaml_2.1.14 rprojroot_1.2
[19] digest_0.6.13 assertthat_0.2.0 tibble_1.3.4 bookdown_0.5 bindrcpp_0.2 ggplot2_2.2.1
[25] glue_1.2.0 evaluate_0.10.1 rmarkdown_1.8 stringi_1.1.6 compiler_3.4.1 scales_0.5.0
[31] backports_1.1.1 pkgconfig_2.0.1
The calculate
function takes the na.rm
argument for the statistic it's calculating. But does this mean at the generate stage, say if we're doing a bootstrap interval, NA
s from the original sample get resampled? If so, I believe this is not a good idea. Because chances are when we look at the sample size for the original sample those NAs don't factor into it. And a bootstrap sample should be the same size as the original sample. If indeed the NA
s from the original sample don't make it into the bootstrap samples, then isn't the na.rm
argument in calculate unnecessary?
Instead of setting an attribute, should hypothesis()
return an object of class proptest
or whatever? That way calculate()
will already know what to do.
Instead of generate()
with different types, would it be better to have simulate()
, bootstrap()
and permute()
just be separate functions? They are "verbs" after all... Wouldn't this make it even clearer what is happening?
It's late...what am I doing wrong?
outcomes <- data_frame(candidate = c("clinton", "trump"))
outcomes %>%
specify(response = candidate) %>%
hypothesize(null = "point", p = c(0.5, 0.5)) %>%
generate(reps = 1000, type = "simulate") %>%
calculate(stat = "prop") %>%
visualize()
Error in sample.int(length(x), size, replace, prob) :
NA in probability vector
Is this right? That's my experience with it.
What should it return, as in, what should the object be called? Options:
dist
sim_dist
or perm_dist
or boot_dist
depending on generate typeboot_mean
or perm_mean
or perm_median
etc. depending on both generate type and stat?Was there a specific purpose for this? If not, I suggest using lowercase, as needing to capitalize might trip up students. Happy to do a small PR for this but didn't want to touch it before checking to make sure the capitalization isn't serving some other purpose.
This is in the summarise step of calculate. Since mean
is a function it's not good practice to also use it as variable name, same with median etc. Alternatives:
stat
so the output looks the same regardless of what statistic is used, this means user has to remember which it was, which is ok I think since in the function call they specify it.
My vote is for Option 1.
> library(infer)
> mtcars %>% hypothesize(null = "independence")
Error in mtcars %>% hypothesize(null = "independence") :
could not find function "%>%"
Should we just move dplyr
to Depends?
mtcars %>%
specify(response = am) %>%
hypothesize(null = "point", p = c("0" = 0.25, "1" = 0.75)) %>%
generate(reps = 100, type = "simulate")
Error in sample.int(length(x), size, replace, prob) :
invalid first argument
Proposed input: a tibble with some meta-info from hypothesize()
Proposed output: a much taller tibble that is the input repeat()
ed n times along with an indexing column.
This function is pretty much the same as Mine's rep_sample_n()
and it encapsulates the idea that the null hypothesis should encode some mechanism by which you can generate additional data sets under that null hypothesis. Some thoughts/questions:
repeat()
for the number of iterations. The meta-info that's appended with hypothesize()
should be sufficient to know whether this is a permutation or simulation situation.simulate()
.Note:
> mtcars %>%
select(cyl) %>%
hypothesize(null = "point", p1 = .25, p2 = .25, p3 = .50)
Error in hypothesize(., null = "point", p1 = 0.25, p2 = 0.25, p3 = 0.5) :
The number of parameters exceeds the number of columns.
The code is checking to see whether the number of parameters is greater than the number of columns in the matrix. But here we want to check whether it is the same as the number of factor levels in the one variable. Does it create a problem anywhere else if I change this?
@andrewpbray @mine-cetinkaya-rundel @beanumber @ismayc @nicksolomon
Another thing I do in my course is to create a sampling distribution of slopes taken from a huge population. I do this in chapter 1 before I've done any inference. The idea is just to visualize how the slopes change from sample to sample outside the context of making specific hypotheses.
Again, would appreciate your vote on which of these options seems best / most consistent with what we are doing.
First:
explanatory <- rnorm(1000, 0, 5)
response <- 40 + explanatory*2 + rnorm(1000,0,10)
popdata <- data.frame(explanatory,response)
(A)
Uses rep_sample_n
to get many samples. Two plots here: one is superimposed lines on a scatterplot, one is a histogram.
manysamples <- popdata %>%
rep_sample_n(size=50, reps=100)
ggplot(manysamples, aes(x=explanatory, y=response, group=replicate)) +
geom_point() +
geom_smooth(method="lm", se=FALSE)
manylms <- manysamples %>%
group_by(replicate) %>%
do(tidy(lm(response ~ explanatory, data=.))) %>%
filter(term=="explanatory")
ggplot(manylms, aes(x=estimate)) + geom_histogram()
(B)
This code doesn't exist yet... we could write an additional option to generate
in infer
... maybe use the argument sample
?
manylms <- popdata %>%
specify(response ~ explanatory) %>%
hypothesize(null = "independence") %>%
generate(reps = 100, type = "sample", size=50) %>%
calculate(stat = "slope")
# I don't know how to use the above to make the slope plot in ggplot (which I believe to be very powerful).
# I'd need both the slope and the intercept to draw the lines... so maybe this is a moot point??
ggplot(manylms, aes(x=estimate)) + geom_histogram()
* checking R code for possible problems ... NOTE
calculate: no visible binding for global variable ‘xbar’
calculate: no visible binding for global variable ‘xtilde’
calculate: no visible binding for global variable ‘prop’
visualize: no visible binding for global variable ‘stat’
This used to be fixed by using mutate_()
with a formula, but now we are supposed to use rlang
?? I think @ismayc knows how to fix this...
@andrewpbray: Indulge @ismayc and my inner pedants and add the pkgdown
https://infer.netlify.com/index.html site to the GitHub repo front page by editing here:
I'd like people to be able to break the inferential pipeline at anypoint and have the output make sense. The default tibble/data.frames that we're working in are pretty good, but there are other attributes related to hypothesize()
that would be nice to print (similar to how "Groups: cyl" will be printed at the top of the output of group_by(mtcars, cyl)
.
Seems like all we'd need to do to do this is add an infer
class to the dataframe then make a print method that will supercede that of tbl
and data.frame
. I was hoping to just tweak the existing print methods for that but...I've been having a tough time tracking them down. They appear to be in tibble
, but I haven't gotten much more than learning about format()
and trunc_mat()
.
Has anyone been able to find them?
> library(infer)
> library(tidyverse)
> load(url("http://s3.amazonaws.com/assets.datacamp.com/production/course_5694/datasets/gss_sampled.RData"))
> gss2016 <- dplyr::filter(gss_sampled, year == 2016)
> table(gss2016$cappun)
IAP FAVOR OPPOSE DK NA
0 85 65 0 0
> gss2016 %>%
+ specify(response = cappun, success = "FAVOR") %>%
+ hypothesize(null = "point", p = .5)
Adding missing grouping variables: `year`
Error in names(dots$p) <- c(attr(x, "success"), missing_lev) :
'names' attribute [5] must be the same length as the vector [2]
I'd like to make this code work in a situation like this when there are old levels with no observations. I'll add re-releveling within hypothesize()
.
It seems that if we are planning to also include approximation methods that use the t and standard normal distribution that we should also include something like
Two categorical (2 level) variables
mtcars %>%
specify(am ~ vs) %>% # alt: response = am, explanatory = vs
hypothesize(null = "independence") %>%
generate(reps = 100, type = "permute") %>%
calculate(stat = "z")
or
One numerical (one mean)
mtcars %>%
specify(response = mpg) %>%
generate(reps = 100, type = "bootstrap") %>%
calculate(stat = "t")
I think this is especially needed since if we'd like to plot both the computation and approximation methods on the same plot. Would something like this make more sense in that case for displaying both?
mtcars %>%
specify(am ~ vs) %>% # alt: response = am, explanatory = vs
hypothesize(null = "independence") %>%
generate(reps = 100, type = "permute") %>%
calculate(stat = "z") %>%
visualize(method = "both", tail = "left", extreme_as = test_stat)
But this brings up the issue of the student needing to calculate the observed (transformed) test statistic by entering the formula for z and also doing a group_by
. Hmm... What do you all think?
Currently, if the input data has an NA, no warnings are shown but the resulting simulated statistics vector can have some NAs since calculate
doesn't specify to handle NAs when taking mean, median, etc. One option is to specify that in calculate. The other option is to make the user get rid of NAs first in the exploratory stage, and then feed the data into the infer
steps. Former is a bit more user friendly, but hides an important consideration. Latter can be tricky for students when doing inference for two variables since they would need to remove pairs of NAs from both variables. Thoughts?
@mine-cetinkaya-rundel @andrewpbray @beanumber @nicksolomon
As I finish up the permutation test, I realize that calculating the linear model test statistic (slope) is not consistent with the structure of the infer
package. Please vote on (A), (B), (C), or (D) and let @ismayc (or @andrewpbray ??) know if / how the infer
package should be updated.
First:
library(infer)
twins <- read.csv("http://andrewpbray.github.io/data/twins.csv")
twin.slps <- twins %>%
specify(Foster ~ Biological) %>%
hypothesize(null = "independence") %>%
generate(reps = 100, type = "permute") %>%
calculate(stat = "slope")
(A)
Very tidy, but I would have to explain do
.
obs.slp <- twins %>% do(tidy(lm(Foster ~ Biological, data = .))) %>%
filter(term == "Biological") %>%
dplyr::select(estimate)
sum(abs(obs.slp) > = abs(twin.slps) ) / 100
(B)
Mostly tidy, but I'm not piping in the twins dataset.
obs.slp <- tidy(lm(Foster ~ Biological, data = twins)) %>%
filter(term == "Biological") %>%
dplyr::select(estimate)
sum(abs(obs.slp) > = abs(twin.slps) ) / 100
(C)
Not tidy.
obs.slp <- lm(Foster ~ Biological, data = twins)$coef[2]
sum(abs(obs.slp) > = abs(twin.slps) ) / 100
(D)
Create a type="identity"
option in generate
. This doesn't work yet. Someone would have to write it. and is identity
the right word? What about observed
?
obs.slp <- twins %>%
specify(Foster ~ Biological) %>%
hypothesize(null = "independence") %>%
generate(type = "identity") %>%
calculate(stat = "slope")
sum(abs(obs.slp) > = abs(twin.slps) ) / 100
The following diagram was inspired by Alan Downey's "There is only one test" and can be found in keynote form in the figs directory.
Some possible desiderata of this diagram:
Any suggestions at this most general level?
Just looking over this briefly, I don't see the "grammar". Do you have a grammar in mind? What are its elements (nouns and verbs)? How are they defined? This seems to be motivated primarily by the desire to use %>%
and not by a well-defined grammar. Am I missing something?
Perhaps a wiki page could be created to describe the grammar.
I'm using infer
in my in-class teaching this semester, and I can't yet definitively say how learning is going, but in general based on informal observation, seems ok.
However one type of question I've been getting repeatedly is "How do I know when to use which type for generating samples?". This is a good question of course, and the answer leads us to discussing differences. But what I'm observing is that students are trying to make if this then that type lists (like if I have categorical variable and point null then simulate). And what's worrisome about it is that it feels a bit like the cheatsheets students put together in courses that teach theoretical inference with type of tests and SE formulas. One of the main motivations for coming up with consistent syntax was to move away from such thinking.
Have others using this in in-class teaching experienced this? Any thoughts? @andrewpbray @beanumber @hardin47
Note: The one alternative I can think of is to have the package decide the type of generate based on the info that comes before it. This is similar to how it worked in my inference
function. But it hides what I think is an important detail. Unfortunately my limited experience in showing that detail hasn't been super positive, but I'm willing to accept that's the way to go, just wanted to hear from others what they think after having students use these functions.
More informative error message needed out of specify()
:
load(url("http://s3.amazonaws.com/assets.datacamp.com/production/course_5694/datasets/gss_sampled.RData"))
gss2016 <- filter(gss_sampled, year == 2016)
ggplot(gss2016, aes(x = cappun)) +
geom_bar()
gss2016 %>%
specify(response = cappun, success = "FAVOR")
Adding missing grouping variables: year
I'm doing CIs for the linear model case. Can that option also be implemented in infer
?
Also, I don't know if you were asking for advice for the linear model hypothesize function, But I prefer "slope=0". That way, the ideas are easier to generalize to multivariate models. I'm also not opposed to having both options be possible as hypothesize arguments.
In working on a solution for #53 I realized that if success
isn't defined by the user we make an assumption and use the first level as success. I don't think this is a good approach because it makes the function "just work" without the student properly formulating the inference chain. I propose removing https://github.com/andrewpbray/infer/blob/f034a8629dd42c83d4fbb33dd1b80324262dbb12/R/calculate.R#L60 and https://github.com/andrewpbray/infer/blob/f034a8629dd42c83d4fbb33dd1b80324262dbb12/R/calculate.R#L61 and instead throwing an error saying success must be defined when the response variable is categorical. I can put that in the same PR as the order
solution.
Also, as far as I can tell currently the "diff in props"
example isn't working since success
functionality isn't built into the difference in proportions case (only the single proportion).
The following cases have arisen that lead to very unhelpful error messages.
beach_mountain
has more than two levels.library(tidyverse)
# library(devtools)
# install_github("andrewpbray/oilabs")
library(oilabs)
library(infer)
data(survey141)
survey141 %>%
specify(response = beach_mountain) %>%
hypothesize(null = "point", p = c("Beach" = .5, "Mountains" = .5)) %>%
generate(reps = 300, type = "simulate")
Especially since they are going to get passed to the prob
argument of sample()
, which is already a vector.
prob
A vector of probability weights for obtaining the elements of the vector being sampled.
Proposed input: dataframe/tibble that has generated data in a form to calculate statistics across replicates
Proposed output: tibble containing the replicated statistics
As @nicholasjorton suggested, it will be important to be able to move easily beyond the traditional Intro Stats tests to this more flexible space -- else you are just a dead end.
Perhaps it would be better to start from that end than from the 1- and 2-sample end. Most of the 1- and 2-sample things can be thought of as special cases of this more general setting anyway.
I've gone at least quickly through all the issues now. It seems like you are trying to create a very closed system -- users will only be able to conduct hypothesis tests that you have imagined in advance they will want to do to and for which they know the magic character strings that name the test. This is only marginally different from having a separate function for each test.
A true grammar of hypothesis testing would be a more open system where a new hypothesis test could be conducted without needing to add to the package.
For starters, in calculate()
, why isn't stat
a function (or a string that names a function). That would allow users to use any test statistic by providing a function of the proper arity for the data. It would also allow users to directly apply the stat function to the data set in question to get the observed value of the test statistic.
Here are some test cases for your "grammar":
@andrewpbray It appears we called both vignettes "flights example" and the second one should be "mtcars example": https://cran.r-project.org/web/packages/infer/index.html
I really like using The Lady Tasting Tea to motivate hypothesis testing. The following code does the trick:
library(tidyverse)
library(infer)
lady_tasting_tea <- data_frame(
first = as.factor(c(rep("milk",4), rep("tea",4)))
)
lady_tasting_tea %>%
specify(response = first) %>% # alt: am ~ NULL (or am ~ 1)
hypothesize(null = "point", p = c("milk" = .5, "tea" = .5)) %>%
generate(reps = 1000, type = "simulate") %>%
calculate(stat = "prop")
However, it would be great if the final stat could be "sum". That way there is one less layer of abstraction between the experiment and the null distribution (students can read off count directly, instead of reading off proportions)
In a more general setting, any many-to-one summary function would be great, for example all those that work with dplyr::summarize()
specify()
formulate()
?A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.