Comments (13)
from infer.
The NA
s from the original sample are resampled in generate()
. I think it's better for the students/users to handle NA
s in their original data first instead of this package handling it for them instead.
When you say "original sample" it right now is just using the number of rows in the data frame resulting from specify()
which may include columns with complete data and also columns like arr_delay
below that are not complete. The na.rm
argument in calculate()
is necessary to remove the NA
s that have been brought forward in a generate()
like that below.
library(nycflights13)
suppressPackageStartupMessages(library(dplyr))
library(stringr)
library(infer)
set.seed(2017)
fli_small <- flights %>%
sample_n(size = 500) %>%
mutate(half_year = case_when(
between(month, 1, 6) ~ "h1",
between(month, 7, 12) ~ "h2"
)) %>%
mutate(day_hour = case_when(
between(hour, 1, 12) ~ "morning",
between(hour, 13, 24) ~ "not morning"
)) %>%
select(arr_delay, dep_delay, half_year,
day_hour, origin, carrier)
# Determine number of missing arrival delay values
sum(is.na(fli_small$arr_delay))
#> [1] 15
# Bootstrap uses similar code to oilabs::rep_sample_n()
boots <- fli_small %>%
specify(response = arr_delay) %>%
generate(reps = 100, type = "bootstrap")
boots %>%
group_by(replicate) %>%
summarize(num_na = sum(is.na(arr_delay)))
#> # A tibble: 100 x 2
#> replicate num_na
#> <int> <int>
#> 1 1 17
#> 2 2 26
#> 3 3 16
#> 4 4 14
#> 5 5 19
#> 6 6 12
#> 7 7 10
#> 8 8 14
#> 9 9 8
#> 10 10 12
#> # ... with 90 more rows
If users take care of the NA
's at the data creation stage before entering into the infer
pipeline this shouldn't be a problem. We should probably at the very least add a warning message that NA
s are potentially resampled though, right? What other suggestions do you have for dealing with this? Maybe error out if NA
s are present in the column being used asking the user to go back to handle them instead before getting into the infer
pipeline?
from infer.
I completely agree that users should be taking care of their NA
s and not us, as there may be different decisions that need to be made. So an error or warning is warranted for sure! That being said, I think the bootstrap sample size should be the number of complete cases in that column that is being resampled (or the two columns if specify has two variables) as opposed to the nrow
of the data frame, what do you think? Though I guess if we give an error if there are NAs in the column to be resampled, we don't have to make this decision in infer
.
from infer.
I’m inclined to just give an error so that we don’t have to deal with this in multiple scenarios. Maybe even do this at the specify()
stage so that we can avoid other problems going forward?
from infer.
@ismayc that sounds good to me! this might make the na.rm
option in calculate obsolete, but i don't see a harm in leaving the ...
in there. especially if we'll have true function recognition in there down the line and ...
could be doing so much more than just NA handling.
from infer.
Sounds good! I’ll tag the commit here when I have this implemented.
from infer.
Just saw the still pondering label on this. Pondering on implementation or whether the change should be made? If the latter, the answer is yes. Let me know if I can help to implement this change.
from infer.
@mine-cetinkaya-rundel Just pondering on implementation. If you have ideas, please do go for it!
from infer.
@mine-cetinkaya-rundel I don't believe we implemented this yet, did we?
from infer.
No I didn't. I just got a chance to start looking at some of the to dos here so I can work on in the next couple days, but feel free to go ahead if you have ideas now.
from infer.
Looks like there are a few other NA
related decisions, would be good to make consistent decisions about them?
from infer.
Agreed. I will take a look at summarizing the issues into one common issue this afternoon.
from infer.
This issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with a reprex: https://reprex.tidyverse.org) and link to this issue.
from infer.
Related Issues (20)
- Estimating the intercept with fit() HOT 2
- transition to cli HOT 2
- improve argument checking
- add / polish alt text HOT 1
- infer::observe() conflicts with shiny::observe() HOT 2
- Clarify whether chisq_test is supposed to give same results as base::chisq.test() HOT 2
- Hex logo missing on package homepage HOT 9
- t where it should be p-value in the documentation HOT 2
- How about add a function to run all workflow? HOT 2
- Error in quantile.default(x[[ncol(x)]], probs = (1 + c(-level, level))/2) : missing values and NaN's not allowed if 'na.rm' is FALSE HOT 3
- t-test bootstrapped p-values very different compared to permutation HOT 4
- Null Distribution of SD not Appropriately Calculated HOT 3
- fill = NULL argument not working for shade_ci() HOT 2
- rep_slice_sample on groups with multiple n values HOT 2
- Warnings about aesthetic length and no non-missing argument to min in visualise + shade_p_value functions HOT 5
- snap changes with additional warning context on R-devel HOT 4
- messages on `devtools::document()` HOT 2
- reduce noise in `devtools::test()` HOT 1
- Release infer 1.0.7 HOT 1
- rep_slice_sample unused argument error in vctrs:vec_chop HOT 4
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from infer.