The calculate function takes the <code class="notrans

I completely agree that users should be taking care of their <code class="notranslate"

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Do NAs get resampled? about infer HOT 13 CLOSED

tidymodels commented on July 24, 2024

Do NAs get resampled?

from infer.

Comments (13)

nicholasjhorton commented on July 24, 2024 1

I agree that this is a problematic area, since one probably wants to condition on the observed sample size. Throwing a warning at the least or potentially an error would make sense to me.

On Oct 29, 2017, at 9:27 AM, Mine Cetinkaya-Rundel ***@***.***> wrote: I completely agree that users should be taking care of their NAs and not us, as there may be different decisions that need to be made. So an error or warning is warranted for sure! That being said, I think the bootstrap sample size should be the number of complete cases in that column that is being resampled (or the two columns if specify has two variables) as opposed to the nrow of the data frame, what do you think? Though I guess if we give an error if there are NAs in the column to be resampled, we don't have to make this decision in infer. — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or mute the thread.

Nicholas Horton Professor of Statistics Department of Mathematics and Statistics, Amherst College PO Box 5000, AC #2239 Amherst, MA 01002-5000

from infer.

ismayc commented on July 24, 2024

The NAs from the original sample are resampled in generate(). I think it's better for the students/users to handle NAs in their original data first instead of this package handling it for them instead.

When you say "original sample" it right now is just using the number of rows in the data frame resulting from specify() which may include columns with complete data and also columns like arr_delay below that are not complete. The na.rm argument in calculate() is necessary to remove the NAs that have been brought forward in a generate() like that below.

library(nycflights13)
suppressPackageStartupMessages(library(dplyr))
library(stringr)
library(infer)
set.seed(2017)
fli_small <- flights %>% 
  sample_n(size = 500) %>% 
  mutate(half_year = case_when(
    between(month, 1, 6) ~ "h1",
    between(month, 7, 12) ~ "h2"
  )) %>% 
  mutate(day_hour = case_when(
    between(hour, 1, 12) ~ "morning",
    between(hour, 13, 24) ~ "not morning"
  )) %>% 
  select(arr_delay, dep_delay, half_year, 
         day_hour, origin, carrier)

# Determine number of missing arrival delay values
sum(is.na(fli_small$arr_delay))
#> [1] 15

# Bootstrap uses similar code to oilabs::rep_sample_n()
boots <- fli_small %>% 
  specify(response = arr_delay) %>% 
  generate(reps = 100, type = "bootstrap")
boots %>% 
  group_by(replicate) %>% 
  summarize(num_na = sum(is.na(arr_delay)))
#> # A tibble: 100 x 2
#>    replicate num_na
#>        <int>  <int>
#>  1         1     17
#>  2         2     26
#>  3         3     16
#>  4         4     14
#>  5         5     19
#>  6         6     12
#>  7         7     10
#>  8         8     14
#>  9         9      8
#> 10        10     12
#> # ... with 90 more rows

If users take care of the NA's at the data creation stage before entering into the infer pipeline this shouldn't be a problem. We should probably at the very least add a warning message that NAs are potentially resampled though, right? What other suggestions do you have for dealing with this? Maybe error out if NAs are present in the column being used asking the user to go back to handle them instead before getting into the infer pipeline?

from infer.

mine-cetinkaya-rundel commented on July 24, 2024

I completely agree that users should be taking care of their NAs and not us, as there may be different decisions that need to be made. So an error or warning is warranted for sure! That being said, I think the bootstrap sample size should be the number of complete cases in that column that is being resampled (or the two columns if specify has two variables) as opposed to the nrow of the data frame, what do you think? Though I guess if we give an error if there are NAs in the column to be resampled, we don't have to make this decision in infer.

from infer.

ismayc commented on July 24, 2024

I’m inclined to just give an error so that we don’t have to deal with this in multiple scenarios. Maybe even do this at the specify() stage so that we can avoid other problems going forward?

from infer.

mine-cetinkaya-rundel commented on July 24, 2024

@ismayc that sounds good to me! this might make the na.rm option in calculate obsolete, but i don't see a harm in leaving the ... in there. especially if we'll have true function recognition in there down the line and ... could be doing so much more than just NA handling.

from infer.

ismayc commented on July 24, 2024

Sounds good! I’ll tag the commit here when I have this implemented.

from infer.

mine-cetinkaya-rundel commented on July 24, 2024

Just saw the still pondering label on this. Pondering on implementation or whether the change should be made? If the latter, the answer is yes. Let me know if I can help to implement this change.

from infer.

ismayc commented on July 24, 2024

@mine-cetinkaya-rundel Just pondering on implementation. If you have ideas, please do go for it!

from infer.

ismayc commented on July 24, 2024

@mine-cetinkaya-rundel I don't believe we implemented this yet, did we?

from infer.

mine-cetinkaya-rundel commented on July 24, 2024

No I didn't. I just got a chance to start looking at some of the to dos here so I can work on in the next couple days, but feel free to go ahead if you have ideas now.

from infer.

mine-cetinkaya-rundel commented on July 24, 2024

Looks like there are a few other NA related decisions, would be good to make consistent decisions about them?

from infer.

ismayc commented on July 24, 2024

Agreed. I will take a look at summarizing the issues into one common issue this afternoon.

from infer.

github-actions commented on July 24, 2024

This issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with a reprex: https://reprex.tidyverse.org) and link to this issue.

from infer.

Do NAs get resampled? about infer HOT 13 CLOSED

Comments (13)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent