Giter Site home page Giter Site logo

Do NAs get resampled? about infer HOT 13 CLOSED

tidymodels avatar tidymodels commented on July 24, 2024
Do NAs get resampled?

from infer.

Comments (13)

nicholasjhorton avatar nicholasjhorton commented on July 24, 2024 1

from infer.

ismayc avatar ismayc commented on July 24, 2024

The NAs from the original sample are resampled in generate(). I think it's better for the students/users to handle NAs in their original data first instead of this package handling it for them instead.

When you say "original sample" it right now is just using the number of rows in the data frame resulting from specify() which may include columns with complete data and also columns like arr_delay below that are not complete. The na.rm argument in calculate() is necessary to remove the NAs that have been brought forward in a generate() like that below.

library(nycflights13)
suppressPackageStartupMessages(library(dplyr))
library(stringr)
library(infer)
set.seed(2017)
fli_small <- flights %>% 
  sample_n(size = 500) %>% 
  mutate(half_year = case_when(
    between(month, 1, 6) ~ "h1",
    between(month, 7, 12) ~ "h2"
  )) %>% 
  mutate(day_hour = case_when(
    between(hour, 1, 12) ~ "morning",
    between(hour, 13, 24) ~ "not morning"
  )) %>% 
  select(arr_delay, dep_delay, half_year, 
         day_hour, origin, carrier)

# Determine number of missing arrival delay values
sum(is.na(fli_small$arr_delay))
#> [1] 15

# Bootstrap uses similar code to oilabs::rep_sample_n()
boots <- fli_small %>% 
  specify(response = arr_delay) %>% 
  generate(reps = 100, type = "bootstrap")
boots %>% 
  group_by(replicate) %>% 
  summarize(num_na = sum(is.na(arr_delay)))
#> # A tibble: 100 x 2
#>    replicate num_na
#>        <int>  <int>
#>  1         1     17
#>  2         2     26
#>  3         3     16
#>  4         4     14
#>  5         5     19
#>  6         6     12
#>  7         7     10
#>  8         8     14
#>  9         9      8
#> 10        10     12
#> # ... with 90 more rows

If users take care of the NA's at the data creation stage before entering into the infer pipeline this shouldn't be a problem. We should probably at the very least add a warning message that NAs are potentially resampled though, right? What other suggestions do you have for dealing with this? Maybe error out if NAs are present in the column being used asking the user to go back to handle them instead before getting into the infer pipeline?

from infer.

mine-cetinkaya-rundel avatar mine-cetinkaya-rundel commented on July 24, 2024

I completely agree that users should be taking care of their NAs and not us, as there may be different decisions that need to be made. So an error or warning is warranted for sure! That being said, I think the bootstrap sample size should be the number of complete cases in that column that is being resampled (or the two columns if specify has two variables) as opposed to the nrow of the data frame, what do you think? Though I guess if we give an error if there are NAs in the column to be resampled, we don't have to make this decision in infer.

from infer.

ismayc avatar ismayc commented on July 24, 2024

I’m inclined to just give an error so that we don’t have to deal with this in multiple scenarios. Maybe even do this at the specify() stage so that we can avoid other problems going forward?

from infer.

mine-cetinkaya-rundel avatar mine-cetinkaya-rundel commented on July 24, 2024

@ismayc that sounds good to me! this might make the na.rm option in calculate obsolete, but i don't see a harm in leaving the ... in there. especially if we'll have true function recognition in there down the line and ... could be doing so much more than just NA handling.

from infer.

ismayc avatar ismayc commented on July 24, 2024

Sounds good! I’ll tag the commit here when I have this implemented.

from infer.

mine-cetinkaya-rundel avatar mine-cetinkaya-rundel commented on July 24, 2024

Just saw the still pondering label on this. Pondering on implementation or whether the change should be made? If the latter, the answer is yes. Let me know if I can help to implement this change.

from infer.

ismayc avatar ismayc commented on July 24, 2024

@mine-cetinkaya-rundel Just pondering on implementation. If you have ideas, please do go for it!

from infer.

ismayc avatar ismayc commented on July 24, 2024

@mine-cetinkaya-rundel I don't believe we implemented this yet, did we?

from infer.

mine-cetinkaya-rundel avatar mine-cetinkaya-rundel commented on July 24, 2024

No I didn't. I just got a chance to start looking at some of the to dos here so I can work on in the next couple days, but feel free to go ahead if you have ideas now.

from infer.

mine-cetinkaya-rundel avatar mine-cetinkaya-rundel commented on July 24, 2024

Looks like there are a few other NA related decisions, would be good to make consistent decisions about them?

from infer.

ismayc avatar ismayc commented on July 24, 2024

Agreed. I will take a look at summarizing the issues into one common issue this afternoon.

from infer.

github-actions avatar github-actions commented on July 24, 2024

This issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with a reprex: https://reprex.tidyverse.org) and link to this issue.

from infer.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.