Giter Site home page Giter Site logo

rep_col_shuffle()? about infer HOT 3 OPEN

andrewpbray avatar andrewpbray commented on August 28, 2024
rep_col_shuffle()?

from infer.

Comments (3)

simonpcouch avatar simonpcouch commented on August 28, 2024

I dig it! If folks would find this pedagogically useful, I think this is surely within scope and would have a low maintenance burden. :)

from infer.

mine-cetinkaya-rundel avatar mine-cetinkaya-rundel commented on August 28, 2024

I think I can see the value, but I'm having a rough time picturing what procedures would look like based on @andrewpbray's description.

@andrewpbray -- Could you write up a couple of examples as though rep_shuffle_col() existed? Also, I think the name would need to be something else -- shuffle is to sample and slice is to col here (though obviously row would have been better.

  • rep_slice_sample() vs. rep_col_shuffle()
  • rep_slice_sample() vs. rep_mutate_shuffle() -- I don't love this at all, but seems more of a parity

from infer.

andrewpbray avatar andrewpbray commented on August 28, 2024

Here's an example of a permutation test using a difference in means, starting with the existing implementation from full pipeline examples docs.

library(infer)

# existing implementation
null_dist <- gss %>%
  specify(age ~ college) %>%
  hypothesize(null = "independence") %>%
  generate(reps = 1000, type = "permute") %>%
  calculate(stat = "diff in means", order = c("degree", "no degree"))
  
# new approach (to get through the generate step)
gss %>%
  rep_col_shuffle(age, reps = 1000)

where the output of the second pipeline would be a data frame with nrow(gss) * reps rows and ncol(gss) + 1 columns, the new column being replicate. In that data frame, age will now be sample(age).

The syntax would be the same for a permutation test for a difference in proportions, the coefficient of a linear model, etc.

If we did a close port of rep_slice_sample(), then that output data frame wouldn't have any of the metadata normally appended by specify() and hypothesize() that is used by calculate(), so the user would have to use dplyr to group_by(replicate) and calculate their statistics. I think that's ok.

from infer.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.