In the <a href="https://infer.netlify.com/articles/two_sample_t.html#randomization-app

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

It seems like the internal ggplot2 function of <a hre

Another option would be to do something like this: <a target="_blank

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

We don't have to have the density overlay, but it is an option in <code class="notrans

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Only have monochromatic bins in null distribution + p-value histograms? about infer HOT 15 CLOSED

tidymodels commented on July 4, 2024

Only have monochromatic bins in null distribution + p-value histograms?

from infer.

Comments (15)

ismayc commented on July 4, 2024

@rudeboybert I'd love to figure out a good way to implement this. It seems like a really tricky task though. I think in order to program it, you'd need to first check to make sure that the obs_stat doesn't fall outside of the range of the stat. (This is just a simple if clause.) If it does, you just do the usual binning as you wish.

(The tricky part) If it doesn't, how could we set up the bins to go from

the minimum value of stat to
(a) the negative of obs_stat/obs_stat if obs_stat is negative and then have
the appropriate number of bins between (a) and (b) the positive of obs_stat/-obs_stat if obs_stat is negative
while also going up to the maximum value of stat?

from infer.

ismayc commented on July 4, 2024

It seems like the internal ggplot2 function of bin_breaks_bins() provides some guidance but it still seems really, really tricky:

bin_breaks_bins <- function(x_range, bins = 30, center = NULL,
                            boundary = NULL, closed = c("right", "left")) {
  stopifnot(length(x_range) == 2)
  
  bins <- as.integer(bins)
  if (bins < 1) {
    stop("Need at least one bin.", call. = FALSE)
  } else if (bins == 1) {
    width <- diff(x_range)
    boundary <- x_range[1]
  } else {
    width <- (x_range[2] - x_range[1]) / (bins - 1)
  }
  
  bin_breaks_width(x_range, width, boundary = boundary, center = center,
                   closed = closed)
}

from infer.

mine-cetinkaya-rundel commented on July 4, 2024

Another option would be to do something like this:

Obviously based on the simulated null as opposed to theoretical (so underneath there is a histogram instead of normal curve) but we just shade beyond the observed over the whole plot instead of in the bins. Perhaps not as pretty, but a simpler solution.

from infer.

ismayc commented on July 4, 2024

@rudeboybert: Do you think having a bin cut in half at the observed statistic would be OK for students? I think this is much better than having some weird shading on the bin in which the observed stat falls.

@mine-cetinkaya-rundel I like this option much, much better than the current shading scheme! Much easier to decipher and really makes it clear what the p-value corresponds to I think.

@andrewpbray @beanumber @hardin47 Any opposition to this shading scheme?

from infer.

ismayc commented on July 4, 2024

Any thoughts on colors? I prefer green to red when working with p-values since I want to try to have students think about green as being "GO!" in looking at the p-value, but I'm open to using red as Mine did above. Here are some examples of what the shading looks like for three types of hypothesis test directions with the density overlay:

from infer.

mine-cetinkaya-rundel commented on July 4, 2024

Do we want/need the density overlay? I think no for simulation based inference since we directly calculate the p-value from the histogram.

For the last figure, I purposefully had the bold line on one side only to indicate the location of the observed sample stat. I'm not wedded to that idea, but I thought it's important to show the other side is just a mirror image, not based on an observed sample stat.

As for the line cutting through the bin -- I guess it's a bit confusing, though less so than the bin itself being multiple colored.

from infer.

ismayc commented on July 4, 2024

We don't have to have the density overlay, but it is an option in visualize() and that's what I was currently testing there. I do agree that I like to have ONLY the observed stat come through as a bolded vertical line. Here is an example without the density overlay (using -obs_stat from the previous plots as the observed statistic).

Agreed that the line cutting through the bin is far easier to explain and this solution cuts down on the number of lines of code in the visualize() function by a lot. I think the tradeoff is well worth it here.

from infer.

rpruim commented on July 4, 2024

This is a little bit tricky to do "right" -- and you may have to decide which elements of "right" matter most to you.

I wrote code in mosaic to do multi-colored bins in lattice histograms, but in general it isn't really a very good idea since it can violate the "area = probability" maxim that defines histograms and density curves. Here's why: Suppose you have a bin over the interval [a, b] and a cutpoint k in that interval. It is not necessarily the case that P(a <= X <= k) / P(a <= X <= b) = (k-a) / (b-a).

So you really should make sure that the test statistic falls on the edge of a bin to maintain the area = probability identity.

On the other hand, if you like "nice" bins, and want all bins to be the same width, this can be a problem. I think I would suggest creating a custom bar plot that splits that bar(s) containing the test statistic (and its negative) and calculates separate heights for each of the two "halves". The result will be a true histogram that accurately represents probability with area, but it will have 2 (4) bins that are narrower than the others.

I also prefer density histograms, because if you start splitting bins (or having unequal widths for any other reason, then "count" doesn't really make sense for the y-scale anymore. If you want to stick with counts, then you really should have all bins the same width.

PS. I haven't mentioned that sticky issues of floating point arithmetic, fuzzy computation, and data resolution.

PPS. I've also used the simple trick of overlaying a semi-transparent rectangle. It is easy to implement (see mosaic:statTally()), but has some of the same issues as the histogram and can slightly distort the representation of the p-value if test statistic values are not uniformly distributed within histogram bins.

x <- c(10, 18, 9, 15)   # counts in four cells
rdata <- rmultinom(999, sum(x), prob = rep(.25, 4))
statTally(x, rdata, fun = max, binwidth = 1)  # unusual test statistic

Now I have to decide whether I want to go back and fix these issues in statTally(). (Thanks ;-)

PPS. I think the vertical line should stop at the x-axis like the bins and the rectangle do.

from infer.

rudeboybert commented on July 4, 2024

I definitely prefer the vertical splitting of bars instead of the horizontally splitting of bars as a short-term minimally viable solution, as a little supplementary explanation will get the correct point across. But in the longer term once the bigger fish are fried, having the appropriate binning scheme will yield graphics that are stand-alone and self-explanatory.

from infer.

mine-cetinkaya-rundel commented on July 4, 2024

@rudeboybert Would this require having bins of altering widths? If so, I can see it being even more confusing. Alternatively it would require somewhat non-standard binwidths/ticks, correct? Not sure if either of these are a huge improvement but we could always prototype and see. I guess I'm just having difficulty picturing what they would look like.

from infer.

mine-cetinkaya-rundel commented on July 4, 2024

Also on the issue of colors -- I am completely impartial, but previously I had used red for p-values (red -> reject) and green for confidence intervals (green -> contain). I think we can go with either though.

from infer.

beanumber commented on July 4, 2024

Not red and green on the same plot though, right?

https://en.wikipedia.org/wiki/Color_blindness#Red.E2.80.93green_color_blindness

from infer.

ismayc commented on July 4, 2024

For sure! I will be nice to any color-blind folks in any vignettes/examples.

from infer.

mine-cetinkaya-rundel commented on July 4, 2024

@beanumber No, not on the same plot. I didn't think we would show p-value and confidence interval on the same plot anyway.

from infer.

github-actions commented on July 4, 2024

This issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with a reprex: https://reprex.tidyverse.org) and link to this issue.

from infer.

Only have monochromatic bins in null distribution + p-value histograms? about infer HOT 15 CLOSED

Comments (15)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent