poldrack / psych10-book Goto Github PK

View Code? Open in Web Editor NEW

487.0 30.0 133.0 3.86 MB

Source files for Statistical Thinking For the 21st Century

Python 1.89% Makefile 2.65% R 4.69% HTML 0.53% TeX 88.23% Dockerfile 2.01%

psych10-book's Introduction

This project has moved to https://github.com/statsthinking21

psych10-book's People

Stargazers

Watchers

Forkers

lucysking hannahmoro drawcloser huayue21 blue1881eulb mrahim darrenzeng2012 dankessler huaxu007 huanniqingchun11 unicornhope jefflieng allensmile zzuiekongning argyelan greatfollow revontulet123 ruc358 liuwenqiang1202 alanhe-xiaoyu mutual-ai yw2575 lnjzcjy midasc wishyoulikebefore fw1121 juadiegaitan amizrob ghcoba abelalez blog4me wzxwzr amoyyean johnsonzhj aliushn agahkarakuzu jngod2011 rwblair yangtianyu92 mahaocheng lfwin piggona liuliuy sckangz lidingruc liyuyu233 jdc08161063 asalbarak cynthiazhaoq king54siyu giserh stgm liryyy gnilihzeux uraboer jdkent basicv8vc mwaskom jizhihang guhjy dszzm civanak yinjc yu802 yubaoliu laolian jinstrong novscorpio liangsci chenglongcui gaozhy hktang ppjun5022 lnsongxf josuema snowdj xiaobingscuer alicezxl njuvictor pyhustsong laoyin sadknight0001 muxinghan pythonpeixun dancingonair xushuda526 gienah spencersun064 chengzheng2013 jesselivingston codetrainee xuezhizeng wgku didiaoboy mengwang024 ferilla atomwrite shenhaijusha saigerutherford hdm-r

psych10-book's Issues

web traffic example

raised by Jack van Horn: your example of website visits over a week might not be the best example for discrete categories and sampling requirements required by contingency tables and Chi-Squared testing.

comments on chapter 12

It would be helpful to add a label to the y-axis in fig:chisqDist. I wasn't totally sure what the label should be -- students may be a bit confused between the y-axis in fig:chisqDist and fig:chisqSim.
The sentence in line 116 is incomplete, "Figure @ref(fig:chisqSim) shows that the theoretical distribution matches closely with the results of a simulation that repeatedly added"
Just FYI, the count() function is a cleaner way to get grouped counts; replaces group_by(x, y) %>% summarize (n = n())
When using the chisq.test() it's not necessary to create a table first. You can just call the variables from the dataset (e.g., chisq.test(NHANES$Depressed, NHANES$SleepTrouble). Should this be clarified in the text? It's an easier way for students to use the function.

comments on chapter 11

In line 61, you write "it is telling us about the likelihood of the data given some parameter, whereas what we really want to know is the parameter value." It would be helpful to reference the example in parentheses next to the words "parameter" and "parameter value".
In line 94, you describe, "Let's say that the security staff runs the bag through their testing apparatus 10 times, and it gives a positive reading on 6 of the 10 tests", but then in the code below you have "nPositives <- 9". I updated to nPositives <- 6.
You mention "Bernoulli trial" in 104. I imagine many students won't know what this is. Replace with "binomial"?
In line 151 you write, " However, in other cases we want to use Bayesian estimation to estimate the value of a parameter." I think students may be confused about what the difference is between "estimating the value of a parameter" and the various values you just estimated using the airport example.
Missing a word or typo in this sentence? Line 176 "We can compute the likelihood of the data under the any particular value of the effectiveness parameter using the dbinom() function in R"
This is abstract; Line 201: "This marginal likelihood is used as a normalizing constant to ensure that the posterior values are true probabilities."
Line 440, clarify what "this" is, "For example, if we had used a binomial distribution based on one head out of two coin flips, this would have been centered around 0.5 but fairly flat, biasing the posterior only slightly." This sentence is bit awkward but afraid of changing the meaning if I revise.

permutation issue

raised by @tansey: glanced at the hypothesis testing chapter and do have one gripe. You really should probably mention that randomization in the presence of covariates is generally not valid. Biologists seem to love the idea of simply being able to randomize a column in their covariate matrix and report back a p-value, but that permutation tests assumes that all covariates are independent.

comments on chapter 8

You write "Using this function, we can generate random numbers from a uniform distribution, and then map those into the distribution of interest via its quantile function." I was a bit confused by what "this" (run_if()?) refers to and what the last clause means.
I think it could be helpful to set the chunk showing the uniform and normal distributions to echo=TRUE given that you've just explained these functions. I tidied in case you want to do this.
You then write, "By default, R will generate a different set of random numbers every time you run it." I'm not sure what "it" refers to (run_if()?)

can i translate it into Chinese and put it on my blog?

Typo: 11.3.2

Hi there,

Awesome book and thanks so much for making it open!

A small typo on section 11.3.2: nPositives <- 9 should instead be 6 (unless the typo is in the written section).
11.3.4's written section says 9, 11.3.5 says 6.

figure 5.5

from Felipe Ortega @jfelipe on twitter - Many examples are great, but I find fig. 5.5 confusing. Data in the last panel cannot be fitted to a straight line, but it can be fitted to an 2nd order polynomial which is linear model as well.

labels overlap in Figure 5.8 (right panel) (p. 67, Chapter 5)

The labels of the right panel of Figure 5.8 overlap each other.

typo on p. 97 (Chapter 7)

At the bottom of p. 97, it says "concepts in statsitics", instead of "statistics".

Error when trying to create epub format ebook

When I run the following command to create epub format ebook, I got errors. Has anyone tried to create a epub format before? Thanks!

bookdown::render_book("index.Rmd", "bookdown::epub_book")

Console output with errors:

label: sleepHist (with options)
List of 5
 $ echo      : logi FALSE
 $ fig.cap   : chr "Left: Histogram showing the number (left) and proportion (right) of people reporting each possible value of the"| __truncated__
 $ fig.width : num 8
 $ fig.height: num 4
 $ out.height: chr "33%"

  |........                                                         |  12%
  ordinary text without R code


label: unnamed-chunk-18

  ordinary text without R code


label: sleepAbsCumulRelFreq (with options)
List of 5
 $ echo      : logi FALSE
 $ fig.cap   : chr "A plot of the relative (red) and cumulative relative (blue) values for frequency (left) and proportion (right) "| __truncated__
 $ fig.width : num 8
 $ fig.height: num 4
 $ out.height: chr "33%"


  ordinary text without R code


label: ageHist (with options)
List of 5
 $ echo      : logi FALSE
 $ fig.cap   : chr "A histogram of the Age (left) and Height (right) variables in NHANES."
 $ fig.width : num 8
 $ fig.height: num 4
 $ out.height: chr "33%"

Quitting from lines 1265-1280 (StatsThinking21.Rmd)
Error in select(., Height) : unused argument (Height)

The system behind "named" tests

The book looks excellent; in particular the tight coupling of text and code! I have dreamed about the perfect stats book/course for a while, and your book seems close.

I would suggest adding a condensed overview of the popular statistical test as linear models and mentioning non-parametric tests. I've made my first attempt here: https://rpubs.com/lindeloev/tests_as_linear (still WIP, but close to finished).

If the infograph or other parts of this could be useful, you are welcome to steal it :-)

Disposition-wise, I like starting with a simple regression (y = a*x+b) because that is what they've learned in high-school. And only then show how dummy coding of x can be exploited to make this work for categorical differences (t-tests).

Typos in Chapter 5.2 and Chapter 8.1

Chapter 5.2: In the line of

The mean (ofted denoted by a bar over the ...

I guess ofted is a typo here.

Chapter 8.1: Line 6

In in a casino game, numbers ...

Here I think the in is a typo.

typo on p. 21 (Chapter 2)

There is a typo on p. 21 (Chapter 2). The caption of Figure 2.1 says "valdity" instead of "validity".

chapter 10: add section on comparing CIs across conditions

section 10.1.5

fisher's exact test

raised by Jack van Horn: In your chapter on contingency tables, given that you mention Sir Ronald Fisher in an earlier chapter, you might also wish to discuss Fisher's Exact Test on 2x2 tables.

\begin{figure} \end{figure} on p. 97 (Chapter 7)

The LaTeX commands \begin{figure}, \end{figure}, and \caption appear on p. 97 as part of the text (just below the figure).

comments on chapter 14

Figure 14.3 is not rendering well on the website --> looks distorted/stretched vertically. It could also be helpful to add labels to the columns in this figure.
Line 236, would be it helpful to label which beta is which or explain this in the text below -- not sure students will follow the code completely.
Not sure if it will get confusing that we use different terms to refer to the relationship between x and y; i.e., "regression slope" (line 360), "effect of x on y" (line 425), "fit line" (line 477), "regression line" (line 485).
Line 526 -- it seems like there needs to be a bit more explanation of what we're doing. The code for the function is pretty complex and I don't think students will follow.

Comments on 3.7

In the P(cancer|test) equation at the end of 3.7, both "cancer" and "disease" are used. I'm assuming these refer to the same thing (B), so it may be clearer to use only one of the words.

chapter 10 citation issue

In line 320 you have, "for example, the now discredited claims by wake:1999..." I think perhaps this an issue with a citation manager?

Typos in 3.8 and 3.10

Chapter 3.8 in the second line of the last paragraph (page 37):

while the prior on the right side (P(B)) tells us how likely

I believe "prior" should be "part".

Chapter 3.10 in the second line of the second paragraph (page 38):

If were to ask you “How likely is it that the US will return to the moon by 2026”

I think a pronoun is missing between "If" and "were".

PS: I'm loving the book so far.

Labels for horizontal axis on Fig 9.9 and 9.10 appear to be truncated.

from twitter via @enoriverbend

separate sentences onto individual lines

would be useful to separate sentences to individual lines for purposes of diffing

error in section 3.2.3

In section 3.2.3, it says that the result of P(Roll1throw1 U Roll1throw2) is 1/6. I think it should be 11/36. (In fact, 11/36 is the result given on the slides of the lecture on probability).

comments on chapter 9

Great chapter!

In line 85, we have "we formulate a prediction based on our hypothesis." We haven't actually stated the hypothesis yet. Should it be: "We hypothesize that more physical exercise is associated with higher BMI"? Just concerned students will get confused by the difference between a hypothesis and a prediction (I'm a bit confused myself).
In line 288, I find this wording a bit confusing "so we have to add the probability that the observed value is as extreme in the other direction". I'm not totally sure how to improve. I think maybe walk through this more slowly instead of in one long sentence.

Add License

Adding a license file would help clarify just what people are allowed to do with this material and under what terms they are contributing when opening pull requests.

GitHub has a great resource for choosing an open source license here: https://choosealicense.com/ as well as instructions on how to easily add one of those licenses to the repository: https://help.github.com/articles/adding-a-license-to-a-repository/

type on p. 59 (Chapter 5)

There is a type in the first line of p. 59. It says: "which is evident in the fact that the all of the points are very close to the line." The "the" in bolds should not be there.

fix discussion of demeaning

section 14.1.2

add standardized regression coefficient section

section 14.3.1

typo in 3.9

In section 3.9, just before the formula for prior odds, there is a typo in the word "positively". It says "positvely" instead of "positively".

chapter 3: typo

When finding the conditional probability of having diabetes if inactive, this line states:

the probability of someone having diabetes given that they are physically active is 0.141

I believe active should be inactive.

comments on chapter 4 (summarizing)

Should some of the text below be moved to Chapter 3, as you use the summation symbol in that chapter?

"You may not be familiar with the $\sum$ symbol, which we call the summation symbol. It basically means that you should loop through all of the values of the index variable (j in this case, which goes from 1 to N) and add up all of the values. In the case of the PhysActive variable, N is equal to two (because there are two possible values)."

In the Histogram Bins section, you begin by explaining height is measured to the first decimal place. Could be helpful to add a code snippet where you show a slice of 5-10 height rows from the dataset. e.g.,:
NHANES_adult %>%
select(Height) %>%
slice(50:55)
I'm not sure the section on the "Freedman-Diaconis" rule is necessary...ends up making your code a bit more complicated too.
In the code chunk under "Skewness," I'm not sure why this is there:
names(waittimes) <- c("waittime")

comments on chapter 5

I anticipate students being confused about the difference between average error and root mean squared error -- i.e., why are we calculating both, how do they tell us different things.
I updated some of your code to use the "add_predictions()" from the modelr package. This is a great package for building intuitions about predicted, observed, and residual values. There is also "add_residuals()". Really easy to use.
There is a pretty big jump from using the mean as the model to a regression equation. I wonder if you could build up the regression equation more slowly. For example, explaining that given the relationship with age, perhaps age is a better predictor of height. Students may also be confused about whether/why the mean is still included in this model.
You define "sample" in reference to "population." Should we also define population?
I think students will wonder why we are introducing another new error term with SSE. i.e., why are we talking about error in so many different ways, how do these different ways of looking at error communciate different things
I'm not sure if students will be able to helpfully interpret the figure showing the median cutting the cumulative distribution.
When you explain variance and that it is sometimes referred to as "mean squared error," would it be useful to compare to the "root mean squared error" that you have been calculating above? In other words, I think this is another place that students may get befuddled by referring to different ways of quantifying error
I'd leave the interquartile range until later.

parenthesis missing on p. 73 (Chapter 5)

There's a closing parenthesis missing at the end of p. 73. It says: "(see right panel Figure 5.16.".

comments on chapter 13

I was finding the Lorenz curves a bit confusing to understand at first because I was confused by the axis labels. The example on wikipedia had x = "Cumulative share of population" and y = "Cumulative share of income earned", which made more sense to me.
Figure 13.2 is not rendering well on the website for the book. Hard to see the point labels and the DC label is cut off.
I was initially a bit confused by Figure 13.4 because I expected you to show Figure 13.2 without DC. Could you clarify that this is just an example with a toy dataset? Or would it be helpful to also show 13.2 without DC?
In line 307 for the Spearman example, why are we computing a correlation "on the hate crime data" using the dfOutlier data? Isn't this just a random sample of 10 numbers we used for Figure 13.4? I may be missing something.

"piechart" instead of "pie chart" (p. 88)

In the second paragraph below section 6.9.1., "pie chart" is missing a blank space between the two words: "The piechart in Figure 6.14 [...]."

typo on p. 91 (Chapter 7)

There's a type on p. 91. In the paragraph just below "7.1 How do we sample", it says "indivdual" instead of "individual".

comments on chapter 3 (probability)

I think there are typos in this sentence that are making it hard to understand: "Similarly, based on the fact that he reasoned that the since the probability of a double-six in throws of dice is 1/36, then the probability of at least one double-six on 24 rolls of two dice would be $24*\frac{1}{36}=\frac{2}{3}$. "
I think there is a typo here: "He then used the fact that the complement of no sixes in four rolls is the complement of at least one six in four rolls". Should this instead read "He then used the fact that the complement of no sixes in four rolls is the probability of at least one six in four rolls"?
In the code chunk under "cumulative probability distributions," I'm not sure why this is there: pFreeThrows=dbinom(seq(0,4),4,0.91)
When you begin discussing conditional probability, some students may confuse this with the conjoint events you described above. You write "So far we have limited ourselves to simple probabilities - that is, the probability of a single event." Perhaps you can add a sentence clarifying this distinction?

typo on p. 95 (Chapter 7)

On p. 95, before "7.4 The Central Limit Theorem", it says: "In Section 9.3.6statistical-power) [...]".

Typo Chapter 6.1

Original:
In particular, they could have show a figure like that shown in Figure 6.2,...

Suggested Change:
In particular, they could have shown a figure like that shown in Figure 6.2,...
-The first "show" should be "shown", but the sentence also seems a bit redundant.

In particular, they could have shown a figure like the one displayed in Figure 6.2,...

English text in equations

I was flipping through a few of the chapters (really enjoying it so far!) and noticed that there are cases where you use English words in math environments, for example $P(Jefferson)=0.014$ on this line. This is of course totally fine, but the way LaTeX typesets characters in math is very different than the way it typesets text, which can lead to kind of weird rendering, e.g., the "ff" in the text uses a ligature whereas the f's are typeset separately in math mode:

There are a variety of ways to remedy this.
Personally, I just use the \text{} command from within math mode, e.g., $P(\text{Jefferson})=0.014$ .
This is a pretty minor issue and likely not worth the time to retrospectively correct (similar to #6), but perhaps the sort of thing you'd want to consider as you add or revise equations.

I'm excited to continue reading through the book and think it's fantastic that you've released it into the wild!

typo on p. 61 (Chapter 5)

There's an extra "in" at the end of p. 61: "Let's say five people are in in a bar".

comments on chapter 7

When describing how to quantify SEM you write, "In general we have to be careful about doing this with smaller samples (less than about 30). Because we have many samples from the NHANES population and we actually know the population parameter, we can confirm that this works correctly by comparing the SEM estimated using the population parameter with the actual standard deviation of the samples that we took from the NHANES dataset." I'm not totally sure what you mean by "doing this" and "we can confirm that this works". Perhaps fill replace the "this" with your exactly what you're referring to?
You write, "The formula for the standard error of the mean says that the quality of our measurement involves two quantities:.." This is the first time you've mentioned "quality of measurement". I wonder if this should be mentioned right at the start of the chapter when you introduce sampling error? You could also there make it explicit that more sampling variability is bad and why with 1-2 sentences. Something like: "Sampling error and sampling variability are associated with the quality of our measurement of the population. Clearly, the larger the difference between our estimate and the population parameter, the worse our estimate is. Further, when there is greater variability in our estimates across samples we cannot know which estimate (if any) reflects the population parameter."
For the alcDist50 figure, I would set echo=FALSE as they won't know what the function means.

Epub output

Hi Prof Poldrack,

Thank you so much for making this awesome textbook open source. Do you have any plan to make this textbook available in file formats for e-readers, like epub? It would be great if we can read this book in a more small screen friendly way.

type on p. 53

There is a typo in the last paragraph of p. 53. It says: "unless we are looking at the same number of of observations". There is one extra "of".

epub version doesn't display formulas correctly

For example, in chapter 3, instead of the formulas I can see the source code for those formula (e.g., (\bar{A})). I used Lithium epub reader on Android.

fig:PureDeathSatFat

For first plot in Introduction, perhaps add geom_point() to help clarify the sentence, "This plot is based on ten numbers."

comments on chapter 2 (data)

Is code visible to students whenever there is no "echo=FALSE" (e.g, the recoding in the third chunk)? If so, I think it would be better to have everything using tidyverse and to make sure style is consistent throughout the book. I'm happy to work on this if that makes sense to you.
When defining nominal scale, could be helpful to loop back to your qualitative data coding example so they understand you are talking about the same thing.
I wasn't totally clear on this "A nominal variable can only be compared for equality; that is, do two observations on that variable have the same value?" -- perhaps add an example with the fruit or the political parties?
I found this somewhat confusing, "I could create a highly reliable measurement by simply giving the same answer each time regardless of the data." I wasn't sure what '"regardless of the data" meant.

(Maybe) typo in section 11.3.4

In the text you write that

In this case, let’s say that we know that the specifity of the test is 0.9, such that the likelihood of a positive result when there is no explosive is 0.1.

The formula of marginal_likelihood is:
marginal_likelihood <- dbinom(nPositives,nTests,0.99)*prior + dbinom(nPositives,nTests,.1)*(1-prior)

Maybe I missed something, but I think that the correct formula should be:
marginal_likelihood <- dbinom(nPositives,nTests,0.9)*prior + dbinom(nPositives,nTests,.1)*(1-prior)

Or with respect to previous paragraph where you say:

Let’s say that we know that the sensitivity of the test is 0.99 – that is, when a device is present, it will detect it 99% of the time.

Then the marginal_likelihood formula should be:
marginal_likelihood <- dbinom(nPositives,nTests,0.99)*prior + dbinom(nPositives,nTests,.01)*(1-prior)

PS: Great book, thank you!

capital letter in "PAnel" (p. 80, Figure 6.3)

There is a capital letter where there should be a lowercase letter in the caption of Figure 6.3 (p. 80). It says "PAnel D shows a box plot".