vasishth / freq_cogsci Goto Github PK

View Code? Open in Web Editor NEW

23.0 4.0 14.0 59.46 MB

Linear mixed models in Linguistics and Psychology: A Comprehensive Introduction

License: MIT License

TeX 96.73% Shell 0.12% CSS 3.14%

freq_cogsci's Introduction

Freq_CogSci

Linear mixed models in Linguistics and Psychology: A Comprehensive Introduction

freq_cogsci's People

Contributors

Stargazers

Watchers

Forkers

katyatulit anhnguyendepocen amits1ngh hyunahahn dallak tmalsburg neuronaren amitalmor echungyang anthesevenants fyjgreatlion bayeslearner sidgupta234 sand94

freq_cogsci's Issues

Ongoing discussion with Daniel, notes

Freq_Intro_Comments_2008_SV_DS.pdf

Daniel, Audrey and I need to meet and talk about open issues in the first five chapters, after SMLP.

ch 3 or contrasts chapter exercises get parker and phillips cognition data

as an exercise?

Compiling book fails: data/powerbeta1mean.Rda missing

Compiling the book fails because the file data/powerbeta1mean.Rda is missing, see here.

There is a line that creates this file on disk but it's commented out, see here.

Not clear why the file is created and then immediately loaded. The detour via disk may not be necessary.

Same problem here.

ch 3 REML vs ML

There is absolutely nothing at the moment about REML vs ML. A broader issue is that we need some explanation for how the parameters are estimated. This is explained for the simple linear model in ch 5, but not for the LMM, even though this is not so hard if one know matrix algebra at an elementary level.

ch 3 mention between vs within factors

Need to include an example with between vs within factors, showing that the between factor can't go into random slopes term.

Add measurement error simulation code into book

Source: https://statmodeling.stat.columbia.edu/2024/04/14/simulation-to-understand-measurement-error-in-regression/#comment-2359176

library(tidyverse)
set.seed(123)
n <- 1000
a <- 0.2
b <- 0.3
sigma <- 0.5

fake %
mutate(y_star = rnorm(n, y, sigma_y),
x_star = rnorm(n, x, sigma_x))

bind_rows(
tibble(x=fake$x, y=fake$y, name=”No measurement error”),
tibble(x=fake$x, y=fake$y_star, name=”Measurement error on y”),
tibble(x=fake$x_star, y=fake$y, name=”Measurement error on x”),
tibble(x=fake$x_star, y=fake$y_star, name=”Measurement error on x and y”)
) %>%
mutate(name = fct_inorder(name)) %>%
ggplot(aes(x,y)) +
geom_point() +
geom_smooth(method=”lm”, fullrange=TRUE) +
facet_wrap(~name)

ch 3: Need to explain fixed effects and random effects terminology

The student needs to know that we are generalizing beyond the subjects (and items) by treating them as random effects. If we treated subject (item) as fixed effect, we can no longer generalize beyond the specific subjects (items).

Nested contrasts chapter

Show a quick example illustrating this point:

"Note that in cases such as these, where $A_{B1}$ vs. $A_{B2}$ are nested within levels of $B$, it is necessary to include the effect of $B$ (part of speech) in the model, even if one is only interested in the effect of $A$ (word frequency) within levels of $B$ (part of speech). Leaving out factor $B$ in this case can lead to biases in parameter estimation in the case the data are not fully balanced."

ch 3 explain log normal

"The exponentiated values are medians, not means. We use the median here because the mean in the log-transformed data depends on the standard deviation."

This needs to be explained in detail in a box.

add ncp in t distribution discussion

figure out contrast coding in 2x3 etc designs

How does anova do contrast coding internally when computing omnibus anovas?

BLUEs and BLUPs

p. 126: Possibly add some content here: discuss difference between BLUEs and BLUPs, which estimate is „more correct“? What is the reason why we want BLUPs? I.e., regression to the mean.

Improve figure

Figure 3.2: again, remove the data points, they don’t contribute anything, do they? it’s hard to see the different lines, maybe better after removing the points. Also: is it possible to give the slopes in numbers; that might be easier to judge

Ongoing discussion with Audrey, notes

Freq_CogSci-2_AB_20200901.pdf

These pdf comments need to be discussed at some point after SMLP.

Add measurement error simulation code into book

Source: https://statmodeling.stat.columbia.edu/2024/04/14/simulation-to-understand-measurement-error-in-regression/#comment-2359176

Code doesn't work, but need to fix this.

library(tidyverse)
set.seed(123)
n <- 1000
a <- 0.2
b <- 0.3
sigma <- 0.5

fake %<%
mutate(y_star = rnorm(n, y, sigma_y),
x_star = rnorm(n, x, sigma_x))

bind_rows(
tibble(x=fake$x, y=fake$y, name=”No measurement error”),
tibble(x=fake$x, y=fake$y_star, name=”Measurement error on y”),
tibble(x=fake$x_star, y=fake$y, name=”Measurement error on x”),
tibble(x=fake$x_star, y=fake$y_star, name=”Measurement error on x and y”)
) %>%
mutate(name = fct_inorder(name)) %>%
ggplot(aes(x,y)) +
geom_point() +
geom_smooth(method=”lm”, fullrange=TRUE) +
facet_wrap(~name)

Fix headers in solutions file

Need to fix headers for each exercise in the solutions.

Package intoo required but no longer available on CRAN

See here. As a result, it is currently impossible to compile the book (unless someone still has intoo installed on their system).

ch 3 model assumptions for LMMs

Explain how to check model assumptions in LMMs.

Compiling book fails: Data files stored in wrong directory

The files maxmodelsiglog.rda and maxmodelsigraw.rda should be stored in the data subdirectory but are instead stored in the root directory. Therefore loading them later fails.

See here and here.

Daniel objection to overfiitting in the no pooling model

p. 108, Figure: what do the points represent? the raw data? there seems serious overfitting

SV: Those are the data-points from the RC expt. Sure, overfitting yes, but that’s what the repeated measures regression model would require us to do. What is your objection here?

in hyp testing chapter, talk about testing correlations using LRT

This needs to be shown with a real life example. E.g., Dillon et al data?

ch 3 testing residuals assumptions section needs restructured

The section weirdly talks about back-transforming. That needs its own section.

Audrey will add other example data sets in ch 2 and elsewhere

"so far all the examples are taken from (psycho)linguistics. If you want the book to also include examples from psychology, I can look for examples and change some of the examples, here and/or in other chapters. Let me know"

Simulation chapter: missing data

Show through simulation that the LMM's Type I error properties are hardly affected by missing data. This is a consequence of shrinkage.

F1 formant data not appropriate?

p. 72: why are female and male data points paired? I don’t understand this. Isn’t gender fixed for each person, and data are from different people? Is this averaged across male versus female subjects? And is it paired because these are responses to the same vowel in the same language?
yes, the last point you mention.
I.e., in an item-based analysis, gender is a dependent variable? That is not a very intuitive concept for psychologists, who may not even know about item-based statistics - many psych people do not need these. This needs to be explained, or use an example with subject-based statistics.
not seeing the problem, but maybe we can talk about it later and change the example. I opened an issue.

Add soln to ch 2 last exercise

@vasishth will do this.

Introduce Hadamard product

In the matrix formulation chapter, introduce Hadamard product.

[Typo?] Unnecessary 'was' in an object relative clause

In the following line that explains an object relative clause, the first was which follows the relative clause marker who should be unnecessary.

Freq_CogSci/02-SamplingDistributions.Rmd

Line 1922 in 13db486

    
           Subject relative clauses are sentences like *The man who was standing near the doorway laughed*. Here, the phrase (called a relative clause) *who was standing near the doorway* modifies the noun phrase *man*; it is called a subject relative because the noun phrase *man* is the subject of the relative clause. By contrast, object relative clauses are sentences like *The man who was the woman was talking to near the doorway laughed*; here, the *man* is the grammatical object of the relative clause *who was the woman was talking to near the doorway*.

🆖 The man who (*was) the woman was talking to near the doorway laughed
🆗 The man who the woman was talking to near the doorway laughed

Fig caption missing

p. 107: add Figure number + caption (also missing for some other figures, e.g., p. 108)

Compiling book fails: missing image file

Error message:

label: lk13E1 (with options) 
List of 4
 $ fig.cap  : chr "(ref:lk13E1)"
 $ out.width: chr "99%"
 $ echo     : logi FALSE
 $ fig.align: chr "center"

Quitting from lines 2948-2949 (Freq_CogSci.Rmd) 
Error in knitr::include_graphics("figures/lk13E1.png", dpi = 1000) : 
  Cannot find the file(s): "figures/lk13E1.png"
Calls: <Anonymous> ... withCallingHandlers -> withVisible -> eval -> eval -> <Anonymous>

The code block in question is:

knitr::include_graphics("figures/lk13E1.png", dpi = 1000)

However, the image figures/lk13E1.png does not exist and is not generated either (at least not under that name afaics).

Compiling book fails: 'x' must be an array of at least two dimensions

This error is generated here. Full error message:

label: unnamed-chunk-385
Quitting from lines 9514-9521 (Freq_CogSci.Rmd) 
Error in base::rowSums(x, na.rm = na.rm, dims = dims, ...) : 
  'x' must be an array of at least two dimensions
Calls: <Anonymous> ... eval -> eval -> table -> rowSums -> rowSums -> <Anonymous>

ch 2 Audrey suggested changes

Audrey said:

Is there already a paragraph or two in the book about hypothesis testing, and the fact that these statistical tests only make sense with a priori hypotheses? I assume there will be a discussion about inference vs exploratory analyses later on but it would not harm to mention that these tests test one a priori defined hypothesis here already.

Also, I would add somewhere in this chapter an explanation of why this approach uses null hypothesis testing (i.e., the only hypothesis for which we have some information)

PS I don't understand the last sentence from Audrey.

Add a box on a one-sided t-test.
Relocate funnel plot to beginning of discussion on Type M and S errors.
Explain Levy and Keller design (2.7.1) in more detail. The word adjunct was not clear to Audrey.
When showing the formant data (Apache etc), show more than one vowel.
Explain what degrees of freedom is
Add a section on why aggregation is bad.

Simulation chapter needs a section on the dangers of aggregation in LMMs

Need to show through simulation that aggregating data by items is going to cause a lot of potentially important variation from being hidden, leading to possible Type I error inflation (of course, it depends on the particular situation being simulated---maybe show a case where this doesn't happen).