Since we can't compare the new language-detection module to the old one, we should at least justify that the new one works properly (and thus justify the minor changes we see in the figures, see wehlutyk/brainscopypaste@2b8dcb8 and wehlutyk/brainscopypaste@9163bdf).
I.e. if we're using H0 to look at WN results, the destination word should be taken in the WN pool. If it's for FA results, it should be taken from the FA pool.
Changing the language detection module gave us a few more quotes. My first reproduction of the analysis gives Stored 1188 of 6586 mined substitutions. (instead of 1051 out of 6172.)
Semantic similarity: how do substitutions affect quotes, informally, and with Lauf et al.'s typification?
Introduce H00: we take similarity into account by looking at synonyms
We also looked at finer grained similarity measures. "Semantic similarity" gives scores, that you can't interpret much. It does show that substitutions go for lower-similarity words rather than higher similarities. Distance travelled on the FA network shows that it's not immediate neighbours. We haven't compared to a H0 (to look for a bias), but it does exclude the possibility of predicting the exact destination word based on similarity.
Feature selection: we did better feature selection, and good news: we can keep most of our features. Explain this and merge into the domain-knowledge feature-selection part. In particular, no need for predictive model.
Rewrite the flow of the argumentation. See storytelling for that.
Exclude POS analysis, mentioning we remove stopwords in the analysis (which preempts the closed/open class analysis)
Other questions
Effect of context: if there are only high-feature words in a sentence, do we replace with words nearer to that?
Rethink susceptibility to substitution (there's a real problem: if low-freq words have a 1/3 probability of being substituted, how do you interpret that value when there are 4 low-freq words? Also, analyse the degenerate case of estimators)
Effect of POS (categorical feature) on variation and on substitution rate
Cross-feature effects and prediction
For susceptibility
Regressing which word is substituted (w/ POS, and after feature selection, showing the role of each feature). It does not work well, probably because of the one-substitution-per-sentence constraint that can't be factored into a simple model, which leads accuracy to plummet when recall goes up, and vice versa. We could try to predict in which bin (or quantile) of a sentence a substitution falls (unconstrained problem), but that would be a prediction based on sentence features, not word features (otherwise, which words?), which is outside the scope of our paper.
We compute PCA on the substituted words, and it mostly catches the correlations between the features.
For variation
Are substitutions mainly among synonyms or neighbours in the FA graph? It seems not (see distance.ipynb), which is coherent with the fact that this situation is much more complex than what happens in random or chosen lists of words. So we really don't try to predict the appearing word among synonyms of the source word.
Regressing the new word's features (again after feature selection, showing the role of each feature): predict the value of one feature based on the source features
We compute PCA on the variations, and show the evolution of the meta-features, but the number of words included is greatly reduced (because it's only the words that have all features defined), which weakens the case for PCA.
Add a reference "This work has also been partially supported by the French National Agency of Research (ANR) through the grant Algopol (ANR-12-CORD-0018)"
Find a better title (Gureckis: "if we copied/pasted this study couldn't exist!")
Fix first sentence of the introduction. I can't find a better formulation.
Rewrite abstract
Check the whole text/flow/definitions for clarity against Gureckis' edited pdf version
Don't use in vivo / in vitro, or properly define it
Rename "orthographical" to "orthographic" in all figures and notebooks
Cite software colophon used, at the end: Python, numpy/scipy, pandas, statsmodels, jupyter, etc.
Decide whether or not to have Supplementary Information. No.
The review from Cognitive Science is synthesized into 6 main points in the Cog. Sci. Review wiki page. That page also contains links to the issues tracking each the 6 points, and reviewer-by-reviewer syntheses with more details.
Things that have changed
Here is the list of things that have changed, for use in the cover letter.
A number of values in the paper have changed (the tracking issue for that is #12)
Clustering coefficient values, since FA link weights are now taken into account in their computation (so it's computed on the undirected weighted graph)
An update in the language detection module has changed the cluster filtering a little, giving us a few more clusters and quotes than before. As a result of this, the number of words coded by Word Frequency has also changed (since frequency of words is computed on the filtered data set). (Details in #12.)
The discovery of three bugs, and the improvement of substitution filtering, led us to gain many more substitutions than previously (again, details in #12). All in all the code now in extremely more reliable as it has unit tests covering nearly all of it. Language, cluster, and substitution filtering are also controlled by precision/recall analyses.
Introduction has been rewritten from scratch to better explain our goals (→ synthesis point 1). It should be clearer and easier to follow thanks to more examples.
Related work has also been mostly rewritten to incorporate the literature we had missed in the first submission (→ synthesis point 2). In particular, work in psycholinguistics on lists of words has been well reviewed, and work in iterated learning experiments (Kirby) integrated into the whole discussion
The overall writing and phrasing in the whole paper has received a lot of attention (→ synthesis point 1, and criticisms of bad writing)
The initial parts of Methods have been expanded to better explain some choices which seemed arbitrary (→ synthesis points 4 and 6)
The set of features used has been expanded (with orthographic and phonological neighbourhood densities, and number of letters), and the way features are selected has been greatly improved and rationalized (→ synthesis point 3)
The demand for word-word metrics (rev. 2) is partially met by the addition (further down) of H00, and a short discussion of distances travelled by substitutions
The section on Substitution model has been expanded to better explain the work done, and show the robustness of results (→ synthesis point 4)
The possible bias from focusing on single-substitutions has been addressed by extending substitution models to the two-substitution case. The results are unchanged (and available in the code repository). (→ synthesis point 6)
Susceptibility has been much better defined, with respect to a null hypothesis. It is indeed not a probability of substitution (as was questioned by rev. 1), and now reflects a bias w.r.t. to random picking of targets. As a result, our conclusions for that measure have also been updated. A section was added to analyse POS susceptibilities also (→ synthesis point 3)
Variation is now compared to an additional null hypothesis, H00, based on random selection in synonyms of the disappearing word. The interaction between features is also (partly) addressed with an all-feature regression. (→ synthesis point 3)
Both susceptibility and variation have been extended to analyse Sentence context (→ synthesis point 3)
A whole Discussion section has been added to recontextualise the results (→ synthesis points 1 and 2)
The (indeed exaggerated) claims about "convergence" have been revised (→ synthesis point 5)
Things we did not do
Cross-feature interactions are combinatorially explosive, and not the goal of our work. We explored many directions to little avail, and what works is shown in the paper. In particular:
PCA (with or without reconstitution of missing values) gives hard-to-interpret results
Anova combinatorially exploses (between global feature values, sentence-relative feature values, and all their interactions), and there is no directing question to reduce dimensions
Regression of susceptibility gives very unreliable results (because the constraints of the problem don't fit in the model)
Regression of variation does give some insight, and is what we show in the paper
We didn't try to do word-based exact predictions (i.e. without features). This could have been (a) which word is substituted, (b) which word appears instead. (a) comes from the association strength of words in the initial sentence with the word predicted by (b), but (b) is a research program in itself:
Our data set is not adapted to computing LSA/LDA because it has groups of very similar documents, so the associations extracted will most likely reflect this, i.e. they will be between words in the same quotation families. That's not informative for substitutions (we want associations from other families to inform the family we look at).
Even in controlled settings and on lists of random words (i.e. lists not designed to trigger intrusions like in the Deese-Roediger-McDermott paradigm, but still with no syntax involved), the state of the art does not predict the new word (Zaromb et al. 2006); instead it predicts a list from which the new word comes from. Now (b) means predicting the new word in sentences from the real world, so it's two big jumps from what exists.
The data is again badly structured for prediction, since there are only a few measurements on many varied cases (each case, i.e. source sentence, has one prediction, and there are only a few measurements for each source sentence), instead of many measurements on a few cases, making prediction amenable to errors. This is explained in the paper.