Giter Site home page Giter Site logo

brainscopypaste-paper's People

Contributors

camilleroth avatar wehlutyk avatar

Stargazers

 avatar

Watchers

 avatar  avatar  avatar

brainscopypaste-paper's Issues

Prediction → result path

Introduce key concepts, interpreted with predictions

  • Dynamical systems (+literature)
  • Cultural attractors (+literature)
    • careful with "reformulation processes"
  • Representation transformation process(es) (+literature)
  • In vivo vs. in vitro (+literature)

Then

  • Narrow them down to the quote test case, making clear predictions
  • Clearly relate to the surrounding literature
  • Make clear that we could predict things, but there's so many things to predict, we don't want to do that. We want to describe.

Robustness of results

The reader must be convinced that none of the following are left to arbitrary choices, or if they are, that the results hold if those choices change:

  • Initial quote selection (see MemeTracker description, and Nifty). We have no way to control that as they don't give details of their selection.
  • Spam- and language-filtering (precision/recall)
  • Substitution detection rules (models)
    • script notebook generation to test with all other models; check results don't change
    • explain all the models
  • Feature selection
    • rework the discussion about selecting features
    • add #letters, which makes more sense than #phonemes for text
    • cut AoA at 14 (like in the first submission)
  • Graph binnings

Improve features, their selection, and their usage

Missing word features

  • Number of letters
  • Phonological neighbourhood density (cite clearpond)
  • All features relative to sentence

Graphs

  • Add a nu - nu00 graph grouping all features' variations
  • Add ICs on nu - nu00 graphs, reduced to selected features
  • Add streamplots (note we're not going to use these in fact, as explained in storytelling)

Clarifications in text

  • Semantic similarity: how do substitutions affect quotes, informally, and with Lauf et al.'s typification?
    • Introduce H00: we take similarity into account by looking at synonyms
    • We also looked at finer grained similarity measures. "Semantic similarity" gives scores, that you can't interpret much. It does show that substitutions go for lower-similarity words rather than higher similarities. Distance travelled on the FA network shows that it's not immediate neighbours. We haven't compared to a H0 (to look for a bias), but it does exclude the possibility of predicting the exact destination word based on similarity.
  • Feature selection: we did better feature selection, and good news: we can keep most of our features. Explain this and merge into the domain-knowledge feature-selection part. In particular, no need for predictive model.
  • Rewrite the flow of the argumentation. See storytelling for that.
  • Exclude POS analysis, mentioning we remove stopwords in the analysis (which preempts the closed/open class analysis)

Other questions

  • Effect of context: if there are only high-feature words in a sentence, do we replace with words nearer to that?
  • Rethink susceptibility to substitution (there's a real problem: if low-freq words have a 1/3 probability of being substituted, how do you interpret that value when there are 4 low-freq words? Also, analyse the degenerate case of estimators)
  • Effect of POS (categorical feature) on variation and on substitution rate

Cross-feature effects and prediction

  • For susceptibility
    • Regressing which word is substituted (w/ POS, and after feature selection, showing the role of each feature). It does not work well, probably because of the one-substitution-per-sentence constraint that can't be factored into a simple model, which leads accuracy to plummet when recall goes up, and vice versa. We could try to predict in which bin (or quantile) of a sentence a substitution falls (unconstrained problem), but that would be a prediction based on sentence features, not word features (otherwise, which words?), which is outside the scope of our paper.
    • We compute PCA on the substituted words, and it mostly catches the correlations between the features.
  • For variation
    • Are substitutions mainly among synonyms or neighbours in the FA graph? It seems not (see distance.ipynb), which is coherent with the fact that this situation is much more complex than what happens in random or chosen lists of words. So we really don't try to predict the appearing word among synonyms of the source word.
    • Regressing the new word's features (again after feature selection, showing the role of each feature): predict the value of one feature based on the source features
    • We compute PCA on the variations, and show the evolution of the meta-features, but the number of words included is greatly reduced (because it's only the words that have all features defined), which weakens the case for PCA.

Misc. details

  • Add a reference "This work has also been partially supported by the French National Agency of Research (ANR) through the grant Algopol (ANR-12-CORD-0018)"
  • Find a better title (Gureckis: "if we copied/pasted this study couldn't exist!")
  • Fix first sentence of the introduction. I can't find a better formulation.
  • Rewrite abstract
  • Check the whole text/flow/definitions for clarity against Gureckis' edited pdf version
  • Don't use in vivo / in vitro, or properly define it
  • Rename "orthographical" to "orthographic" in all figures and notebooks
  • Cite software colophon used, at the end: Python, numpy/scipy, pandas, statsmodels, jupyter, etc.
  • Decide whether or not to have Supplementary Information. No.
  • Write Cover Letter

Supplementary Material

In the annex:

  • Maybe the schematics for the 4 substitution models we show in the main text right now, if the reviewers say so

In the Supplementary Material:

  • Susceptibility (all combinations and binnings), variations (absolute, relative, all binnings), and biases (nu - nu00) for all features
  • The same+POS for a few other noticeable substitution models
  • Scatter plots for all feature correlations
  • Schematics for all substitution models
  • Link to brainscopypaste repository for additional graphics (e.g. two-substitution models)

Relate to missed literature

  • Sentence recall (Potter & Lombardi)
  • False memories (Deese)
  • Subjective organization (Tulving, Zaromb)
  • Working memory and attention (Jefferies)
  • Iterated learning (Kirby)

Cover letter

The review from Cognitive Science is synthesized into 6 main points in the Cog. Sci. Review wiki page. That page also contains links to the issues tracking each the 6 points, and reviewer-by-reviewer syntheses with more details.

Things that have changed

Here is the list of things that have changed, for use in the cover letter.

  • A number of values in the paper have changed (the tracking issue for that is #12)
    • Clustering coefficient values, since FA link weights are now taken into account in their computation (so it's computed on the undirected weighted graph)
    • An update in the language detection module has changed the cluster filtering a little, giving us a few more clusters and quotes than before. As a result of this, the number of words coded by Word Frequency has also changed (since frequency of words is computed on the filtered data set). (Details in #12.)
    • The discovery of three bugs, and the improvement of substitution filtering, led us to gain many more substitutions than previously (again, details in #12). All in all the code now in extremely more reliable as it has unit tests covering nearly all of it. Language, cluster, and substitution filtering are also controlled by precision/recall analyses.
  • Introduction has been rewritten from scratch to better explain our goals (→ synthesis point 1). It should be clearer and easier to follow thanks to more examples.
  • Related work has also been mostly rewritten to incorporate the literature we had missed in the first submission (→ synthesis point 2). In particular, work in psycholinguistics on lists of words has been well reviewed, and work in iterated learning experiments (Kirby) integrated into the whole discussion
  • The overall writing and phrasing in the whole paper has received a lot of attention (→ synthesis point 1, and criticisms of bad writing)
  • The initial parts of Methods have been expanded to better explain some choices which seemed arbitrary (→ synthesis points 4 and 6)
  • The set of features used has been expanded (with orthographic and phonological neighbourhood densities, and number of letters), and the way features are selected has been greatly improved and rationalized (→ synthesis point 3)
  • The demand for word-word metrics (rev. 2) is partially met by the addition (further down) of H00, and a short discussion of distances travelled by substitutions
  • The section on Substitution model has been expanded to better explain the work done, and show the robustness of results (→ synthesis point 4)
  • The possible bias from focusing on single-substitutions has been addressed by extending substitution models to the two-substitution case. The results are unchanged (and available in the code repository). (→ synthesis point 6)
  • Susceptibility has been much better defined, with respect to a null hypothesis. It is indeed not a probability of substitution (as was questioned by rev. 1), and now reflects a bias w.r.t. to random picking of targets. As a result, our conclusions for that measure have also been updated. A section was added to analyse POS susceptibilities also (→ synthesis point 3)
  • Variation is now compared to an additional null hypothesis, H00, based on random selection in synonyms of the disappearing word. The interaction between features is also (partly) addressed with an all-feature regression. (→ synthesis point 3)
  • Both susceptibility and variation have been extended to analyse Sentence context (→ synthesis point 3)
  • A whole Discussion section has been added to recontextualise the results (→ synthesis points 1 and 2)
  • The (indeed exaggerated) claims about "convergence" have been revised (→ synthesis point 5)

Things we did not do

  • Cross-feature interactions are combinatorially explosive, and not the goal of our work. We explored many directions to little avail, and what works is shown in the paper. In particular:
    • PCA (with or without reconstitution of missing values) gives hard-to-interpret results
    • Anova combinatorially exploses (between global feature values, sentence-relative feature values, and all their interactions), and there is no directing question to reduce dimensions
    • Regression of susceptibility gives very unreliable results (because the constraints of the problem don't fit in the model)
    • Regression of variation does give some insight, and is what we show in the paper
  • We didn't try to do word-based exact predictions (i.e. without features). This could have been (a) which word is substituted, (b) which word appears instead. (a) comes from the association strength of words in the initial sentence with the word predicted by (b), but (b) is a research program in itself:
    • Our data set is not adapted to computing LSA/LDA because it has groups of very similar documents, so the associations extracted will most likely reflect this, i.e. they will be between words in the same quotation families. That's not informative for substitutions (we want associations from other families to inform the family we look at).
    • Even in controlled settings and on lists of random words (i.e. lists not designed to trigger intrusions like in the Deese-Roediger-McDermott paradigm, but still with no syntax involved), the state of the art does not predict the new word (Zaromb et al. 2006); instead it predicts a list from which the new word comes from. Now (b) means predicting the new word in sentences from the real world, so it's two big jumps from what exists.
    • The data is again badly structured for prediction, since there are only a few measurements on many varied cases (each case, i.e. source sentence, has one prediction, and there are only a few measurements for each source sentence), instead of many measurements on a few cases, making prediction amenable to errors. This is explained in the paper.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.