Giter Site home page Giter Site logo

subtitle-word-frequencies's Issues

Evaluate on behavioural data

There should be a script to evaluate the frequency table on the behavioural data. We expect word frequency to positively correlate with reading comprehension. This can either be tested as a direct correlation, or by using the new frequencies as part of the LINT pipeline. I think the first option is preferable: besides being easier, it also avoids a bias due to to the LINT formula being fitted to an existing frequency table.

Lemmatisation

  • Enable lemmatisation in preprocessing
  • Generate alternative frequency table with lemmatisation enabled

Strip accents

Roughly speaking, the data includes 3 types of diacritics:

  1. lexical diacritics (café, übermensch)
  2. emphatic diacritics (ík, échte), which may or may not match spelling standards.
  3. mistakes (ǹpo, éoscarssowhite)

As for handling these:

  1. Should be kept in. Preserving these takes priority over any handling of (2) and (3).
  2. Should probably be stripped, since they are prosody markers. That said, if t-scan does not strip accents, it will treat "echte" and "échte" as different lemmas with different frequencies. This would stand out more if "échte" is absent from the frequency table.
  3. Should be removed. However, note that for the examples above, the correct version would be "npo" (accent stripped) and "#oscarssowhite" / "oscarssowhite" (the first one is not really feasible to automate, the second one removes the entire character). Due to their nature, these cases should be quite rare.

My initial suggestion was that this could be handled by lemmatising the word and then cross-referencing with a vocabulary like the ANW. If the word appears in the ANW with an accent, it is is a lexical accent that should be preserved.

However, this only really works if you are also tagging named entities, since those a) do not appear in a dictionary, b) are especially likely to contain accents, and c) should have their accents preserved.

If you do this, it would be best if it is outfactored to a separate service that can also be utilised by t-scan, in order to avoid the discrepancy for emphatic accents.

All in all, this method is not impossible, but probably not worth the effort.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.