The subtitle-word-frequencies's discuss from uudigitalhumanitieslab

Evaluate on behavioural data

There should be a script to evaluate the frequency table on the behavioural data. We expect word frequency to positively correlate with reading comprehension. This can either be tested as a direct correlation, or by using the new frequencies as part of the LINT pipeline. I think the first option is preferable: besides being easier, it also avoids a bias due to to the LINT formula being fitted to an existing frequency table.

Lemmatisation

Enable lemmatisation in preprocessing
Generate alternative frequency table with lemmatisation enabled

Strip accents

Roughly speaking, the data includes 3 types of diacritics:

lexical diacritics (café, übermensch)
emphatic diacritics (ík, échte), which may or may not match spelling standards.
mistakes (ǹpo, éoscarssowhite)

As for handling these:

Should be kept in. Preserving these takes priority over any handling of (2) and (3).
Should probably be stripped, since they are prosody markers. That said, if t-scan does not strip accents, it will treat "echte" and "échte" as different lemmas with different frequencies. This would stand out more if "échte" is absent from the frequency table.
Should be removed. However, note that for the examples above, the correct version would be "npo" (accent stripped) and "#oscarssowhite" / "oscarssowhite" (the first one is not really feasible to automate, the second one removes the entire character). Due to their nature, these cases should be quite rare.

My initial suggestion was that this could be handled by lemmatising the word and then cross-referencing with a vocabulary like the ANW. If the word appears in the ANW with an accent, it is is a lexical accent that should be preserved.

However, this only really works if you are also tagging named entities, since those a) do not appear in a dictionary, b) are especially likely to contain accents, and c) should have their accents preserved.

If you do this, it would be best if it is outfactored to a separate service that can also be utilised by t-scan, in order to avoid the discrepancy for emphatic accents.

All in all, this method is not impossible, but probably not worth the effort.

uudigitalhumanitieslab / subtitle-word-frequencies Goto Github PK

subtitle-word-frequencies's Issues

Evaluate on behavioural data

Lemmatisation

Strip accents

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent