The scripts in this repository contain the scripts that Wan-Ting creates for her PhD thesis and future publication.
This script counts the frequency of the lexical words in one csv file.
This script contains two steps
- cleaning the text: including tokenisation, lemmentisation and manual identify the words to be replaced or removed
- unique word: output the distinct words in the text and count the numbers of the distinct word This script is based on the SpaCy module.
These two scripts create a simulated environment to control the sample size of each corpus with two lexical analysis approaches: Zipf or Heaps' law.