Giter Site home page Giter Site logo

subtitle-word-frequencies's Introduction

Subtitle word frequencies

DOI

This repository contains python scripts to extract word frequency data from a collection of subtitle files.

Notable features:

  • Frequency lists can be converted to the format used by T-scan.
  • Summarise the total data per genre based on an accompanying metadata file
  • Text can be lemmatised using Frog or spaCy.

The purpose of this repository is to provide transparency in our data processing and to make it easier to repeat the frequency analysis on newer data in the future. It is not developed to be of general use, but we include a licence for reuse (see below).

Contents

Data

The scripts are designed for a collection of subtitles from the NPO (Dutch public broadcast). This dataset is not provided in this repository and is not publicly available due to copyright restrictions. The Research Software Lab works on this data in agreement with the NPO, but we cannot share the data with others.

Our data encodes subtitles as WebVTT files, with an accompanying metadata file included as an .xlsx file.

Scripts

Scripts are written in Python and are structured into the following modules:

  • analysis for counting and lemmatising extracted text
  • metadata for parsing the metadata file to see the distribution of genres
  • tscan for converting frequency data to the format used by T-scan
  • vtt for extracting plain-text data from .vtt files

Requirements

You'll need:

Install required python packages with

pip install -r requirements.txt

Lemmatisers

To perform lemmatisation, you'll also need to download data for spacy and/or frog.

After installing the requirements, run:

python -m spacy download nl_core_news_sm
python -c "import frog; frog.installdata()"

Usage

The following commands are supported.

Summary of genres

You can create a csv file that lists the genres and the number of files + total runtime per genre specified in a metadata spreadsheet. To run this:

python -m metadata.summary

to create a summary of the metadata file located in /data, which makes sense if the data folder contains a single xlsx file.

You can also specify the location:

python -m metadata.summary path/to/metadata.xlsx path/to/output.csv

Export plain text of VTT files

Takes a directory containing .vtt files as input and converts the contents to plain text files.

python -m vtt.convert_to_plain path/to/data

For each *.vtt file in the provided directory, the script will save a file next to it named *.plain.txt. This file contains the text of the subtitles, with one line per segment.

The script filters out some common non-utterances that appear in captions, e.g. (muziek), APPLAUS EN GEJUICH, 888.

Lemmatise plain text exports

After generating plain text files as above, you can generate a lemmatised version using either Frog or SpaCy.

python -m analysis.lemmatize path/to/data [--frog|--spacy]

The data directory is the same directory in which you ran vtt.convert_to_plain - it should contain the *.plain.txt files generated by that script. For each file, the lemmatisation script will generate *.lemmas.txt, which contains the lemmatised text.

Use the --frog or --spacy to set the lemmatiser. Frog is the default: it is also used in T-scan, so results are more likely to match. However, at the time of writing, spaCy is much faster than Frog.

Count token frequencies

You can count token frequencies in the cleaned files (generated by vtt.convert_to_plain or analysis.lemmatize) and export them to a csv with:

python -m analysis.collect_counts path/to/data

Use the option --level lemma to count in the lemmatised files. You can also specify the input directory and the output location:

python -m analysis.collect_counts path/to/data --output path/to-output.csv --level lemma

The resulting csv file lists the frequency for each word or lemma.

Convert frequencies to T-scan format

You can convert the output of the previous step into a file formatted for T-scan.

python -m tscan --input path/to/input.csv --output path/to/output

This is a tab-separated file without headers. Each row represents a term. Rows are sorted from most to least frequent and list:

  • the term
  • the absolute frequency
  • the cumulative absolute frequency
  • the cumulative percentile frequency

Developing

Unit tests

Run unit tests with

pytest

To add new python packages, add them to requirements.in and run

pip-compile requirements.in --outputfile requirements.txt

Licence

This repository is shared under a BSD 3-Clause licence.

subtitle-word-frequencies's People

Contributors

lukavdplas avatar

Watchers

Julian Gonggrijp avatar Jelte van Boheemen avatar Sander Prins avatar  avatar  avatar

subtitle-word-frequencies's Issues

Evaluate on behavioural data

There should be a script to evaluate the frequency table on the behavioural data. We expect word frequency to positively correlate with reading comprehension. This can either be tested as a direct correlation, or by using the new frequencies as part of the LINT pipeline. I think the first option is preferable: besides being easier, it also avoids a bias due to to the LINT formula being fitted to an existing frequency table.

Strip accents

Roughly speaking, the data includes 3 types of diacritics:

  1. lexical diacritics (café, übermensch)
  2. emphatic diacritics (ík, échte), which may or may not match spelling standards.
  3. mistakes (ǹpo, éoscarssowhite)

As for handling these:

  1. Should be kept in. Preserving these takes priority over any handling of (2) and (3).
  2. Should probably be stripped, since they are prosody markers. That said, if t-scan does not strip accents, it will treat "echte" and "échte" as different lemmas with different frequencies. This would stand out more if "échte" is absent from the frequency table.
  3. Should be removed. However, note that for the examples above, the correct version would be "npo" (accent stripped) and "#oscarssowhite" / "oscarssowhite" (the first one is not really feasible to automate, the second one removes the entire character). Due to their nature, these cases should be quite rare.

My initial suggestion was that this could be handled by lemmatising the word and then cross-referencing with a vocabulary like the ANW. If the word appears in the ANW with an accent, it is is a lexical accent that should be preserved.

However, this only really works if you are also tagging named entities, since those a) do not appear in a dictionary, b) are especially likely to contain accents, and c) should have their accents preserved.

If you do this, it would be best if it is outfactored to a separate service that can also be utilised by t-scan, in order to avoid the discrepancy for emphatic accents.

All in all, this method is not impossible, but probably not worth the effort.

Lemmatisation

  • Enable lemmatisation in preprocessing
  • Generate alternative frequency table with lemmatisation enabled

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.