Giter Site home page Giter Site logo

thealtres / song-title-detection Goto Github PK

View Code? Open in Web Editor NEW
0.0 1.0 0.0 10.53 MB

Automatic detection of song titles in French theater, focusing on vaudeville.

License: GNU General Public License v3.0

Python 100.00%
digital-humanities french-theater humanites-numeriques song-titles vaudevillle computational-drama-analysis

song-title-detection's Introduction

Automatic detection of song titles in a French corpus

DOI

Part of the Thealtres project: https://thealtres.pages.unistra.fr/


Identification of song titles with detection_airs.py : main program with options:

  • character_list_regex.py writes a list of characters for any given id_work from the manual annotation in annotations_fr-characters.csv (not used in evaluation mode).
  • semantic_search.py allows for the optional semantic search of proper song titles.
  • encoding.py encodes in xml the play with the song titles identified.

Data:

  • liste_des_noms_d-airs_standards.txt is used as a reference for the program to suggests a title already existing.
  • annotations_fr-characters.csv is used to fetch the character names within a play.

How to use the program

in config.py change the paths so that it matches your directory then :

python detection_airs.py [id_work/'all' (only in mode auto)] [--mode/-m {extract, eval, auto} default = extract] [--sem_search/-s] [--characters/-c] [--encode/-e] [--nb {[int], all} default = 1]

with id_work being a unique id for each play with an OCR : looks for the tesseract OCR but will use the original text if a tesseract OCR is not present. Creates a directory from the variable {dossier_sortie} containing as output [id_work]{suffix_doc_sortie}.csv
both variable found in config.py to adapt to the corpus used. --mode being by default the extraction mode following the next steps:
The program will give a candidate line with its line number: input ";" to reject it, any other key will add the line to the list.
Tips: Add lines that do not contain titles such as "AIR :" or "CHOEUR." : in that situation the program will select the next line as the title. This option works better with the option --characters selected as it will discriminate the character names as potential song titles
The program will then suggest titles from a reference document containing well known song titles. By default only string matching suggestions are made. The semantic search option provides a third suggestion. By selecting ";" it is possible to write down the correct title.
A validation after all the lines have been filtered is asked : input "n" to select the candidates again.
Eval mode will write the precision, recall and f1 score in a [id_work]_stats.csv doc and print the result for those stats for the entire corpus so far in the file stats.csv in the out directory.
Sem_Search will implement the semantic search during the selection of a proper title.
Characters is only available for plays that have been annotated manually as it looks for the character names in a seperate document, creates a document called [id_work]_characters.txt.
Encoding will encode in a new xml document the element stage type "tune" from the OCR available. The automatic mode accepts all instead of an id number and will then automatically extract all song titles for the entire corpus found in config in the path dossier. Automatic mode has one more optional argument -nb followed by an integer. It adds the number of best standard titles to the output.

Output

In the directory corpus/id_work, the following documents are created :

  • id_work_characters.txt : contains the characters extracted from the manual annotation (annotations_fr-characters.csv), written as regular expressions.

  • id_work_airs.csv : contains the following for each line : id_work;idAir;title;line;best-candidate-title;isAir :

      - id_work 
      - isAir : booleen, 1 if the line contains a valid air
      - idAir : incremented for each play 
      - title : title as is in the OCR raw txt
      - line : line number in the txt doc
      - best candidate title : using string matching and/or semantic search gives the best candidate within the reference list in "liste_des_noms_d-airs_standards.txt"
    
  • id_work_airs-encodes.xml : same as the OCR text file but with element stage type "tune" encoded.

  • id_work_stats.csv :

      - Airs candidats:              
      - Airs manuellement filtrés :                 
      - Airs réelement présents :                
      - Précision:              
      - Rappel:           
      - Mesure-F1:
      - Non attrapé : titles missed by the program
      - Supplémentaire : titles found by the program but not annotated in the evaluation corpus.
    

Usage in Extract mode (default)

python detection_airs.py 001 -s -c
CHŒUR.:378:
input = y
Titre candidat: Air de M. Marius Boullard.
[1]Sting matching fuzzy: ('Air de M. Boullard.', 95)
[2]String matching difflib: Air de M. Boullard.
[3]Semantic search : Air de M. Boullard.
[;]Autre
input = 1
output = 001;1;CHŒUR.=Air de M. Marius Boullard.;378;Air de M. Boullard.;1
...
CHŒUR.:2863:
input = ;
output = 001;;CHŒUR.2863;;0
...
CHŒUR, *:3546:
input = y
Candidat: Air: De la belle Polonaise.
[1]String matching fuzzy :('AIR : Adieu, je vous fuis, bois charmant.', 86)
[2]String matching difflib :[]
[3]Semantic search :AIR : Le beau Lycas.
[;]Autre

Selectionner option : ;
Meilleurs candidats
AIR : Adieu, je vous fuis, bois charmant.
AIR : Ah ! ah ! ah ! ah ! c'est désolant.
AIR : Ah ! Mon ami, c'est un rayon d'espoir.
AIR : Ah ! ah ! ah ! ah ! ah ! ah ! ah ! ah !
AIR : Ah ! de plaisir notre âme est enivrée.
AIR : De la Marseillaise.
Air : De la galoppe.
Air de la Nacelle. (de Panseron.)
Air de la Parisienne.
Air de la Parisienne.
AIR : Le beau Lycas.
Air : de la Fiancée.
Air : Ô filii.
AIR : Enfans de Polymnie.
Air : C'est charmant.

C/C le meilleur candidat de la liste ou entrez le nom de l'air manuellement:
input : Air: De la belle Polonaise.
ouput : 132;1;12;CHŒUR, *=Air: De la belle Polonaise.;3546;Air: De la belle Polonaise.

(In the second case: CHOEUR. was not selected as a valid candidate as it appears only 8 lines after a song title Air : de M. Marius Boullard.)

Usage in Automatic mode

python detection_airs.py -m auto 001 -nb 3 -s -c
output(first 3 lines):
>id_work;isAir;idAir;title;line;best-candidate-title;best-candidate-title;best-candidate-title;

001;1;1;CHŒUR.=Air de M. Marius Boullard.;378;AIR : Adieu, je vous fuis, bois charmant.;Air de M. Boullard.;AIR : Ah ! ah ! ah ! ah ! c'est désolant. 001;1;2;Air de M. Marius Boullard.;842;AIR : Adieu, je vous fuis, bois charmant.;Air de M. Boullard.;AIR : Ah ! ah ! ah ! ah ! c'est désolant.

The options -nb means that the best 3 string matching standard titles are added to the output, -s means that semantic search is activated and -c means that the list of characters is used to exclude a line.

How to cite

Main developer is Alexia Schneider, with contribution on the literary aspects by Lara Nugues and supervision by Pablo Ruiz Fabo.

Schneider, Alexia & Nugues, Lara & Ruiz Fabo, Pablo (2023). Thealtres : Comparing Theater in Alsatian with the Dramatic Traditions at its Source. https://thealtres.pages.unistra.fr/

song-title-detection's People

Contributors

pruizf avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.