TEITransformations

Scripts to:

transform multiple TEI XML files into single TEI XML files (root : TEI)
transform multiple TEI XML files into single TEI XML files (root : teiCorpus)

Both scripts expect TEI XML files created with ExportFromTranskribus project with specific patterns.

Getting started

Prerequisites

tei2tei.py expects, in <title>, a title form with the following model: "{Title}, {number}, page {page_number} - Transcription" ;
tei2teicorpus.py expects, in <title>, a title form with the following model: "{Title}, {number} - Transcription" ;

Where {number} refers to a volume. This is because they were initially designed to transform newpapers transcriptions.

Input formats

tei2tei.py expects directories containing TEI XML files in input/. Each directories forms a bundle merged into an output TEI XML file.
tei2teicorpus.py expects TEI XML files in input/. All files in the directory will be merged into the output TEI XML file.

Installing

use virtual environment with Python3 and requirements installed
create input/ and output/ directories in TEI2TEI/ and TEI2TEICORPUS/ directories to store and retrieve your data, or use --input/--output options to specify other directories.

Running

TEI2TEI

TEI2TEI can transform groups of XML files according to different features.

INPUT and OUTPUT options

By default, tei2tei.py will merge xml files contained in a directory placed in a directory named input/ and will place the result in a directory named output/. Both input/ and output/ are expected to be in the same directory as tei2tei.py.

(.venv)~$ python3 tei2tei.py

You can specify the directory containing the groups of files to merge and/or the directory where the result of the transformations should be placed using --input/-i and/or --output/-o options:

(.venv)~$ python3 tei2tei.py --input directory/name --output directory/name

Additional options:

`--nofacs/-n`

Will cause the script to ignore facsimile elements and related attributes:

(.venv)~$ python3 tei2tei.py --nofacs

`--volumes/-v`

Will cause the script to use volume numbers and not complete title to create values to xml:ids in facsimile elements and related.

It is designed for files that are part of the same series and that are numbered accordingly, as unique volumes in the series.
Use this option if your titles are formed according to the following pattern : "{title}, {unique_number}, page {page_number} - Transcription".
Do not use this option if your series includes several times the same volume number!
Note that this option will have no effect if the --nofacs/-n option is activated.

(.venv)~$ python3 tei2tei.py --volumes

TEI2TEICORPUS

TEI2TEICORPUS can merge several TEI-XML files into a single-filed TEICORPUS. The user is required to provide an output filename when running the script. This filename should not include any extension.

INPUT and OUTPUT options

By default, tei2teicorpus.py will merge xml files contained in a directory named input/ and will place the result in a directory named output/, both expected to be in the same directory as tei2teicorpus.py.

(.venv)~$ python3 tei2teicorpus.py

You can specify the directory containing the files to merge and/or the directory where the result of the transformation should be place using --input/-i and/or --output/-o options:

(.venv)~$ python3 tei2teicorpus.py --input directory/name --output directory/name

By default, the TEI elements in the final teiCorpus will be ordered randomly. You can use addition options to better control the ordering of the TEI elements.

Additional sorting options:

`--volumes/-v`

Will cause the script to use volume numbers in ascending order to sort the TEI elements.

It is designed for files that are part of the same series and that are numbered accordingly, as unique volumes in the series.
Use this option if your titles are formed according to the following pattern: "{title}, {unique_number}, page {page_number} - Transcription".
Do not use this option if your series includes several times the same volume number or they will be overwritten!

(.venv)~$ python3 tei2teicorpus.py --volumes

`--sort/-s`

Will cause the script to use file names in ascending order to sort TEI elements.

WARNING: this option required filenames to be numbers that you manually added.

(.venv)~$ python3 tei2teicorpus.py --sort

--sort/-s and --volumes/-v are NOT compatible

Affiner la structure du <body>

L'export de Transkribus n'est pas terrible la dessus, et ce serait une bonne pratique à mon avis d'avoir des <div> à l'intérieur du <body>, avec éventuellement un attribut @type qu'on pourrait ensuite renseigner à la volée (en fonction des types de documents, ici "rapport" par exemple, mais "jugement" pour prud'hommes, "compte-rendu" ou autres).

Idéalement, il faudrait aussi typer les <p>, au moins, les header, et le contenu

Exemple pour 9M5

****<div type="rapport">****
   <pb facs="#facs_24" n="24"/>
   <p facs="#facs_24_r1" *****type="headerRapport"*****>
     <lb facs="#facs_24_r1l1" n="N001"/>Du 19 Xb 1894
     <lb facs="#facs_24_r1l2" n="N002"/>Préfecture du Rhône
     <lb facs="#facs_24_r1l3" n="N003"/>Commissariat spécial
     <lb facs="#facs_24_r1l4" n="N004"/>Mouvement des tisseurs
     <lb facs="#facs_24_r1l5" n="N005"/>Tissage mécanique
     <lb facs="#facs_24_r1l6" n="N006"/>Réunion privée des ouvriers
     <lb facs="#facs_24_r1l7" n="N007"/>et ouvrières
    </p>
    <p facs="#facs_24_r2" *****type="contenu"*****>
     <head><lb facs="#facs_24_r2l1" n="N001"/>Rapport </head>
    <lb facs="#facs_24_r2l2" n="N002"/>Une réunion privée organisée par 
   <lb facs="#facs_24_r2l3" n="N003"/>la Commission du relèvement des
 salaires et par <lb facs="#facs_24_r2l4" n="N004"/>le syndicat des <work>ouvrières</work> et <work>ouvriers</work> du tisage <lb facs="#facs_24_r2l5" n="N005"/>mécanique, a eu lieu hier 18 . <placeName rend="multiline">Boulevard de la Croix </placeName>
 <lb facs="#facs_24_r2l6" n="N006"/>
</div>

timeus-anr / teitransformations Goto Github PK