Giter Site home page Giter Site logo

teitransformations's Introduction

TEITransformations

Scripts to:

  • transform multiple TEI XML files into single TEI XML files (root : TEI)
  • transform multiple TEI XML files into single TEI XML files (root : teiCorpus)

Both scripts expect TEI XML files created with ExportFromTranskribus project with specific patterns.

Getting started

Prerequisites

  • tei2tei.py expects, in <title>, a title form with the following model: "{Title}, {number}, page {page_number} - Transcription" ;
  • tei2teicorpus.py expects, in <title>, a title form with the following model: "{Title}, {number} - Transcription" ;

Where {number} refers to a volume. This is because they were initially designed to transform newpapers transcriptions.

Input formats

  • tei2tei.py expects directories containing TEI XML files in input/. Each directories forms a bundle merged into an output TEI XML file.
  • tei2teicorpus.py expects TEI XML files in input/. All files in the directory will be merged into the output TEI XML file.

Installing

  • use virtual environment with Python3 and requirements installed
  • create input/ and output/ directories in TEI2TEI/ and TEI2TEICORPUS/ directories to store and retrieve your data, or use --input/--output options to specify other directories.

Running

TEI2TEI

TEI2TEI can transform groups of XML files according to different features.

INPUT and OUTPUT options

By default, tei2tei.py will merge xml files contained in a directory placed in a directory named input/ and will place the result in a directory named output/. Both input/ and output/ are expected to be in the same directory as tei2tei.py.

(.venv)~$ python3 tei2tei.py

You can specify the directory containing the groups of files to merge and/or the directory where the result of the transformations should be placed using --input/-i and/or --output/-o options:

(.venv)~$ python3 tei2tei.py --input directory/name --output directory/name
Additional options:
--nofacs/-n

Will cause the script to ignore facsimile elements and related attributes:

(.venv)~$ python3 tei2tei.py --nofacs
--volumes/-v

Will cause the script to use volume numbers and not complete title to create values to xml:ids in facsimile elements and related.

  • It is designed for files that are part of the same series and that are numbered accordingly, as unique volumes in the series.
  • Use this option if your titles are formed according to the following pattern : "{title}, {unique_number}, page {page_number} - Transcription".
  • Do not use this option if your series includes several times the same volume number!
  • Note that this option will have no effect if the --nofacs/-n option is activated.
(.venv)~$ python3 tei2tei.py --volumes

TEI2TEICORPUS

TEI2TEICORPUS can merge several TEI-XML files into a single-filed TEICORPUS. The user is required to provide an output filename when running the script. This filename should not include any extension.

INPUT and OUTPUT options

By default, tei2teicorpus.py will merge xml files contained in a directory named input/ and will place the result in a directory named output/, both expected to be in the same directory as tei2teicorpus.py.

(.venv)~$ python3 tei2teicorpus.py

You can specify the directory containing the files to merge and/or the directory where the result of the transformation should be place using --input/-i and/or --output/-o options:

(.venv)~$ python3 tei2teicorpus.py --input directory/name --output directory/name

By default, the TEI elements in the final teiCorpus will be ordered randomly. You can use addition options to better control the ordering of the TEI elements.

Additional sorting options:
--volumes/-v

Will cause the script to use volume numbers in ascending order to sort the TEI elements.

  • It is designed for files that are part of the same series and that are numbered accordingly, as unique volumes in the series.
  • Use this option if your titles are formed according to the following pattern: "{title}, {unique_number}, page {page_number} - Transcription".
  • Do not use this option if your series includes several times the same volume number or they will be overwritten!
(.venv)~$ python3 tei2teicorpus.py --volumes
--sort/-s

Will cause the script to use file names in ascending order to sort TEI elements.

  • WARNING: this option required filenames to be numbers that you manually added.
(.venv)~$ python3 tei2teicorpus.py --sort

--sort/-s and --volumes/-v are NOT compatible

teitransformations's People

Contributors

alix-tz avatar charlesriondet avatar

Watchers

 avatar  avatar  avatar

teitransformations's Issues

Affiner la structure du <body>

L'export de Transkribus n'est pas terrible la dessus, et ce serait une bonne pratique à mon avis d'avoir des <div> à l'intérieur du <body>, avec éventuellement un attribut @type qu'on pourrait ensuite renseigner à la volée (en fonction des types de documents, ici "rapport" par exemple, mais "jugement" pour prud'hommes, "compte-rendu" ou autres).

Idéalement, il faudrait aussi typer les <p>, au moins, les header, et le contenu

Exemple pour 9M5

****<div type="rapport">****
   <pb facs="#facs_24" n="24"/>
   <p facs="#facs_24_r1" *****type="headerRapport"*****>
     <lb facs="#facs_24_r1l1" n="N001"/>Du 19 Xb 1894
     <lb facs="#facs_24_r1l2" n="N002"/>Préfecture du Rhône
     <lb facs="#facs_24_r1l3" n="N003"/>Commissariat spécial
     <lb facs="#facs_24_r1l4" n="N004"/>Mouvement des tisseurs
     <lb facs="#facs_24_r1l5" n="N005"/>Tissage mécanique
     <lb facs="#facs_24_r1l6" n="N006"/>Réunion privée des ouvriers
     <lb facs="#facs_24_r1l7" n="N007"/>et ouvrières
    </p>
    <p facs="#facs_24_r2" *****type="contenu"*****>
     <head><lb facs="#facs_24_r2l1" n="N001"/>Rapport </head>
    <lb facs="#facs_24_r2l2" n="N002"/>Une réunion privée organisée par 
   <lb facs="#facs_24_r2l3" n="N003"/>la Commission du relèvement des
 salaires et par <lb facs="#facs_24_r2l4" n="N004"/>le syndicat des <work>ouvrières</work> et <work>ouvriers</work> du tisage <lb facs="#facs_24_r2l5" n="N005"/>mécanique, a eu lieu hier 18 . <placeName rend="multiline">Boulevard de la Croix </placeName>
 <lb facs="#facs_24_r2l6" n="N006"/>
</div>

Corriger le mode -nofacs

Correctement intégrer le retrait des attributs facs lorsqu'on execute le programme avec l'option -nofacs, car pour le moment ils demeurrent dans les fichiers de sortie, même lorsque les elements facsimile ont été retirés.

Revoir le calcul des id des facs

Si on fusionne plusieurs fichiers XML TEI contenant des facsimiles, il y a des doublons dans les id des facsimiles et sous éléments.

Il faut aussi retirer les caractères non ASCII des id calculés.

Pipe TEI2TEI and TEI2corpus

Would be cool if we could join the two commands:
1° merge files in single TEI files (groupe pages of a single document)
2° gather all the produced files in a TeiCorpus (all the document on the same subject for instance).

In my use case => rebuild the documents of 9M5 and then create a teiCorpus grouping all of them.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.