Giter Site home page Giter Site logo

ocr-d / gt-repo-template Goto Github PK

View Code? Open in Web Editor NEW
8.0 2.0 4.0 161 KB

A template for creating a ground truth repo with the various functions and features: such as metadata creation, data analysis and presentation.

License: Creative Commons Attribution Share Alike 4.0 International

ground-truth ocr-d pagexml repository template

gt-repo-template's Introduction

๐Ÿ”‘ What must they do?

A template for the creation of a ground truth repo with the following functions and features:

  • Publication of the Ground Truth data
  • Documentation and archiving of the Ground Truth
    • Assistance with the creation of metadata for the Ground Truth Repo
  • Specifications for the uniform storage and organization of the Ground Truth Repo
  • automatic functions that control a github-action-workflow:

๐Ÿ‘ท ๐Ÿ‘ทโ€โ™€๏ธ How to use the template

Step 1

  • Create a repository for your Ground Truth data publication. Click on the Use this Template button.
  • Save your data to the repository. Your data should be stored in the Data directory. See the Organization of directories and files in the Repo.
  • The creation of a README.md file is not necessary.
  • The README.md file is at first created automatically and can be expanded manually in a subsequent step.
  • The LICENSE.md file should match the license of your data. Use Choose an open source license to assign the suitable license.

Step 2

  • Create metadata data for your ground truth dataset.
  • Metadata is necessary to ensure that your repository is correctly documented. Use the metadata form to record the metadata correctly.

Step 3

  • The template contains tools that automatically create specific web pages from the stored metadata and ground truth data. You can publish these as GitHub pages. What do you do for this.
    1. The analysis we started through a tag. see How to start the automatic functions?
    2. Adjust the GitHub page setting. Select the gh-pages branch to do this.

Step 4

  • After creating the repository, saving and pushing the data and automatically analyzing the data with the Github workflow, you can customize the README.md file.
  • The README.md file is also created during the analysis. This contains the metadata, data about the corpus and a section extent part that you can customize.
  • Do you want to customize the README.md file?
  • In the <div id="extent"> section, you can additions to the README.md file.
  • You can find the old version of README.md file in the readme_old directory. The current version of README.md file can be found in the main branch.

๐Ÿ—‰ METS File

The gt-repo-template has the capability to generate METS files for GT data, involving an analysis of both the data structure and PAGE files. Despite the availability of this automated functionality, it is recommended to consider creating a custom METS file.

This METS file can contain various elements, including bibliographic and provenance data. It is important that they respect the OCR-D METS specification.

Please note that you use the following file group (FileGrp) in the METS file for referencing the images.

<mets:fileGrp USE="OCR-D-IMG">

It's important to note that referencing PAGE files using URLs/URIs is not permitted. PAGE files should be stored in the repository and referenced within the METS file as follows:

<mets:FLocat xlink:href="GT-PAGE/[optional directory]/[PAGE-File.xml]" LOCTYPE="OTHER" OTHERLOCTYPE="FILE"/>

The image files should either be referenced via a URL/URI in the METS file or, if the image files are stored in the repository, specified as a file reference in the METS file.

  • URL/URI:
<mets:FLocat xlink:href="https://opendata.uni-halle.de/retrieve/0775684d-82e9-4cb0-8e03-02f34c97949a/00000412.jpg" LOCTYPE="URL"/>
  • File Reference:
<mets:FLocat xlink:href="GT-PAGE/[optional directory]/[image directory optional]/00000412.jpg" LOCTYPE="OTHER" OTHERLOCTYPE="FILE"/>
  • File Reference and file group (fileGrp) example
<mets:fileGrp USE="OCR-D-IMG">
         <mets:file MIMETYPE="image/jpeg" ID="OCR-D-IMG_0001" GROUPID="OCR-D-IMG_0001">
            <mets:FLocat LOCTYPE="OTHER" OTHERLOCTYPE="FILE" xlink:href="jpg/rudolstadt_weiber_1683_0005.jpg"/>
         </mets:file>
</mets:fileGrp>

๐Ÿ—€ Organization of directories and files in the GT-Repo

The structure of the repo is the following:

โ”œโ”€โ”€ METADATA.yml
โ”œโ”€โ”€ LICENSE.md
โ””โ”€โ”€ data
      โ””โ”€โ”€ document_title or identifier
          โ”œโ”€โ”€ GT-PAGE
          โ””โ”€โ”€ mets.xml

Cached Image files:

  • In a separate directory.
  • In the same directory as the text transcription (inside the GT-PAGE folder).

If you use your own METS file, the images must be referenced in it.

  • Can be referenced in the METS file as URL/URI Example:
<mets:fileGrp USE="OCR-D-IMG">
         <mets:file MIMETYPE="image/jpeg" ID="OCR-D-IMG_0001" GROUPID="OCR-D-IMG_0001">
            <mets:FLocat LOCTYPE="OTHER" OTHERLOCTYPE="FILE" xlink:href="jpg/rudolstadt_weiber_1683_0005.jpg"/>
         </mets:file>
</mets:fileGrp>

Linked image files in the Page file as directory/file name or URL/URI:

  • May be referenced in the transkribus PAGE file, eScriptorium Page or in normal Page file as directory/file name or URL/URI. Example:

Transkribus

<TranskribusMetadata docId="1256538" pageId="50892347" pageNr="1" tsid="105748322" status="GT" userId="48446" imgUrl="https://files.transkribus.eu/Get?id=SFNIJNJBHWZPNRYZCAIWBJIA&amp;fileType=view" xmlUrl="https://files.transkribus.eu/Get?id=TWZJHYTDEPJDGTXDWJQAXHXH" imageId="27308940"/>

eScriptorium

<Metadata externalRef="https://images.sub.uni-goettingen.de/iiif/image/gdz:PPN643815198:00000008/full/full/0/default.jpg">

normal Page file (Aletheia)

<Page imageFilename="../jpg/brockes_vergnuegen07_1743_0004.jpg" imageWidth="2848" imageHeight="4288" type="content">

This reference to the image file must always be relative to the Page file. In this case, the image files must be saved in the repo or referenced in a METS file.

๐Ÿค– How to start the automatic functions?

The github-action-workflow is triggered by assigning a version tag (e.g. v1.8.11) at push. The version tag consists of the lowercase letter v (stands for version) and a three-part numerical code. Number code: e.g. 1.8.11 The number code has the following meaning:

  • the first number indicates the version number (1).
  • the second number indicates the feature (8)
  • the third number indicates the fixes, paths... (11)

๐Ÿ““ GT repo metadata

You can find metadata about the GT Repo in the following files.

  • mets.xml
  • metadata.json
  • metadata.yml
  • CITATION.cff

The content of the metadata files is the same, only the formats vary. You can find the file at:

gt-repo-template's People

Contributors

bertsky avatar kba avatar markusweigelt avatar stweil avatar tboenig avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

gt-repo-template's Issues

make all created repos findable by adding a Github topic in CD

It would be very useful if all GT repos which (successfully) used this template could be easily identified as such on Github.

The best way to achieve this IMO would be to have the gtrepo workflow add a step which uses the Github API (and Github Action Token) to add the topic automatically.

For the concrete string I would propose ocr-d-gt.

The next best thing at the moment is a Github search for the strings in the generated Readme.

"File not found" error in gtrepo action

In the analyse_and_makebagit part of the gtrepo action we encountered a missing file error during the DHd workshop.

The problem was resolved by substituting the METADATA.yml for another one known to be working. Below are the error message and the (probably faulty) METADATA.yml (as txt, yml wasn't allowed), which was generated through the suggested form at https://tboenig.github.io/gt-metadata/document-your-gt.html .

See action under: https://github.com/maxte0/ocr_trainingsdaten/actions/runs/8067064928/job/22036618726
image
METADATA.txt

Alert invalid data layout

Description

It is required to hold data in a directory layout with the subfolder GT-PAGE straight beneath the directory which contains the mets.xml.

Currently, if this layout is not matched, the workflow creates a rather empty bagit-file without any data and an empty mets.xml.
Instead of behaving like this, it would be better to signal the invalid layout and make the workflow yield the error.

Requirements regarding directory naming

Description

Currently, it is not clear to the user, whether the name GT-PAGE for the directory containing the groundtruth data is a hard and fixed requirement. It seems to be; therefore using a different label should yield an error instead of having an action running green.

Extension: Rename Images to fit GT-File by name

Description

Actually, when images files ere referenced in mets.xml in group , the get downloaded an pushed to directory.
This way, the naming similarity between image and GT-data is lost. But this similarity is a key requirement for tools like Transkribus or LAREX to match image with GT-data for further corrections or extensions.

Even worse, because our data consists of a overall sample of 40.000+ prints, it includes for example several images named "00000008.jpg" which could overwrite each other.

use caching for speedup

We should heavily use action caching to save time and bandwidth:

  • installation of required tools
  • cloning of required repos (mind that we can still pull from the repo once it's in the cache)
  • analysis and transforms: as long as the respective input files did not change (so the cache keys must match exactly the path names)

Missing GT-files

Description

Current action workflow fails, and I can't figure out why, see gt-test.

I used the same structure like the one we tried together, only with more items (16) and some meaningful MODS-metadata for each image, respectively.

Whereas the images with their absolute URLs are present, job make bagit seems to miss the gt files, which are relative to the project:

 10:50:55.547 INFO ocrd.workspace_bagger - Handling OcrdFile <OcrdFile fileGrp=OCR-D-IMG ID=IMG_MAX_1290695, mimetype=image/jpeg, url=https://opendata.uni-halle.de/retrieve/aca4b878-c92b-47b7-92e7-99da69c00846/00000442.jpg, local_filename=---]/> 
10:50:56.949 INFO ocrd.workspace_bagger - Handling OcrdFile <OcrdFile fileGrp=FULLTEXT ID=FULLTEXT_1, mimetype=application/alto+xml, url=---, local_filename=---]/> 
Traceback (most recent call last):
  File "/home/runner/.local/bin/ocrd", line 8, in <module>
    sys.exit(cli())
  File "/home/runner/.local/lib/python3.10/site-packages/click/core.py", line 1157, in __call__
    return self.main(*args, **kwargs)
  File "/home/runner/.local/lib/python3.10/site-packages/click/core.py", line 1078, in main
    rv = self.invoke(ctx)
  File "/home/runner/.local/lib/python3.10/site-packages/click/core.py", line 1688, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/home/runner/.local/lib/python3.10/site-packages/click/core.py", line 1688, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/home/runner/.local/lib/python3.10/site-packages/click/core.py", line 1434, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/runner/.local/lib/python3.10/site-packages/click/core.py", line 783, in invoke
    return __callback(*args, **kwargs)
  File "/home/runner/.local/lib/python3.10/site-packages/ocrd/cli/zip.py", line 54, in bag
    workspace_bagger.bag(
  File "/home/runner/.local/lib/python3.10/site-packages/ocrd/workspace_bagger.py", line 175, in bag
    total_bytes, total_files = self._bag_mets_files(workspace, bagdir, ocrd_mets, processes)
  File "/home/runner/.local/lib/python3.10/site-packages/ocrd/workspace_bagger.py", line 80, in _bag_mets_files
    self.resolver.download_to_directory(file_grp_dir, f.url, basename=f.basename)
  File "/home/runner/.local/lib/python3.10/site-packages/ocrd/resolver.py", line 63, in download_to_directory
    raise ValueError(f"'url' must be a non-empty string, not '{url}'") # actually Path also ok
ValueError: 'url' must be a non-empty string, not ''

??

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.