Giter Site home page Giter Site logo

gtcheck's Introduction

GTCheck

Python 3.6 license

Overview

Check changes in your OCR Ground Truth

If the Ground Truth data is version controlled via git repository, you can use "GTCheck" to validate and commit your modification. Therefore GTCheck will display for any line the original text, the modified version as well as the corresponding image. A virtual keyboard supports character replacements and the transcription of missing text.

Installation

This installation is tested with Ubuntu and we expect that it should work for other similar environments.

1. Requirements

  • Python> 3.6

2. Copy this repository

git clone https://github.com/UB-Mannheim/GTCheck.git
cd GTCheck

3. Installation into a Python Virtual Environment

python3 -m venv venv
source venv/bin/activate
pip install --upgrade pip setuptools
pip install -r requirements.txt
python setup.py install

Process steps

Start the server

gtcheck run-server {parameters..}

Add repo

gtcheck add-repo path/to/git-repo {parameters..}

Start single instance

gtcheck run-single path/to/git-repo {parameters..}

Working with page-xml files

Extract page-xml to gt-linepairs The working directory is either the path to a mets.xml file or the folder with page-xml files. It is possible to pass a mets.xml if it already exists, else a temporary mets-file will be created.

gtcheck extract-page path/to/working-directory {-m path/to/mets.xml}  -I {INPUTGROUPNAME} -O {OUTPUTGROUNAME} {parameters...}

Add to server and check! (see above)

Update page-xml files

gtcheck update-page path/to/repo  -I {./ GROUPNAME} {parameters...}

Setup page

In the first page you can set up your git credentials and select the branch or create a new branch for committing the modifications. Setup page

GTCheck page

In this page you can see and edit the modifications and the original text.

The modifications can be committed (with the commit message), skipped (if not clear what to do), added to the stage mode and later can be committed all at once or can be stashed (keep the original version). Edit page

A virtual keyboard supports character replacements and the transcription of missing text Vkeys

FAQ

TIFF images
Not all browser support tif images.
The workaround atm is installing browser extension.
E.g. Firefox you can find tiff viewer: https://addons.mozilla.org/de/firefox/addon/

UTF-8 Foldername:
git config core.quotepath off

Spellchecking:
This app uses the browser spellchecking.
E.g. Firefox:
https://support.mozilla.org/en-US/kb/how-do-i-use-firefox-spell-checker
https://addons.mozilla.org/de/firefox/language-tools/

Copyright and License

Copyright (c) 2020 Universitätsbibliothek Mannheim

Author:

GTCheck is Free Software. You may use it under the terms of the Apache 2.0 License. See LICENSE for details.

gtcheck's People

Contributors

jkamlah avatar stweil avatar zuphilip avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

gtcheck's Issues

Error when stashing file on split set

If I stash a line on a split gt set, I get the following error (adding, commiting and skipping works fine):

git.exc.GitCommandError git.exc.GitCommandError: Cmd('git') failed due to: exit code(128) cmdline: git rm -f /home/thschmidt/GTCheck/gtcheck/static/subrepo/c92654aac0d3a6d8e15ca965a9caafb4203f44fecb8b36f62235c3158b8b63da/duplicate_02_part_01/gtlines/001_0001.gt.txt stderr: 'fatal: pathspec '/home/thschmidt/GTCheck/gtcheck/static/subrepo/c92654aac0d3a6d8e15ca965a9caafb4203f44fecb8b36f62235c3158b8b63da/duplicate_02_part_01/gtlines/001_0001.gt.txt' did not match any files' Traceback (most recent call last): File "/home/thschmidt/GTCheck/venv/lib/python3.7/site-packages/flask/app.py", line 2088, in __call__ return self.wsgi_app(environ, start_response) File "/home/thschmidt/GTCheck/venv/lib/python3.7/site-packages/flask/app.py", line 2073, in wsgi_app response = self.handle_exception(e) File "/home/thschmidt/GTCheck/venv/lib/python3.7/site-packages/flask/app.py", line 2070, in wsgi_app response = self.full_dispatch_request() File "/home/thschmidt/GTCheck/venv/lib/python3.7/site-packages/flask/app.py", line 1515, in full_dispatch_request rv = self.handle_user_exception(e) File "/home/thschmidt/GTCheck/venv/lib/python3.7/site-packages/flask/app.py", line 1513, in full_dispatch_request rv = self.dispatch_request() File "/home/thschmidt/GTCheck/venv/lib/python3.7/site-packages/flask/app.py", line 1499, in dispatch_request return self.ensure_sync(self.view_functions[rule.endpoint])(**req.view_args) File "/home/thschmidt/GTCheck/gtcheck/app.py", line 369, in edit repo.git.rm('-f', str(fname)) File "/home/thschmidt/GTCheck/venv/lib/python3.7/site-packages/git/cmd.py", line 584, in <lambda> return lambda *args, **kwargs: self._call_process(name, *args, **kwargs) File "/home/thschmidt/GTCheck/venv/lib/python3.7/site-packages/git/cmd.py", line 1120, in _call_process return self.execute(call, **exec_kwargs) File "/home/thschmidt/GTCheck/venv/lib/python3.7/site-packages/git/cmd.py", line 924, in execute raise GitCommandError(redacted_command, status, stderr_value, stdout_value) git.exc.GitCommandError: Cmd('git') failed due to: exit code(128) cmdline: git rm -f /home/thschmidt/GTCheck/gtcheck/static/subrepo/c92654aac0d3a6d8e15ca965a9caafb4203f44fecb8b36f62235c3158b8b63da/duplicate_02_part_01/gtlines/001_0001.gt.txt stderr: 'fatal: pathspec '/home/thschmidt/GTCheck/gtcheck/static/subrepo/c92654aac0d3a6d8e15ca965a9caafb4203f44fecb8b36f62235c3158b8b63da/duplicate_02_part_01/gtlines/001_0001.gt.txt' did not match any files'

I checked the metioned folder and file (/home/thschmidt/GTCheck/gtcheck/static/subrepo/c92654aac0d3a6d8e15ca965a9caafb4203f44fecb8b36f62235c3158b8b63da/duplicate_02_part_01/gtlines/001_0001.gt.txt), which exists.

NoneType object no attribute data_stream

Issue description

Open a folder with tif and txt files fails on Ubuntu 20.04 using default Python 3.8.5 and Firefox with TIFFViewer Addon

Steps to reproduce the issue

  1. Setup folder with about 300 pairs with images + text
  2. Only images already committed, all text files untracked due future corrections.
  3. Clone repository, setup venv, install dependencies via pip
  4. Start: gtcheck <path-to-repo>
  5. Review start page: looks fine
  6. Click green button "Start"

What's the expected result?

  • Open View with Images and Text

What's the actual result?

  • Werkzeug Error screen

Additional details / screenshot

127.0.0.1 - - [11/Jun/2021 10:13:50] "POST /gtcheckinit HTTP/1.1" 500 -
Traceback (most recent call last):
  File "/home/hartwig/Projekte/work/mlu/ulb/github-gtcheck/venv/lib/python3.8/site-packages/flask/app.py", line 2088, in __call__
    return self.wsgi_app(environ, start_response)
  File "/home/hartwig/Projekte/work/mlu/ulb/github-gtcheck/venv/lib/python3.8/site-packages/flask/app.py", line 2073, in wsgi_app
    response = self.handle_exception(e)
  File "/home/hartwig/Projekte/work/mlu/ulb/github-gtcheck/venv/lib/python3.8/site-packages/flask/app.py", line 2070, in wsgi_app
    response = self.full_dispatch_request()
  File "/home/hartwig/Projekte/work/mlu/ulb/github-gtcheck/venv/lib/python3.8/site-packages/flask/app.py", line 1515, in full_dispatch_request
    rv = self.handle_user_exception(e)
  File "/home/hartwig/Projekte/work/mlu/ulb/github-gtcheck/venv/lib/python3.8/site-packages/flask/app.py", line 1513, in full_dispatch_request
    rv = self.dispatch_request()
  File "/home/hartwig/Projekte/work/mlu/ulb/github-gtcheck/venv/lib/python3.8/site-packages/flask/app.py", line 1499, in dispatch_request
    return self.ensure_sync(self.view_functions[rule.endpoint])(**req.view_args)
  File "/home/hartwig/Projekte/work/mlu/ulb/github-gtcheck/gtcheck/app.py", line 341, in gtcheckinit
    return gtcheck()
  File "/home/hartwig/Projekte/work/mlu/ulb/github-gtcheck/gtcheck/app.py", line 172, in gtcheck
    origtext = item.a_blob.data_stream.read().decode('utf-8').lstrip(" ")
AttributeError: 'NoneType' object has no attribute 'data_stream'

Reproduce

Maybe an issue with Python Version

Splitting set without duplicating it too produces no result

If you try to split a gt set while leaving "duplicate set for multiple keying" = 0, nothing will happen and GTCheck will not split the gt set.

Screenshot 2021-09-15 094633

It would be more user-friendly to automatically set "duplicate set for multiple keying" if a user chooses to split a set for multiuser keying by the chosen amount of split-parts.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.