Giter Site home page Giter Site logo

zuphilip / ocr-fileformat Goto Github PK

View Code? Open in Web Editor NEW

This project forked from ub-mannheim/ocr-fileformat

0.0 3.0 0.0 755 KB

Validate and transform various OCR file formats (hOCR, ALTO, PAGE, FineReader)

Home Page: http://digi.bib.uni-mannheim.de/ocr-fileformat/

License: MIT License

Makefile 11.34% Shell 26.21% PHP 13.11% HTML 29.27% CSS 1.59% JavaScript 15.12% XSLT 3.35%

ocr-fileformat's Introduction

ocr-fileformat

Build Status GitHub release

Validate and transform between OCR file formats (hOCR, ALTO, PAGE, FineReader)

Screenshot GUI

Installation

To install system-wide to /usr/local:

sudo make install

To install without sudo to your home directory:

make install PREFIX=$HOME/.local

If $HOME/.local/bin is not in your PATH, add this to your shell startup file (e.g. ~/.bashrc or ~/.zshrc):

export PATH="$HOME/.local/bin $PATH"

The web application has a PHP backed. You can deploy it on any PHP-capable server by copying the web folder somewhere below the document root of your server, e.g. /var/www/html for Apache on Debian/Ubuntu:

sudo -u www-data cp -r web /var/www/html/ocr-fileformat

In this example the GUI would be available under http://localhost/ocr-fileformat/.

Usage

The project offers two functionalities, which can be accessd via a command line script (CLI), using a web interface (GUI) or in you own tools (API)

CLI

  • ocr-transform: Transformation of OCR output between OCR formats
  • ocr-validate: Validation of OCR output against OCR format schemas

API

Transformation

Transformation CLI

Usage: ocr-transform [-dl] <input-fmt> <output-fmt> [<input> [<output>]] [-- <saxon_opts>]

For example, you can transform an ALTO XML to a hOCR file with:

ocr-transform alto hocr sample.xml sample.hocr

Or convert from ALTO XML (version 2.1) to hOCR with:

ocr-transform alto2.1 hocr sample.alto sample.hocr

You can also pass arguments directly to the Saxon CLI by passing them after a double dash (--). For example, to set the foo parameter to bar:

ocr-transform alto hocr sample.xml sample.hocr -- foo=bar

Try ocr-transform -h to get an overview:

Usage: ocr-transform [-dl] <input-fmt> <output-fmt> [<input> [<output>]] [-- <saxon_opts>]
Input formats:
- 'alto'
- 'hocr'
Output formats:
- 'alto2.0'
- 'alto2.1'
- 'hocr'
Saxon-HE 9.7.0.4J from Saxonica
Java version 1.7.0_95
Usage: see http://www.saxonica.com/html/documentation/using-xsl/commandline.html
Options available: -? -a -catalog -config -cr -diag -dtd -ea -expand -explain -export -ext -im -init -it -l -license -m -nogo -now -o -opt -or -outval -p -pack -quit -r -repeat -s -sa -scmin -strip -t -T -threads -TJ -TP -traceout -tree -u -val -versionmsg -warnings -x -xi -xmlversion -xsd -xsdversion -xsiloc -xsl -xsltversion -y
Use -XYZ:? for details of option XYZ
Params: 
param=value           Set stylesheet string parameter
+param=filename       Set stylesheet document parameter
?param=expression     Set stylesheet parameter using XPath
!param=value          Set serialization parameter

Transformation GUI

Select the Transform menu option. Choose a URL, an input and an output format. Click Transform.

Transformation API

The stylesheets are installed in $PREFIX/share/ocr-fileformat/xslt and can be used directly in your scripts and software. You will need to use an XSLT 2.0 capable stylesheet transformer.

Supported Transformations

From ╲ To hOCR ALTO PAGEXML FineReader Plain Text
hOCR - - -
ALTO - - -
PAGE - - - - -
FineReader - - - - -

Validation

Usage: ocr-validate [-dh] <schema> <file>

Validation CLI

For example, to validate an XML file againt the ALTO 3.1 schema:

ocr-validate alto-3-1 myFile.alto

Validation GUI

Select the Validate menu option. Choose a URL and an schema. Click Validate.

Validation API

The XSD files are installed under $PREFIX/share/ocr-fileformat/xsd

Supported Validation Formats

hOCR ALTO PAGEXML FineReader
Validation ✖️

License

This is free software. You may use it under the terms of the MIT License.

During the installation process several projects are included (in ./vendor). These projects have different licenses:

ocr-fileformat's People

Contributors

kba avatar zuphilip avatar stweil avatar cneud avatar

Watchers

James Cloos avatar  avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.