Giter Site home page Giter Site logo

fazledyn-or / core Goto Github PK

View Code? Open in Web Editor NEW

This project forked from ocr-d/core

0.0 0.0 0.0 5.03 MB

Collection of OCR-related python tools and wrappers from @OCR-D

Home Page: https://ocr-d.de/core/

License: Apache License 2.0

Shell 0.68% Python 98.68% Makefile 0.56% Dockerfile 0.07% Vim Script 0.01%

core's Introduction

OCR-D/core

Python modules implementing OCR-D specs and related tools

image image Docker Image CI image image image

Gitter chat

Introduction

This repository contains the python packages that form the base for tools within the OCR-D ecosphere.

All packages are also published to PyPI.

Installation

NOTE Unless you want to contribute to OCR-D/core, we recommend installation as part of ocrd_all which installs a complete stack of OCR-D-related software.

The easiest way to install is via pip:

pip install ocrd

# or just the functionality you need, e.g.

pip install ocrd_modelfactory

All python software released by OCR-D requires Python 3.7 or higher.

NOTE Some OCR-D-Tools (or even test cases) might reveal an unintended behavior if you have specific environment modifications, like:

  • using a custom build of ImageMagick, whose format delegates are different from what OCR-D supposes
  • custom Python logging configurations in your personal account

Command line tools

NOTE: All OCR-D CLI tools support a --help flag which shows usage and supported flags, options and arguments.

ocrd CLI

ocrd-dummy CLI

A minimal OCR-D processor that copies from -I/-input-file-grp to -O/-output-file-grp

Configuration

Almost all behaviour of the OCR-D/core software is configured via CLI options and flags, which can be listed with the --help flag that all CLI support.

Some parts of the software are configured via environement variables:

  • OCRD_METS_CACHING: If set to true, access to the METS file is cached, speeding in-memory search and modification.

  • OCRD_PROFILE: This variable configures the built-in CPU and memory profiling. If empty, no profiling is done. Otherwise expected to contain any of the following tokens:

    • CPU: Enable CPU profiling of processor runs
    • RSS: Enable RSS memory profiling
    • PSS: Enable proportionate memory profiling
  • OCRD_PROFILE_FILE: If set, then the CPU profile is written to this file for later peruse with a analysis tools like snakeviz

  • PATH: Search path for processor executables (affects ocrd process and ocrd resmgr).

  • HOME: Directory to look for ocrd_logging.conf, fallback for unset XDG variables (see below).

  • XDG_CONFIG_HOME: Directory to look for ./ocrd/resources.yml (i.e. ocrd resmgr user database) – defaults to $HOME/.config.

  • XDG_DATA_HOME: Directory to look for ./ocrd-resources/* (i.e. ocrd resmgr data location) – defaults to $HOME/.local/share.

  • OCRD_DOWNLOAD_RETRIES: Number of times to retry failed attempts for downloads of workspace files.

  • OCRD_DOWNLOAD_TIMEOUT: Timeout in seconds for connecting or reading (comma-separated) when downloading.

  • OCRD_METS_CACHING: Whether to enable in-memory storage of OcrdMets data structures for speedup during processing or workspace operations.

  • OCRD_MAX_PROCESSOR_CACHE: Maximum number of processor instances (for each set of parameters) to be kept in memory (including loaded models) for processing workers or processor servers.

  • OCRD_NETWORK_SERVER_ADDR_PROCESSING: Default address of Processing Server to connect to (for ocrd network client processing).

  • OCRD_NETWORK_SERVER_ADDR_WORKFLOW: Default address of Workflow Server to connect to (for ocrd network client workflow).

  • OCRD_NETWORK_SERVER_ADDR_WORKSPACE: Default address of Workspace Server to connect to (for ocrd network client workspace).

  • OCRD_NETWORK_WORKER_QUEUE_CONNECT_ATTEMPTS: Number of attempts for a worker to create its queue. Helpfull if the rabbitmq-server needs time to be fully started.

Packages

ocrd_utils

Contains utilities and constants, e.g. for logging, path normalization, coordinate calculation etc.

See README for ocrd_utils for further information.

ocrd_models

Contains file format wrappers for PAGE-XML, METS, EXIF metadata etc.

See README for ocrd_models for further information.

ocrd_modelfactory

Code to instantiate models from existing data.

See README for ocrd_modelfactory for further information.

ocrd_validators

Schemas and routines for validating BagIt, ocrd-tool.json, workspaces, METS, page, CLI parameters etc.

See README for ocrd_validators for further information.

ocrd_network

Components related to OCR-D Web API

See README for ocrd_network for further information.

ocrd

Depends on all of the above, also contains decorators and classes for creating OCR-D processors and CLIs.

Also contains the command line tool ocrd.

See README for ocrd for further information.

bash library

Builds a bash script that can be sourced by other bash scripts to create OCRD-compliant CLI.

For example:

source `ocrd bashlib filename`
declare -A NAMESPACES MIMETYPES
eval NAMESPACES=( `ocrd bashlib constants NAMESPACES` )
echo ${NAMESPACES[page]}
eval MIMETYPE_PAGE=( `ocrd bashlib constants MIMETYPE_PAGE` )
echo $MIMETYPE_PAGE
eval MIMETYPES=( `ocrd bashlib constants EXT_TO_MIME` )
echo ${MIMETYPES[.jpg]}

bashlib CLI

See CLI usage

bashlib API

ocrd__raise

Raise an error and exit.

ocrd__log

Delegate logging to ocrd log

ocrd__minversion

Ensure minimum version

ocrd__dumpjson

Output ocrd-tool.json content verbatim.

Requires $OCRD_TOOL_JSON and $OCRD_TOOL_NAME to be set:

export OCRD_TOOL_JSON=/path/to/ocrd-tool.json
export OCRD_TOOL_NAME=ocrd-foo-bar

(Which you automatically get from ocrd__wrap.)

ocrd__show_resource

Output given resource file's content.

ocrd__list_resources

Output all resource files' names.

ocrd__usage

Print help on CLI usage.

ocrd__parse_argv

Parses arguments according to OCR-D CLI. In doing so, depending on the values passed to it, may delegate to …

Expects an associative array ("hash"/"dict") ocrd__argv to be predefined:

declare -A ocrd__argv=()

This will be filled by the parser along the following keys:

  • overwrite: whether --overwrite is enabled
  • profile: whether --profile is enabled
  • profile_file: the argument of --profile-file
  • log_level: the argument of --log-level
  • mets_file: absolute path of the --mets argument
  • working_dir: absolute path of the --working-dir argument or the parent of mets_file
  • page_id: the argument of --page-id
  • input_file_grp: the argument of --input-file-grp
  • output_file_grp: the argument of --output-file-grp

Moreover, there will be an associative array params with the fully expanded runtime values of the ocrd-tool.json parameters.

ocrd__wrap

Parses an ocrd-tool.json for a specific tool (i.e. processor executable).

Delegates to …

Usage: ocrd__wrap PATH/TO/OCRD-TOOL.JSON EXECUTABLE ARGS

For example:

ocrd__wrap $SHAREDIR/ocrd-tool.json ocrd-olena-binarize "$@"
...

ocrd__input_file

(Requires ocrd__wrap to have been run first.)

Access information on the input files according to the parsed CLI arguments:

  • their file url (or local file path)
  • their file ID
  • their mimetype
  • their pageId
  • their proposed corresponding outputFileId (generated from ${ocrd__argv[output__file_grp]} and input file ID)

Usage: ocrd__input_file NR KEY

For example:

pageId=`ocrd__input_file 3 pageId`

To be used in a loop over all selected pages:

for ((n=0; n<${#ocrd__files[*]}; n++)); do
    local in_fpath=($(ocrd__input_file $n url))
    local in_id=($(ocrd__input_file $n ID))
    local in_mimetype=($(ocrd__input_file $n mimetype))
    local in_pageId=($(ocrd__input_file $n pageId))
    local out_id=$(ocrd__input_file $n outputFileId)
    local out_fpath="${ocrd__argv[output_file_grp]}/${out_id}.xml

    # process $in_fpath to $out_fpath ...

    declare -a options
    if [ -n "$in_pageId" ]; then
        options=( -g $in_pageId )
    else
        options=()
    fi
    if [[ "${ocrd__argv[overwrite]}" == true ]];then
        options+=( --force )
    fi
    options+=( -G ${ocrd__argv[output_file_grp]}
               -m $MIMETYPE_PAGE -i "$out_id"
               "$out_fpath" )
    ocrd -l ${ocrd__argv[log_level]} workspace -d ${ocrd__argv[working_dir]} add "${options[@]}"

Note: If the --input-file-grp is multi-valued (N fileGrps separated by commas), then usage is similar:

  • The function ocrd__input_file can be used, but its results will be lists (delimited by whitespace and surrounded by single quotes), e.g. [url]='file1.xml file2.xml' [ID]='id_file1 id_file2' [mimetype]='application/vnd.prima.page+xml image/tiff' ....
  • Therefore its results should be encapsulated in a (non-associative) array variable and without extra quotes, e.g. in_file=($(ocrd__input_file 3 url)), or as shown above.
  • This will yield the first fileGrp's results on index 0, which in bash will always be the same as if you referenced the array without index (so code does not need to be changed much), e.g. test -f $in_file which equals test -f ${in_file[0]}.
  • Additional fileGrps will have to be fetched from higher indexes, e.g. test -f ${in_file[1]}.

Testing

Download assets (make assets)

Test with local files: make test

  • Test with remote assets:
    • make test OCRD_BASEURL='https://github.com/OCR-D/assets/raw/master/data/'

See Also

core's People

Contributors

kba avatar bertsky avatar mehmedgit avatar joschrew avatar m3ssman avatar stweil avatar wrznr avatar cneud avatar hnesk avatar tdoan2010 avatar mweidling avatar mikegerber avatar mexthecat avatar j23d avatar b2m avatar witiko avatar finkf avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.