ocr-d / ocrd_kraken Goto Github PK

View Code? Open in Web Editor NEW

10.0 5.0 6.0 196 KB

Wrapper for the kraken OCR engine

License: Apache License 2.0

Python 93.62% Makefile 3.73% Dockerfile 2.65%

ocr-d

ocrd_kraken's People

Contributors

Stargazers

Watchers

Forkers

wrznr kba mmreza79 mikegerber bertsky stweil

ocrd_kraken's Issues

Binarization: Created files have no file:GROUPID

ocrd-kraken-binarize --version
Version 0.0.1, ocrd/core 0.3.1

ocrd workspace find --file-grp OCR-D-KRAKEN-BIN --output-field ID
OCR-D-KRAKEN-BIN_0001
OCR-D-KRAKEN-BIN_0002

ocrd workspace find --file-grp OCR-D-KRAKEN-BIN --output-field groupId
None
None

Binarization creates 2 source files in target workspace

While binarizing image to a new workspace two tifs will be created.
The filename of the image is not the filename of the given mets.xml!
It seems the filename of the METS file in the cache directory!?

(source) Files are stored in the root directory of the workspace and looks like this:
file.path.to.old.workspace.filename
and
file.path.to.new.workspace.file.path.to.old.workspace.filename

The original file (OCR-D-IMG/filename) is missing in the new workspace!
(Inside METS is a reference to the first file mentioned above!)
Steps:

ocrd workspace validate
ocrd workspace clone -a -m mets.xml
cd /tmp/pyocrd-'xyz'
ocrd-kraken-binarize -w /new/target/dir

Binarization doesn't respect output group

Rename "kraken-segment" to "kraken-segment-line"

It is in fact only line segmentation!

ocrd-kraken-segment creates negative coordinates (=invalid PAGE)

Hi,

I have an example, where ocrd-kraken-segment creates negative coordinates (=invalid PAGE).
I just have used:

ocrd resmgr download ocrd-kraken-segment blla.mlmodel
ocrd-kraken-segment -I <inputFileGrp> -O <outputFileGrp>

example.zip
As Result I can see:

<pc:TextRegion id="region_line_36">
            <pc:Coords points="3040,382 3040,-2 3219,-2 3219,382 3216,575 3037,569"/>

kraken-binarize produces two identical output files per input file

When calling kraken binarization on a mets.xml like
ocrd process -m mets.xml kraken-binarize -I DEFAULT -O OCR-D-IMG-BIN
I obtain 2 identical binarized images per single input image, one residing in the CWD and another one in a directory OCR-D-IMG-BIN.

Possibly related to #10, #16?

recognize: word coordinates are often invalid

Currently, the _make_word approach (creating a Word with dummy coordinates first, then adding points glyph by glyph

ocrd_kraken/ocrd_kraken/recognize.py

Line 99 in 802c6b0

current_word.get_Coords().points += ' ' + points_from_polygon(poly)

and finally them to a bbox when the next word starts,

ocrd_kraken/ocrd_kraken/recognize.py

Line 94 in 802c6b0

    
           current_word.get_Coords().points = points_from_bbox(*bbox_from_polygon(polygon_from_points(current_word.get_Coords().points.strip())))

IIUC) creates polygons which are semantically unsound, e.g. 141,1263 141,1343 141,1343 141,1263 (notice the same points repeating, so we actually have only 2 instead of 4 here).

Restrict segmentation to print space

If a print space is set via the corresponding element it has to be respected for (block) segmentation.

documentation: README completeness, debug ocrd-tool.json

Please debug your ocrd_tool.json file.
I found some errors:

<report valid="false">
  <error>[tools.ocrd-kraken-binarize.input_file_grp] 'OCR-D-IMG' is not of type 'array'</error>
  <error>[tools.ocrd-kraken-binarize.output_file_grp] 'OCR-D-IMG-BIN' is not of type 'array'</error>
  <error>[tools.ocrd-kraken-binarize.parameters.level-of-operation] 'description' is a required property</error>
  <error>[tools.ocrd-kraken-segment] 'input_file_grp' is a required property</error>
  <error>[tools.ocrd-kraken-segment] 'output_file_grp' is a required property</error>
  <error>[tools.ocrd-kraken-segment.parameters.maxcolseps] 'description' is a required property</error>
  <error>[tools.ocrd-kraken-segment.parameters.scale] 'description' is a required property</error>
  <error>[tools.ocrd-kraken-segment.parameters.black_colseps] 'description' is a required property</error>
  <error>[tools.ocrd-kraken-segment.parameters.white_colseps] 'description' is a required property</error>
  <error>[tools.ocrd-kraken-ocr] 'input_file_grp' is a required property</error>
  <error>[tools.ocrd-kraken-ocr] 'output_file_grp' is a required property</error>
  <error>[tools.ocrd-kraken-ocr.parameters.lines-json.required] 'true' is not of type 'boolean'</error>
</report>

You can find the ocrd-tool.json documentation: https://ocr-d.github.io/ocrd_tool

Please check your README file and complet them. An ideal README file look like:

# Name of application


## Introduction
...

## Installation
...

## Usage
...

## Testing
...

Thank you very much.

use multi-model recognition

Kraken offers "multi-script" (actually multi-model) prediction in one pass, so instead of a fixed model, we could run with multiple models and use the annotated language and script mappings to select per-segment (as in ocrd-tesserocr-recognize with xpath_model).

IIUC, that would entail using mm_rpred (instead of rpred) and passing lang/script to bounds['boxes'][...]['tags'] (or bounds['lines'][...]['tags'] with baseline segmentation) and a dict from lang/script to model names as the first arg.

Experiment with kraken's trainable reading order detection

License template not completely filled out

The LICENSE does not contain the year and name of the copyright owner:

   Copyright [yyyy] [name of copyright owner]

segmentation: add optional level=region (line segmentation only)

According to the OCR-D functional model, line segmentation takes place after block segmentation. It should therefor check for existing text blocks and restrict its area of operation to them.

Default input_file_grp / output_file_grp

should be OCR-D-IMG and OCR-D-IMG-BIN-KRAKEN resp.

fallback to CPU if no GPU

It's unfortunate that Kraken itself requires selecting the computing device to have Pytorch use in advance.

For practical purposes, workflows should try to use CUDA if available. That's why ocrd_detectron2 falls back to cpu.

This should be implemented (and then documented) here as well.

ocrd validate-ocrd-tool ocrd-tool.json für KRAKEN liefert: ...
is not of type 'object'

Complete implementation of OCR with kraken

Text recognition is e.g. referenced in ocrd-tool.json but not in setup.py.

ocrd-kraken-ocr process call seems to be broken

Traceback (most recent call last):
  File "/home/j23d/.local/share/virtualenvs/ocrd_butler-o_KhKE38/lib/python3.6/site-packages/celery/app/trace.py", line 385, in trace_task
    R = retval = fun(*args, **kwargs)
  File "/home/j23d/projects/ocrd_butler/ocrd_butler/celery_utils.py", line 21, in __call__
    return TaskBase.__call__(self, *args, **kwargs)
  File "/home/j23d/.local/share/virtualenvs/ocrd_butler-o_KhKE38/lib/python3.6/site-packages/celery/app/trace.py", line 648, in __protected_call__
    return self.run(*args, **kwargs)
  File "/home/j23d/projects/ocrd_butler/ocrd_butler/execution/tasks.py", line 82, in create_task
    **kwargs)
  File "/home/j23d/.local/share/virtualenvs/ocrd_butler-o_KhKE38/lib/python3.6/site-packages/ocrd/processor/base.py", line 56, in run_processor
    processor.process()
  File "/home/j23d/.local/share/virtualenvs/ocrd_butler-o_KhKE38/lib/python3.6/site-packages/ocrd_kraken/ocr.py", line 44, in process
    content=bin_image_bytes.getvalue())
  File "/home/j23d/.local/share/virtualenvs/ocrd_butler-o_KhKE38/lib/python3.6/site-packages/ocrd/workspace.py", line 162, in add_file
    raise Exception("'content' was set but no 'local_filename'")
Exception: 'content' was set but no 'local_filename'

I suspect that the call is not up to date with the current version of OCR-D core.

Pass on pageID

ocrd-kraken-binarize is not passing on the pageID