Giter Site home page Giter Site logo

ocrd_cis's Introduction

Language grade: Python Total alerts image

Content:

ocrd_cis

CIS OCR-D command line tools for the automatic post-correction of OCR-results.

Introduction

ocrd_cis contains different tools for the automatic post-correction of OCR results. It contains tools for the training, evaluation and execution of the post-correction. Most of the tools are following the OCR-D CLI conventions.

Additionally, there is a helper tool to align multiple OCR results, as well as an improved version of Ocropy that works with Python 3 and is also wrapped for OCR-D.

Installation

There are 2 ways to install the ocrd_cis tools:

  • normal packaging:
make install # or equally: pip install -U pip .

(Installs ocrd_cis including its Python dependencies from the current directory to the Python package directory.)

  • editable mode:
make install-devel # or equally: pip install -e -U pip .

(Installs ocrd_cis including its Python dependencies from the current directory.)

It is possible (and recommended) to install ocrd_cis in a custom user directory (instead of system-wide) by using virtualenv (or venv):

 # create venv:
 python3 -m venv venv-dir # where "venv-dir" could be any path name
 # enter venv in current shell:
 source venv-dir/bin/activate
 # install ocrd_cis:
 make install # or any other way (see above)
 # use ocrd_cis:
 ocrd-cis-ocropy-binarize ...
 # finally, leave venv:
 deactivate

Profiler

The post-correction is dependent on the language profiler and its language configurations to generate corrections for suspicious words. In order to use the post-correction, a profiler and according language configurations have to be present on the system. You can refer to our manuals and our lexical resources for more information.

If you use docker you can use the preinstalled profiler from within the docker-container. The profiler is installed to /apps/profiler and the language configurations lie in /etc/profiler/languages in the container image.

Usage

Most tools follow the OCR-D specifications, (which makes them OCR-D processors,) i.e. they accept the command-line options --input-file-grp, --output-file-grp, --page-id, --parameter, --mets, --log-level (each with an argument). Invoke with --help to get self-documentation.

Some of the processors (most notably the alignment tool) expect a comma-seperated list of multiple input file groups, or multiple output file groups.

The ocrd-tool.json contains a formal description of all the processors along with the parameter config file accepted by their --parameter argument.

ocrd-cis-postcorrect

This processor runs the post correction using a pre-trained model. If additional support OCRs should be used, models for these OCR steps are required and must be executed and aligned beforehand (see the test script for an example).

There is a basic model trained on the OCR-D ground truth. It gets installed allongside this module. You can get the model's install path using ocrd-cis-data -model (see below for a description of ocrd-cis-data). To use this model (or any other model) the model parameter in the configuration file must be set to the path of the model to use. Be aware that the models are trained with a specific maximal number of OCR's (usally 2) and that is not possible to use more OCR's than the number used for training (it is possible to use less, though).

Arguments:

  • --parameter path to configuration file
  • --input-file-grp name of the master-OCR file group
  • --output-file-grp name of the post-correction file group
  • --log-level set log level
  • --mets path to METS file in workspace

As mentioned above in order to use the postcorrection with input from multiple OCR's, some preprocessing steps are needed: firstly the additional OCR recognition has to be done and secondly the multiple OCR's have to be aligned (you can also take a look to the function ocrd_cis_align in the tests). Assuming an original recognition as file group OCR1 on the segmented document of file group SEG, the folloing commands can be used:

ocrd-ocropus-recognize -I SEG -O OCR2 ... # additional OCR
ocrd-cis-align -I OCR1,OCR2 -O ALGN ... # align OCR1 and OCR2
ocrd-cis-postcorrect -I ALGN -O PC ... # post correction

ocrd-cis-align

Aligns tokens of multiple input file groups to one output file group. This processor is used to align the master OCR with any additional support OCRs. It accepts a comma-separated list of input file groups, which it aligns in order.

Arguments:

  • --parameter path to configuration file
  • --input-file-grp comma seperated list of the input file groups; first input file group is the master OCR; if there is a ground truth (for evaluation) it must be the last file group in the list
  • --output-file-grp name of the file group for the aligned result
  • --log-level set log level
  • --mets path to METS file in workspace

ocrd-cis-data

Helper tool to get the path of the installed data files. Usage: ocrd-cis-data [-h|-jar|-3gs|-model|-config] to get the path of the jar library, the pre-trained post correction model, the path to the default 3-grams language model file or the default training configuration file. This tool does not follow the OCR-D conventions.

Training

There is no dedicated training script provided. Models are trained using the java implementation directly (check out the training test script for an example). Training a model requires a workspace containing one or more file groups consisting of aligned OCR and ground-truth documents (the last file group has to be the ground truth).

Arguments:

  • --parameter path to configuration file
  • --input-file-grp name of the input file group to profile
  • --output-file-grp name of the output file group where the profile is stored
  • --log-level set log level
  • --mets path to METS file in the workspace

ocrd-cis-ocropy-train

The ocropy-train tool can be used to train LSTM models. It takes ground truth from the workspace and saves (image+text) snippets from the corresponding pages. Then a model is trained on all snippets for 1 million (or the given number of) randomized iterations from the parameter file.

java -jar $(ocrd-cis-data -jar) \
	 -c train \
	 --input-file-grp OCR1,OCR2,GT \
     --log-level DEBUG \
	 -m mets.xml \
	 --parameter $(ocrd-cis-data -config)

ocrd-cis-ocropy-clip

The clip processor can be used to remove intrusions of neighbouring segments in regions / lines of a page. It runs a connected component analysis on every text region / line of every PAGE in the input file group, as well as its overlapping neighbours, and for each binary object of conflict, determines whether it belongs to the neighbour, and can therefore be clipped to the background. It references the resulting segment image files in the output PAGE (via AlternativeImage). (Use this to suppress separators and neighbouring text.)

ocrd-cis-ocropy-clip \
  -I OCR-D-SEG-REGION \
  -O OCR-D-SEG-REGION-CLIP \
  -p '{"level-of-operation": "region"}'

Available parameters are:

   "level-of-operation" [string - "region"]
    PAGE XML hierarchy level granularity to annotate images for
    Possible values: ["region", "line"]
   "dpi" [number - -1]
    pixel density in dots per inch (overrides any meta-data in the
    images); disabled when negative
   "min_fraction" [number - 0.7]
    share of foreground pixels that must be retained by the largest label

ocrd-cis-ocropy-resegment

The resegment processor can be used to remove overlap between neighbouring lines of a page. It runs a line segmentation on every text region of every PAGE in the input file group, and for each line already annotated, determines the label of largest extent within the original coordinates (polygon outline) in that line, and annotates the resulting coordinates in the output PAGE. (Use this to polygonalise text lines that are poorly segmented, e.g. via bounding boxes.)

ocrd-cis-ocropy-resegment \
  -I OCR-D-SEG-LINE \
  -O OCR-D-SEG-LINE-RES \
  -p '{"extend_margins": 3}'

Available parameters are:

   "level-of-operation" [string - "page"]
    PAGE XML hierarchy level to segment textlines in ('region' abides by
    existing text region boundaries, 'page' optimises lines in the whole
    page once
    Possible values: ["page", "region"]
   "method" [string - "lineest"]
    source for new line polygon candidates ('lineest' for line
    estimation, i.e. how Ocropy would have segmented text lines;
    'baseline' tries to re-polygonize from the baseline annotation;
    'ccomps' avoids crossing connected components by majority rule)
    Possible values: ["lineest", "baseline", "ccomps"]
   "dpi" [number - 0]
    pixel density in dots per inch (overrides any meta-data in the
    images); disabled when zero or negative
   "min_fraction" [number - 0.75]
    share of foreground pixels that must be retained by the output
    polygons
   "extend_margins" [number - 3]
    number of pixels to extend the input polygons in all directions

ocrd-cis-ocropy-segment

The segment processor can be used to segment (pages or) regions of a page into (regions and) lines. It runs a line segmentation on every (page or) text region of every PAGE in the input file group, and adds (text regions containing) TextLine elements with the resulting polygon outlines to the annotation of the output PAGE. (Does not detect tables.)

ocrd-cis-ocropy-segment \
  -I OCR-D-SEG-BLOCK \
  -O OCR-D-SEG-LINE \
  -p '{"level-of-operation": "page", "gap_height": 0.015}'

Available parameters are:

   "dpi" [number - -1]
    pixel density in dots per inch (overrides any meta-data in the
    images); disabled when negative; when disabled and no meta-data is
    found, 300 is assumed
   "level-of-operation" [string - "region"]
    PAGE XML hierarchy level to read images from and add elements to
    Possible values: ["page", "table", "region"]
   "maxcolseps" [number - 20]
    (when operating on the page/table level) maximum number of
    white/background column separators to detect, counted piece-wise
   "maxseps" [number - 20]
    (when operating on the page/table level) number of black/foreground
    column separators to detect (and suppress), counted piece-wise
   "maximages" [number - 10]
    (when operating on the page level) maximum number of black/foreground
    very large components to detect (and suppress), counted piece-wise
   "csminheight" [number - 4]
    (when operating on the page/table level) minimum height of
    white/background or black/foreground column separators in multiples
    of scale/capheight, counted piece-wise
   "hlminwidth" [number - 10]
    (when operating on the page/table level) minimum width of
    black/foreground horizontal separators in multiples of
    scale/capheight, counted piece-wise
   "gap_height" [number - 0.01]
    (when operating on the page/table level) largest minimum pixel
    average in the horizontal or vertical profiles (across the binarized
    image) to still be regarded as a gap during recursive X-Y cut from
    lines to regions; needs to be larger when more foreground noise is
    present, reduce to avoid mistaking text for noise
   "gap_width" [number - 1.5]
    (when operating on the page/table level) smallest width in multiples
    of scale/capheight of a valley in the horizontal or vertical
    profiles (across the binarized image) to still be regarded as a gap
    during recursive X-Y cut from lines to regions; needs to be smaller
    when more foreground noise is present, increase to avoid mistaking
    inter-line as paragraph gaps and inter-word as inter-column gaps
   "overwrite_order" [boolean - true]
    (when operating on the page/table level) remove any references for
    existing TextRegion elements within the top (page/table) reading
    order; otherwise append
   "overwrite_separators" [boolean - true]
    (when operating on the page/table level) remove any existing
    SeparatorRegion elements; otherwise append
   "overwrite_regions" [boolean - true]
    (when operating on the page/table level) remove any existing
    TextRegion elements; otherwise append
   "overwrite_lines" [boolean - true]
    (when operating on the region level) remove any existing TextLine
    elements; otherwise append
   "spread" [number - 2.4]
    distance in points (pt) from the foreground to project text line (or
    text region) labels into the background for polygonal contours; if
    zero, project half a scale/capheight

ocrd-cis-ocropy-deskew

The deskew processor can be used to deskew pages / regions of a page. It runs a projection profile-based skew estimation on every segment of every PAGE in the input file group and annotates the orientation angle in the output PAGE. (Does not include orientation detection.)

ocrd-cis-ocropy-deskew \
  -I OCR-D-SEG-LINE \
  -O OCR-D-SEG-LINE-DES \
  -p '{"level-of-operation": "page", "maxskew": 10}'

Available parameters are:

   "maxskew" [number - 5.0]
    modulus of maximum skewing angle to detect (larger will be slower, 0
    will deactivate deskewing)
   "level-of-operation" [string - "region"]
    PAGE XML hierarchy level granularity to annotate images for
    Possible values: ["page", "region"]

ocrd-cis-ocropy-denoise

The denoise processor can be used to despeckle pages / regions / lines of a page. It runs a connected component analysis and removes small components (black or white) on every segment of every PAGE in the input file group and references the resulting segment image files in the output PAGE (as AlternativeImage).

ocrd-cis-ocropy-denoise \
  -I OCR-D-SEG-LINE-DES \
  -O OCR-D-SEG-LINE-DEN \
  -p '{"noise_maxsize": 2}'

Available parameters are:

   "noise_maxsize" [number - 3.0]
    maximum size in points (pt) for connected components to regard as
    noise (0 will deactivate denoising)
   "dpi" [number - -1]
    pixel density in dots per inch (overrides any meta-data in the
    images); disabled when negative
   "level-of-operation" [string - "page"]
    PAGE XML hierarchy level granularity to annotate images for
    Possible values: ["page", "region", "line"]

ocrd-cis-ocropy-binarize

The binarize processor can be used to binarize (and optionally denoise and deskew) pages / regions / lines of a page. It runs the "nlbin" adaptive whitelevel thresholding on every segment of every PAGE in the input file group and references the resulting segment image files in the output PAGE (as AlternativeImage). (If a deskewing angle has already been annotated in a region, the tool respects that and rotates accordingly.) Images can also be produced grayscale-normalized.

ocrd-cis-ocropy-binarize \
  -I OCR-D-SEG-LINE-DES \
  -O OCR-D-SEG-LINE-BIN \
  -p '{"level-of-operation": "page", "threshold": 0.7}'

Available parameters are:

   "method" [string - "ocropy"]
    binarization method to use (only 'ocropy' will include deskewing and
    denoising)
    Possible values: ["none", "global", "otsu", "gauss-otsu", "ocropy"]
   "threshold" [number - 0.5]
    for the 'ocropy' and ' global' method, black/white threshold to apply
    on the whitelevel normalized image (the larger the more/heavier
    foreground)
   "grayscale" [boolean - false]
    for the 'ocropy' method, produce grayscale-normalized instead of
    thresholded image
   "maxskew" [number - 0.0]
    modulus of maximum skewing angle (in degrees) to detect (larger will
    be slower, 0 will deactivate deskewing)
   "noise_maxsize" [number - 0]
    maximum pixel number for connected components to regard as noise (0
    will deactivate denoising)
   "level-of-operation" [string - "page"]
    PAGE XML hierarchy level granularity to annotate images for
    Possible values: ["page", "region", "line"]

ocrd-cis-ocropy-dewarp

The dewarp processor can be used to vertically dewarp text lines of a page. It runs the baseline estimation and center normalizer algorithm on every line in every text region of every PAGE in the input file group and references the resulting line image files in the output PAGE (as AlternativeImage).

ocrd-cis-ocropy-dewarp \
  -I OCR-D-SEG-LINE-BIN \
  -O OCR-D-SEG-LINE-DEW \
  -p '{"range": 5}'

Available parameters are:

   "dpi" [number - -1]
    pixel density in dots per inch (overrides any meta-data in the
    images); disabled when negative
   "range" [number - 4.0]
    maximum vertical disposition or maximum margin (will be multiplied by
    mean centerline deltas to yield pixels)
   "max_neighbour" [number - 0.05]
    maximum rate of foreground pixels intruding from neighbouring lines
    (line will not be processed above that)

ocrd-cis-ocropy-recognize

The recognize processor can be used to recognize the lines / words / glyphs of a page. It runs LSTM optical character recognition on every line in every text region of every PAGE in the input file group and adds the resulting text annotation in the output PAGE.

ocrd-cis-ocropy-recognize \
  -I OCR-D-SEG-LINE-DEW \
  -O OCR-D-OCR-OCRO \
  -p '{"textequiv_level": "word", "model": "fraktur-jze.pyrnn"}'

Available parameters are:

   "textequiv_level" [string - "line"]
    PAGE XML hierarchy level granularity to add the TextEquiv results to
    Possible values: ["line", "word", "glyph"]
   "model" [string]
    ocropy model to apply (e.g. fraktur.pyrnn)

Tesserocr

Install essential system packages for Tesserocr

sudo apt-get install python3-tk \
  tesseract-ocr libtesseract-dev libleptonica-dev \
  libimage-exiftool-perl libxml2-utils

Then install Tesserocr from: https://github.com/OCR-D/ocrd_tesserocr

pip install -r requirements.txt
pip install .

Download and move tesseract models from: https://github.com/tesseract-ocr/tesseract/wiki/Data-Files or use your own models and place them into: /usr/share/tesseract-ocr/4.00/tessdata

Workflow configuration

A decent pipeline might look like this:

  1. image normalization/optimization
  2. page-level binarization
  3. page-level cropping
  4. (page-level binarization)
  5. (page-level despeckling)
  6. page-level deskewing
  7. (page-level dewarping)
  8. region segmentation, possibly subdivided into
    1. text/non-text separation
    2. text region segmentation (and classification)
    3. reading order detection
    4. non-text region classification
  9. region-level clipping
  10. (region-level deskewing)
  11. line segmentation
  12. (line-level clipping or resegmentation)
  13. line-level dewarping
  14. line-level recognition
  15. (line-level alignment and post-correction)

If GT is used, then cropping/segmentation steps can be omitted.

If a segmentation is used which does not produce overlapping segments, then clipping/resegmentation can be omitted.

Testing

To run a few basic tests type make test (ocrd_cis has to be installed in order to run any tests).

Miscellaneous

OCR-D workspace

  • Create a new (empty) workspace: ocrd workspace -d workspace-dir init
  • cd into workspace-dir
  • Add new file to workspace: ocrd workspace add file -G group -i id -m mimetype -g pageId

OCR-D links

ocrd_cis's People

Contributors

bertsky avatar chris-j-weber avatar cneud avatar finkf avatar kba avatar stweil avatar sulzbals avatar tenglmeier avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

ocrd_cis's Issues

Segmentation takes hours for a single newspaper page

While running QuiVer benchmarks tests the segmentation of a single newspaper page takes several hours.
It is still unfinished after 3:30 hours.

Benchmark protocol:

Launching `/app/workflows/workspaces/reichsanzeiger_random_selected_pages_ocr/data/reichsanzeiger_random/selected_pages_ocr.txt.nf` [nice_kare] DSL2 - revision: 8ad3dbf42c
[...]
executor >  local (6)ESC[K
[6c/1653c5] process > ocrd_cis_ocropy_binarize_0 [100%] 1 of 1 ✔ESC[K
[51/f7d705] process > ocrd_tesserocr_crop_1      [100%] 1 of 1 ✔ESC[K
[0e/64d9d0] process > ocrd_skimage_binarize_2    [100%] 1 of 1 ✔ESC[K
[d1/127ed7] process > ocrd_skimage_denoise_3     [100%] 1 of 1 ✔ESC[K
[80/9b6f02] process > ocrd_tesserocr_deskew_4    [100%] 1 of 1 ✔ESC[K
[8f/17f8eb] process > ocrd_cis_ocropy_segment_5  [  0%] 0 of 1ESC[K
[-        ] process > ocrd_cis_ocropy_dewarp_6   -ESC[K
[-        ] process > ocrd_calamari_recognize_7  -ESC[K

Task protocol:

04:04:15.301 INFO processor.OcropySegment - INPUT FILE 0 / P_1879_45_0344
04:04:17.330 INFO processor.OcropySegment - computing line segmentation for page "OCR-D-BIN-DENOISE-DESKEW_1879_45_0344"
04:04:17.330 ERROR processor.OcropySegment - Cannot line-segment page "OCR-D-BIN-DENOISE-DESKEW_1879_45_0344": image too wide for a page image (7086, 10777)
04:04:17.335 INFO processor.OcropySegment - created file ID: OCR-D-SEG_1879_45_0344, file_grp: OCR-D-SEG, path: OCR-D-SEG/OCR-D-SEG_1879_45_0344.xml
04:04:17.335 INFO processor.OcropySegment - INPUT FILE 1 / P_1885_5_0055
04:04:19.555 INFO processor.OcropySegment - computing line segmentation for page "OCR-D-BIN-DENOISE-DESKEW_1885_5_0055"
[...]
07:42:36.641 WARNING processor.OcropyResegment - baseline part crosses existing x in region "OCR-D-BIN-DENOISE-DESKEW_1885_5_0055"
07:42:37.321 WARNING processor.OcropySegment - Label 188 contour 1 is too small (131/19460) in region "OCR-D-BIN-DENOISE-DESKEW_1885_5_0055"
07:42:37.395 WARNING processor.OcropyResegment - baseline part crosses existing x in region "OCR-D-BIN-DENOISE-DESKEW_1885_5_0055"
07:42:37.772 WARNING processor.OcropyResegment - baseline part crosses existing x in region "OCR-D-BIN-DENOISE-DESKEW_1885_5_0055"
07:42:37.773 WARNING processor.OcropyResegment - baseline part crosses existing x in region "OCR-D-BIN-DENOISE-DESKEW_1885_5_0055"
07:42:39.257 WARNING processor.OcropyResegment - baseline part component crosses existing x in region "OCR-D-BIN-DENOISE-DESKEW_1885_5_0055"
07:42:39.356 INFO processor.OcropySegment - Added region "OCR-D-BIN-DENOISE-DESKEW_1885_5_0055_region0500" with 34 lines for page "OCR-D-BIN-DENOISE-DESKEW_1885_5_0055"

Document dependency on jq

jq is used for JSON parsing in the shell scripts. README should explain how to install (sudo apt-get install jq) and those scripts that depend on it should fail early if jq isn't in $PATH.

wrap scale estimation as separate processor for DPI estimation

It would be useful to have a dedicated processor for DPI estimation in OCR-D. That's because we cannot rely on DPI metadata, although we need to. (Most Ocropy segmentation steps now zoom in on the annotated DPI value in order to forego the 300 DPI assumption. This situation is likely similar with other modules.)

Tesseract already has such a functionality, which is based on its internal line segmentation: first the average scale gets estimated, then it gets multiplied by a constant to yield the DPI. This is based under the assumption that xheight is more or less homogeneous across the page. (Which it is not!) But Tesseract's API does not export that estimation, and does not give access to the TO_BLOCK_LIST which holds the average line_size.

So it's probably best to use ocrolib.psegutils.estimate_scale for this in the same fashion.

But since we know that pages can have widely varying font sizes, we should look at scales more locally, and then find a better statistic than just median to give us the mean xheight of a 12pt text line.

This could be achieved as follows: in estimate_scale, we add an option to look at the np.histogram of blob sizes (square root of box areas for connected components), trying to filter out both the tiny boxes originating from noise and the huge boxes from headings and drop-caps. Then we use that in a dedicated processor ocrd-cis-ocropy-estimate-density, multiplying the estimated scale with a configurable constant factor (which defaults e.g. to 10) to yield the DPI estimation. We annotate this in PAGE-XML under PcGts/Page/@imageXResolution and PcGts/Page/@imageYResolution with PcGts/Page/@imageResolutionUnit="PPI". A future OcrdExif in core can then use that information to override the EXIF data found in the binary image.

Segment crashes

ocrd_cis/ocropy/common.py:643: VisibleDeprecationWarning: Creating an ndarray from ragged nested sequences (which is a list-or-tuple of lists-or-tuples-or ndarrays with different lengths or shapes) is deprecated. If you meant to do this, you must specify 'dtype=object' when creating the ndarray.
  sepslices = np.array(sepslices)
15:55:07.138 INFO processor.OcropySegment - Found 170 text lines for page "SBB-CROP_Ansiedlung_Korotschin_UZS_Sign_22a_0003"
15:56:49.378 INFO processor.OcropySegment - Found 84 text regions for page "SBB-CROP_Ansiedlung_Korotschin_UZS_Sign_22a_0003"
15:56:55.435 WARNING processor.OcropySegment - Label 1 contour 1 is too small (157/4808) in region "SBB-CROP_Ansiedlung_Korotschin_UZS_Sign_22a_0003"
Traceback (most recent call last):
  File "/data/ocr-d/ocrd_all/venv/bin/ocrd-cis-ocropy-segment", line 8, in <module>
    sys.exit(ocrd_cis_ocropy_segment())
  File "click/core.py", line 1130, in __call__
    return self.main(*args, **kwargs)
  File "click/core.py", line 1055, in main
    rv = self.invoke(ctx)
  File "click/core.py", line 1404, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "click/core.py", line 760, in invoke
    return __callback(*args, **kwargs)
  File "ocrd_cis/ocropy/cli.py", line 53, in ocrd_cis_ocropy_segment
    return ocrd_cli_wrap_processor(OcropySegment, *args, **kwargs)
  File "ocrd/decorators/__init__.py", line 117, in ocrd_cli_wrap_processor
    run_processor(processorClass, ocrd_tool, mets, workspace=workspace, **kwargs)
  File "ocrd/processor/helpers.py", line 107, in run_processor
    processor.process()
  File "ocrd_cis/ocropy/segment.py", line 406, in process
    input_file.pageId, zoom, rogroup=rogroup)
  File "ocrd_cis/ocropy/segment.py", line 680, in _process_element
    min_area=640/zoom/zoom)
  File "ocrd_cis/ocropy/segment.py", line 232, in masks2polygons
    for baseline in baselines], name)
  File "ocrd_cis/ocropy/segment.py", line 232, in <listcomp>
    for baseline in baselines], name)
  File "shapely/geometry/base.py", line 582, in intersection
    return shapely.intersection(self, other, grid_size=grid_size)
  File "shapely/decorators.py", line 77, in wrapped
    return func(*args, **kwargs)
  File "shapely/set_operations.py", line 133, in intersection
    return lib.intersection(a, b, **kwargs)
FloatingPointError: invalid value encountered in intersection

ocrd-cis-align producing unexpected xml

When using ocrd-cis-align together with ocrd-dinglehopper I noticed some unexpected behavior regarding the generated XML by ocrd-cis-align.

See qurator-spk/dinglehopper#37 for the way that led me here.

Here is the minimal workflow to reproduce the problem (also in the attached zip file as workflow.sh).

ocrd workspace init
ocrd workspace set-id "OCR-D-CIS-ALIGN-BUG"

ocrd workspace add --file-grp OCR-D-IMG --file-id OCR-D-IMG_f0001 --mimetype image/jpg --page-id PAGE_0001 OCR-D-IMG/FILE_0001.jpg

ocrd-olena-binarize -I OCR-D-IMG -O OCR-D-BIN -P impl sauvola -P k 0.2
ocrd-cis-ocropy-segment -I OCR-D-BIN  -O OCR-D-SEG-REG -P level-of-operation page
ocrd-tesserocr-recognize -I OCR-D-SEG-REG -O OCR-D-OCR-TESS1 -P textequiv_level word
ocrd-tesserocr-recognize -I OCR-D-SEG-REG -O OCR-D-OCR-TESS2 -P textequiv_level word
ocrd-cis-align -I OCR-D-OCR-TESS1,OCR-D-OCR-TESS2 -O OCR-D-ALIGN

I added a minimal example on how to reproduce the problem using docker and the attached data:

 docker run --rm -it -v ${WORKSPACE}:/data -w /data -- ocrd/all:maximum bash workflow.sh

The unexpected part is, that the information from the text line from OCR-D-OCR-TESS2 is split into two XML nodes:

<pc:TextEquiv index="2" dataTypeDetails="OCR-D-OCR-TESS2/OCR-D-BIN_f0001_region0001_line0000"/>
<pc:TextEquiv conf="0.639380130767822" dataType="ocrd-cis-line-alignment" dataTypeDetails="OCR-D-OCR-TESS2/OCR-D-BIN_f0001_region0001_line0000">
  <pc:Unicode>上оrеm টрsum</pc:Unicode>
</pc:TextEquiv>

Invalid license?

The current code uses python-Levenshtein which uses GNU General Public License v2 or later.

ocrd-cis-post-correct.sh is not OCR-D ready

There seems to be a mix-up of input and output groups.
ocrd-cis-post-correct.sh -m mets.xml -p anyFile -I OCR-D-GT-SEG-BLOCK,OCR-D-GT-SEG-PAGE -O newOutputGorup
Exception: Invalid input/output file grps:
Output fileGrp[@use='OCR-D-GT-SEG-BLOCK'] already in METS!

ocropy segment does not handle input files properly

ocrd-cis-ocropy-segment does not segment any image files.

Looking at the code in ocrd_cis/ocropy/segment.py line 220 looks wrong:

for (n, input_file) in enumerate(self.input_files):

Shouldn't this be:

for (n, input_file) in enumerate(self.workspace.mets.find_files(fileGrp=self.input_file_grp)):

As well as line 113 seems weird:

if hasattr(self, 'output_file_grp'):
  try:
    self.output_file_grp, self.image_file_grp = self.output_file_grp.split(',')
  except ValueError:
    self.image_file_grp = FALLBACK_FILEGRP_IMG
    LOG.info("No output file group for images specified, falling back to '%s'", FALLBACK_FILEGRP_IMG)

Core dump in ocrd-cis-ocropy-segment

I just (22.03.2022) updated ocrd_all.
After this update, I have got an error (core dump) in ocrd-cis-ocropy-segment - like this:

15:29:41.938 INFO processor.OcropySegment - INPUT FILE 0 / P_4074_007817778_00001
15:29:42.644 INFO processor.OcropySegment - Page "OCR-D-REG-DESKEW-4074_007817778_00001" uses 200.000000 DPI
15:29:42.718 INFO processor.OcropySegment - computing line segmentation for region "TR-1"
Segmentation fault (core dumped)

Here I have attached the workspace to reproduce:
ocrd-cis-ocrocy-segment-error.zip

ocropy: add parameters to override DPI meta-data

While waiting for OCR-D/core#376 as a general mechanism in core (sanctioned by spec) at some point in the future (3.0 milestone), we need a quick solution for Ocropy segmentation and recognition now (final workshop): override the meta-data density value with a parameter exposed to the user (as a last resort).

AssertionError from add_baseline(geom)

I have found an issue for a specific page but I am not sure what exactly the problem is other than that the page seems empty:

20:43:59.071 DEBUG ocrd.processor.helpers.run_processor - Running processor <class 'ocrd_cis.ocropy.segment.OcropySegment'>
20:43:59.071 DEBUG ocrd.processor.helpers.run_processor - Processor instance <ocrd_cis.ocropy.segment.OcropySegment object at 0x7f3d684b9220> (ocrd-cis-ocropy-segment v0.1.5 doing layout/segmentation/region)
20:43:59.072 DEBUG ocrd.mets_client[/tmp/ocrd_network_sockets/_vd18_data_PPN831977752_513pages_mets_xml.sock] - find_files({'mimetype': None, 'page_id': 'PHYS_0510', 'file_grp': 'OCR-D-CLIP'})
20:43:59.246 DEBUG ocrd.processor.base - adding file FILE_0510_OCR-D-CLIP for page PHYS_0510 to input file group OCR-D-CLIP
20:43:59.246 DEBUG ocrd.processor.base - another file FILE_0510_OCR-D-CLIP_region0000.IMG-CLIP for page PHYS_0510 in input file group OCR-D-CLIP
20:43:59.246 DEBUG ocrd.processor.base - another file FILE_0510_OCR-D-CLIP_region0001.IMG-CLIP for page PHYS_0510 in input file group OCR-D-CLIP
20:43:59.246 DEBUG ocrd.processor.base - another file FILE_0510_OCR-D-CLIP_region0004.IMG-CLIP for page PHYS_0510 in input file group OCR-D-CLIP
20:43:59.246 DEBUG ocrd.processor.base - another file FILE_0510_OCR-D-CLIP_region0007.IMG-CLIP for page PHYS_0510 in input file group OCR-D-CLIP
20:43:59.246 DEBUG ocrd.processor.base - another file FILE_0510_OCR-D-CLIP_region0008.IMG-CLIP for page PHYS_0510 in input file group OCR-D-CLIP
20:43:59.246 DEBUG ocrd.processor.base - another file FILE_0510_OCR-D-CLIP_region0009.IMG-CLIP for page PHYS_0510 in input file group OCR-D-CLIP
20:43:59.246 DEBUG ocrd.processor.base - another file FILE_0510_OCR-D-CLIP_region0010.IMG-CLIP for page PHYS_0510 in input file group OCR-D-CLIP
20:43:59.246 DEBUG ocrd.processor.base - another file FILE_0510_OCR-D-CLIP_region0011.IMG-CLIP for page PHYS_0510 in input file group OCR-D-CLIP
20:43:59.246 DEBUG ocrd.workspace.download_file - 'local_filename' OCR-D-CLIP/FILE_0510_OCR-D-CLIP.xml already within /vd18_data/PPN831977752_513pages - nothing to do
20:43:59.249 DEBUG ocrd.mets_client[/tmp/ocrd_network_sockets/_vd18_data_PPN831977752_513pages_mets_xml.sock] - find_files({'local_filename': 'DEFAULT/FILE_0510_DEFAULT.jpg'})
20:43:59.706 DEBUG ocrd.mets_client[/tmp/ocrd_network_sockets/_vd18_data_PPN831977752_513pages_mets_xml.sock] - find_files({'local_filename': 'DEFAULT/FILE_0510_DEFAULT.jpg'})
20:44:00.485 DEBUG ocrd.workspace.image_from_page - page 'FILE_0510_OCR-D-CLIP' has border, orientation=0 skew=0.00
20:44:00.485 DEBUG ocrd.workspace.image_from_page - Using AlternativeImage 5 {'', 'deskewed', 'cropped', 'binarized', 'clipped', 'despeckled'} for page 'FILE_0510_OCR-D-CLIP'
20:44:00.485 DEBUG ocrd.mets_client[/tmp/ocrd_network_sockets/_vd18_data_PPN831977752_513pages_mets_xml.sock] - find_files({'local_filename': 'OCR-D-SEG-BLOCK-TESSERACT/FILE_0510_OCR-D-SEG-BLOCK-TESSERACT.IMG-BIN.png'})
20:44:01.273 DEBUG ocrd.utils.coords.shift_coordinates - shifting coordinates by [-46 -49]
20:44:01.273 DEBUG ocrd.utils.coords.shift_coordinates - shifting coordinates by [-359.  -618.5]
20:44:01.273 DEBUG ocrd.utils.coords.rotate_coordinates - rotating coordinates by 0.00° around [359.  618.5]
20:44:01.273 DEBUG ocrd.utils.coords.shift_coordinates - shifting coordinates by [359.  618.5]
20:44:01.274 DEBUG ocrd.utils.coords.shift_coordinates - shifting coordinates by [0 0]
20:44:01.277 DEBUG ocrd.utils.crop_image - cropping image to (346, 828, 381, 863)
20:44:01.278 DEBUG ocrd.utils.coords.shift_coordinates - shifting coordinates by [-346 -828]
20:44:01.279 DEBUG ocrd.workspace.image_from_segment - segment 'region0000' has orientation=0 skew=0.01
20:44:01.279 DEBUG ocrd.workspace.image_from_segment - Using AlternativeImage 1 {'despeckled', 'clipped', 'binarized'} for segment 'region0000'
20:44:01.279 DEBUG ocrd.mets_client[/tmp/ocrd_network_sockets/_vd18_data_PPN831977752_513pages_mets_xml.sock] - find_files({'local_filename': 'OCR-D-CLIP/FILE_0510_OCR-D-CLIP_region0000.IMG-CLIP.png'})
20:44:02.078 DEBUG ocrd.utils.coords.shift_coordinates - shifting coordinates by [-17.5 -17.5]
20:44:02.078 DEBUG ocrd.utils.coords.rotate_coordinates - rotating coordinates by 0.01° around [17.5 17.5]
20:44:02.079 DEBUG ocrd.utils.coords.shift_coordinates - shifting coordinates by [17.50371747 17.50371747]
20:44:02.079 DEBUG ocrd.workspace.image_from_segment - Rotating AlternativeImage for segment 'region0000' by 0.01°
20:44:02.079 DEBUG ocrd.utils.rotate_image - rotating image by 0.01°
20:44:02.079 DEBUG ocrd.workspace.image_from_segment - Recropping AlternativeImage for segment 'region0000'
20:44:02.080 DEBUG ocrd.utils.crop_image - cropping image to (0, 0, 35, 35)
20:44:02.080 DEBUG ocrd.utils.coords.shift_coordinates - shifting coordinates by [0 0]
20:44:02.083 DEBUG ocrd.utils.crop_image - cropping image to (261, 435, 367, 845)
20:44:02.085 DEBUG ocrd.utils.coords.shift_coordinates - shifting coordinates by [-261 -435]
20:44:02.085 DEBUG ocrd.workspace.image_from_segment - segment 'region0001' has orientation=0 skew=0.01
20:44:02.085 DEBUG ocrd.workspace.image_from_segment - Using AlternativeImage 1 {'despeckled', 'clipped', 'binarized'} for segment 'region0001'
20:44:02.085 DEBUG ocrd.mets_client[/tmp/ocrd_network_sockets/_vd18_data_PPN831977752_513pages_mets_xml.sock] - find_files({'local_filename': 'OCR-D-CLIP/FILE_0510_OCR-D-CLIP_region0001.IMG-CLIP.png'})
20:44:02.835 DEBUG ocrd.utils.coords.shift_coordinates - shifting coordinates by [ -53. -205.]
20:44:02.836 DEBUG ocrd.utils.coords.rotate_coordinates - rotating coordinates by 0.01° around [ 53. 205.]
20:44:02.836 DEBUG ocrd.utils.coords.shift_coordinates - shifting coordinates by [ 53.04355096 205.0112552 ]
20:44:02.836 DEBUG ocrd.workspace.image_from_segment - Rotating AlternativeImage for segment 'region0001' by 0.01°
20:44:02.836 DEBUG ocrd.utils.rotate_image - rotating image by 0.01°
20:44:02.837 DEBUG ocrd.workspace.image_from_segment - Recropping AlternativeImage for segment 'region0001'
20:44:02.839 DEBUG ocrd.utils.crop_image - cropping image to (0, 0, 106, 410)
20:44:02.839 DEBUG ocrd.utils.coords.shift_coordinates - shifting coordinates by [0 0]
20:44:03.055 ERROR ocrd.processor.helpers.run_processor - Failure in processor 'ocrd-cis-ocropy-segment'
Traceback (most recent call last):
  File "/home/mm/repos/core/build/__editable__.ocrd-2.65.0-py3-none-any/ocrd/processor/helpers.py", line 130, in run_processor
    processor.process()
  File "/home/mm/venv38-all/lib/python3.8/site-packages/ocrd_cis/ocropy/segment.py", line 500, in process
    self._process_element(region, ignore, region_image, region_coords,
  File "/home/mm/venv38-all/lib/python3.8/site-packages/ocrd_cis/ocropy/segment.py", line 788, in _process_element
    line_polygons, _ = masks2polygons(line_labels, baselines, element_bin,
  File "/home/mm/venv38-all/lib/python3.8/site-packages/ocrd_cis/ocropy/segment.py", line 234, in masks2polygons
    base = join_baselines([baseline.intersection(polygon)
  File "/home/mm/venv38-all/lib/python3.8/site-packages/ocrd_cis/ocropy/segment.py", line 959, in join_baselines
    add_baseline(geom)
  File "/home/mm/venv38-all/lib/python3.8/site-packages/ocrd_cis/ocropy/segment.py", line 951, in add_baseline
    assert all(p1[0] < p2[0] for p1, p2 in zip(result[:-1], result[1:])), result
AssertionError: [(52.0, 277.0), (74.5, 279.5), (77.875, 279.875), (88.0, 281.0), (89.0, 275.0), (89.0, 281.0), (90.0, 275.0), (90.0, 281.0), (91.0, 274.0), (91.0, 282.0), (92.0, 274.0), (92.0, 282.0), (93.0, 274.0), (93.0, 282.0), (94.0, 273.0), (94.0, 282.0), (95.0, 273.0), (95.0, 282.0), (96.0, 273.0), (96.0, 283.0), (97.0, 273.0), (97.0, 283.0), (98.0, 273.0), (98.0, 283.0), (99.0, 273.0), (99.0, 283.0), (100.0, 272.0), (100.0, 283.0), (101.0, 272.0), (101.0, 283.0), (102.0, 272.0), (102.0, 283.0), (103.0, 272.0), (103.0, 283.0), (104.0, 272.0), (104.0, 283.0), (105.0, 272.0)]

The used workflow:

cis-ocropy-binarize      -I DEFAULT                   -O OCR-D-BINPAGE             -P dpi 300
anybaseocr-crop          -I OCR-D-BINPAGE             -O OCR-D-SEG-PAGE-ANYOCR     -P dpi 300
cis-ocropy-denoise       -I OCR-D-SEG-PAGE-ANYOCR     -O OCR-D-DENOISE-OCROPY      -P dpi 300
cis-ocropy-deskew        -I OCR-D-DENOISE-OCROPY      -O OCR-D-DESKEW-OCROPY       -P level-of-operation page
tesserocr-segment-region -I OCR-D-DESKEW-OCROPY       -O OCR-D-SEG-BLOCK-TESSERACT -P dpi 300 -P padding 5.0  -P find_tables false
segment-repair           -I OCR-D-SEG-BLOCK-TESSERACT -O OCR-D-SEGMENT-REPAIR      -P plausibilize true       -P plausibilize_merge_min_overlap 0.7
cis-ocropy-clip          -I OCR-D-SEGMENT-REPAIR      -O OCR-D-CLIP
cis-ocropy-segment       -I OCR-D-CLIP                -O OCR-D-SEGMENT-OCROPY      -P dpi 300
cis-ocropy-dewarp        -I OCR-D-SEGMENT-OCROPY      -O OCR-D-DEWARP
tesserocr-recognize      -I OCR-D-DEWARP              -O OCR-D-OCR                 -P model Fraktur

Here is the problematic image of page 510:
FILE_0510_DEFAULT

It is worth mentioning that other similar pages did not fail. E.g. pages 508 and 509:

FILE_0508_DEFAULT
FILE_0509_DEFAULT

no correction with ocrd-cis-postcorrect

I'm running ocrd-cis-postcorrect on the aligned OCR-output of Calamari and Tesserocr. So far, the output seems to be completely identical with the input even though there are quite some differences between the results of the two OCR engines. See e.g. the attached example.
postcorrect.zip

How can I achieve some correction results?

Latest release is broken with Shapely 2.0.x

Traceback (most recent call last):
  File "/usr/local/share/pyenv/versions/3.7.16/bin/ocrd-cis-ocropy-segment", line 11, in <module>
    load_entry_point('ocrd-cis==0.1.5', 'console_scripts', 'ocrd-cis-ocropy-segment')()
  File "/usr/local/share/pyenv/versions/3.7.16/lib/python3.7/site-packages/pkg_resources/__init__.py", line 490, in load_entry_point
    return get_distribution(dist).load_entry_point(group, name)
  File "/usr/local/share/pyenv/versions/3.7.16/lib/python3.7/site-packages/pkg_resources/__init__.py", line 2862, in load_entry_point
    return ep.load()
  File "/usr/local/share/pyenv/versions/3.7.16/lib/python3.7/site-packages/pkg_resources/__init__.py", line 2462, in load
    return self.resolve()
  File "/usr/local/share/pyenv/versions/3.7.16/lib/python3.7/site-packages/pkg_resources/__init__.py", line 2468, in resolve
    module = __import__(self.module_name, fromlist=['__name__'], level=0)
  File "/usr/local/share/pyenv/versions/3.7.16/lib/python3.7/site-packages/ocrd_cis/ocropy/cli.py", line 11, in <module>
    from ocrd_cis.ocropy.segment import OcropySegment
  File "/usr/local/share/pyenv/versions/3.7.16/lib/python3.7/site-packages/ocrd_cis/ocropy/segment.py", line 8, in <module>
    from shapely.geometry import Polygon, asPolygon
ImportError: cannot import name 'asPolygon' from 'shapely.geometry' (/usr/local/share/pyenv/versions/3.7.16/lib/python3.7/site-packages/shapely/geometry/__init__.py)

This is 0.1.5, but git also seems to be affected from the looks of it.

Column segmentation failure

workflow:

ocrd process \
"olena-binarize -I OCR-D-IMG -O OCR-D-N1 -P impl wolf" \
"anybaseocr-crop -I OCR-D-N1 -O OCR-D-N2" \
"olena-binarize -I OCR-D-N2 -O OCR-D-N3 -P impl wolf" \
"cis-ocropy-denoise -I OCR-D-N3 -O OCR-D-N4 -P level-of-operation page" \
"cis-ocropy-deskew -I OCR-D-N4 -O OCR-D-N5 -P level-of-operation page" \
"cis-ocropy-segment -I OCR-D-N5 -O OCR-D-N6 -P level-of-operation page" \
"cis-ocropy-deskew -I OCR-D-N6 -O OCR-D-N7 -P level-of-operation region" \
"cis-ocropy-clip -I OCR-D-N7 -O OCR-D-N8 -P level-of-operation region" \
"cis-ocropy-dewarp -I OCR-D-N8 -O OCR-D-N9" \
"calamari-recognize -I OCR-D-N9 -O OCR-D-OCR -P checkpoint /usr/local/ocrd_models/calamari/calamari_models-0.3/fraktur_19th_century/*.ckpt.json"
  • 0001 is not segmented in 3 columns, but vertical separator line is without any breaks after wolf binarization.
  • 0017 does not segment correctly when binarized, but original (color)
  • 0042 does not segment correctly.

TIFFs:
0001
0017
0042

OcropyResegment: ValueError: A LinearRing must have at least 3 coordinate tuples

Log:

12:34:29.920 INFO processor.OcropyResegment - INPUT FILE 0 / P_00001
12:34:30.312 INFO processor.OcropyResegment - Page "OCR-D-N11_00001" uses 300.000000 DPI
12:34:30.321 WARNING processor.OcropyResegment - Page "OCR-D-N11_00001" region "region0000" contains only one line
...
12:34:39.847 WARNING ocrd_utils.crop_image - crop coordinates ((-1, 47, 1090, 97)) exceed image (1090x188)
12:34:39.881 WARNING processor.OcropyResegment - Largest label (2) largest contour (0) is small (0/53946) in line 
"region0004_li
ne0001"
Traceback (most recent call last):
  File "/beegfs/home/hd/hd_hd/hd_wu120/ocrd_all/venv/bin/ocrd-cis-ocropy-resegment", line 8, in <module>
    sys.exit(ocrd_cis_ocropy_resegment())
  File "/beegfs/home/hd/hd_hd/hd_wu120/ocrd_all/venv/lib/python3.7/site-packages/click/core.py", line 829, in __call__
    return self.main(*args, **kwargs)
  File "/beegfs/home/hd/hd_hd/hd_wu120/ocrd_all/venv/lib/python3.7/site-packages/click/core.py", line 782, in main
    rv = self.invoke(ctx)
  File "/beegfs/home/hd/hd_hd/hd_wu120/ocrd_all/venv/lib/python3.7/site-packages/click/core.py", line 1066, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/beegfs/home/hd/hd_hd/hd_wu120/ocrd_all/venv/lib/python3.7/site-packages/click/core.py", line 610, in invoke
    return callback(*args, **kwargs)
  File "/beegfs/home/hd/hd_hd/hd_wu120/ocrd_all/venv/lib/python3.7/site-packages/ocrd_cis/ocropy/cli.py", line 38, in 
ocrd_cis_o
cropy_resegment
    return ocrd_cli_wrap_processor(OcropyResegment, *args, **kwargs)
  File "/beegfs/home/hd/hd_hd/hd_wu120/ocrd_all/venv/lib/python3.7/site-packages/ocrd/decorators/__init__.py", line 81, 
in ocrd_
cli_wrap_processor
    run_processor(processorClass, ocrd_tool, mets, workspace=workspace, **kwargs)
  File "/beegfs/home/hd/hd_hd/hd_wu120/ocrd_all/venv/lib/python3.7/site-packages/ocrd/processor/helpers.py", line 69, 
in run_pro
cessor
    processor.process()
  File "/beegfs/home/hd/hd_hd/hd_wu120/ocrd_all/venv/lib/python3.7/site-packages/ocrd_cis/ocropy/resegment.py", line 
234, in process
    extend_margins=margin, threshold_relative=threshold)
  File "/beegfs/home/hd/hd_hd/hd_wu120/ocrd_all/venv/lib/python3.7/site-packages/ocrd_cis/ocropy/resegment.py", line 
107, in resegment
    polygon = Polygon(polygon).simplify(2).exterior.coords[:-1] # keep open
  File "/beegfs/home/hd/hd_hd/hd_wu120/ocrd_all/venv/lib/python3.7/site-packages/shapely/geometry/polygon.py", line 
243, in __init__
    ret = geos_polygon_from_py(shell, holes)
  File "/beegfs/home/hd/hd_hd/hd_wu120/ocrd_all/venv/lib/python3.7/site-packages/shapely/geometry/polygon.py", line 
509, in geos_polygon_from_py
    ret = geos_linearring_from_py(shell)
  File "shapely/speedups/_speedups.pyx", line 252, in shapely.speedups._speedups.geos_linearring_from_py
ValueError: A LinearRing must have at least 3 coordinate tuples

Workflow:

#!/bin/bash
#SBATCH --partition=single
#SBATCH --time=1:00:00
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=8
#SBATCH --mem=20gb
cd /beegfs/work/ws/hd_wu120-ubhd-ocrd-0/dobbert1861/14.tif >ocrd.log 2>&1 # ocrd.log neu schreiben
. $HOME/.bashrc >>ocrd.log 2>&1
. $HOME/ocrd_all/venv/bin/activate >>ocrd.log 2>&1
/usr/bin/time ocrd-create-mets.xml >>ocrd.log 2>&1
/usr/bin/time ocrd-olena-binarize -I OCR-D-IMG -O OCR-D-N1 -P impl wolf >>ocrd.log 2>&1
/usr/bin/time ocrd-anybaseocr-crop -I OCR-D-N1 -O OCR-D-N2 >>ocrd.log 2>&1
/usr/bin/time ocrd-olena-binarize -I OCR-D-N2 -O OCR-D-N3 -P impl wolf >>ocrd.log 2>&1
/usr/bin/time ocrd-cis-ocropy-denoise -I OCR-D-N3 -O OCR-D-N4 -P level-of-operation page >>ocrd.log 2>&1
/usr/bin/time ocrd-cis-ocropy-deskew -I OCR-D-N4 -O OCR-D-N5 -P level-of-operation page >>ocrd.log 2>&1
/usr/bin/time ocrd-tesserocr-segment-region -I OCR-D-N5 -O OCR-D-N6 >>ocrd.log 2>&1
/usr/bin/time ocrd-segment-repair -I OCR-D-N6 -O OCR-D-N7 -P plausibilize true >>ocrd.log 2>&1
/usr/bin/time ocrd-cis-ocropy-clip -I OCR-D-N7 -O OCR-D-N8 -P level-of-operation region >>ocrd.log 2>&1
/usr/bin/time ocrd-cis-ocropy-deskew -I OCR-D-N8 -O OCR-D-N9 -P level-of-operation region >>ocrd.log 2>&1
/usr/bin/time ocrd-tesserocr-segment-line -I OCR-D-N9 -O OCR-D-N10 >>ocrd.log 2>&1
/usr/bin/time ocrd-cis-ocropy-clip -I OCR-D-N10 -O OCR-D-N11 -P level-of-operation line >>ocrd.log 2>&1
/usr/bin/time ocrd-cis-ocropy-resegment -I OCR-D-N11 -O OCR-D-N12 >>ocrd.log 2>&1
/usr/bin/time ocrd-cis-ocropy-dewarp -I OCR-D-N12 -O OCR-D-N13 >>ocrd.log 2>&1
/usr/bin/time ocrd-calamari-recognize -I OCR-D-N13 -O OCR-D-OCR -P checkpoint "$HOME/ocrd_models/calamari/GT4HistOCR/*.ckpt.json" >>ocrd.log 2>&1

Image:
https://digi.ub.uni-heidelberg.de/diglitData/v/ocrd/dobbert1861_-_14.tif

  • git log of ocrd_all: commit 39a946255f8276a01b17055b300f1ddbc267187a
  • git log of ocrd_all/ocrd_cis: commit 5ec0e34
(venv) [hd_XXX@login5 ocrd_all]$ ocrd-cis-ocropy-resegment --version
Version 0.1.5, ocrd/core 2.19.0

object.__setattr__ related errors "AttributeError: can't set attribute"

ocrd-cis-ocopy-segment currently returns some AttributeErrors in some regions. Seems shapely misbehaves. The same error is discussed here: shapely/shapely#1207

The workflow I've used follows this sequence of processors: https://ocr-d.de/en/workflows#best-results-for-selected-pages

I am using the input images from this repository: https://ocr-d-repo.scc.kit.edu/api/v1/dataresources/b22282d5-a206-4def-9021-7302199f7326/data/mangoldt_unternehmergewinn_1855.ocrd.zip

Screenshot-from-2022-02-28-09-35-49

remove unconditional pylab imports

It looks like the package python3-tk should be added to the list of required packages. Without it, I got this error:

(venv-20200906) $ ocrd-cis-ocropy-denoise --help
Traceback (most recent call last):
  File "/usr/lib/python3.5/tkinter/__init__.py", line 36, in <module>
    import _tkinter
ImportError: No module named '_tkinter'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/venv-20200906/bin/ocrd-cis-ocropy-denoise", line 5, in <module>
    from ocrd_cis.ocropy.cli import ocrd_cis_ocropy_denoise
  File "/venv-20200906/lib/python3.5/site-packages/ocrd_cis/ocropy/cli.py", line 4, in <module>
    from ocrd_cis.ocropy.binarize import OcropyBinarize
  File "/venv-20200906/lib/python3.5/site-packages/ocrd_cis/ocropy/binarize.py", line 25, in <module>
    from . import common
  File "/venv-20200906/lib/python3.5/site-packages/ocrd_cis/ocropy/common.py", line 12, in <module>
    from . import ocrolib
  File "/venv-20200906/lib/python3.5/site-packages/ocrd_cis/ocropy/ocrolib/__init__.py", line 11, in <module>
    from . import default, common
  File "/venv-20200906/lib/python3.5/site-packages/ocrd_cis/ocropy/ocrolib/common.py", line 22, in <module>
    from pylab import (clf, cm, ginput, gray, imshow, ion, subplot,
  File "/venv-20200906/lib/python3.5/site-packages/pylab.py", line 1, in <module>
    from matplotlib.pylab import *
  File "/venv-20200906/lib/python3.5/site-packages/matplotlib/pylab.py", line 245, in <module>
    from matplotlib import cbook, mlab, pyplot as plt
  File "/venv-20200906/lib/python3.5/site-packages/matplotlib/pyplot.py", line 2372, in <module>
    switch_backend(rcParams["backend"])
  File "/venv-20200906/lib/python3.5/site-packages/matplotlib/pyplot.py", line 207, in switch_backend
    backend_mod = importlib.import_module(backend_name)
  File "/usr/lib/python3.5/importlib/__init__.py", line 126, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
  File "/venv-20200906/lib/python3.5/site-packages/matplotlib/backends/backend_tkagg.py", line 1, in <module>
    from . import _backend_tk
  File "/venv-20200906/lib/python3.5/site-packages/matplotlib/backends/_backend_tk.py", line 5, in <module>
    import tkinter as Tk
  File "/usr/lib/python3.5/tkinter/__init__.py", line 38, in <module>
    raise ImportError(str(msg) + ', please install the python3-tk package')
ImportError: No module named '_tkinter', please install the python3-tk package

@stweil Thanks for the report. I am surprised we did not detect this earlier. It is another example of bad ocrolib packaging – these show/plot functions are not needed usually, so the pylab import should not be unconditional (but function-local).

I don't think we should allow dragging in python3-tk (which in turn requires X11 libs).

Originally posted in OCR-D/ocrd_all#184 (comment)

TopologicalError: GEOSIntersection_r could not be performed

Environment

  • Version: included in Docker Image ocrd/all from 2020-08-04 (docker image id: 158ea3d64eae)

Current Behavior:

When executing something like: docker run --rm -u "40366" -w /data -v "/home/aqayv/project/ulb-it-migration/WORKSPACE_OCR/203074":/data -v /usr/share/tesseract-ocr/4.00/tessdata:/usr/local/share/tessdata/ ocrd/all:2020-08-04 ocrd-make -f ulb-ocrd-vd18-02.mk .:

make: Entering directory '/data'
make -R -C . -I /data/ -f /data/ulb-ocrd-vd18-02.mk  2>&1 | tee ..ulb-ocrd-vd18-02.log
make[1]: Entering directory '/data'
building OCR-D-SEGMENT-OCROPY from OCR-D-CLIP with pattern rule for ocrd-cis-ocropy-segment
STAMP=`test -e OCR-D-SEGMENT-OCROPY && date -Ins -r OCR-D-SEGMENT-OCROPY`; ocrd-cis-ocropy-segment   -I OCR-D-CLIP -p OCR-D-SEGMENT-OCROPY.json -O OCR-D-SEGMENT-OCROPY --overwrite 2>&1 | tee OCR-D-SEGMENT-OCROPY.log && touch -c OCR-D-SEGMENT-OCROPY || { if test -z "$STAMP"; then rm -fr OCR-D-SEGMENT-OCROPY; else touch -c -d "$STAMP" OCR-D-SEGMENT-OCROPY; fi; false; }
05:42:29.063 WARNING matplotlib - Matplotlib created a temporary config/cache directory at /.config/matplotlib because the default path (/tmp/matplotlib-ib2pg3_l) is not a writable directory; it is highly recommended to set the MPLCONFIGDIR environment variable to a writable directory, in particular to speed up the import of Matplotlib and to better support multiprocessing.
05:42:39.158 ERROR shapely.geos - TopologyException: Input geom 1 is invalid: Self-intersection at or near point 238 1073 at 238 1073
Traceback (most recent call last):
  File "/usr/bin/ocrd-cis-ocropy-segment", line 8, in <module>
    sys.exit(ocrd_cis_ocropy_segment())
  File "/usr/lib/python3.6/site-packages/click/core.py", line 829, in __call__
    return self.main(*args, **kwargs)
  File "/usr/lib/python3.6/site-packages/click/core.py", line 782, in main
    rv = self.invoke(ctx)
  File "/usr/lib/python3.6/site-packages/click/core.py", line 1066, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/usr/lib/python3.6/site-packages/click/core.py", line 610, in invoke
    return callback(*args, **kwargs)
  File "/usr/lib/python3.6/site-packages/ocrd_cis/ocropy/cli.py", line 54, in ocrd_cis_ocropy_segment
    return ocrd_cli_wrap_processor(OcropySegment, *args, **kwargs)
  File "/usr/lib/python3.6/site-packages/ocrd/decorators.py", line 102, in ocrd_cli_wrap_processor
    run_processor(processorClass, ocrd_tool, mets, workspace=workspace, **kwargs)
  File "/usr/lib/python3.6/site-packages/ocrd/processor/base.py", line 61, in run_processor
    processor.process()
  File "/usr/lib/python3.6/site-packages/ocrd_cis/ocropy/segment.py", line 387, in process
    region.id, file_id + '_' + region.id, zoom)
  File "/usr/lib/python3.6/site-packages/ocrd_cis/ocropy/segment.py", line 653, in _process_element
    line_polygon = polygon_for_parent(line_polygon, element)
  File "/usr/lib/python3.6/site-packages/ocrd_cis/ocropy/segment.py", line 676, in polygon_for_parent
    interp = childp.intersection(parentp)
  File "/usr/lib/python3.6/site-packages/shapely/geometry/base.py", line 649, in intersection
    return geom_factory(self.impl['intersection'](self, other))
  File "/usr/lib/python3.6/site-packages/shapely/topology.py", line 70, in __call__
    self._check_topology(err, this, other)
  File "/usr/lib/python3.6/site-packages/shapely/topology.py", line 38, in _check_topology
    self.fn.__name__, repr(geom)))
shapely.errors.TopologicalError: The operation 'GEOSIntersection_r' could not be performed. Likely cause is invalidity of the geometry <shapely.geometry.polygon.Polygon object at 0x7fca99544160>
Makefile:320: recipe for target 'OCR-D-SEGMENT-OCROPY' failed
make[1]: *** [OCR-D-SEGMENT-OCROPY] Error 1
make[1]: Leaving directory '/data'
make: *** [.] Error 2
Makefile:205: recipe for target '.' failed
make: Leaving directory '/data'

Expected Behavior:

Please do not crash, but log an Error and move on gracefully

2020-09-10-bug-203074.zip

Core dump in ocrd-cis-ocropy-denoise

With an update of ocrd_all from 22.03.22 I get this core dump:

ocrd-cis-ocropy-denoise -I OCR-D-BIN-REG -O OCR-D-BIN-REG-DENOISE -P level-of-operation region
08:45:41.815 INFO processor.OcropyDenoise - INPUT FILE 0 / P_4074_007817778_00001
08:45:42.493 INFO processor.OcropyDenoise - Page "OCR-D-BIN-REG-4074_007817778_00001" uses 200.000000 DPI
08:45:42.564 INFO processor.OcropyDenoise - About to despeckle 'OCR-D-BIN-REG-DENOISE-4074_007817778_00001_TR-1'
Segmentation fault (core dumped)

Maybe this is related to #89

Problem with installed pillow version

I do rm -rf venv && python3 -m venv venv from inside ocrd_cis. Then source venv/bin/activate && pip install --upgrade pip -e .

pip freeze | grep -i Pillow produces: Pillow==5.4.1

When I run ocrd-cis-ocropy-binarize I get the following error:

Traceback (most recent call last):
  File "/nfs/datb/histocrdata/ocrd/experiments/ocrd_cis/venv/lib64/python3.6/site-packages/pkg_resources/__init__.py", line 583, in _build_master
    ws.require(__requires__)
  File "/nfs/datb/histocrdata/ocrd/experiments/ocrd_cis/venv/lib64/python3.6/site-packages/pkg_resources/__init__.py", line 900, in require
    needed = self.resolve(parse_requirements(requirements))
  File "/nfs/datb/histocrdata/ocrd/experiments/ocrd_cis/venv/lib64/python3.6/site-packages/pkg_resources/__init__.py", line 791, in resolve
    raise VersionConflict(dist, req).with_context(dependent_req)
pkg_resources.ContextualVersionConflict: (Pillow 5.4.1 (/nfs/datb/histocrdata/ocrd/experiments/ocrd_cis/venv/lib/python3.6/site-packages), Requirement.parse('pillow>=6.2.0'), {'ocrd-cis'})

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/nfs/datb/histocrdata/ocrd/experiments/ocrd_cis/venv/bin/ocrd-cis-ocropy-binarize", line 6, in <module>
    from pkg_resources import load_entry_point
  File "/nfs/datb/histocrdata/ocrd/experiments/ocrd_cis/venv/lib64/python3.6/site-packages/pkg_resources/__init__.py", line 3251, in <module>
    @_call_aside
  File "/nfs/datb/histocrdata/ocrd/experiments/ocrd_cis/venv/lib64/python3.6/site-packages/pkg_resources/__init__.py", line 3235, in _call_aside
    f(*args, **kwargs)
  File "/nfs/datb/histocrdata/ocrd/experiments/ocrd_cis/venv/lib64/python3.6/site-packages/pkg_resources/__init__.py", line 3264, in _initialize_master_working_set
    working_set = WorkingSet._build_master()
  File "/nfs/datb/histocrdata/ocrd/experiments/ocrd_cis/venv/lib64/python3.6/site-packages/pkg_resources/__init__.py", line 585, in _build_master
    return cls._build_from_requirements(__requires__)
  File "/nfs/datb/histocrdata/ocrd/experiments/ocrd_cis/venv/lib64/python3.6/site-packages/pkg_resources/__init__.py", line 598, in _build_from_requirements
    dists = ws.resolve(reqs, Environment())
  File "/nfs/datb/histocrdata/ocrd/experiments/ocrd_cis/venv/lib64/python3.6/site-packages/pkg_resources/__init__.py", line 786, in resolve
    raise DistributionNotFound(req, requirers)
pkg_resources.DistributionNotFound: The 'pillow>=6.2.0' distribution was not found and is required by ocrd-cis

Line detection during dewarping is defective

ocrd-cis-ocropy-dewarp frequently complains about finding more than one text line in a line segment:

09:11:12.258 INFO processor.OcropyDewarp - About to dewarp page 'Page1' region 'Page1_Block16' line 'Page1_Block16_line0007'
09:11:12.273 ERROR processor.OcropyDewarp - cannot dewarp line "Page1_Block16_line0007": found more than 1 textline, most likely from bad cropping

The corresponding lines however look perfectly fine:
FILE_0011_ARESEG-IMG_Page1_Block16_Page1_Block16_line0007

@bertsky Is it possible to adjust the internal heuristics to be more lax here?

ocrd-cis-ocropy-recognize: 'ascii' codec can't decode byte 0xa9

models:

> find . -name *.pyrnn|xargs md5sum
bb90b17321987002afa6b94e650d16fa  ./venv/lib/python3.6/site-packages/ocrd_cis/ocropy/models/fraktur.pyrnn
ef3238cd60cb1c35ede74573c8d14766  ./venv/lib/python3.6/site-packages/ocrd_cis/ocropy/models/fraktur-jze.pyrnn

file: https://digi.ub.uni-heidelberg.de/diglitData/jb/ocropy-test.jpg

command:

> ocrd-make -f crop-anyocr-binarize-page-olena-sauvola-denoise-ocropy-deskew-page-ocropy-segment-tesseract-ocropy-dewarp-ocr-ocropy-tesseract.`mk 
make: Entering directory '/home/jb/workspace/ocrd/ocrd4dwork'
building OCR-D-OCR-OCRO-fraktur-SEG-LINE-tesseract-ocropy-DEWARP from OCR-D-SEG-LINE-tesseract-ocropy-DEWARP with pattern rule for ocrd-cis-ocropy-recognize
ocrd workspace remove-group -r OCR-D-OCR-OCRO-fraktur-SEG-LINE-tesseract-ocropy-DEWARP 2>/dev/null || true
ocrd-cis-ocropy-recognize -I OCR-D-SEG-LINE-tesseract-ocropy-DEWARP -O OCR-D-OCR-OCRO-fraktur-SEG-LINE-tesseract-ocropy-DEWARP -p OCR-D-OCR-OCRO-fraktur-SEG-LINE-tesseract-ocropy-DEWARP.json 2>&1 | tee OCR-D-OCR-OCRO-fraktur-SEG-LINE-tesseract-ocropy-DEWARP.log && touch -c OCR-D-OCR-OCRO-fraktur-SEG-LINE-tesseract-ocropy-DEWARP || { rm -fr OCR-D-OCR-OCRO-fraktur-SEG-LINE-tesseract-ocropy-DEWARP.json OCR-D-OCR-OCRO-fraktur-SEG-LINE-tesseract-ocropy-DEWARP; exit 1; }
16:39:06.634 INFO ocrd.workspace_validator - input_file_grp=['OCR-D-SEG-LINE-tesseract-ocropy-DEWARP'] output_file_grp=['OCR-D-OCR-OCRO-fraktur-SEG-LINE-tesseract-ocropy-DEWARP']
Traceback (most recent call last):
  File "/home/jb/ocrd_all/venv/bin/ocrd-cis-ocropy-recognize", line 8, in <module>
    sys.exit(ocrd_cis_ocropy_recognize())
  File "/home/jb/ocrd_all/venv/lib/python3.6/site-packages/click/core.py", line 829, in __call__
    return self.main(*args, **kwargs)
  File "/home/jb/ocrd_all/venv/lib/python3.6/site-packages/click/core.py", line 782, in main
    rv = self.invoke(ctx)
  File "/home/jb/ocrd_all/venv/lib/python3.6/site-packages/click/core.py", line 1066, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/jb/ocrd_all/venv/lib/python3.6/site-packages/click/core.py", line 610, in invoke
    return callback(*args, **kwargs)
  File "/home/jb/ocrd_all/venv/lib/python3.6/site-packages/ocrd_cis/ocropy/cli.py", line 49, in ocrd_cis_ocropy_recognize
    return ocrd_cli_wrap_processor(OcropyRecognize, *args, **kwargs)
  File "/home/jb/ocrd_all/venv/lib/python3.6/site-packages/ocrd/decorators.py", line 54, in ocrd_cli_wrap_processor
    run_processor(processorClass, ocrd_tool, mets, workspace=workspace, **kwargs)
  File "/home/jb/ocrd_all/venv/lib/python3.6/site-packages/ocrd/processor/base.py", line 57, in run_processor
    processor.process()
  File "/home/jb/ocrd_all/venv/lib/python3.6/site-packages/ocrd_cis/ocropy/recognize.py", line 134, in process
    self.network = load_object(self.get_model(), verbose=1)
  File "/home/jb/ocrd_all/venv/lib/python3.6/site-packages/ocrd_cis/ocropy/ocrolib/common.py", line 459, in load_object
    return unpickler.load()
UnicodeDecodeError: 'ascii' codec can't decode byte 0xa9 in position 0: ordinal not in range(128)
Makefile:304: recipe for target 'OCR-D-OCR-OCRO-fraktur-SEG-LINE-tesseract-ocropy-DEWARP' failed
make: *** [OCR-D-OCR-OCRO-fraktur-SEG-LINE-tesseract-ocropy-DEWARP] Error 1
make: Leaving directory '/home/jb/workspace/ocrd/ocrd4dwork'

Got exception using ocrd-cis-ocropy-resegment with method 'ccomps'

I have got the following exception (with loglevel 'trace') using ocrd-cis-ocropy-resegmentwith method 'ccomps':

15:16:17.857 INFO processor.OcropyResegment - Page "OCR-D-REG-DESKEW-4074_007817778_00001" uses 200.000000 DPI
15:16:17.908 DEBUG ocrd_utils.crop_image - cropping image to (1966, 595, 2151, 682)
15:16:17.926 DEBUG ocrd_utils.coords.shift_coordinates - shifting coordinates by [-1966  -595]
15:16:17.926 DEBUG ocrd.workspace.image_from_segment - segment 'TR-1' has orientation=0 skew=0.00
15:16:17.927 DEBUG ocrd.workspace.image_from_segment - Using AlternativeImage 3 {'', 'verticallinesremoved', 'binarized', 'deskewed'} for segment 'TR-1'
15:16:17.928 DEBUG ocrd.workspace.download_file - download_file <OcrdFile fileGrp=OCR-D-REG-VL ID=OCR-D-REG-VL-4074_007817778_00001_TR-1.IMG-DESKEW, mimetype=image/png, url=OCR-D-REG-VL/OCR-D-REG-VL_4074_007817778_00001_TR-1.IMG-DESKEW.png, local_filename=OCR-D-REG-VL/OCR-D-REG-VL_4074_007817778_00001_TR-1.IMG-DESKEW.png]/>  [_recursion_count=0]
15:16:17.929 DEBUG PIL.PngImagePlugin - STREAM b'IHDR' 16 13
15:16:17.929 DEBUG PIL.PngImagePlugin - STREAM b'IDAT' 41 977
15:16:17.930 DEBUG ocrd_utils.coords.shift_coordinates - shifting coordinates by [-92.5 -43.5]
15:16:17.930 DEBUG ocrd_utils.coords.rotate_coordinates - rotating coordinates by 0.00° around [92.5 43.5]
15:16:17.931 DEBUG ocrd_utils.coords.shift_coordinates - shifting coordinates by [92.5 43.5]
15:16:17.931 DEBUG ocrd_utils.coords.shift_coordinates - shifting coordinates by [0 0]
15:16:17.940 DEBUG processor.OcropyResegment - unmasking area of text region "TR-1" for "TR-1"
15:16:17.947 DEBUG processor.OcropyResegment - calculating connected component and distance transforms for "TR-1"
15:16:17.948 DEBUG processor.OcropyResegment - estimated scale: 34
Traceback (most recent call last):
  File "/home/ocrdadmin/ocrd_all/venv/bin/ocrd-cis-ocropy-resegment", line 8, in <module>
    sys.exit(ocrd_cis_ocropy_resegment())
  File "/home/ocrdadmin/ocrd_all/venv/lib/python3.6/site-packages/click/core.py", line 1128, in __call__
    return self.main(*args, **kwargs)
  File "/home/ocrdadmin/ocrd_all/venv/lib/python3.6/site-packages/click/core.py", line 1053, in main
    rv = self.invoke(ctx)
  File "/home/ocrdadmin/ocrd_all/venv/lib/python3.6/site-packages/click/core.py", line 1395, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/ocrdadmin/ocrd_all/venv/lib/python3.6/site-packages/click/core.py", line 754, in invoke
    return __callback(*args, **kwargs)
  File "/home/ocrdadmin/ocrd_all/venv/lib/python3.6/site-packages/ocrd_cis/ocropy/cli.py", line 38, in ocrd_cis_ocropy_resegment
    return ocrd_cli_wrap_processor(OcropyResegment, *args, **kwargs)
  File "/home/ocrdadmin/ocrd_all/venv/lib/python3.6/site-packages/ocrd/decorators/__init__.py", line 88, in ocrd_cli_wrap_processor
    run_processor(processorClass, ocrd_tool, mets, workspace=workspace, **kwargs)
  File "/home/ocrdadmin/ocrd_all/venv/lib/python3.6/site-packages/ocrd/processor/helpers.py", line 88, in run_processor
    processor.process()
  File "/home/ocrdadmin/ocrd_all/venv/lib/python3.6/site-packages/ocrd_cis/ocropy/resegment.py", line 166, in process
    self._process_segment(region, region_image, region_coords, page_id, zoom, lines, ignore)
  File "/home/ocrdadmin/ocrd_all/venv/lib/python3.6/site-packages/ocrd_cis/ocropy/resegment.py", line 259, in _process_segment
    distances[i] = distances[i] / distances[i].max() * 255
FloatingPointError: invalid value encountered in true_divide

The PAGE looks like this:

  <pc:Page imageFilename="OCR-D-IMG/4074_007817778_00001.tif" imageWidth="4619" imageHeight="3312" orientation="0.">
    <pc:AlternativeImage filename="OCR-D-BIN/OCR-D-BIN_4074_007817778_00001.IMG-BIN.png" comments=",binarized" />
    <pc:TextRegion id="TR-1" orientation="0.">
      <pc:AlternativeImage filename="OCR-D-BIN-REG/OCR-D-BIN-REG-4074_007817778_00001_TR-1.IMG-BIN.png" comments=",binarized" />
      <pc:AlternativeImage filename="OCR-D-REG-DESKEW/OCR-D-REG-DESKEW-4074_007817778_00001_TR-1.IMG-DESKEW.png" comments=",binarized,deskewed" />
      <pc:AlternativeImage filename="OCR-D-REG-VL/OCR-D-REG-VL_4074_007817778_00001_TR-1.IMG-DESKEW.png" comments=",binarized,deskewed,verticallinesremoved" />
      <pc:Coords points="1966,595 1966,682 2151,682 2151,595" />
      <pc:TextLine id="TR-1_line0001">
        <pc:Coords points="1966,595 1966,682 2151,682 2151,595" />
        <pc:TextEquiv>
          <pc:Unicode>1889</pc:Unicode>
        </pc:TextEquiv>
      </pc:TextLine>

Please clarify ...

ValueError: No PAGE-XML for page 'OCR-D-010_00001' in fileGrp 'OCR-D-011' but multiple matches.

Got latest updates for sbb-textline and did ~/local/bin/git -C core pull origin master and did make NO_UPDATE=1 all today,

now with this non-sbb-textline workflow:

/usr/bin/time ocrd-olena-binarize -I OCR-D-IMG -O OCR-D-001 -P impl wolf >>ocrd.log 2>&1
/usr/bin/time ocrd-tesserocr-crop -I OCR-D-001 -O OCR-D-002 >>ocrd.log 2>&1
/usr/bin/time ocrd-olena-binarize -I OCR-D-002 -O OCR-D-003 -P impl wolf >>ocrd.log 2>&1
/usr/bin/time ocrd-cis-ocropy-denoise -I OCR-D-003 -O OCR-D-004 >>ocrd.log 2>&1
/usr/bin/time ocrd-cis-ocropy-deskew -I OCR-D-004 -O OCR-D-005 -P level-of-operation page >>ocrd.log 2>&1
/usr/bin/time ocrd-tesserocr-segment-region -I OCR-D-005 -O OCR-D-006 ; ocrd-segment-repair -I OCR-D-006 -O OCR-D-007 -P plausibilize true >>ocrd.log 2>&1
/usr/bin/time ocrd-cis-ocropy-clip -I OCR-D-007 -O OCR-D-008 -P level-of-operation region >>ocrd.log 2>&1
/usr/bin/time ocrd-cis-ocropy-deskew -I OCR-D-008 -O OCR-D-009 -P level-of-operation region >>ocrd.log 2>&1
/usr/bin/time ocrd-tesserocr-segment-line -I OCR-D-009 -O OCR-D-010 >>ocrd.log 2>&1
/usr/bin/time ocrd-cis-ocropy-resegment -I OCR-D-010 -O OCR-D-011 >>ocrd.log 2>&1
/usr/bin/time ocrd-cis-ocropy-dewarp -I OCR-D-011 -O OCR-D-012 >>ocrd.log 2>&1
/usr/bin/time ocrd-calamari-recognize -I OCR-D-012 -O OCR-D-OCR -P checkpoint "$HOME/ocrd_models/calamari/GT4HistOCR/*.ckpt.json" >>ocrd.log 2>&1

Ill get this error message:

Fr 6. Nov 13:27:10 CET 2020
13:27:12.486 INFO ocrd.resolver.workspace_from_nothing - Writing METS to 
/beegfs/work/ws/hd_xxx-ubhd-ocrd-0/charis1824/0000aaaf.tif/mets.xml
13:27:12.489 INFO ocrd.workspace.save_mets - Saving mets 
'/beegfs/work/ws/hd_xxx-ubhd-ocrd-0/charis1824/0000aaaf.tif/mets.xml'
/beegfs/work/ws/hd_xxx-ubhd-ocrd-0/charis1824/0000aaaf.tif
13:27:14.571 INFO ocrd.workspace.save_mets - Saving mets 
'/beegfs/work/ws/hd_xxx-ubhd-ocrd-0/charis1824/0000aaaf.tif/mets.xml'
1.58user 0.63system 0:03.89elapsed 56%CPU (0avgtext+0avgdata 55828maxresident)k
153099inputs+24outputs (516major+43632minor)pagefaults 0swaps
13:27:29.512 INFO ocrd-olena-binarize - processing image/tiff input file OCR-D-IMG_00001 (P_00001)
13:27:37.486 INFO ocrd.workspace.save_mets - Saving mets 
'/beegfs/work/ws/hd_xxx-ubhd-ocrd-0/charis1824/0000aaaf.tif/mets.xml'
13:27:40.702 INFO ocrd.workspace.save_mets - Saving mets 
'/beegfs/work/ws/hd_xxx-ubhd-ocrd-0/charis1824/0000aaaf.tif/mets.xml'
10.97user 3.51system 0:26.09elapsed 55%CPU (0avgtext+0avgdata 249236maxresident)k
1206448inputs+628outputs (3717major+489510minor)pagefaults 0swaps
13:27:42.631 INFO processor.TesserocrCrop - INPUT FILE 0 / P_00001
13:27:43.389 INFO processor.TesserocrCrop - Page 'P_00001' images will use 400 DPI from image meta-data
13:27:44.890 INFO processor.TesserocrCrop - Updated page border: 1454:2657,264:314
13:27:44.892 INFO processor.TesserocrCrop - Updated page border: 170:2657,264:330
13:27:44.892 INFO processor.TesserocrCrop - Updated page border: 104:2657,264:388
13:27:44.893 INFO processor.TesserocrCrop - Updated page border: 104:2657,264:388
13:27:44.894 INFO processor.TesserocrCrop - Updated page border: 104:2657,264:388
13:27:44.894 INFO processor.TesserocrCrop - Updated page border: 104:2657,264:447
13:27:44.895 INFO processor.TesserocrCrop - Updated page border: 103:2657,264:506
13:27:44.896 INFO processor.TesserocrCrop - Updated page border: 103:2661,264:529
13:27:44.897 INFO processor.TesserocrCrop - Updated page border: 103:2661,264:564
13:27:44.897 INFO processor.TesserocrCrop - Updated page border: 103:2661,264:618
13:27:44.898 INFO processor.TesserocrCrop - Updated page border: 103:2661,264:618
13:27:44.899 INFO processor.TesserocrCrop - Updated page border: 103:2662,264:645
13:27:44.900 INFO processor.TesserocrCrop - Updated page border: 103:2662,264:680
13:27:44.900 INFO processor.TesserocrCrop - Updated page border: 103:2662,264:740
13:27:44.901 INFO processor.TesserocrCrop - Updated page border: 103:2662,264:740
13:27:44.902 INFO processor.TesserocrCrop - Updated page border: 103:2665,264:764
13:27:44.903 INFO processor.TesserocrCrop - Updated page border: 103:2665,264:798
13:27:44.903 INFO processor.TesserocrCrop - Updated page border: 103:2665,264:821
13:27:44.904 INFO processor.TesserocrCrop - Updated page border: 103:2665,264:856
13:27:44.904 INFO processor.TesserocrCrop - Updated page border: 103:2665,264:898
13:27:44.905 INFO processor.TesserocrCrop - Updated page border: 103:2665,264:898
13:27:44.905 INFO processor.TesserocrCrop - Updated page border: 103:2665,264:898
13:27:44.906 INFO processor.TesserocrCrop - Updated page border: 103:2665,264:913
13:27:44.907 INFO processor.TesserocrCrop - Updated page border: 103:2666,264:938
13:27:44.907 INFO processor.TesserocrCrop - Updated page border: 103:2666,264:968
13:27:44.908 INFO processor.TesserocrCrop - Updated page border: 103:2666,264:1031
13:27:44.909 INFO processor.TesserocrCrop - Updated page border: 103:2667,264:1031
13:27:44.909 INFO processor.TesserocrCrop - Updated page border: 103:2667,264:1056
13:27:44.910 INFO processor.TesserocrCrop - Updated page border: 103:2667,264:1056
13:27:44.910 INFO processor.TesserocrCrop - Updated page border: 103:2667,264:1090
13:27:44.911 INFO processor.TesserocrCrop - Updated page border: 103:2668,264:1117
13:27:44.912 INFO processor.TesserocrCrop - Updated page border: 103:2668,264:1148
13:27:44.912 INFO processor.TesserocrCrop - Updated page border: 103:2668,264:1176
13:27:44.913 INFO processor.TesserocrCrop - Updated page border: 103:2668,264:1208
13:27:44.913 INFO processor.TesserocrCrop - Updated page border: 103:2668,264:1266
13:27:44.914 INFO processor.TesserocrCrop - Updated page border: 103:2668,264:1327
13:27:44.914 INFO processor.TesserocrCrop - Updated page border: 103:2668,264:1327
13:27:44.915 INFO processor.TesserocrCrop - Updated page border: 103:2668,264:1388
13:27:44.916 INFO processor.TesserocrCrop - Updated page border: 103:2669,264:1388
13:27:44.916 INFO processor.TesserocrCrop - Updated page border: 103:2669,264:1446
13:27:44.917 INFO processor.TesserocrCrop - Updated page border: 103:2669,264:1446
13:27:44.917 INFO processor.TesserocrCrop - Updated page border: 103:2669,264:1492
13:27:44.918 INFO processor.TesserocrCrop - Updated page border: 103:2669,264:1507
13:27:44.918 INFO processor.TesserocrCrop - Updated page border: 103:2669,264:1563
13:27:44.919 INFO processor.TesserocrCrop - Updated page border: 103:2669,264:1622
13:27:44.920 INFO processor.TesserocrCrop - Updated page border: 103:2671,264:1638
13:27:44.920 INFO processor.TesserocrCrop - Updated page border: 103:2671,264:1679
13:27:44.921 INFO processor.TesserocrCrop - Updated page border: 103:2671,264:1698
13:27:44.921 INFO processor.TesserocrCrop - Updated page border: 103:2671,264:1739
13:27:44.922 INFO processor.TesserocrCrop - Updated page border: 103:2672,264:1758
13:27:44.923 INFO processor.TesserocrCrop - Updated page border: 103:2672,264:1798
13:27:44.923 INFO processor.TesserocrCrop - Updated page border: 103:2672,264:1818
13:27:44.924 INFO processor.TesserocrCrop - Updated page border: 103:2672,264:1854
13:27:44.924 INFO processor.TesserocrCrop - Updated page border: 103:2672,264:1875
13:27:44.925 INFO processor.TesserocrCrop - Updated page border: 103:2672,264:1875
13:27:44.925 INFO processor.TesserocrCrop - Updated page border: 103:2672,264:1915
13:27:44.926 INFO processor.TesserocrCrop - Updated page border: 103:2672,264:1932
13:27:44.926 INFO processor.TesserocrCrop - Updated page border: 103:2672,264:1974
13:27:44.927 INFO processor.TesserocrCrop - Updated page border: 103:2672,264:2003
13:27:44.928 INFO processor.TesserocrCrop - Updated page border: 103:2672,264:2026
13:27:44.928 INFO processor.TesserocrCrop - Updated page border: 103:2672,264:2085
13:27:44.929 INFO processor.TesserocrCrop - Updated page border: 103:2672,264:2085
13:27:44.929 INFO processor.TesserocrCrop - Updated page border: 103:2672,264:2108
13:27:44.930 INFO processor.TesserocrCrop - Updated page border: 103:2672,264:2142
13:27:44.931 INFO processor.TesserocrCrop - Updated page border: 103:2672,264:2206
13:27:44.931 INFO processor.TesserocrCrop - Updated page border: 103:2672,264:2230
13:27:44.932 INFO processor.TesserocrCrop - Updated page border: 103:2672,264:2265
13:27:44.932 INFO processor.TesserocrCrop - Updated page border: 103:2672,264:2289
13:27:44.932 INFO processor.TesserocrCrop - Updated page border: 103:2672,264:2289
13:27:44.933 INFO processor.TesserocrCrop - Updated page border: 103:2672,264:2289
13:27:44.933 INFO processor.TesserocrCrop - Updated page border: 103:2672,264:2323
13:27:44.934 INFO processor.TesserocrCrop - Updated page border: 103:2672,264:2323
13:27:44.934 INFO processor.TesserocrCrop - Updated page border: 103:2672,264:2348
13:27:44.934 INFO processor.TesserocrCrop - Updated page border: 103:2672,264:2348
13:27:44.935 INFO processor.TesserocrCrop - Updated page border: 103:2672,264:2383
13:27:44.935 INFO processor.TesserocrCrop - Updated page border: 103:2672,264:2409
13:27:44.936 INFO processor.TesserocrCrop - Updated page border: 103:2672,264:2442
13:27:44.936 INFO processor.TesserocrCrop - Updated page border: 103:2672,264:2502
13:27:44.937 INFO processor.TesserocrCrop - Updated page border: 103:2672,264:2502
13:27:44.938 INFO processor.TesserocrCrop - Updated page border: 103:2672,264:2623
13:27:44.938 INFO processor.TesserocrCrop - Updated page border: 103:2672,264:2678
13:27:44.938 INFO processor.TesserocrCrop - Updated page border: 103:2672,264:2680
13:27:44.939 INFO processor.TesserocrCrop - Updated page border: 103:2672,264:2736
13:27:44.939 INFO processor.TesserocrCrop - Updated page border: 103:2672,264:2789
13:27:44.940 INFO processor.TesserocrCrop - Updated page border: 103:2672,264:2810
13:27:44.940 INFO processor.TesserocrCrop - Updated page border: 103:2672,264:2869
13:27:44.941 INFO processor.TesserocrCrop - Updated page border: 103:2672,264:2869
13:27:44.941 INFO processor.TesserocrCrop - Updated page border: 103:2672,264:2928
13:27:44.942 INFO processor.TesserocrCrop - Updated page border: 103:2672,264:2928
13:27:44.942 INFO processor.TesserocrCrop - Updated page border: 103:2672,264:2988
13:27:44.943 INFO processor.TesserocrCrop - Ignoring region 'region0093' because its width is too small (21)
13:27:44.943 INFO processor.TesserocrCrop - Updated page border: 103:2672,264:3045
13:27:44.943 INFO processor.TesserocrCrop - Updated page border: 103:2672,264:3076
13:27:44.944 INFO processor.TesserocrCrop - Ignoring region 'region0096' because its width is too small (19)
13:27:44.944 INFO processor.TesserocrCrop - Updated page border: 103:2672,264:3104
13:27:44.945 INFO processor.TesserocrCrop - Updated page border: 103:2672,264:3162
13:27:44.945 INFO processor.TesserocrCrop - Updated page border: 103:2672,264:3221
13:27:44.946 INFO processor.TesserocrCrop - Updated page border: 103:2672,264:3221
13:27:44.946 INFO processor.TesserocrCrop - Updated page border: 103:2672,264:3256
13:27:44.947 INFO processor.TesserocrCrop - Ignoring region 'region0102' because its width is too small (11)
13:27:44.947 INFO processor.TesserocrCrop - Updated page border: 103:2672,264:3313
13:27:44.948 INFO processor.TesserocrCrop - Updated page border: 103:2672,264:3376
13:27:44.948 INFO processor.TesserocrCrop - Updated page border: 103:2672,264:3376
13:27:44.949 INFO processor.TesserocrCrop - Updated page border: 103:2672,264:3432
13:27:44.949 INFO processor.TesserocrCrop - Updated page border: 103:2672,264:3440
13:27:44.950 INFO processor.TesserocrCrop - Updated page border: 103:2672,264:3440
13:27:44.950 INFO processor.TesserocrCrop - Updated page border: 103:2672,264:3495
13:27:44.951 INFO processor.TesserocrCrop - Updated page border: 103:2672,264:3495
13:27:44.951 INFO processor.TesserocrCrop - Updated page border: 103:2672,264:3495
13:27:44.952 INFO processor.TesserocrCrop - Updated page border: 103:2672,264:3555
13:27:44.952 INFO processor.TesserocrCrop - Padded page border: 99:2676,260:3559
13:27:45.145 INFO ocrd.workspace.save_image_file - created file ID: OCR-D-002_00001.IMG-CROP, file_grp: OCR-D-002, 
path: OCR-D-002/OCR-D-002_00001.IMG-CROP.png
13:27:45.160 INFO ocrd.process.profile - Executing processor 'ocrd-tesserocr-crop' took 2.625491s 
[--input-file-grp='OCR-D-001' --output-file-grp='OCR-D-002' --parameter='{"dpi": -1, "padding": 4}']
13:27:45.161 INFO ocrd.workspace.save_mets - Saving mets 
'/beegfs/work/ws/hd_xxx-ubhd-ocrd-0/charis1824/0000aaaf.tif/mets.xml'
3.30user 0.33system 0:04.47elapsed 81%CPU (0avgtext+0avgdata 145512maxresident)k
155416inputs+475outputs (314major+59107minor)pagefaults 0swaps
13:27:59.546 INFO ocrd-olena-binarize - processing PAGE-XML input file OCR-D-002_00001 (P_00001)
13:28:01.127 INFO ocrd-olena-binarize - found imageFilename 'OCR-D-IMG/0001.tif' for input file ID=OCR-D-002_00001 
(pageId=P_00001)
13:28:08.776 INFO ocrd.workspace.save_mets - Saving mets 
'/beegfs/work/ws/hd_xxx-ubhd-ocrd-0/charis1824/0000aaaf.tif/mets.xml'
13:28:10.378 INFO ocrd.workspace.save_mets - Saving mets 
'/beegfs/work/ws/hd_xxx-ubhd-ocrd-0/charis1824/0000aaaf.tif/mets.xml'
10.90user 3.41system 0:25.22elapsed 56%CPU (0avgtext+0avgdata 182592maxresident)k
1214490inputs+29012outputs (4008major+461754minor)pagefaults 0swaps
13:28:12.939 INFO processor.OcropyDenoise - INPUT FILE 0 / P_00001
13:28:13.688 INFO processor.OcropyDenoise - Page "OCR-D-003_00001" uses 400.000000 DPI
13:28:13.689 INFO processor.OcropyDenoise - About to despeckle 'OCR-D-004_00001'
13:28:15.410 INFO ocrd.workspace.save_image_file - created file ID: OCR-D-004_00001.IMG-DESPECK, file_grp: OCR-D-004, 
path: OCR-D-004/OCR-D-004_00001.IMG-DESPECK.png
13:28:15.436 INFO processor.OcropyDenoise - created file ID: OCR-D-004_00001, file_grp: OCR-D-004, path: 
OCR-D-004/OCR-D-004_00001.xml
13:28:15.437 INFO ocrd.process.profile - Executing processor 'ocrd-cis-ocropy-denoise' took 2.499802s 
[--input-file-grp='OCR-D-003' --output-file-grp='OCR-D-004' --parameter='{"noise_maxsize": 3.0, "dpi": 0, 
"level-of-operation": "page"}']
13:28:15.438 INFO ocrd.workspace.save_mets - Saving mets 
'/beegfs/work/ws/hd_xxx-ubhd-ocrd-0/charis1824/0000aaaf.tif/mets.xml'
2.63user 1.27system 0:05.17elapsed 75%CPU (0avgtext+0avgdata 511816maxresident)k
189222inputs+654outputs (545major+358541minor)pagefaults 0swaps
13:28:19.269 INFO processor.OcropyDeskew - INPUT FILE 0 / P_00001
13:28:19.974 INFO processor.OcropyDeskew - About to deskew page 'OCR-D-004_00001'
13:28:37.418 INFO processor.OcropyDeskew - Found angle for page 'OCR-D-004_00001': -1.2
13:28:38.155 INFO ocrd.workspace.save_image_file - created file ID: OCR-D-005_00001.IMG-DESKEW, file_grp: OCR-D-005, 
path: OCR-D-005/OCR-D-005_00001.IMG-DESKEW.png
13:28:38.160 INFO processor.OcropyDeskew - created file ID: OCR-D-005_00001, file_grp: OCR-D-005, path: 
OCR-D-005/OCR-D-005_00001.xml
13:28:38.161 INFO ocrd.process.profile - Executing processor 'ocrd-cis-ocropy-deskew' took 18.894748s 
[--input-file-grp='OCR-D-004' --output-file-grp='OCR-D-005' --parameter='{"level-of-operation": "page", "maxskew": 
5.0}']
13:28:38.162 INFO ocrd.workspace.save_mets - Saving mets 
'/beegfs/work/ws/hd_xxx-ubhd-ocrd-0/charis1824/0000aaaf.tif/mets.xml'
17.84user 3.00system 0:22.84elapsed 91%CPU (0avgtext+0avgdata 378764maxresident)k
221350inputs+791outputs (691major+1077857minor)pagefaults 0swaps
13:28:46.100 INFO processor.RepairSegmentation - INPUT FILE 0 / P_00001
13:28:46.136 INFO ocrd.page_validator.validate - Validating input file 'OCR-D-006_00001'
13:28:46.191 INFO ocrd.process.profile - Executing processor 'ocrd-segment-repair' took 0.092767s 
[--input-file-grp='OCR-D-006' --output-file-grp='OCR-D-007' --parameter='{"plausibilize": true, "sanitize": false, 
"plausibilize_merge_min_overlap": 0.9}']
13:28:46.192 INFO ocrd.workspace.save_mets - Saving mets 
'/beegfs/work/ws/hd_xxx-ubhd-ocrd-0/charis1824/0000aaaf.tif/mets.xml'
13:28:49.915 INFO processor.OcropyClip - INPUT FILE 0 / P_00001
13:28:50.831 INFO processor.OcropyClip - Page "OCR-D-006_00001" uses 400.000000 DPI
13:28:53.047 INFO processor.OcropyClip - created file ID: OCR-D-008_00001, file_grp: OCR-D-008, path: 
OCR-D-008/OCR-D-008_00001.xml
13:28:53.103 INFO ocrd.process.profile - Executing processor 'ocrd-cis-ocropy-clip' took 3.189316s 
[--input-file-grp='OCR-D-007' --output-file-grp='OCR-D-008' --parameter='{"level-of-operation": "region", "dpi": 0, 
"min_fraction": 0.7}']
13:28:53.104 INFO ocrd.workspace.save_mets - Saving mets 
'/beegfs/work/ws/hd_xxx-ubhd-ocrd-0/charis1824/0000aaaf.tif/mets.xml'
3.29user 1.72system 0:06.93elapsed 72%CPU (0avgtext+0avgdata 521656maxresident)k
190740inputs+42outputs (554major+521368minor)pagefaults 0swaps
13:28:56.254 INFO processor.OcropyDeskew - INPUT FILE 0 / P_00001
13:28:57.313 INFO processor.OcropyDeskew - About to deskew region 'region0000'
13:29:01.742 INFO processor.OcropyDeskew - Found angle for region 'region0000': 0.8
13:29:01.996 INFO ocrd.workspace.save_image_file - created file ID: OCR-D-009_00001_region0000.IMG-DESKEW, file_grp: 
OCR-D-009, path: OCR-D-009/OCR-D-009_00001_region0000.IMG-DESKEW.png
13:29:02.246 INFO processor.OcropyDeskew - About to deskew region 'region0001'
13:29:02.358 INFO processor.OcropyDeskew - Found angle for region 'region0001': 0.9
13:29:02.371 INFO ocrd.workspace.save_image_file - created file ID: OCR-D-009_00001_region0001.IMG-DESKEW, file_grp: 
OCR-D-009, path: OCR-D-009/OCR-D-009_00001_region0001.IMG-DESKEW.png
13:29:02.637 INFO processor.OcropyDeskew - About to deskew region 'region0002'
13:29:02.728 INFO processor.OcropyDeskew - Found angle for region 'region0002': 0.0
13:29:02.764 INFO ocrd.workspace.save_image_file - created file ID: OCR-D-009_00001_region0002.IMG-DESKEW, file_grp: 
OCR-D-009, path: OCR-D-009/OCR-D-009_00001_region0002.IMG-DESKEW.png
13:29:02.983 INFO processor.OcropyDeskew - About to deskew region 'region0003'
13:29:03.834 INFO processor.OcropyDeskew - Found angle for region 'region0003': 1.1
13:29:03.914 INFO ocrd.workspace.save_image_file - created file ID: OCR-D-009_00001_region0003.IMG-DESKEW, file_grp: 
OCR-D-009, path: OCR-D-009/OCR-D-009_00001_region0003.IMG-DESKEW.png
13:29:04.112 INFO processor.OcropyDeskew - About to deskew region 'region0005'
13:29:04.691 INFO processor.OcropyDeskew - Found angle for region 'region0005': 1.4
13:29:04.739 INFO ocrd.workspace.save_image_file - created file ID: OCR-D-009_00001_region0005.IMG-DESKEW, file_grp: 
OCR-D-009, path: OCR-D-009/OCR-D-009_00001_region0005.IMG-DESKEW.png
13:29:04.950 INFO processor.OcropyDeskew - About to deskew region 'region0006'
13:29:05.160 INFO processor.OcropyDeskew - Found angle for region 'region0006': 0.8
13:29:05.173 INFO ocrd.workspace.save_image_file - created file ID: OCR-D-009_00001_region0006.IMG-DESKEW, file_grp: 
OCR-D-009, path: OCR-D-009/OCR-D-009_00001_region0006.IMG-DESKEW.png
13:29:05.331 INFO processor.OcropyDeskew - About to deskew region 'region0008'
13:29:06.674 INFO processor.OcropyDeskew - Found angle for region 'region0008': 1.1
13:29:06.784 INFO ocrd.workspace.save_image_file - created file ID: OCR-D-009_00001_region0008.IMG-DESKEW, file_grp: 
OCR-D-009, path: OCR-D-009/OCR-D-009_00001_region0008.IMG-DESKEW.png
13:29:06.995 INFO processor.OcropyDeskew - About to deskew region 'region0010'
13:29:07.374 INFO processor.OcropyDeskew - Found angle for region 'region0010': 0.9
13:29:07.395 INFO ocrd.workspace.save_image_file - created file ID: OCR-D-009_00001_region0010.IMG-DESKEW, file_grp: 
OCR-D-009, path: OCR-D-009/OCR-D-009_00001_region0010.IMG-DESKEW.png
13:29:07.551 INFO processor.OcropyDeskew - About to deskew region 'region0011'
13:29:07.578 INFO processor.OcropyDeskew - Found angle for region 'region0011': 0.0
13:29:07.581 INFO ocrd.workspace.save_image_file - created file ID: OCR-D-009_00001_region0011.IMG-DESKEW, file_grp: 
OCR-D-009, path: OCR-D-009/OCR-D-009_00001_region0011.IMG-DESKEW.png
13:29:07.751 INFO processor.OcropyDeskew - About to deskew region 'region0013'
13:29:08.803 INFO processor.OcropyDeskew - Found angle for region 'region0013': 1.1
13:29:08.863 INFO ocrd.workspace.save_image_file - created file ID: OCR-D-009_00001_region0013.IMG-DESKEW, file_grp: 
OCR-D-009, path: OCR-D-009/OCR-D-009_00001_region0013.IMG-DESKEW.png
13:29:09.076 INFO processor.OcropyDeskew - About to deskew region 'region0015'
13:29:09.639 INFO processor.OcropyDeskew - Found angle for region 'region0015': 1.2
13:29:09.669 INFO ocrd.workspace.save_image_file - created file ID: OCR-D-009_00001_region0015.IMG-DESKEW, file_grp: 
OCR-D-009, path: OCR-D-009/OCR-D-009_00001_region0015.IMG-DESKEW.png
13:29:09.865 INFO processor.OcropyDeskew - About to deskew region 'region0017'
13:29:10.551 INFO processor.OcropyDeskew - Found angle for region 'region0017': 1.4
13:29:10.589 INFO ocrd.workspace.save_image_file - created file ID: OCR-D-009_00001_region0017.IMG-DESKEW, file_grp: 
OCR-D-009, path: OCR-D-009/OCR-D-009_00001_region0017.IMG-DESKEW.png
13:29:10.793 INFO processor.OcropyDeskew - About to deskew region 'region0018'
13:29:10.817 INFO processor.OcropyDeskew - Found angle for region 'region0018': 0.0
13:29:10.820 INFO ocrd.workspace.save_image_file - created file ID: OCR-D-009_00001_region0018.IMG-DESKEW, file_grp: 
OCR-D-009, path: OCR-D-009/OCR-D-009_00001_region0018.IMG-DESKEW.png
13:29:10.984 INFO processor.OcropyDeskew - About to deskew region 'region0019'
13:29:11.103 INFO processor.OcropyDeskew - Found angle for region 'region0019': 1.2
13:29:11.110 INFO ocrd.workspace.save_image_file - created file ID: OCR-D-009_00001_region0019.IMG-DESKEW, file_grp: 
OCR-D-009, path: OCR-D-009/OCR-D-009_00001_region0019.IMG-DESKEW.png
13:29:11.269 INFO processor.OcropyDeskew - About to deskew region 'region0021'
13:29:12.048 INFO processor.OcropyDeskew - Found angle for region 'region0021': 0.9
13:29:12.107 INFO ocrd.workspace.save_image_file - created file ID: OCR-D-009_00001_region0021.IMG-DESKEW, file_grp: 
OCR-D-009, path: OCR-D-009/OCR-D-009_00001_region0021.IMG-DESKEW.png
13:29:12.114 INFO processor.OcropyDeskew - created file ID: OCR-D-009_00001, file_grp: OCR-D-009, path: 
OCR-D-009/OCR-D-009_00001.xml
13:29:12.121 INFO ocrd.process.profile - Executing processor 'ocrd-cis-ocropy-deskew' took 15.867691s 
[--input-file-grp='OCR-D-008' --output-file-grp='OCR-D-009' --parameter='{"level-of-operation": "region", "maxskew": 
5.0}']
13:29:12.122 INFO ocrd.workspace.save_mets - Saving mets 
'/beegfs/work/ws/hd_xxx-ubhd-ocrd-0/charis1824/0000aaaf.tif/mets.xml'
16.53user 1.02system 0:19.02elapsed 92%CPU (0avgtext+0avgdata 241896maxresident)k
234303inputs+802outputs (739major+206779minor)pagefaults 0swaps
13:29:14.170 INFO processor.TesserocrSegmentLine - INPUT FILE 0 / P_00001
13:29:14.997 INFO processor.TesserocrSegmentLine - Page 'P_00001' images will use 400 DPI from image meta-data
13:29:19.175 ERROR ocrd.workspace.image_from_segment - segment "region0015" image (binarized,despeckled,deskewed; 
1215x329) has not been cropped properly (1207x301)
13:29:20.027 ERROR ocrd.workspace.image_from_segment - segment "region0019" image (binarized,despeckled,deskewed; 
912x101) has not been cropped properly (910x81)
13:29:20.438 INFO ocrd.process.profile - Executing processor 'ocrd-tesserocr-segment-line' took 6.428814s 
[--input-file-grp='OCR-D-009' --output-file-grp='OCR-D-010' --parameter='{"dpi": -1, "overwrite_lines": true}']
13:29:20.439 INFO ocrd.workspace.save_mets - Saving mets 
'/beegfs/work/ws/hd_xxx-ubhd-ocrd-0/charis1824/0000aaaf.tif/mets.xml'
6.43user 0.53system 0:08.24elapsed 84%CPU (0avgtext+0avgdata 214988maxresident)k
169195inputs+90outputs (370major+142186minor)pagefaults 0swaps
13:29:23.384 INFO processor.OcropyResegment - INPUT FILE 0 / P_00001
13:29:24.225 INFO processor.OcropyResegment - Page "OCR-D-010_00001" uses 400.000000 DPI
13:29:43.961 INFO ocrd.workspace.save_image_file - created file ID: 
OCR-D-011_00001_region0000_region0000_line0000.IMG-RESEG, file_grp: OCR-D-011, path: 
OCR-D-011/OCR-D-011_00001_region0000_region0000_line0000.IMG-RESEG.png
13:29:44.265 INFO ocrd.workspace.save_image_file - created file ID: 
OCR-D-011_00001_region0000_region0000_line0001.IMG-RESEG, file_grp: OCR-D-011, path: 
OCR-D-011/OCR-D-011_00001_region0000_region0000_line0001.IMG-RESEG.png
13:29:44.444 INFO ocrd.workspace.save_image_file - created file ID: 
OCR-D-011_00001_region0000_region0000_line0002.IMG-RESEG, file_grp: OCR-D-011, path: 
OCR-D-011/OCR-D-011_00001_region0000_region0000_line0002.IMG-RESEG.png
13:29:44.583 INFO ocrd.workspace.save_image_file - created file ID: 
OCR-D-011_00001_region0000_region0000_line0003.IMG-RESEG, file_grp: OCR-D-011, path: 
OCR-D-011/OCR-D-011_00001_region0000_region0000_line0003.IMG-RESEG.png
13:29:44.723 INFO ocrd.workspace.save_image_file - created file ID: 
OCR-D-011_00001_region0000_region0000_line0004.IMG-RESEG, file_grp: OCR-D-011, path: 
OCR-D-011/OCR-D-011_00001_region0000_region0000_line0004.IMG-RESEG.png
13:29:44.875 INFO ocrd.workspace.save_image_file - created file ID: 
OCR-D-011_00001_region0000_region0000_line0005.IMG-RESEG, file_grp: OCR-D-011, path: 
OCR-D-011/OCR-D-011_00001_region0000_region0000_line0005.IMG-RESEG.png
13:29:45.015 INFO ocrd.workspace.save_image_file - created file ID: 
OCR-D-011_00001_region0000_region0000_line0006.IMG-RESEG, file_grp: OCR-D-011, path: 
OCR-D-011/OCR-D-011_00001_region0000_region0000_line0006.IMG-RESEG.png
13:29:45.157 INFO ocrd.workspace.save_image_file - created file ID: 
OCR-D-011_00001_region0000_region0000_line0007.IMG-RESEG, file_grp: OCR-D-011, path: 
OCR-D-011/OCR-D-011_00001_region0000_region0000_line0007.IMG-RESEG.png
13:29:45.297 INFO ocrd.workspace.save_image_file - created file ID: 
OCR-D-011_00001_region0000_region0000_line0008.IMG-RESEG, file_grp: OCR-D-011, path: 
OCR-D-011/OCR-D-011_00001_region0000_region0000_line0008.IMG-RESEG.png
13:29:45.443 INFO ocrd.workspace.save_image_file - created file ID: 
OCR-D-011_00001_region0000_region0000_line0009.IMG-RESEG, file_grp: OCR-D-011, path: 
OCR-D-011/OCR-D-011_00001_region0000_region0000_line0009.IMG-RESEG.png
13:29:45.584 INFO ocrd.workspace.save_image_file - created file ID: 
OCR-D-011_00001_region0000_region0000_line0010.IMG-RESEG, file_grp: OCR-D-011, path: 
OCR-D-011/OCR-D-011_00001_region0000_region0000_line0010.IMG-RESEG.png
13:29:45.730 INFO ocrd.workspace.save_image_file - created file ID: 
OCR-D-011_00001_region0000_region0000_line0011.IMG-RESEG, file_grp: OCR-D-011, path: 
OCR-D-011/OCR-D-011_00001_region0000_region0000_line0011.IMG-RESEG.png
13:29:45.880 INFO ocrd.workspace.save_image_file - created file ID: 
OCR-D-011_00001_region0000_region0000_line0012.IMG-RESEG, file_grp: OCR-D-011, path: 
OCR-D-011/OCR-D-011_00001_region0000_region0000_line0012.IMG-RESEG.png
13:29:46.020 INFO ocrd.workspace.save_image_file - created file ID: 
OCR-D-011_00001_region0000_region0000_line0013.IMG-RESEG, file_grp: OCR-D-011, path: 
OCR-D-011/OCR-D-011_00001_region0000_region0000_line0013.IMG-RESEG.png
13:29:46.159 INFO ocrd.workspace.save_image_file - created file ID: 
OCR-D-011_00001_region0000_region0000_line0014.IMG-RESEG, file_grp: OCR-D-011, path: 
OCR-D-011/OCR-D-011_00001_region0000_region0000_line0014.IMG-RESEG.png
13:29:46.298 INFO ocrd.workspace.save_image_file - created file ID: 
OCR-D-011_00001_region0000_region0000_line0015.IMG-RESEG, file_grp: OCR-D-011, path: 
OCR-D-011/OCR-D-011_00001_region0000_region0000_line0015.IMG-RESEG.png
13:29:46.461 INFO ocrd.workspace.save_image_file - created file ID: 
OCR-D-011_00001_region0000_region0000_line0016.IMG-RESEG, file_grp: OCR-D-011, path: 
OCR-D-011/OCR-D-011_00001_region0000_region0000_line0016.IMG-RESEG.png
13:29:46.624 INFO ocrd.workspace.save_image_file - created file ID: 
OCR-D-011_00001_region0000_region0000_line0017.IMG-RESEG, file_grp: OCR-D-011, path: 
OCR-D-011/OCR-D-011_00001_region0000_region0000_line0017.IMG-RESEG.png
13:29:46.765 INFO ocrd.workspace.save_image_file - created file ID: 
OCR-D-011_00001_region0000_region0000_line0018.IMG-RESEG, file_grp: OCR-D-011, path: 
OCR-D-011/OCR-D-011_00001_region0000_region0000_line0018.IMG-RESEG.png
13:29:46.908 INFO ocrd.workspace.save_image_file - created file ID: 
OCR-D-011_00001_region0000_region0000_line0019.IMG-RESEG, file_grp: OCR-D-011, path: 
OCR-D-011/OCR-D-011_00001_region0000_region0000_line0019.IMG-RESEG.png
13:29:47.034 INFO ocrd.workspace.save_image_file - created file ID: 
OCR-D-011_00001_region0000_region0000_line0020.IMG-RESEG, file_grp: OCR-D-011, path: 
OCR-D-011/OCR-D-011_00001_region0000_region0000_line0020.IMG-RESEG.png
13:29:47.253 INFO ocrd.workspace.save_image_file - created file ID: 
OCR-D-011_00001_region0000_region0000_line0021.IMG-RESEG, file_grp: OCR-D-011, path: 
OCR-D-011/OCR-D-011_00001_region0000_region0000_line0021.IMG-RESEG.png
13:29:47.433 INFO ocrd.workspace.save_image_file - created file ID: 
OCR-D-011_00001_region0000_region0000_line0022.IMG-RESEG, file_grp: OCR-D-011, path: 
OCR-D-011/OCR-D-011_00001_region0000_region0000_line0022.IMG-RESEG.png
13:29:47.633 INFO ocrd.workspace.save_image_file - created file ID: 
OCR-D-011_00001_region0000_region0000_line0023.IMG-RESEG, file_grp: OCR-D-011, path: 
OCR-D-011/OCR-D-011_00001_region0000_region0000_line0023.IMG-RESEG.png
13:29:47.830 INFO ocrd.workspace.save_image_file - created file ID: 
OCR-D-011_00001_region0000_region0000_line0024.IMG-RESEG, file_grp: OCR-D-011, path: 
OCR-D-011/OCR-D-011_00001_region0000_region0000_line0024.IMG-RESEG.png
13:29:47.971 INFO ocrd.workspace.save_image_file - created file ID: 
OCR-D-011_00001_region0000_region0000_line0025.IMG-RESEG, file_grp: OCR-D-011, path: 
OCR-D-011/OCR-D-011_00001_region0000_region0000_line0025.IMG-RESEG.png
13:29:48.111 INFO ocrd.workspace.save_image_file - created file ID: 
OCR-D-011_00001_region0000_region0000_line0026.IMG-RESEG, file_grp: OCR-D-011, path: 
OCR-D-011/OCR-D-011_00001_region0000_region0000_line0026.IMG-RESEG.png
13:29:48.250 INFO ocrd.workspace.save_image_file - created file ID: 
OCR-D-011_00001_region0000_region0000_line0027.IMG-RESEG, file_grp: OCR-D-011, path: 
OCR-D-011/OCR-D-011_00001_region0000_region0000_line0027.IMG-RESEG.png
13:29:48.387 INFO ocrd.workspace.save_image_file - created file ID: 
OCR-D-011_00001_region0000_region0000_line0028.IMG-RESEG, file_grp: OCR-D-011, path: 
OCR-D-011/OCR-D-011_00001_region0000_region0000_line0028.IMG-RESEG.png
13:29:48.527 INFO ocrd.workspace.save_image_file - created file ID: 
OCR-D-011_00001_region0000_region0000_line0029.IMG-RESEG, file_grp: OCR-D-011, path: 
OCR-D-011/OCR-D-011_00001_region0000_region0000_line0029.IMG-RESEG.png
13:29:48.667 INFO ocrd.workspace.save_image_file - created file ID: 
OCR-D-011_00001_region0000_region0000_line0030.IMG-RESEG, file_grp: OCR-D-011, path: 
OCR-D-011/OCR-D-011_00001_region0000_region0000_line0030.IMG-RESEG.png
13:29:48.807 INFO ocrd.workspace.save_image_file - created file ID: 
OCR-D-011_00001_region0000_region0000_line0031.IMG-RESEG, file_grp: OCR-D-011, path: 
OCR-D-011/OCR-D-011_00001_region0000_region0000_line0031.IMG-RESEG.png
13:29:48.947 INFO ocrd.workspace.save_image_file - created file ID: 
OCR-D-011_00001_region0000_region0000_line0032.IMG-RESEG, file_grp: OCR-D-011, path: 
OCR-D-011/OCR-D-011_00001_region0000_region0000_line0032.IMG-RESEG.png
13:29:49.089 INFO ocrd.workspace.save_image_file - created file ID: 
OCR-D-011_00001_region0000_region0000_line0033.IMG-RESEG, file_grp: OCR-D-011, path: 
OCR-D-011/OCR-D-011_00001_region0000_region0000_line0033.IMG-RESEG.png
13:29:49.229 INFO ocrd.workspace.save_image_file - created file ID: 
OCR-D-011_00001_region0000_region0000_line0034.IMG-RESEG, file_grp: OCR-D-011, path: 
OCR-D-011/OCR-D-011_00001_region0000_region0000_line0034.IMG-RESEG.png
13:29:49.371 INFO ocrd.workspace.save_image_file - created file ID: 
OCR-D-011_00001_region0000_region0000_line0035.IMG-RESEG, file_grp: OCR-D-011, path: 
OCR-D-011/OCR-D-011_00001_region0000_region0000_line0035.IMG-RESEG.png
13:29:49.512 INFO ocrd.workspace.save_image_file - created file ID: 
OCR-D-011_00001_region0000_region0000_line0036.IMG-RESEG, file_grp: OCR-D-011, path: 
OCR-D-011/OCR-D-011_00001_region0000_region0000_line0036.IMG-RESEG.png
13:29:49.637 INFO ocrd.workspace.save_image_file - created file ID: 
OCR-D-011_00001_region0000_region0000_line0037.IMG-RESEG, file_grp: OCR-D-011, path: 
OCR-D-011/OCR-D-011_00001_region0000_region0000_line0037.IMG-RESEG.png
13:29:49.637 WARNING processor.OcropyResegment - Page "OCR-D-010_00001" region "region0001" contains only one line
13:29:49.637 WARNING processor.OcropyResegment - Page "OCR-D-010_00001" region "region0002" contains only one line
13:29:50.950 INFO ocrd.workspace.save_image_file - created file ID: 
OCR-D-011_00001_region0003_region0003_line0000.IMG-RESEG, file_grp: OCR-D-011, path: 
OCR-D-011/OCR-D-011_00001_region0003_region0003_line0000.IMG-RESEG.png
13:29:51.025 INFO ocrd.workspace.save_image_file - created file ID: 
OCR-D-011_00001_region0003_region0003_line0001.IMG-RESEG, file_grp: OCR-D-011, path: 
OCR-D-011/OCR-D-011_00001_region0003_region0003_line0001.IMG-RESEG.png
13:29:51.081 INFO ocrd.workspace.save_image_file - created file ID: 
OCR-D-011_00001_region0003_region0003_line0002.IMG-RESEG, file_grp: OCR-D-011, path: 
OCR-D-011/OCR-D-011_00001_region0003_region0003_line0002.IMG-RESEG.png
13:29:51.122 INFO ocrd.workspace.save_image_file - created file ID: 
OCR-D-011_00001_region0003_region0003_line0003.IMG-RESEG, file_grp: OCR-D-011, path: 
OCR-D-011/OCR-D-011_00001_region0003_region0003_line0003.IMG-RESEG.png
13:29:51.164 INFO ocrd.workspace.save_image_file - created file ID: 
OCR-D-011_00001_region0003_region0003_line0004.IMG-RESEG, file_grp: OCR-D-011, path: 
OCR-D-011/OCR-D-011_00001_region0003_region0003_line0004.IMG-RESEG.png
13:29:51.299 INFO ocrd.workspace.save_image_file - created file ID: 
OCR-D-011_00001_region0003_region0003_line0005.IMG-RESEG, file_grp: OCR-D-011, path: 
OCR-D-011/OCR-D-011_00001_region0003_region0003_line0005.IMG-RESEG.png
13:29:51.371 INFO ocrd.workspace.save_image_file - created file ID: 
OCR-D-011_00001_region0003_region0003_line0006.IMG-RESEG, file_grp: OCR-D-011, path: 
OCR-D-011/OCR-D-011_00001_region0003_region0003_line0006.IMG-RESEG.png
13:29:51.431 INFO ocrd.workspace.save_image_file - created file ID: 
OCR-D-011_00001_region0003_region0003_line0007.IMG-RESEG, file_grp: OCR-D-011, path: 
OCR-D-011/OCR-D-011_00001_region0003_region0003_line0007.IMG-RESEG.png
13:29:52.058 INFO ocrd.workspace.save_image_file - created file ID: 
OCR-D-011_00001_region0005_region0005_line0000.IMG-RESEG, file_grp: OCR-D-011, path: 
OCR-D-011/OCR-D-011_00001_region0005_region0005_line0000.IMG-RESEG.png
13:29:52.310 INFO ocrd.workspace.save_image_file - created file ID: 
OCR-D-011_00001_region0005_region0005_line0001.IMG-RESEG, file_grp: OCR-D-011, path: 
OCR-D-011/OCR-D-011_00001_region0005_region0005_line0001.IMG-RESEG.png
13:29:52.373 INFO ocrd.workspace.save_image_file - created file ID: 
OCR-D-011_00001_region0005_region0005_line0002.IMG-RESEG, file_grp: OCR-D-011, path: 
OCR-D-011/OCR-D-011_00001_region0005_region0005_line0002.IMG-RESEG.png
13:29:52.405 INFO ocrd.workspace.save_image_file - created file ID: 
OCR-D-011_00001_region0005_region0005_line0003.IMG-RESEG, file_grp: OCR-D-011, path: 
OCR-D-011/OCR-D-011_00001_region0005_region0005_line0003.IMG-RESEG.png
13:29:52.745 INFO ocrd.workspace.save_image_file - created file ID: 
OCR-D-011_00001_region0006_region0006_line0000.IMG-RESEG, file_grp: OCR-D-011, path: 
OCR-D-011/OCR-D-011_00001_region0006_region0006_line0000.IMG-RESEG.png
13:29:52.765 INFO ocrd.workspace.save_image_file - created file ID: 
OCR-D-011_00001_region0006_region0006_line0001.IMG-RESEG, file_grp: OCR-D-011, path: 
OCR-D-011/OCR-D-011_00001_region0006_region0006_line0001.IMG-RESEG.png
13:29:55.198 INFO ocrd.workspace.save_image_file - created file ID: 
OCR-D-011_00001_region0008_region0008_line0000.IMG-RESEG, file_grp: OCR-D-011, path: 
OCR-D-011/OCR-D-011_00001_region0008_region0008_line0000.IMG-RESEG.png
13:29:55.249 INFO ocrd.workspace.save_image_file - created file ID: 
OCR-D-011_00001_region0008_region0008_line0001.IMG-RESEG, file_grp: OCR-D-011, path: 
OCR-D-011/OCR-D-011_00001_region0008_region0008_line0001.IMG-RESEG.png
13:29:55.299 INFO ocrd.workspace.save_image_file - created file ID: 
OCR-D-011_00001_region0008_region0008_line0002.IMG-RESEG, file_grp: OCR-D-011, path: 
OCR-D-011/OCR-D-011_00001_region0008_region0008_line0002.IMG-RESEG.png
13:29:55.348 INFO ocrd.workspace.save_image_file - created file ID: 
OCR-D-011_00001_region0008_region0008_line0003.IMG-RESEG, file_grp: OCR-D-011, path: 
OCR-D-011/OCR-D-011_00001_region0008_region0008_line0003.IMG-RESEG.png
13:29:55.401 INFO ocrd.workspace.save_image_file - created file ID: 
OCR-D-011_00001_region0008_region0008_line0004.IMG-RESEG, file_grp: OCR-D-011, path: 
OCR-D-011/OCR-D-011_00001_region0008_region0008_line0004.IMG-RESEG.png
13:29:55.457 INFO ocrd.workspace.save_image_file - created file ID: 
OCR-D-011_00001_region0008_region0008_line0005.IMG-RESEG, file_grp: OCR-D-011, path: 
OCR-D-011/OCR-D-011_00001_region0008_region0008_line0005.IMG-RESEG.png
13:29:55.509 INFO ocrd.workspace.save_image_file - created file ID: 
OCR-D-011_00001_region0008_region0008_line0006.IMG-RESEG, file_grp: OCR-D-011, path: 
OCR-D-011/OCR-D-011_00001_region0008_region0008_line0006.IMG-RESEG.png
13:29:55.561 INFO ocrd.workspace.save_image_file - created file ID: 
OCR-D-011_00001_region0008_region0008_line0007.IMG-RESEG, file_grp: OCR-D-011, path: 
OCR-D-011/OCR-D-011_00001_region0008_region0008_line0007.IMG-RESEG.png
13:29:55.613 INFO ocrd.workspace.save_image_file - created file ID: 
OCR-D-011_00001_region0008_region0008_line0008.IMG-RESEG, file_grp: OCR-D-011, path: 
OCR-D-011/OCR-D-011_00001_region0008_region0008_line0008.IMG-RESEG.png
13:29:55.665 INFO ocrd.workspace.save_image_file - created file ID: 
OCR-D-011_00001_region0008_region0008_line0009.IMG-RESEG, file_grp: OCR-D-011, path: 
OCR-D-011/OCR-D-011_00001_region0008_region0008_line0009.IMG-RESEG.png
13:29:55.862 INFO ocrd.workspace.save_image_file - created file ID: 
OCR-D-011_00001_region0008_region0008_line0010.IMG-RESEG, file_grp: OCR-D-011, path: 
OCR-D-011/OCR-D-011_00001_region0008_region0008_line0010.IMG-RESEG.png
13:29:55.945 INFO ocrd.workspace.save_image_file - created file ID: 
OCR-D-011_00001_region0008_region0008_line0011.IMG-RESEG, file_grp: OCR-D-011, path: 
OCR-D-011/OCR-D-011_00001_region0008_region0008_line0011.IMG-RESEG.png
13:29:56.500 INFO ocrd.workspace.save_image_file - created file ID: 
OCR-D-011_00001_region0010_region0010_line0000.IMG-RESEG, file_grp: OCR-D-011, path: 
OCR-D-011/OCR-D-011_00001_region0010_region0010_line0000.IMG-RESEG.png
13:29:56.552 INFO ocrd.workspace.save_image_file - created file ID: 
OCR-D-011_00001_region0010_region0010_line0001.IMG-RESEG, file_grp: OCR-D-011, path: 
OCR-D-011/OCR-D-011_00001_region0010_region0010_line0001.IMG-RESEG.png
13:29:56.590 INFO ocrd.workspace.save_image_file - created file ID: 
OCR-D-011_00001_region0010_region0010_line0002.IMG-RESEG, file_grp: OCR-D-011, path: 
OCR-D-011/OCR-D-011_00001_region0010_region0010_line0002.IMG-RESEG.png
13:29:56.623 WARNING processor.OcropyResegment - Page "OCR-D-010_00001" region "region0011" contains only one line
13:29:58.287 INFO ocrd.workspace.save_image_file - created file ID: 
OCR-D-011_00001_region0013_region0013_line0000.IMG-RESEG, file_grp: OCR-D-011, path: 
OCR-D-011/OCR-D-011_00001_region0013_region0013_line0000.IMG-RESEG.png
13:29:58.336 INFO ocrd.workspace.save_image_file - created file ID: 
OCR-D-011_00001_region0013_region0013_line0001.IMG-RESEG, file_grp: OCR-D-011, path: 
OCR-D-011/OCR-D-011_00001_region0013_region0013_line0001.IMG-RESEG.png
13:29:58.387 INFO ocrd.workspace.save_image_file - created file ID: 
OCR-D-011_00001_region0013_region0013_line0002.IMG-RESEG, file_grp: OCR-D-011, path: 
OCR-D-011/OCR-D-011_00001_region0013_region0013_line0002.IMG-RESEG.png
13:29:58.432 INFO ocrd.workspace.save_image_file - created file ID: 
OCR-D-011_00001_region0013_region0013_line0003.IMG-RESEG, file_grp: OCR-D-011, path: 
OCR-D-011/OCR-D-011_00001_region0013_region0013_line0003.IMG-RESEG.png
13:29:58.479 INFO ocrd.workspace.save_image_file - created file ID: 
OCR-D-011_00001_region0013_region0013_line0004.IMG-RESEG, file_grp: OCR-D-011, path: 
OCR-D-011/OCR-D-011_00001_region0013_region0013_line0004.IMG-RESEG.png
13:29:58.523 INFO ocrd.workspace.save_image_file - created file ID: 
OCR-D-011_00001_region0013_region0013_line0005.IMG-RESEG, file_grp: OCR-D-011, path: 
OCR-D-011/OCR-D-011_00001_region0013_region0013_line0005.IMG-RESEG.png
13:29:58.572 INFO ocrd.workspace.save_image_file - created file ID: 
OCR-D-011_00001_region0013_region0013_line0006.IMG-RESEG, file_grp: OCR-D-011, path: 
OCR-D-011/OCR-D-011_00001_region0013_region0013_line0006.IMG-RESEG.png
13:29:58.619 INFO ocrd.workspace.save_image_file - created file ID: 
OCR-D-011_00001_region0013_region0013_line0007.IMG-RESEG, file_grp: OCR-D-011, path: 
OCR-D-011/OCR-D-011_00001_region0013_region0013_line0007.IMG-RESEG.png
13:29:58.698 INFO ocrd.workspace.save_image_file - created file ID: 
OCR-D-011_00001_region0013_region0013_line0008.IMG-RESEG, file_grp: OCR-D-011, path: 
OCR-D-011/OCR-D-011_00001_region0013_region0013_line0008.IMG-RESEG.png
13:29:58.860 ERROR ocrd.workspace.image_from_segment - segment "region0015" image (binarized,despeckled,deskewed; 
1215x329) has not been cropped properly (1207x301)
13:29:59.680 INFO ocrd.workspace.save_image_file - created file ID: 
OCR-D-011_00001_region0015_region0015_line0000.IMG-RESEG, file_grp: OCR-D-011, path: 
OCR-D-011/OCR-D-011_00001_region0015_region0015_line0000.IMG-RESEG.png
13:29:59.738 INFO ocrd.workspace.save_image_file - created file ID: 
OCR-D-011_00001_region0015_region0015_line0001.IMG-RESEG, file_grp: OCR-D-011, path: 
OCR-D-011/OCR-D-011_00001_region0015_region0015_line0001.IMG-RESEG.png
13:29:59.788 INFO ocrd.workspace.save_image_file - created file ID: 
OCR-D-011_00001_region0015_region0015_line0002.IMG-RESEG, file_grp: OCR-D-011, path: 
OCR-D-011/OCR-D-011_00001_region0015_region0015_line0002.IMG-RESEG.png
13:29:59.827 INFO ocrd.workspace.save_image_file - created file ID: 
OCR-D-011_00001_region0015_region0015_line0003.IMG-RESEG, file_grp: OCR-D-011, path: 
OCR-D-011/OCR-D-011_00001_region0015_region0015_line0003.IMG-RESEG.png
13:29:59.861 INFO ocrd.workspace.save_image_file - created file ID: 
OCR-D-011_00001_region0015_region0015_line0004.IMG-RESEG, file_grp: OCR-D-011, path: 
OCR-D-011/OCR-D-011_00001_region0015_region0015_line0004.IMG-RESEG.png
13:30:00.495 INFO ocrd.workspace.save_image_file - created file ID: 
OCR-D-011_00001_region0017_region0017_line0000.IMG-RESEG, file_grp: OCR-D-011, path: 
OCR-D-011/OCR-D-011_00001_region0017_region0017_line0000.IMG-RESEG.png
13:30:00.517 INFO ocrd.workspace.save_image_file - created file ID: 
OCR-D-011_00001_region0017_region0017_line0001.IMG-RESEG, file_grp: OCR-D-011, path: 
OCR-D-011/OCR-D-011_00001_region0017_region0017_line0001.IMG-RESEG.png
13:30:00.543 INFO ocrd.workspace.save_image_file - created file ID: 
OCR-D-011_00001_region0017_region0017_line0002.IMG-RESEG, file_grp: OCR-D-011, path: 
OCR-D-011/OCR-D-011_00001_region0017_region0017_line0002.IMG-RESEG.png
13:30:00.568 INFO ocrd.workspace.save_image_file - created file ID: 
OCR-D-011_00001_region0017_region0017_line0003.IMG-RESEG, file_grp: OCR-D-011, path: 
OCR-D-011/OCR-D-011_00001_region0017_region0017_line0003.IMG-RESEG.png
13:30:00.677 INFO ocrd.workspace.save_image_file - created file ID: 
OCR-D-011_00001_region0017_region0017_line0004.IMG-RESEG, file_grp: OCR-D-011, path: 
OCR-D-011/OCR-D-011_00001_region0017_region0017_line0004.IMG-RESEG.png
13:30:00.677 WARNING processor.OcropyResegment - Page "OCR-D-010_00001" region "region0018" contains only one line
13:30:00.677 WARNING processor.OcropyResegment - Page "OCR-D-010_00001" region "region0019" contains only one line
13:30:01.597 INFO ocrd.workspace.save_image_file - created file ID: 
OCR-D-011_00001_region0021_region0021_line0000.IMG-RESEG, file_grp: OCR-D-011, path: 
OCR-D-011/OCR-D-011_00001_region0021_region0021_line0000.IMG-RESEG.png
13:30:01.627 INFO ocrd.workspace.save_image_file - created file ID: 
OCR-D-011_00001_region0021_region0021_line0001.IMG-RESEG, file_grp: OCR-D-011, path: 
OCR-D-011/OCR-D-011_00001_region0021_region0021_line0001.IMG-RESEG.png
13:30:01.655 INFO ocrd.workspace.save_image_file - created file ID: 
OCR-D-011_00001_region0021_region0021_line0002.IMG-RESEG, file_grp: OCR-D-011, path: 
OCR-D-011/OCR-D-011_00001_region0021_region0021_line0002.IMG-RESEG.png
13:30:01.688 INFO ocrd.workspace.save_image_file - created file ID: 
OCR-D-011_00001_region0021_region0021_line0003.IMG-RESEG, file_grp: OCR-D-011, path: 
OCR-D-011/OCR-D-011_00001_region0021_region0021_line0003.IMG-RESEG.png
13:30:01.722 INFO ocrd.workspace.save_image_file - created file ID: 
OCR-D-011_00001_region0021_region0021_line0004.IMG-RESEG, file_grp: OCR-D-011, path: 
OCR-D-011/OCR-D-011_00001_region0021_region0021_line0004.IMG-RESEG.png
13:30:01.757 INFO ocrd.workspace.save_image_file - created file ID: 
OCR-D-011_00001_region0021_region0021_line0005.IMG-RESEG, file_grp: OCR-D-011, path: 
OCR-D-011/OCR-D-011_00001_region0021_region0021_line0005.IMG-RESEG.png
13:30:01.778 INFO ocrd.workspace.save_image_file - created file ID: 
OCR-D-011_00001_region0021_region0021_line0006.IMG-RESEG, file_grp: OCR-D-011, path: 
OCR-D-011/OCR-D-011_00001_region0021_region0021_line0006.IMG-RESEG.png
13:30:01.784 INFO processor.OcropyResegment - created file ID: OCR-D-011_00001, file_grp: OCR-D-011, path: 
OCR-D-011/OCR-D-011_00001.xml
13:30:01.787 INFO ocrd.process.profile - Executing processor 'ocrd-cis-ocropy-resegment' took 38.405861s 
[--input-file-grp='OCR-D-010' --output-file-grp='OCR-D-011' --parameter='{"dpi": 0, "min_fraction": 0.8, 
"extend_margins": 3}']
13:30:01.788 INFO ocrd.workspace.save_mets - Saving mets 
'/beegfs/work/ws/hd_xxx-ubhd-ocrd-0/charis1824/0000aaaf.tif/mets.xml'
37.39user 5.99system 0:41.54elapsed 104%CPU (0avgtext+0avgdata 345484maxresident)k
265195inputs+1001outputs (843major+1157777minor)pagefaults 0swaps
Traceback (most recent call last):
  File "/beegfs/home/hd/hd_hd/hd_xxx/ocrd_all/venv/bin/ocrd-cis-ocropy-dewarp", line 8, in <module>
    sys.exit(ocrd_cis_ocropy_dewarp())
  File "/beegfs/home/hd/hd_hd/hd_xxx/ocrd_all/venv/lib/python3.7/site-packages/click/core.py", line 829, in __call__
    return self.main(*args, **kwargs)
  File "/beegfs/home/hd/hd_hd/hd_xxx/ocrd_all/venv/lib/python3.7/site-packages/click/core.py", line 782, in main
    rv = self.invoke(ctx)
  File "/beegfs/home/hd/hd_hd/hd_xxx/ocrd_all/venv/lib/python3.7/site-packages/click/core.py", line 1066, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/beegfs/home/hd/hd_hd/hd_xxx/ocrd_all/venv/lib/python3.7/site-packages/click/core.py", line 610, in invoke
    return callback(*args, **kwargs)
  File "/beegfs/home/hd/hd_hd/hd_xxx/ocrd_all/venv/lib/python3.7/site-packages/ocrd_cis/ocropy/cli.py", line 43, in 
ocrd_cis_ocropy_dewarp
    return ocrd_cli_wrap_processor(OcropyDewarp, *args, **kwargs)
  File "/beegfs/home/hd/hd_hd/hd_xxx/ocrd_all/venv/lib/python3.7/site-packages/ocrd/decorators/__init__.py", line 81, 
in ocrd_cli_wrap_processor
    run_processor(processorClass, ocrd_tool, mets, workspace=workspace, **kwargs)
  File "/beegfs/home/hd/hd_hd/hd_xxx/ocrd_all/venv/lib/python3.7/site-packages/ocrd/processor/helpers.py", line 69, in 
run_processor
    processor.process()
  File "/beegfs/home/hd/hd_hd/hd_xxx/ocrd_all/venv/lib/python3.7/site-packages/ocrd_cis/ocropy/dewarp.py", line 108, in 
process
    for (n, input_file) in enumerate(self.input_files):
  File "/beegfs/home/hd/hd_hd/hd_xxx/ocrd_all/venv/lib/python3.7/site-packages/ocrd/processor/base.py", line 126, in 
input_files
    ret = self.zip_input_files(mimetype=None, on_error='abort')
  File "/beegfs/home/hd/hd_hd/hd_xxx/ocrd_all/venv/lib/python3.7/site-packages/ocrd/processor/base.py", line 225, in 
zip_input_files
    file_.pageId, ifg))
ValueError: No PAGE-XML for page 'OCR-D-010_00001' in fileGrp 'OCR-D-011' but multiple matches.
1.40user 0.57system 0:03.28elapsed 60%CPU (0avgtext+0avgdata 96844maxresident)k
146230inputs+11outputs (530major+30910minor)pagefaults 0swaps
Traceback (most recent call last):
  File "/beegfs/home/hd/hd_hd/hd_xxx/ocrd_all/venv/local/sub-venv/headless-tf1/bin/ocrd-calamari-recognize", line 8, in 
<module>
    sys.exit(ocrd_calamari_recognize())
  File 
"/beegfs/home/hd/hd_hd/hd_xxx/ocrd_all/venv/local/sub-venv/headless-tf1/lib/python3.7/site-packages/click/core.py", 
line 829, in __call__
    return self.main(*args, **kwargs)
  File 
"/beegfs/home/hd/hd_hd/hd_xxx/ocrd_all/venv/local/sub-venv/headless-tf1/lib/python3.7/site-packages/click/core.py", 
line 782, in main
    rv = self.invoke(ctx)
  File 
"/beegfs/home/hd/hd_hd/hd_xxx/ocrd_all/venv/local/sub-venv/headless-tf1/lib/python3.7/site-packages/click/core.py", 
line 1066, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File 
"/beegfs/home/hd/hd_hd/hd_xxx/ocrd_all/venv/local/sub-venv/headless-tf1/lib/python3.7/site-packages/click/core.py", 
line 610, in invoke
    return callback(*args, **kwargs)
  File 
"/beegfs/home/hd/hd_hd/hd_xxx/ocrd_all/venv/local/sub-venv/headless-tf1/lib/python3.7/site-packages/ocrd_calamari/cli.py
", line 13, in ocrd_calamari_recognize
    return ocrd_cli_wrap_processor(CalamariRecognize, *args, **kwargs)
  File 
"/beegfs/home/hd/hd_hd/hd_xxx/ocrd_all/venv/local/sub-venv/headless-tf1/lib/python3.7/site-packages/ocrd/decorators/__in
it__.py", line 80, in ocrd_cli_wrap_processor
    raise Exception("Invalid input/output file grps:\n\t%s" % '\n\t'.join(report.errors))
Exception: Invalid input/output file grps:
	Input fileGrp[@USE='OCR-D-012'] not in METS!
1.20user 0.70system 0:04.33elapsed 43%CPU (0avgtext+0avgdata 73976maxresident)k
119356inputs+10outputs (433major+40131minor)pagefaults 0swaps

region segmentation crashes

ocrd-cis-ocropy-segment crashed completely on this picture with the following workflow:
ocrd-cis-ocropy-binarize|MAX|OCR-D-BIN1| | |ERROR
ocrd-anybaseocr-crop|OCR-D-BIN1|OCR-D-CROP| | |ERROR
ocrd-olena-binarize|OCR-D-CROP|OCR-D-BIN| | |ERROR
ocrd-cis-ocropy-deskew|OCR-D-BIN|OCR-D-DESKEW| | /test/data/ocrd/taverna/models/param-cis-deskew-page.json |ERROR
ocrd-cis-ocropy-denoise|OCR-D-DESKEW|OCR-D-DENOISE| | |ERROR
ocrd-cis-ocropy-segment|OCR-D-DENOISE|OCR-D-SEG-REGION| | /test/data/ocrd/taverna/models/param-cis-seg-page.json |ERROR

Make deskewing efficient+robust, and add orientation

As outlined a while ago,

# TODO: make zoomable, i.e. interpolate down to max 300 DPI to be faster
# TODO: sweep through angles very coarse, then hill climbing for precision
# TODO: try with shear (i.e. simply numpy shift) instead of true rotation
# TODO: use square of difference instead of variance as projection score
# (more reliable with background noise or multi-column)
# TODO: offer flip (90°) test (comparing length-normalized projection profiles)
# TODO: offer mirror (180°, or + vs - 90°) test based on ascender/descender signal
# (Latin scripts have more ascenders e.g. bhdkltſ than descenders e.g. qgpyj)

there are plenty of opportunities to improve ocrd-cis-ocropy-deskew:

  • downscale during estimation when pixel density (or resolution) is large
  • use hill-climbing approach with increasing precision (and interpretable performance/quality trade-off parameter) to find best angle (instead of exhaustive linear sweep)
  • approximate expensive rotation by cheap shear operation during estimation
  • use square of difference between rows of projection profile as score (instead of variance) to be more robust against noise and non-aligning text columns
  • clip/delete extremely large connected components during estimation to be more robust against separators, images and borders
  • also detect orientation (³) (multiples of 90°) by:
    • comparing horizontal and vertical projection profiles after length normalization: if the (best) vertical profile scores much better than the (best) horizontal profile, then the image needs to be reflected by 90° – because straight pages' text lines align horizontally, causing maximal fg/bg variance in the horizontal profile (¹)
    • comparing the steepness of the foreground/background transition flanks: if (going from "top to bottom") the gradient is (sufficiently) larger from bg to fg than from fg to bg, then the image needs to be reflected by 180° – because ascenders are much more frequent than descenders in Latin-based scripts, and thus the gradient above the line is supposed to be less steep than below (²)
  1. This would of course have to be inverted for vertical text like traditional Chinese or Japanese
  2. This might of course not work for other scripts. Preliminary OCR might be the only choice there.
  3. Both steps can be combined: -90° = 90°+180°

Got "Killed" running ocrd-cis-ocropy-clip

Running this workflow below with an image creates a "killed" error like this:

09:12:59.552 INFO processor.OcropyClip - INPUT FILE 0 / PAGE-1
09:12:59.922 INFO processor.OcropyClip - Page "OCR-D-SEG-REG-DESKEW-1" uses 300.000000 DPI
Killed

For getting the image, please contact myself on Gitter

Workflow used:

ocrd-cis-ocropy-binarize \
	-I OCR-D-IMG \
	-O OCR-D-BIN
  ocrd-tesserocr-segment-region \
	-I OCR-D-BIN \
	-O OCR-D-SEG-REG
  ocrd-tesserocr-deskew \
	-I OCR-D-SEG-REG \
	-O OCR-D-SEG-REG-DESKEW
  ocrd-cis-ocropy-clip \
	-I OCR-D-SEG-REG-DESKEW \
	-O OCR-D-SEG-REG-DESKEW-CLIP

ocrd-cis-align --dump-json does not produce valid JSON

Calling ocrd-cis-align --dump-json with the Docker image ocrd/all:2020-12-28 gives the following standard output (notice the last three lines):

{
 "executable": "ocrd-cis-align",
 "categories": [
  "Text recognition and optimization"
 ],
 "steps": [
  "recognition/post-correction"
 ],
 "input_file_grp": [
  "OCR-D-OCR-1",
  "OCR-D-OCR-2",
  "OCR-D-OCR-N"
 ],
 "output_file_grp": [
  "OCR-D-ALIGNED"
 ],
 "description": "Align multiple OCRs and/or GTs"
}
11:13:38.440 CRITICAL root - getLogger was called before initLogging. Source of the call:
11:13:38.441 CRITICAL root -   File "/build/ocrd_cis/ocrd_cis/align/cli.py", line 35, in __init__
11:13:38.441 CRITICAL root -     self.log = getLogger('cis.Processor.Aligner')

This crashes OCR-D when calling ocrd process "cis-align …" …:

Traceback (most recent call last):
  File "/usr/bin/ocrd", line 33, in <module>
    sys.exit(load_entry_point('ocrd', 'console_scripts', 'ocrd')())
  …
  File "/build/core/ocrd/ocrd/task_sequence.py", line 72, in validate
    param_validator = ParameterValidator(self.ocrd_tool_json)
  File "/build/core/ocrd/ocrd/task_sequence.py", line 53, in ocrd_tool_json
    self._ocrd_tool_json = json.loads(result.stdout)
  File "/usr/lib/python3.6/json/__init__.py", line 354, in loads
    return _default_decoder.decode(s)
  File "/usr/lib/python3.6/json/decoder.py", line 342, in decode
    raise JSONDecodeError("Extra data", s, end)
json.decoder.JSONDecodeError: Extra data: line 19 column 1 (char 312)

Bug: OcropyClip: TypeError: function takes exactly 1 argument (2 given)

Workflow:

. /usr/local/ocrd_all/venv/bin/activate
export TMPDIR=/dwork/tmp
export LD_LIBRARY_PATH=/usr/local/lib:$LD_LIBRARY_PATH
ocrd-create-mets.xml
( /usr/bin/time ocrd process \
"olena-binarize -I OCR-D-IMG -O OCR-D-N1 -P impl wolf" \
"anybaseocr-crop -I OCR-D-N1 -O OCR-D-N2" \
"olena-binarize -I OCR-D-N2 -O OCR-D-N3 -P impl wolf" \
"cis-ocropy-denoise -I OCR-D-N3 -O OCR-D-N4 -P level-of-operation page" \
"cis-ocropy-deskew -I OCR-D-N4 -O OCR-D-N5 -P level-of-operation page" \
"tesserocr-segment-region -I OCR-D-N5 -O OCR-D-N6" \
"segment-repair -I OCR-D-N6 -O OCR-D-N7 -P plausibilize true" \
"cis-ocropy-deskew -I OCR-D-N7 -O OCR-D-N8 -P level-of-operation region" \
"cis-ocropy-clip -I OCR-D-N8 -O OCR-D-N9 -P level-of-operation region" \
"tesserocr-segment-line -I OCR-D-N9 -O OCR-D-N10" \
"cis-ocropy-clip -I OCR-D-N10 -O OCR-D-N11 -P level-of-operation line" \
"cis-ocropy-resegment -I OCR-D-N11 -O OCR-D-N12" \
"cis-ocropy-dewarp -I OCR-D-N12 -O OCR-D-N13" \
"calamari-recognize -I OCR-D-N13 -O OCR-D-OCR -P checkpoint /usr/local/ocrd_models/calamari/calamari_models-0.3/fraktur_historical/*.ckpt.json"

) >cmd.log 2>&1

Log:

02:57:04.632 INFO ocrd.task_sequence.run_tasks - Start processing task 'cis-ocropy-clip -I OCR-D-N8 -O OCR-D-N9 -p '{"level-of-op
eration": "region", "dpi": 0, "min_fraction": 0.7}''
Traceback (most recent call last):
  File "/usr/local/ocrd_all/venv/bin/ocrd", line 8, in <module>
    sys.exit(cli())
  File "/usr/local/ocrd_all/venv/lib/python3.7/site-packages/click/core.py", line 829, in __call__
    return self.main(*args, **kwargs)
  File "/usr/local/ocrd_all/venv/lib/python3.7/site-packages/click/core.py", line 782, in main
    rv = self.invoke(ctx)
  File "/usr/local/ocrd_all/venv/lib/python3.7/site-packages/click/core.py", line 1259, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/usr/local/ocrd_all/venv/lib/python3.7/site-packages/click/core.py", line 1066, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/usr/local/ocrd_all/venv/lib/python3.7/site-packages/click/core.py", line 610, in invoke
    return callback(*args, **kwargs)
  File "/usr/local/ocrd_all/venv/lib/python3.7/site-packages/ocrd/cli/process.py", line 26, in pro
cess_cli
    run_tasks(mets, log_level, page_id, tasks, overwrite)
  File "/usr/local/ocrd_all/venv/lib/python3.7/site-packages/ocrd/task_sequence.py", line 149, in 
run_tasks
    raise Exception("%s exited with non-zero return value %s. STDOUT:\n%s\nSTDERR:\n%s" % (task.executable, returncode, out, err))
Exception: ocrd-cis-ocropy-clip exited with non-zero return value 1. STDOUT:

STDERR:
02:57:06.605 INFO processor.OcropyClip - INPUT FILE 0 / P_00001
02:57:07.682 INFO processor.OcropyClip - Page "OCR-D-N8_00001" uses 300.000000 DPI
Traceback (most recent call last):
  File "/usr/local/ocrd_all/venv/bin/ocrd-cis-ocropy-clip", line 8, in <module>
    sys.exit(ocrd_cis_ocropy_clip())
  File "/usr/local/ocrd_all/venv/lib/python3.7/site-packages/click/core.py", line 829, in __call__
    return self.main(*args, **kwargs)
  File "/usr/local/ocrd_all/venv/lib/python3.7/site-packages/click/core.py", line 782, in main
    rv = self.invoke(ctx)
  File "/usr/local/ocrd_all/venv/lib/python3.7/site-packages/click/core.py", line 1066, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/usr/local/ocrd_all/venv/lib/python3.7/site-packages/click/core.py", line 610, in invoke
    return callback(*args, **kwargs)
  File "/usr/local/ocrd_all/venv/lib/python3.7/site-packages/ocrd_cis/ocropy/cli.py", line 33, in ocrd_cis_ocropy_clip
    return ocrd_cli_wrap_processor(OcropyClip, *args, **kwargs)
  File "/usr/local/ocrd_all/venv/lib/python3.7/site-packages/ocrd/decorators/__init__.py", line 81, in ocrd_cli_wrap_processor
    run_processor(processorClass, ocrd_tool, mets, workspace=workspace, **kwargs)
  File "/usr/local/ocrd_all/venv/lib/python3.7/site-packages/ocrd/processor/helpers.py", line 68, in run_processor
    processor.process()
  File "/usr/local/ocrd_all/venv/lib/python3.7/site-packages/ocrd_cis/ocropy/clip.py", line 131, in process
    background_image = Image.new('L', page_image.size, background)
  File "/usr/local/ocrd_all/venv/lib/python3.7/site-packages/PIL/Image.py", line 2613, in new
    return im._new(core.fill(mode, size, color))
TypeError: function takes exactly 1 argument (2 given)

Command exited with non-zero status 1

resegment: running for 155 minutes(?)...

and still running.

Workflow:

. /usr/local/ocrd_all/venv/bin/activate
export TMPDIR=/dwork/tmp
export LD_LIBRARY_PATH=/usr/local/lib:$LD_LIBRARY_PATH
ocrd-create-mets.xml
( /usr/bin/time ocrd process \
"olena-binarize -I OCR-D-IMG -O OCR-D-N1 -P impl wolf" \
"anybaseocr-crop -I OCR-D-N1 -O OCR-D-N2" \
"olena-binarize -I OCR-D-N2 -O OCR-D-N3 -P impl wolf" \
"cis-ocropy-denoise -I OCR-D-N3 -O OCR-D-N4 -P level-of-operation page" \
"cis-ocropy-deskew -I OCR-D-N4 -O OCR-D-N5 -P level-of-operation page" \
"pc-segmentation -I OCR-D-N5 -O OCR-D-N6" \
"cis-ocropy-deskew -I OCR-D-N6 -O OCR-D-N7 -P level-of-operation region" \
"tesserocr-segment-line -I OCR-D-N7 -O OCR-D-N8" \
"cis-ocropy-resegment -I OCR-D-N8 -O OCR-D-N9" \
"cis-ocropy-dewarp -I OCR-D-N9 -O OCR-D-N10" \
"calamari-recognize -I OCR-D-N10 -O OCR-D-OCR -P checkpoint /usr/local/ocrd_models/calamari/calamari_models-0.3/fraktur_19th_century/*.ckpt.json"

) >cmd.log 2>&1
ps axf
ls       66073  0.0  0.0   4384   744 pts/0    S    14:40   0:00                                  |   \_ /usr/bin/time ocrd process olena-binarize -I O[44/1843]
-O OCR-D-N1 -P impl wolf anybaseocr-crop -I OCR-D-N1 -O OCR-D-N2 olena-binarize -I OCR-D-N2 -O OCR-D-N3 -P impl wolf cis-ocropy-denoise -I OCR-D-N3 -O OCR-D-N4
-P level-of-operation page cis-ocropy-deskew -I OCR-D-N4 -O OCR-D-N5 -P level-of-operation page pc-segmentation -I OCR-D-N5 -O OCR-D-N6 cis-ocropy-deskew -I OCR
-D-N6 -O OCR-D-N7 -P level-of-operation region tesserocr-segment-line -I OCR-D-N7 -O OCR-D-N8 cis-ocropy-resegment -I OCR-D-N8 -O OCR-D-N9 cis-ocropy-dewarp -I
OCR-D-N9 -O OCR-D-N10 calamari-recognize -I OCR-D-N10 -O OCR-D-OCR -P checkpoint /usr/local/ocrd_models/calamari/calamari_models-0.3/fraktur_19th_century/*.ckpt
.json
ls       66074  0.0  0.0 2423620 68968 pts/0   S    14:40   0:05                                  |       \_ /dwork/ocrd-schroot-ubuntu-eoan/usr/local/ocrd_all/
venv/bin/python3.7 /dwork/ocrd-schroot-ubuntu-eoan/usr/local/ocrd_all/venv/bin/ocrd process olena-binarize -I OCR-D-IMG -O OCR-D-N1 -P impl wolf anybaseocr-crop
 -I OCR-D-N1 -O OCR-D-N2 olena-binarize -I OCR-D-N2 -O OCR-D-N3 -P impl wolf cis-ocropy-denoise -I OCR-D-N3 -O OCR-D-N4 -P level-of-operation page cis-ocropy-de
skew -I OCR-D-N4 -O OCR-D-N5 -P level-of-operation page pc-segmentation -I OCR-D-N5 -O OCR-D-N6 cis-ocropy-deskew -I OCR-D-N6 -O OCR-D-N7 -P level-of-operation
region tesserocr-segment-line -I OCR-D-N7 -O OCR-D-N8 cis-ocropy-resegment -I OCR-D-N8 -O OCR-D-N9 cis-ocropy-dewarp -I OCR-D-N9 -O OCR-D-N10 calamari-recognize
 -I OCR-D-N10 -O OCR-D-OCR -P checkpoint /usr/local/ocrd_models/calamari/calamari_models-0.3/fraktur_19th_century/*.ckpt.json
ls        2747  116  0.3 11505348 519324 pts/0 Rl   16:44 160:53                                  |           \_ /dwork/ocrd-schroot-ubuntu-eoan/usr/local/ocrd_
all/venv/bin/python3.7 /dwork/ocrd-schroot-ubuntu-eoan/usr/local/ocrd_all/venv/bin/ocrd-cis-ocropy-resegment --working-dir /_digi8+9/digitalisate8/ocr-d/testset
/x,pc-segmentation,tesserocr-segment-line,calamari-frak19th --mets mets.xml --input-file-grp OCR-D-N8 --output-file-grp OCR-D-N9 --parameter {"dpi": 0, "min_fra
ction": 0.8, "extend_margins": 3}

@bertsky: same image set as in last email.

PS: no cis-ocropy-clip for obvious reasons :-)

Bad argument in function `pointPolygonTest` - pt[0] has a wrong type.

The error happened with 5 workspaces I have tried. It's strange that our test workspace still works (probably due to missing segmentation features). I am using the latest ocrd_all maximum image. Workspace used: https://gdz.sub.uni-goettingen.de/mets/PPN1023134829.mets.xml

N E X T F L O W  ~  version 21.04.3                                  
Launching `/scratch1/users/mmustaf/operandi/slurm_workspaces/eda2f2cc-dec8-4d5b-9da9-6267663ec455/user_workflow.nf` [prickly_bartik] - revision: d87a716c22
O P E R A N D I - H P C - D E F A U L T  P I P E L I N E
===========================================
input_file_group    : MAX
mets                : /scratch1/users/mmustaf/operandi/slurm_workspaces/eda2f2cc-dec8-4d5b-9da9-6267663ec455/cccdfe69-b449-4367-9eea-66e16e3050fa/mets.xml
volume_map_dir      : /scratch1/users/mmustaf/operandi/slurm_workspaces/eda2f2cc-dec8-4d5b-9da9-6267663ec455
models_mapping      : /scratch1/users/mmustaf/ocrd_models:/usr/local/share
sif_path            : /scratch1/users/mmustaf/ocrd_all_maximum_image.sif
singularity_wrapper : singularity exec --bind /scratch1/users/mmustaf/operandi/slurm_workspaces/eda2f2cc-dec8-4d5b-9da9-6267663ec455 --bind /scratch1/users/mmustaf/ocrd_models:/usr/local/share --env OCRD_METS_CACHING=true /scratch1/users/mmustaf/ocrd_all_maximum_image.sif

[0c/9d6c01] Submitted process > ocrd_cis_ocropy_binarize
[75/78eb46] Submitted process > ocrd_anybaseocr_crop
[45/e1c09c] Submitted process > ocrd_skimage_binarize
[c6/c6a10d] Submitted process > ocrd_skimage_denoise
[74/b99277] Submitted process > ocrd_tesserocr_deskew
[6b/214ea6] Submitted process > ocrd_cis_ocropy_segment
Error executing process > 'ocrd_cis_ocropy_segment'

Caused by:
  Process `ocrd_cis_ocropy_segment` terminated with an error exit status (1)

Command executed:

  singularity exec --bind /scratch1/users/mmustaf/operandi/slurm_workspaces/eda2f2cc-dec8-4d5b-9da9-6267663ec455 --bind /scratch1/users/mmustaf/ocrd_models:/usr/local/share --env OCRD_METS_CACHING=true /scratch1/users/mmustaf/ocrd_all_maximum_image.sif ocrd-cis-ocropy-segment -m mets.xml -I OCR-D-BIN-DENOISE-DESKEW -O OCR-D-SEG -p '{"level-of-operation": "page"}'

Command exit status:
  1

Command output:
  (empty)

Command error:
  13:38:12.729 INFO processor.OcropySegment - Found 20 separators for page "FILE_0001_OCR-D-BIN-DENOISE-DESKEW"
  13:38:12.767 ERROR ocrd.processor.helpers.run_processor - Failure in processor 'ocrd-cis-ocropy-segment'
  Traceback (most recent call last):
    File "/build/core/ocrd/ocrd/processor/helpers.py", line 128, in run_processor
      processor.process()
    File "/usr/local/lib/python3.8/site-packages/ocrd_cis/ocropy/segment.py", line 404, in process
      self._process_element(page, ignore, page_image, page_coords,
    File "/usr/local/lib/python3.8/site-packages/ocrd_cis/ocropy/segment.py", line 750, in _process_element
      sep_polygons, _ = masks2polygons(seplines, None, element_bin,
    File "/usr/local/lib/python3.8/site-packages/ocrd_cis/ocropy/segment.py", line 139, in masks2polygons
      hole_idx = np.argmin([cv2.pointPolygonTest(contour, tuple(pt[0]), True)
    File "/usr/local/lib/python3.8/site-packages/ocrd_cis/ocropy/segment.py", line 139, in <listcomp>
      hole_idx = np.argmin([cv2.pointPolygonTest(contour, tuple(pt[0]), True)
  cv2.error: OpenCV(4.7.0) :-1: error: (-5:Bad argument) in function 'pointPolygonTest'
  > Overload resolution failed:
  >  - Can't parse 'pt'. Sequence item with index 0 has a wrong type
  >  - Can't parse 'pt'. Sequence item with index 0 has a wrong type
  
  Traceback (most recent call last):
    File "/usr/local/bin/ocrd-cis-ocropy-segment", line 8, in <module>
      sys.exit(ocrd_cis_ocropy_segment())
    File "/usr/local/lib/python3.8/dist-packages/click/core.py", line 1130, in __call__
      return self.main(*args, **kwargs)
    File "/usr/local/lib/python3.8/dist-packages/click/core.py", line 1055, in main
      rv = self.invoke(ctx)
    File "/usr/local/lib/python3.8/dist-packages/click/core.py", line 1404, in invoke
      return ctx.invoke(self.callback, **ctx.params)
    File "/usr/local/lib/python3.8/dist-packages/click/core.py", line 760, in invoke
      return __callback(*args, **kwargs)
    File "/usr/local/lib/python3.8/site-packages/ocrd_cis/ocropy/cli.py", line 53, in ocrd_cis_ocropy_segment
      return ocrd_cli_wrap_processor(OcropySegment, *args, **kwargs)
    File "/build/core/ocrd/ocrd/decorators/__init__.py", line 116, in ocrd_cli_wrap_processor
      run_processor(processorClass, mets_url=mets, workspace=workspace, **kwargs)
    File "/build/core/ocrd/ocrd/processor/helpers.py", line 131, in run_processor
      raise err
    File "/build/core/ocrd/ocrd/processor/helpers.py", line 128, in run_processor
      processor.process()
    File "/usr/local/lib/python3.8/site-packages/ocrd_cis/ocropy/segment.py", line 404, in process
      self._process_element(page, ignore, page_image, page_coords,
    File "/usr/local/lib/python3.8/site-packages/ocrd_cis/ocropy/segment.py", line 750, in _process_element
      sep_polygons, _ = masks2polygons(seplines, None, element_bin,
    File "/usr/local/lib/python3.8/site-packages/ocrd_cis/ocropy/segment.py", line 139, in masks2polygons
      hole_idx = np.argmin([cv2.pointPolygonTest(contour, tuple(pt[0]), True)
    File "/usr/local/lib/python3.8/site-packages/ocrd_cis/ocropy/segment.py", line 139, in <listcomp>
      hole_idx = np.argmin([cv2.pointPolygonTest(contour, tuple(pt[0]), True)
  cv2.error: OpenCV(4.7.0) :-1: error: (-5:Bad argument) in function 'pointPolygonTest'
  > Overload resolution failed:
  >  - Can't parse 'pt'. Sequence item with index 0 has a wrong type
  >  - Can't parse 'pt'. Sequence item with index 0 has a wrong type

Improve code quality

LGTM shows only a moderate C code quality for ocrd_cis and gives numerous hints how it could be improved.

cannot find model

To me, 1eb4cb1 (Improve searching for model files) is a regression: The old method delegated to ocrolib.load_object(), which itself would try various paths (the system's datarootdir, /usr/local/share/ocropus, the CWD, the directory of the source file) and its subdirectories, and would also try with and without .gz suffix. It thus covered many use cases (in venvs, in system, in docker containers).

But the new method with an additional get_model() raises an exception before even allowing load_object() to do its work.

calamari_ocr dependency outdated and heavy

We still have calamari_ocr == 0.3.5 in install_requires. That's quite a drag IMO: it is quite out-dated, pulls in Tensorflow/Keras (themselves outdated), and probably not really necessary (since we've had ocrd_calamari for quite a while now).

Can we remove this?

(I see references to Calamari in div.eval, div.auswerter.runcalamari and aio.runcalamari.)

'GeometryCollection' object has no attribute 'exterior'

Environment

  • Version: included in Docker Image ocrd/all:maximum from 2020-09-10 (docker image id: 9e71ab5d7d53)

Current behavior

When executing docker run -v 1085:/data -w /data -v calamari_models:/models -- ocrd/all:maximum ocrd process ⟨here omitting the “best results for selected pages” workflow⟩, I receive the following error:

Traceback (most recent call last):
  File "/usr/bin/ocrd", line 33, in <module>
    sys.exit(load_entry_point('ocrd', 'console_scripts', 'ocrd')())
  File "/usr/lib/python3.6/site-packages/click/core.py", line 829, in __call__
    return self.main(*args, **kwargs)
  File "/usr/lib/python3.6/site-packages/click/core.py", line 782, in main
    rv = self.invoke(ctx)
  File "/usr/lib/python3.6/site-packages/click/core.py", line 1259, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/usr/lib/python3.6/site-packages/click/core.py", line 1066, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/usr/lib/python3.6/site-packages/click/core.py", line 610, in invoke
    return callback(*args, **kwargs)
  File "/build/core/ocrd/ocrd/cli/process.py", line 28, in process_cli
    run_tasks(mets, log_level, page_id, tasks, overwrite)
  File "/build/core/ocrd/ocrd/task_sequence.py", line 149, in run_tasks
    raise Exception("%s exited with non-zero return value %s. STDOUT:\n%s\nSTDERR:\n%s" % (task.executable, returncode, out, err))
Exception: ocrd-cis-ocropy-segment exited with non-zero return value 1. STDOUT:

STDERR:
Traceback (most recent call last):
  File "/usr/bin/ocrd-cis-ocropy-segment", line 33, in <module>
    sys.exit(load_entry_point('ocrd-cis', 'console_scripts', 'ocrd-cis-ocropy-segment')())
  File "/usr/lib/python3.6/site-packages/click/core.py", line 829, in __call__
    return self.main(*args, **kwargs)
  File "/usr/lib/python3.6/site-packages/click/core.py", line 782, in main
    rv = self.invoke(ctx)
  File "/usr/lib/python3.6/site-packages/click/core.py", line 1066, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/usr/lib/python3.6/site-packages/click/core.py", line 610, in invoke
    return callback(*args, **kwargs)
  File "/build/ocrd_cis/ocrd_cis/ocropy/cli.py", line 53, in ocrd_cis_ocropy_segment
    return ocrd_cli_wrap_processor(OcropySegment, *args, **kwargs)
  File "/build/core/ocrd/ocrd/decorators.py", line 102, in ocrd_cli_wrap_processor
    run_processor(processorClass, ocrd_tool, mets, workspace=workspace, **kwargs)
  File "/build/core/ocrd/ocrd/processor/helpers.py", line 69, in run_processor
    processor.process()
  File "/build/ocrd_cis/ocrd_cis/ocropy/segment.py", line 381, in process
    region.id, file_id + '_' + region.id, zoom)
  File "/build/ocrd_cis/ocrd_cis/ocropy/segment.py", line 648, in _process_element
    line_polygon = polygon_for_parent(line_polygon, element)
  File "/build/ocrd_cis/ocrd_cis/ocropy/segment.py", line 677, in polygon_for_parent
    return interp.exterior.coords[:-1] # keep open
AttributeError: 'GeometryCollection' object has no attribute 'exterior'

I uploaded the contents of the directory 1085 here (removed now) to make reproduction easier. See also an example input image.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.