Giter Site home page Giter Site logo

ocrd_tesserocr's Introduction

ocrd_tesserocr

Crop, deskew, segment into regions / tables / lines / words, or recognize with tesserocr

image image image Docker Automated build

Introduction

This package offers OCR-D compliant workspace processors for (much of) the functionality of Tesseract via its Python API wrapper tesserocr. (Each processor is a parameterizable step in a configurable workflow of the OCR-D functional model. There are usually various alternative processor implementations for each step. Data is represented with METS and PAGE.)

It includes image preprocessing (cropping, binarization, deskewing), layout analysis (region, table, line, word segmentation), script identification, font style recognition and text recognition.

Most processors can operate on different levels of the PAGE hierarchy, depending on the workflow configuration. In PAGE, image results are referenced (read and written) via AlternativeImage, text results via TextEquiv, font attributes via TextStyle, script via @primaryScript, deskewing via @orientation, cropping via Border and segmentation via Region / TextLine / Word elements with Coords/@points.

Installation

With docker

This is the best option if you want to run the software in a container.

You need to have Docker

docker pull ocrd/tesserocr

To run with docker:

docker run -v path/to/workspaces:/data ocrd/tesserocr ocrd-tesserocrd-crop ...

From PyPI and Tesseract provided by system

If your operating system / distribution already provides Tesseract 4.1 or newer, then just install its development package:

# on Debian / Ubuntu:
sudo apt install libtesseract-dev

Otherwise, recent Tesseract packages for Ubuntu are available via PPA alex-p, which has up-to-date builds of Tesseract and its dependencies:

# on Debian / Ubuntu
sudo add-apt-repository ppa:alex-p/tesseract-ocr
sudo apt-get update
sudo apt install libtesseract-dev

Once Tesseract is available, just install ocrd_tesserocr from PyPI server:

pip install ocrd_tesserocr

We strongly recommend setting up a venv first.

From git

Use this option if there is no suitable prebuilt version of Tesseract available on your system, or you want to change the source code or install the latest, unpublished changes.

git clone https://github.com/OCR-D/ocrd_tesserocr
cd ocrd_tesserocr
# install Tesseract:
sudo make deps-ubuntu # system dependencies just for the build
make deps
# install tesserocr and ocrd_tesserocr:
make install

We strongly recommend setting up a venv first.

Models

Tesseract comes with synthetically trained models for languages (tesseract-ocr-{eng,deu,deu_latf,...} or scripts (tesseract-ocr-script-{latn,frak,...}). In addition, various models trained on scan data are available from the community.

Since all OCR-D processors must resolve file/data resources in a standardized way, and we want to stay interoperable with standalone Tesseract (which uses a single compile-time tessdata directory), ocrd-tesserocr-recognize expects the recognition models to be installed in its module resource location only. The module location is determined by the underlying Tesseract installation (compile-time tessdata directory, or run-time $TESSDATA_PREFIX environment variable). Other resource locations (data/system/cwd) will be ignored, and should not be used when installing models with the Resource Manager (ocrd resmgr download).

To see the module resource location of your installation:

ocrd-tesserocr-recognize -D

For a full description of available commands for resource management, see:

ocrd resmgr --help
ocrd resmgr list-available --help
ocrd resmgr download --help
ocrd resmgr list-installed --help

Note: (In previous versions, the resource locations of standalone Tesseract and the OCR-D wrapper were different. If you already have models under $XDG_DATA_HOME/ocrd-resources/ocrd-tesserocr-recognize, usually ~/.local/share/ocrd-resources/ocrd-tesserocr-recognize, then consider moving them to the new default under ocrd-tesserocr-recognize -D, usually /usr/share/tesseract-ocr/4.00/tessdata, or alternatively overriding the module directory by setting TESSDATA_PREFIX=$XDG_DATA_HOME/ocrd-resources/ocrd-tesserocr-recognize in the environment.)

Cf. OCR-D model guide.

Models always use the filename suffix .traineddata, but are just loaded by their basename. You will need at least eng and osd installed (even for segmentation and deskewing), probably also Latin and Fraktur etc. So to get minimal models, do:

ocrd resmgr download ocrd-tesserocr-recognize eng.traineddata
ocrd resmgr download ocrd-tesserocr-recognize osd.traineddata

(This will already be installed if using the Docker or git installation option.)

As of v0.13.1, you can configure ocrd-tesserocr-recognize to select models dynamically segment by segment, either via custom conditions on the PAGE-XML annotation (presented as XPath rules), or by automatically choosing the model with highest confidence.

Usage

For details, see docstrings in the individual processors and ocrd-tool.json descriptions, or simply --help.

Available OCR-D processors are:

  • ocrd-tesserocr-crop (simplistic)
    • sets Border of pages and adds AlternativeImage files to the output fileGrp
  • ocrd-tesserocr-deskew (for skew and orientation; mind operation_level)
    • sets @orientation of regions or pages and adds AlternativeImage files to the output fileGrp
  • ocrd-tesserocr-binarize (Otsu – not recommended, unless already binarized and using tiseg)
    • adds AlternativeImage files to the output fileGrp
  • ocrd-tesserocr-recognize (optionally including segmentation; mind segmentation_level and textequiv_level)
    • adds TextRegions, TableRegions, ImageRegions, MathsRegions, SeparatorRegions, NoiseRegions, ReadingOrder and AlternativeImage to Page and sets their @orientation (optionally)
    • adds TextRegions to TableRegions and sets their @orientation (optionally)
    • adds TextLines to TextRegions (optionally)
    • adds Words to TextLines (optionally)
    • adds Glyphs to Words (optionally)
    • adds TextEquiv
  • ocrd-tesserocr-segment (all-in-one segmentation – recommended; delegates to recognize)
    • adds TextRegions, TableRegions, ImageRegions, MathsRegions, SeparatorRegions, NoiseRegions, ReadingOrder and AlternativeImage to Page and sets their @orientation
    • adds TextRegions to TableRegions and sets their @orientation
    • adds TextLines to TextRegions
    • adds Words to TextLines
    • adds Glyphs to Words
  • ocrd-tesserocr-segment-region (only regions – with overlapping bboxes; delegates to recognize)
    • adds TextRegions, TableRegions, ImageRegions, MathsRegions, SeparatorRegions, NoiseRegions and ReadingOrder to Page and sets their @orientation
  • ocrd-tesserocr-segment-table (only table cells; delegates to recognize)
    • adds TextRegions to TableRegions
  • ocrd-tesserocr-segment-line (only lines – from overlapping regions; delegates to recognize)
    • adds TextLines to TextRegions
  • ocrd-tesserocr-segment-word (only words; delegates to recognize)
    • adds Words to TextLines
  • ocrd-tesserocr-fontshape (only text style – via Tesseract 3 models)
    • adds TextStyle to Words

The text region @types detected are (from Tesseract's PolyBlockType):

  • paragraph: normal block (aligned with others in the column)
  • floating: unaligned block (is in a cross-column pull-out region)
  • heading: block that spans more than one column
  • caption: block for text that belongs to an image

If you are unhappy with these choices, then consider post-processing with a dedicated custom processor in Python, or by modifying the PAGE files directly (e.g. xmlstarlet ed --inplace -u '//pc:TextRegion/@type[.="floating"]' -v paragraph filegrp/*.xml).

All segmentation is currently done as bounding boxes only by default, i.e. without precise polygonal outlines. For dense page layouts this means that neighbouring regions and neighbouring text lines may overlap a lot. If this is a problem for your workflow, try post-processing like so:

  • after line segmentation: use ocrd-cis-ocropy-resegment for polygonalization, or ocrd-cis-ocropy-clip on the line level
  • after region segmentation: use ocrd-segment-repair with plausibilize (and sanitize after line segmentation)

It also means that Tesseract should be allowed to segment across multiple hierarchy levels at once, to avoid introducing inconsistent/duplicate text line assignments in text regions, or word assignments in text lines. Hence,

  • prefer ocrd-tesserocr-recognize with segmentation_level=region
    over ocrd-tesserocr-segment followed by ocrd-tesserocr-recognize,
    if you want to do all in one with Tesseract,
  • prefer ocrd-tesserocr-recognize with segmentation_level=line
    over ocrd-tesserocr-segment-line followed by ocrd-tesserocr-recognize,
    if you want to do everything but region segmentation with Tesseract,
  • prefer ocrd-tesserocr-segment over ocrd-tesserocr-segment-region
    followed by (ocrd-tesserocr-segment-table and) ocrd-tesserocr-segment-line,
    if you want to do everything but recognition with Tesseract.

However, you can also run ocrd-tesserocr-segment* and ocrd-tesserocr-recognize with shrink_polygons=True to get polygons by post-processing each segment, shrinking to the convex hull of all its symbol outlines.

Testing

make test

This downloads some test data from https://github.com/OCR-D/assets under repo/assets, and runs some basic test of the Python API as well as the CLIs.

Set PYTEST_ARGS="-s --verbose" to see log output (-s) and individual test results (--verbose).

ocrd_tesserocr's People

Contributors

bertsky avatar cneud avatar joschrew avatar kba avatar m3ssman avatar mikegerber avatar noahmetzger avatar stweil avatar wrznr avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

ocrd_tesserocr's Issues

Use orientation attribute on page level

As soon as we move to PAGE XML 2019, deskewing on page level should not only create an AlternativeImage but also set the corresponding orientation attribute.

recognize: use primaryScript or TextStyle to load model

In the current state, the OCR model has to be selected in the fixed parameter JSON for the whole pipeline (all pages, all regions, all lines). We should at least offer a setting like dynamic that instead looks into ...

  • mods:language of the workspace's METS file
  • @primaryScript and @secondaryScript of the elements to be processed (or their parents), depending on textequiv_level
  • TextStyle/@fontFamily of the elements to be processed (or their parents), depending on textequiv_level – as described by the spec

...and combines this information somehow to select one of the predefined models. (Predefined could include custom built models, though. So maybe this must be more than a single new value in the parameter file.)

Allow the use of multiple models (at once)

Tesseract allows for recognition with multiple models at once via the + operator:

$ tesseract 1930.cropped.jpg 1930 -l Fraktur+Latin

The same is not possible yet with ocrd_tesserocr. Using

{
  "model" : "Fraktur+Latin"
}

leads to

Exception: configured model Fraktur+Latin is not installed

Pls. note that tesserocr is able to interpret the + operator.

common: rare OverflowError in crop_image

Sometimes, I get a OverflowError: signed integer is greater than maximum in common.image_from_segment | common.crop_image | PIL.Image.new. Maybe something with the background color estimation went wrong, or the coordinates are off scale. Investigating...

Comment "cropped" is used by deskewing (for non-cropped pages)

Running

ocrd-tesserocr-deskew -m mets.xml -I ORIGINAL -O DESKEW -p <(echo '{"operation_level": "page"}')

results in

<?xml version="1.0" encoding="UTF-8"?>
<pc:PcGts xmlns:pc="http://schema.primaresearch.org/PAGE/gts/pagecontent/2018-07-15">
    <pc:Metadata>
        <pc:Creator>OCR-D/core 1.0.0b10</pc:Creator>
        <pc:Created>2019-07-09T15:37:12.528892</pc:Created>
        <pc:LastChange>2019-07-09T15:37:12.528892</pc:LastChange>
        <pc:MetadataItem type="processingStep" name="preprocessing/optimization/deskewing" value="ocrd-tesserocr-deskew">
            <pc:Labels>
                <pc:Label value="page" type="operation_level"/>
            </pc:Labels>
        </pc:MetadataItem>
    </pc:Metadata>
    <pc:Page imageFilename="https://digital.slub-dresden.de/data/kitodo/GottDie_453779263/GottDie_453779263_tif/jpegs/00000033.tif.original.jpg" imageWidth="1187" imageHeight="1687" readingDirection="left-to-right" textLineOrder="top-to-bottom">
        <pc:AlternativeImage filename="OCR-D-IMG-DESKEW/FILE_0033_OCR-D-IMG-DESKEW.png" comments="cropped"/>
    </pc:Page>
</pc:PcGts>

The comment should be deskewed, right?

Original files being copied?

Here my minimal mets example:

<?xml version="1.0" encoding="UTF-8"?>
<mets:mets xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:vls="http://semantics.de/vls" xmlns:mets="http://www.loc.gov/METS/" xsi:schemaLocation="http://www.loc.gov/METS/ http://www.loc.gov/standards/mets/version18/mets.xsd">
  <mets:dmdSec ID="dmdSec_0001">
    <mets:mdWrap MDTYPE="MODS">
      <mets:xmlData>
        <mods:mods xmlns:mods="http://www.loc.gov/mods/v3">
          <mods:identifier type="purl">http://www.deutschestextarchiv.de/wundt_grundriss_1896</mods:identifier>
        </mods:mods>
      </mets:xmlData>
    </mets:mdWrap>
  </mets:dmdSec>
  <mets:fileSec>
    <mets:fileGrp USE="OCR-D-IMG">
      <mets:file MIMETYPE="image/jpeg" ID="OCR-D-IMG_0001">
        <mets:FLocat LOCTYPE="OTHER" xlink:href="x/06.jpg" OTHERLOCTYPE="FILE"/>
      </mets:file>
    </mets:fileGrp>
  </mets:fileSec>
  <mets:structMap TYPE="LOGICAL">
    <mets:div TYPE="Monograph" DMDID="dmdSec_0001" ID="loc_0001">
    </mets:div>
  </mets:structMap>
  <mets:structMap TYPE="PHYSICAL">
    <mets:div TYPE="physSequence" ID="physroot">
      <mets:div ID="phys_0001" TYPE="page" DMDID="DMGT_0001" ORDER="1">
        <mets:fptr FILEID="OCR-D-IMG_0001"/>
      </mets:div>
    </mets:div>
  </mets:structMap>
</mets:mets>

Why is the file x/06.jpg being copied to OCR-D-IMG/OCR-D-IMG_0001.jpg after using ocrd-tesserocr-deskew -I OCR-D-IMG -O OCR-D-DESKEW ?

To prevent naming conflicts afterwards?

"segment-line" produces "GeometryCollection" (which has no coords)

I receive the following error when running segment-line:

Traceback (most recent call last):
  File "/home/kmw/OCR-D/env/bin/ocrd-tesserocr-segment-line", line 8, in <module>
    sys.exit(ocrd_tesserocr_segment_line())
  File "/home/kmw/OCR-D/env/lib/python3.7/site-packages/click/core.py", line 764, in __call__
    return self.main(*args, **kwargs)
  File "/home/kmw/OCR-D/env/lib/python3.7/site-packages/click/core.py", line 717, in main
    rv = self.invoke(ctx)
  File "/home/kmw/OCR-D/env/lib/python3.7/site-packages/click/core.py", line 956, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/kmw/OCR-D/env/lib/python3.7/site-packages/click/core.py", line 555, in invoke
    return callback(*args, **kwargs)
  File "/home/kmw/OCR-D/env/lib/python3.7/site-packages/ocrd_tesserocr/cli.py", line 26, in ocrd_tesserocr_segment_line
    return ocrd_cli_wrap_processor(TesserocrSegmentLine, *args, **kwargs)
  File "/home/kmw/OCR-D/env/lib/python3.7/site-packages/ocrd/decorators.py", line 54, in ocrd_cli_wrap_processor
    run_processor(processorClass, ocrd_tool, mets, workspace=workspace, **kwargs)
  File "/home/kmw/OCR-D/env/lib/python3.7/site-packages/ocrd/processor/base.py", line 57, in run_processor
    processor.process()
  File "/home/kmw/OCR-D/env/lib/python3.7/site-packages/ocrd_tesserocr/segment_line.py", line 115, in process
    line_polygon = line_poly.exterior.coords
AttributeError: 'GeometryCollection' object has no attribute 'exterior'

on the image
FILE_0024_ORIGINAL

v0.2.1 in pypi

Please also upload version 0.2.1 in PyPI, and maybe change the url from kba to ocr-d namespace.

Why does 'ocrd-tesserocr-deskew' need 'osd.traineddata' as a model?

Without this model the deskewing fails with the message:

Traceback (most recent call last):
  File "/usr/bin/ocrd-tesserocr-deskew", line 8, in <module>
    sys.exit(ocrd_tesserocr_deskew())
  File "/usr/lib/python3.6/site-packages/click/core.py", line 764, in __call__
    return self.main(*args, **kwargs)
  File "/usr/lib/python3.6/site-packages/click/core.py", line 717, in main
    rv = self.invoke(ctx)
  File "/usr/lib/python3.6/site-packages/click/core.py", line 956, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/usr/lib/python3.6/site-packages/click/core.py", line 555, in invoke
    return callback(*args, **kwargs)
  File "/usr/lib/python3.6/site-packages/ocrd_tesserocr/cli.py", line 40, in ocrd_tesserocr_deskew
    return ocrd_cli_wrap_processor(TesserocrDeskew, *args, **kwargs)
  File "/usr/lib/python3.6/site-packages/ocrd/decorators.py", line 54, in ocrd_cli_wrap_processor
    run_processor(processorClass, ocrd_tool, mets, workspace=workspace, **kwargs)
  File "/usr/lib/python3.6/site-packages/ocrd/processor/base.py", line 56, in run_processor
    processor.process()
  File "/usr/lib/python3.6/site-packages/ocrd_tesserocr/deskew.py", line 75, in process
    psm=PSM.AUTO_OSD
  File "tesserocr.pyx", line 1189, in tesserocr.PyTessBaseAPI.__cinit__
  File "tesserocr.pyx", line 1202, in tesserocr.PyTessBaseAPI._init_api
RuntimeError: Failed to init API, possibly an invalid tessdata path: /tess/data/path

recognize: use PSM_RAW_LINE instead of PSM_SINGLE_LINE

Our OCR-D wrappers have the advantage of allowing to isolate subtasks in a finely grained manner. Tesseract's CLI on the other hand must always provide a good all-in-one compromise, even when called with different PSMs specifically. (E.g. it will always binarize when still necessary for layout analysis, and attempt baseline+xheight+ascender prediction even in PSM_SINGLE_LINE.)

Now, for some workflows it might be beneficial to suppress any additional Tesseract-internal segmentation on the provided line images – for instance when that image is cropped and masked from a line polygon already, or clipped already. Under these circumstances, we should rather use PSM_RAW_LINE.

But other workflows will just enter with the same rough bounding boxes that Tesseract's CLI would also create internally. Then PSM_SINGLE_LINE is a better choice.

So how do we encapsulate this without confusing users, but giving them the best possible results? Do we check the line segment's number of points (for polygon vs bbox workflow), and decide automatically, or expose this as a (thoroghly described) parameter?

@wrznr @kba @stweil

ocrd-tesserocr-segment-line does not find any lines

ocrd-tesserocr-segment-line does not give results for any of the files I tested. For example:

cd `mktemp -d`
wget https://ocr-d-repo.scc.kit.edu/api/v1/dataresources/8d8aa287-94ca-48e3-84a8-1ee602871550/data/lohenstein_agrippina_1665.ocrd.zip
dtrx lohenstein_agrippina_1665.ocrd.zip
cd lohenstein_agrippina_1665.ocrd/data
ocrd-tesserocr-segment-line -l DEBUG -m mets.xml -I OCR-D-IMG -O OCR-D-SEG-LINE
cat OCR-D-SEG-LINE/OCR-D-SEG-LINE_0001

yields:

<?xml version="1.0" encoding="UTF-8"?>
<pc:PcGts xmlns:pc="http://schema.primaresearch.org/PAGE/gts/pagecontent/2017-07-15">
    <pc:Metadata>
        <pc:Creator>OCR-D/core 1.0.0b9</pc:Creator>
        <pc:Created>2019-06-20T16:08:42.841929</pc:Created>
        <pc:LastChange>2019-06-20T16:08:42.841929</pc:LastChange>
    </pc:Metadata>
    <pc:Page imageFilename="OCR-D-IMG/OCR-D-IMG_0001" imageWidth="1214" imageHeight="1916"/>
</pc:PcGts>
% pip list | grep tesserocr     
ocrd-tesserocr             0.2.2       
tesserocr                  2.4.0       

Respect alternative image (if present)

According to the OCR-D functional model, binarization can take place prior to block and line segmentation. Both processing steps should use the alternative image (if present).

Missing v0.5.0 tag

PyPI has 0.5.0, while this repo has 0.4.0 as the latest version. (Found this while investigating a problem with 0.5.0)

Related: The releases page in this repo says that 0.2.2 is the newest version while also offering to show newer tags – with 0.3.0 and 0.4.0 🙈 I have no idea what GitHub is doing here.

Index out of range in tessapi.AllWordConfidences()

The calculation of the word confidence fails if the returned list is empty (this happens with the Fraktur model for blumbach_anatomie_1805_0049.xml). I am not sure why the confidence list is empty and what the best way to fix this is.

But maybe it is sufficient to just set the word_conf to 0.0 if the returned list is empty.

Error trace:

Traceback (most recent call last):
  File "/run/media/flo/a57ed1c0-7fc5-41b1-a6e5-0d43b3ae6a40/data/devel/work/cis-ocrd-py/env/bin/ocrd-tesserocr-recognize", line 11, in <module>
    load_entry_point('ocrd-tesserocr', 'console_scripts', 'ocrd-tesserocr-recognize')()
  File "/run/media/flo/a57ed1c0-7fc5-41b1-a6e5-0d43b3ae6a40/data/devel/work/cis-ocrd-py/env/lib/python3.7/site-packages/click/core.py", line 722, in __call__
    return self.main(*args, **kwargs)
  File "/run/media/flo/a57ed1c0-7fc5-41b1-a6e5-0d43b3ae6a40/data/devel/work/cis-ocrd-py/env/lib/python3.7/site-packages/click/core.py", line 697, in main
    rv = self.invoke(ctx)
  File "/run/media/flo/a57ed1c0-7fc5-41b1-a6e5-0d43b3ae6a40/data/devel/work/cis-ocrd-py/env/lib/python3.7/site-packages/click/core.py", line 895, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/run/media/flo/a57ed1c0-7fc5-41b1-a6e5-0d43b3ae6a40/data/devel/work/cis-ocrd-py/env/lib/python3.7/site-packages/click/core.py", line 535, in invoke
    return callback(*args, **kwargs)
  File "/run/media/flo/a57ed1c0-7fc5-41b1-a6e5-0d43b3ae6a40/data/devel/work/ocrd_tesserocr/ocrd_tesserocr/cli.py", line 27, in ocrd_tesserocr_recognize
    return ocrd_cli_wrap_processor(TesserocrRecognize, *args, **kwargs)
  File "/run/media/flo/a57ed1c0-7fc5-41b1-a6e5-0d43b3ae6a40/data/devel/work/cis-ocrd-py/env/lib/python3.7/site-packages/ocrd/decorators.py", line 28, in ocrd_cli_wrap_processor
    run_processor(processorClass, ocrd_tool, mets, workspace=workspace, **kwargs)
  File "/run/media/flo/a57ed1c0-7fc5-41b1-a6e5-0d43b3ae6a40/data/devel/work/cis-ocrd-py/env/lib/python3.7/site-packages/ocrd/processor/base.py", line 63, in run_processor
    processor.process()
  File "/run/media/flo/a57ed1c0-7fc5-41b1-a6e5-0d43b3ae6a40/data/devel/work/ocrd_tesserocr/ocrd_tesserocr/recognize.py", line 153, in process
    word_conf = tessapi.AllWordConfidences()[0]/100.0
IndexError: list index out of range

Additional parameter for DPI override

Instead of retransferring OCR-D/core#376: There will have to be a general mechanism in core (sanctioned by spec) at some point in the future (3.0 milestone), but for now (final workshop) we just want a quick solution for Tesseract segmentation and recognition: override the user_defined_dpi not with the value from the meta-data density, but a parameter exposed to the user (as a last resort).

RFC: Harmonize parameters for similar processors

Both ocrd-tesserocr-deskew and ocrd-cis-ocropy-deskew are used to deskew a page or all regions of a page. They also both have a parameter to decide which operation is desired:

  • ocrd-tesserocr-deskew: operation_level ["page", "region"]
  • ocrd-cis-ocropy-deskew: level-of-operation ["page", "region"]

Ideally the parameter names for identical functionality should be identical for all processors.
It would also be nice to harmonize the use of underscore or minus sign in parameter names.

pip3 installation does not work with pure Ubuntu 18.04 = Bionic

trying to install ocr-d with pip3 install ocrd_tesserocr using Ubuntu 18.04 = Bionic Beaver leads to the following error message:

...
  Failed building wheel for tesserocr
  Running setup.py clean for tesserocr
  Running setup.py bdist_wheel for bagit ... done
  Stored in directory: /root/.cache/pip/wheels/8d/77/f7/8f91043ef3c99bbab558f578d19ce5938896e37e57609f9786
  Running setup.py bdist_wheel for wrapt ... done
  Stored in directory: /root/.cache/pip/wheels/d7/de/2e/efa132238792efb6459a96e85916ef8597fcb3d2ae51590dfd
Successfully built bagit wrapt
Failed to build tesserocr
Installing collected packages: lxml, numpy, Pillow, ocrd-utils, ocrd-models, bagit, bagit-profile, ocrd-modelfactory, ocrd-validators, wrapt, Deprecated, atomicwrites, Jinja2, itsdangerous, Werkzeug, Flask, opencv-python-headless, ocrd, tesserocr, ocrd-tesserocr
  Found existing installation: Jinja2 2.10
    Not uninstalling jinja2 at /usr/lib/python3/dist-packages, outside environment /usr
  Running setup.py install for tesserocr ... error
    Complete output from command /usr/bin/python3 -u -c "import setuptools, tokenize;__file__='/tmp/pip-build-0u0i5oh5/tesserocr/setup.py';f=getattr(tokenize, 'open', open)(__file__);code=f.read().replace('\r\n', '\n');f.close();exec(compile(code, __file__, 'exec'))" install --record /tmp/pip-c5ui7pzc-record/install-record.txt --single-version-externally-managed --compile:
    Supporting tesseract v4.0.0
    Building with configs: {'libraries': ['tesseract', 'lept'], 'cython_compile_time_env': {'TESSERACT_VERSION': 67108864}}
    /usr/lib/python3.6/distutils/dist.py:261: UserWarning: Unknown distribution option: 'long_description_content_type'
      warnings.warn(msg)
    running install
    running build
    running build_ext
    building 'tesserocr' extension
    creating build
    creating build/temp.linux-x86_64-3.6
    x86_64-linux-gnu-gcc -pthread -DNDEBUG -g -fwrapv -O2 -Wall -g -fstack-protector-strong -Wformat -Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2 -fPIC -I/usr/include/python3.6m -c tesserocr.cpp -o build/temp.linux-x86_64-3.6/tesserocr.o -std=c++11 -DUSE_STD_NAMESPACE
    tesserocr.cpp: In function ‘PyObject* __pyx_pf_9tesserocr_16PyResultIterator_8GetBestLSTMSymbolChoices(__pyx_obj_9tesserocr_PyResultIterator*)’:
    tesserocr.cpp:12198:43: error: ‘class tesseract::ResultIterator’ has no member named ‘GetBestLSTMSymbolChoices’
       __pyx_v_output = (__pyx_v_self->_riter->GetBestLSTMSymbolChoices()[0]);
                                               ^~~~~~~~~~~~~~~~~~~~~~~~
    error: command 'x86_64-linux-gnu-gcc' failed with exit status 1
   
    ----------------------------------------
Command "/usr/bin/python3 -u -c "import setuptools, tokenize;__file__='/tmp/pip-build-0u0i5oh5/tesserocr/setup.py';f=getattr(tokenize, 'open', open)(__file__);code=f.read().replace('\r\n', '\n');f.close();exec(compile(code, __file__, 'exec'))" install --record /tmp/pip-c5ui7pzc-record/install-record.txt --single-version-externally-managed --compile" failed with error code 1 in /tmp/pip-build-0u0i5oh5/tesserocr/

@kba did email to me, that tesseract-lib in 18.04 is too old and that the docker container are based on Ubuntu 19.10.

root@Ubuntu_18.04:/home/jb# pkg-config --modversion  tesseract
4.0.0-beta.1

Improve dockerization

  • Ensure ocrd/core is up-to-date and auto-building
  • Test Circle CI with ocrd/core base image
  • Test Circle CI with ocrd/tesserocr base image
  • Test running make test-cli through docker (i.e. instead of ocrd-tesserocr-recognize: docker run ocrd/tesserocr ocrd-tesserocr-recognize)

pip install ocrd_tesserocr error

When i run PIP install ocrd_tesserocr, i got the error as below:

image

But i had install libleptonica-dev

image

and install tesserocr-2.4.0-cp37-cp37m-win_amd64.whl"

image

problem is still cannnot be resolved

Ensure multilevel TextEquiv consistency after recognition

By the new consistency rules for TextEquiv on different levels, the recognition result should be propagated to the levels above textequiv_level afterwards (using the concatenation rules). This simplifies many tasks (e.g. line level evaluation instead of alignment) and makes using GT workspaces easier (which usually contain more annotation than advertised by their fileGrp USE).

sudo needed?

sudo is not available in many docker images. Default user within a docker container is root. The deps-ubuntu target is used only for building the docker container for dockerhub and circle ci. I see no need to have sudo in the makefile.

@bertsky ?

TESSDATA_PREFIX is interpreted incorrectly

I'm trying to get the https://github.com/bertsky/workflow-configuration/blob/master/crop-anyocr-binarize-page-olena-sauvola-denoise-ocropy-deskew-page-ocropy-segment-tesseract-ocropy-dewarp-ocr-ocropy-tesseract.mk running on https://github.com/OCR-D/assets/tree/master/data/kant_aufklaerung_1784/data

ocrd-make -f gt-binarize-page-olena-sauvola-denoise-ocropy-deskew-page-ocropy-clip-deskew-region-tesseract-resegment-dewarp-ocr-ocropy-tesseract-extract-lines.mk INPUT
=OCR-D-IMG LOGLEVEL=DEBUG

If I set TESSDATA_PREFIX like it is supposed to, to the parent directory of tessdata

19:38:21.064 DEBUG processor.TesserocrRecognize - TESSDATA: /home/kba/ocrd_all/venv/share/, installed Tesseract models: ['tessdata/Fraktur_50000000.334_450937-best', 'tess
data/Fraktur_50000000.334_450937-fast', 'tessdata/best/Fraktur_50000000.334_450937', 'tessdata/eng', 'tessdata/equ', 'tessdata/fast/Fraktur_50000000.334_450937', 'tessdata
/osd', 'tessdata/script/Fraktur']
Traceback (most recent call last):
  File "/home/kba/ocrd_all/venv/bin/ocrd-tesserocr-recognize", line 8, in <module>
    sys.exit(ocrd_tesserocr_recognize())
  File "/home/kba/ocrd_all/venv/lib/python3.6/site-packages/click/core.py", line 829, in __call__
    return self.main(*args, **kwargs)
  File "/home/kba/ocrd_all/venv/lib/python3.6/site-packages/click/core.py", line 782, in main
    rv = self.invoke(ctx)
  File "/home/kba/ocrd_all/venv/lib/python3.6/site-packages/click/core.py", line 1066, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/kba/ocrd_all/venv/lib/python3.6/site-packages/click/core.py", line 610, in invoke
    return callback(*args, **kwargs)
  File "/home/kba/ocrd_all/venv/lib/python3.6/site-packages/ocrd_tesserocr/cli.py", line 36, in ocrd_tesserocr_recognize
    return ocrd_cli_wrap_processor(TesserocrRecognize, *args, **kwargs)
  File "/home/kba/monorepo/core/ocrd/ocrd/decorators.py", line 54, in ocrd_cli_wrap_processor
    run_processor(processorClass, ocrd_tool, mets, workspace=workspace, **kwargs)
  File "/home/kba/monorepo/core/ocrd/ocrd/processor/base.py", line 60, in run_processor
    processor.process()
  File "/home/kba/ocrd_all/venv/lib/python3.6/site-packages/ocrd_tesserocr/recognize.py", line 87, in process
    raise Exception("configured model " + sub_model + " is not installed")
Exception: configured model script/Fraktur is not installed

If I set TESSDATA_PREFIX=$TESSDATA_PREFIX/tessdata, it still cannot find the model:

19:41:33.510 DEBUG processor.TesserocrRecognize - TESSDATA: /home/kba/ocrd_all/venv/share/tessdata/, installed Tesseract models: ['Fraktur_50000000.334_450937-best', 'Frak
tur_50000000.334_450937-fast', 'best/Fraktur_50000000.334_450937', 'eng', 'equ', 'fast/Fraktur_50000000.334_450937', 'osd', 'script/Fraktur']
Traceback (most recent call last):
  File "/home/kba/ocrd_all/venv/bin/ocrd-tesserocr-recognize", line 8, in <module>
    sys.exit(ocrd_tesserocr_recognize())
  File "/home/kba/ocrd_all/venv/lib/python3.6/site-packages/click/core.py", line 829, in __call__
    return self.main(*args, **kwargs)
  File "/home/kba/ocrd_all/venv/lib/python3.6/site-packages/click/core.py", line 782, in main
    rv = self.invoke(ctx)
  File "/home/kba/ocrd_all/venv/lib/python3.6/site-packages/click/core.py", line 1066, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/kba/ocrd_all/venv/lib/python3.6/site-packages/click/core.py", line 610, in invoke
    return callback(*args, **kwargs)
  File "/home/kba/ocrd_all/venv/lib/python3.6/site-packages/ocrd_tesserocr/cli.py", line 36, in ocrd_tesserocr_recognize
    return ocrd_cli_wrap_processor(TesserocrRecognize, *args, **kwargs)
  File "/home/kba/monorepo/core/ocrd/ocrd/decorators.py", line 54, in ocrd_cli_wrap_processor
    run_processor(processorClass, ocrd_tool, mets, workspace=workspace, **kwargs)
  File "/home/kba/monorepo/core/ocrd/ocrd/processor/base.py", line 60, in run_processor
    processor.process()
  File "/home/kba/ocrd_all/venv/lib/python3.6/site-packages/ocrd_tesserocr/recognize.py", line 87, in process
    raise Exception("configured model " + sub_model + " is not installed")
Exception: configured model script/Latin is not installed
Makefile:307: recipe for target 'OCR-D-OCR-TESS-Fraktur+Latin-BINPAGE-sauvola-DENOISE-ocropy-DESKEW-tesseract-DESKEW-ocropy-CLIP-DESKEW-tesseract-DESKEW-ocropy-RESEG-DEWAR
P' failed

Has somebody an idea what I am doing wrong?

also fill PAGE's glyphs and its variants and confidences via GetIterator() in recognize.py

For OCR postcorrection, TextLine.Word.Glyph.TextEquiv can be more valueable than just TextLine.TextEquiv. It allows to build up a lattice (or rather, confusion network) of alternative character hypotheses to (re)build words and phrases from. The PAGE notion of character hypotheses is glyph variants, i.e. a sequence of TextEquiv with index and conf (confidence) attributes. This does not help in addressing segmentation ambiguity (especially on the word level, since PAGE enforces a hierarchy of Word). But most ambiguity on the character level can still be captured.

Example:

<TextLine id="...">
  <Coords points="..."/>
  <Word id="...">
    <Coords points="..."/>
    <Glyph id="...">
      <Coords points="..."/>
      <TextEquiv>
        <Unicode>a</Unicode>
      </TextEquiv>
    </Glyph>
    <Glyph id="...">
      <Coords points="..."/>
      <TextEquiv index="0" conf="0.6">
        <Unicode>m</Unicode>
      </TextEquiv>
      <TextEquiv index="1" conf="0.3">
        <Unicode>rn</Unicode>
      </TextEquiv>
      <TextEquiv index="2" conf="0.1">
        <Unicode>in</Unicode>
      </TextEquiv>
    </Glyph>
  </Word>
  <Word id="...">
    ...
  </Word>
  <TextEquiv>
    <Unicode>am Ende</Unicode>
  </TextEquiv>
</TextLine>

So this part of the wrapper should also dive into the word and character/glyph substructure as a complementary level of annotation. Tesseract's API seems to be straightforward for this use case: baseapi.h contains GetIterator() giving a ResultIterator, which allows to recurse across RIL_SYMBOL as PageIteratorLevel. For each glyph then a GetUTF8Text() and Confidence() yield what we need.

bug: segment-region produce empty OrderedGroup

 <pc:ReadingOrder>
            <pc:OrderedGroup id="reading-order"/>
  </pc:ReadingOrder>

It is required that one of these elements into the ReadingOrder container.
See the schema snippet:

<choice minOccurs="1" maxOccurs="unbounded">
    <element name="RegionRefIndexed" type="pc:RegionRefIndexedType"/>
    <element name="OrderedGroupIndexed" type="pc:OrderedGroupIndexedType"/>
    <element name="UnorderedGroupIndexed" type="pc:UnorderedGroupIndexedType"/>
</choice>

new locale assertions in Tesseract are incompatible with Click

Ever since Tesseract 4 had to introduce an assertion that localization be plain POSIX (C) to ensure certain legacy assumptions in its code are always met, we have to override the current locale before initializing tesserocr API, too. This cannot be done by the user before calling any ocrd_tesserocr CLI, because we depend on the Click library, which itself is incompatible (in Python 3) with that locale (it requires at least C.UTF-8). So we have a deadlock.

We could perhaps reset the locale after click and before tesserocr though.

image_from_page / image_from_segment: Need for workspace?

page_image = workspace.resolve_image_as_pil(page.imageFilename)

Can we change the signatures of these methods to avoid relying on the workspace?

AFAICS the workspace is only required to access the resolver for accessing images as PIL Image. Does the convenience of not having to worry about retrieving remote URL and caching images outweigh the benefits of having all these utility methods in ocrd_utils?

Not sure about the consequences but before I investigate further, do you think it would be worth it to have these functions in ocrd_utils rather than as methods of Workspace?

TypeError: object of type 'list_reverseiterator' has no len()

With this workspace I get the following error:

% ocrd-tesserocr-recognize -I OCR-D-GT-PAGE-BINPAGE-sauvola -O OCR-D-OCR-TESS-frk+deu-OCR-D-GT-PAGE-BINPAGE-sauvola -p '{ "textequiv_level" : "glyph", "overwrite_words": true, "model" : "frk+deu" }'
16:14:59.323 INFO ocrd.workspace_validator - input_file_grp=['OCR-D-GT-PAGE-BINPAGE-sauvola'] output_file_grp=['OCR-D-OCR-TESS-frk+deu-OCR-D-GT-PAGE-BINPAGE-sauvola']
16:14:59.940 INFO processor.TesserocrRecognize - Using model 'frk+deu' in /usr/share/tesseract//tessdata/ for recognition at the glyph level
16:14:59.940 INFO processor.TesserocrRecognize - INPUT FILE 0 / 00000055
16:14:59.982 INFO processor.TesserocrRecognize - Page '00000055' images will use 300 DPI from image meta-data
16:14:59.982 INFO processor.TesserocrRecognize - Processing page '00000055'
Traceback (most recent call last):
  File "/home/mike/.virtualenvs/bug-list_reverseiterator-has-no-len/bin/ocrd-tesserocr-recognize", line 8, in <module>
    sys.exit(ocrd_tesserocr_recognize())
  File "/home/mike/.virtualenvs/bug-list_reverseiterator-has-no-len/lib/python3.7/site-packages/click/core.py", line 764, in __call__
    return self.main(*args, **kwargs)
  File "/home/mike/.virtualenvs/bug-list_reverseiterator-has-no-len/lib/python3.7/site-packages/click/core.py", line 717, in main
    rv = self.invoke(ctx)
  File "/home/mike/.virtualenvs/bug-list_reverseiterator-has-no-len/lib/python3.7/site-packages/click/core.py", line 956, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/mike/.virtualenvs/bug-list_reverseiterator-has-no-len/lib/python3.7/site-packages/click/core.py", line 555, in invoke
    return callback(*args, **kwargs)
  File "/home/mike/.virtualenvs/bug-list_reverseiterator-has-no-len/lib/python3.7/site-packages/ocrd_tesserocr/cli.py", line 36, in ocrd_tesserocr_recognize
    return ocrd_cli_wrap_processor(TesserocrRecognize, *args, **kwargs)
  File "/home/mike/.virtualenvs/bug-list_reverseiterator-has-no-len/lib/python3.7/site-packages/ocrd/decorators.py", line 60, in ocrd_cli_wrap_processor
    run_processor(processorClass, ocrd_tool, mets, workspace=workspace, **kwargs)
  File "/home/mike/.virtualenvs/bug-list_reverseiterator-has-no-len/lib/python3.7/site-packages/ocrd/processor/base.py", line 57, in run_processor
    processor.process()
  File "/home/mike/.virtualenvs/bug-list_reverseiterator-has-no-len/lib/python3.7/site-packages/ocrd_tesserocr/recognize.py", line 189, in process
    page_update_higher_textequiv_levels(maxlevel, pcgts)
  File "/home/mike/.virtualenvs/bug-list_reverseiterator-has-no-len/lib/python3.7/site-packages/ocrd_tesserocr/recognize.py", line 525, in page_update_higher_textequiv_levels
    word_conf /= len(glyphs)
TypeError: object of type 'list_reverseiterator' has no len()

Using ocrd-tesserocr 0.8.0:

% pip list | grep tess     
ocrd-tesserocr         0.8.0     
tesserocr              2.5.0  

PAGE xml Word tokenization

Hello Community,

I've encountered unexpected behavior using ocr-tesserocr-recognize on Word-Level using Parameter textequiv_level, which seems to run into trouble finding correct word boundaries
and therefore messes data on word-level

ulb-dd-zeitungsdigitalisierung-02.zip

Actually

    <pc:TextLine id="region0035_line0026">
        <pc:Coords points="405,2412 2485,2412 2485,2508 405,2508"/>
        <pc:Word id="region0035_line0026_word0000">
            <pc:Coords points="405,2412 523,2412 523,2508 405,2508"/>
            <pc:TextEquiv conf="0.59">
                <pc:Unicode>ſind</pc:Unicode>
            </pc:TextEquiv>
        </pc:Word>
        <pc:Word id="region0035_line0026_word0001">
            <pc:Coords points="538,2412 693,2412 693,2508 538,2508"/>
            <pc:TextEquiv conf="0.96">
                <pc:Unicode>mehr</pc:Unicode>
            </pc:TextEquiv>
        </pc:Word>
        <pc:Word id="region0035_line0026_word0002">
            <pc:Coords points="727,2412 915,2412 915,2508 727,2508"/>
            <pc:TextEquiv conf="0.95">
                <pc:Unicode>pikant</pc:Unicode>
            </pc:TextEquiv>
        </pc:Word>
        <pc:Word id="region0035_line0026_word0003">
            <pc:Coords points="933,2412 1445,2412 1445,2508 933,2508"/>
            <pc:TextEquiv conf="0.95">
                <pc:Unicode>als regelmäßig,</pc:Unicode>
            </pc:TextEquiv>
        </pc:Word>
        <pc:Word id="region0035_line0026_word0004">
            <pc:Coords points="1452,2412 1951,2412 1951,2508 1452,2508"/>
            <pc:TextEquiv conf="0.94">
                <pc:Unicode>der Teint ihres</pc:Unicode>
            </pc:TextEquiv>
        </pc:Word>
        <pc:Word id="region0035_line0026_word0005">
            <pc:Coords points="1972,2412 2485,2412 2485,2508 1972,2508"/>
            <pc:TextEquiv conf="0.08">
                <pc:Unicode>„Gefichtes blei,</pc:Unicode>
            </pc:TextEquiv>
        </pc:Word>
        <pc:TextEquiv>
            <pc:Unicode>ſind mehr pikant als regelmäßig, der Teint ihres „Gefichtes blei,</pc:Unicode>
        </pc:TextEquiv>
    </pc:TextLine>

Expected

 <pc:Word id="region0035_line0026_word0003">
            <pc:Coords points="933,2412 1445,2412 1445,2508 933,2508"/>
            <pc:TextEquiv conf="0.95">
                <pc:Unicode>als</pc:Unicode>
            </pc:TextEquiv>
        </pc:Word>
        <pc:Word id="region0035_line0026_word0004">
            <pc:Coords points="1452,2412 1951,2412 1951,2508 1452,2508"/>
            <pc:TextEquiv conf="0.94">
                <pc:Unicode> regelmäßig,</pc:Unicode>
            </pc:TextEquiv>
        </pc:Word>

Plattform

Ubuntu 18.04 LTS
tesseract 4.1.1-rc2-17-g6343
leptonica-1.78.0
libgif 5.1.4 : libjpeg 8d (libjpeg-turbo 1.5.2) : libpng 1.6.34 : libtiff 4.0.9 : zlib 1.2.11 : libwebp 0.6.1 : libopenjp2 2.3.0
Found AVX2
Found AVX
Found FMA
Found SSE
Found libarchive 3.2.2 zlib/1.2.11 liblzma/5.2.2 bz2lib/1.0.6 liblz4/1.7.1

segment-line: annotate polygon or clipped image

Currently all we get is bounding boxes, which for historic print often overlap heavily.

Tesseract internally of course "knows" (already decided) which component belongs to which line, but how do we get that information via API? There are 2 general paths:

  1. polygon coordinates via baseline; either via existing/old API or via new API we have to get into Tesseract, cf. tesseract-ocr/tesseract#2971 (comment)
  2. retrieving a clipped line image for each line individually, perhaps via GetTextlines or GetComponentImages.

@wrznr what do you think?

File names

There are three files {0006,0007,0008}.xml that all belong to the same filegroup gt. If I run ocrd-tesserocr-recognize on the filegroup gt, with output filegroup tess recognize searches for the files of the filegroup in the mets.xml file. If for some reason (files where not added to the workspace in nummerical order?) the files are not returned in numerical order - for example 0007, 0008, 0006 - recognize generates the files tess-0001.xml (0007.xml), tess-0002.xml (0008.xml) and tess-0003.xml (0006.xml).

This destroys the mapping between gt and ocr pages.

A simple solution would be to use:

self.workspace.add_file(
  ID=ID,
  file_grp=self.output_file_grp,
  basename=self.output_file_grp + '-' + os.path.basename(input_file.url),
  mimetype=MIMETYPE_PAGE,
  content=to_xml(pcgts),
)

to create the new files to the workspace.

pip install ocrd_tesserocr fails with tesseract version 4.0.0-beta-26-gfd49

I use pip install ocrd_tesserocr to install ocrd_tesseract into my virtualenv environment. The installation fails with:

...
  Running setup.py bdist_wheel for tesserocr ... error
  Complete output from command /run/media/flo/a57ed1c0-7fc5-41b1-a6e5-0d43b3ae6a40/data/devel/work/cis-ocrd-py/env/bin/python -u -c "import setuptools, tokenize;__file__='/tmp/pip-install-k_dgo547/tesserocr/setup.py';f=getattr(tokenize, 'open', open)(__file__);code=f.read().replace('\r\n', '\n');f.close();exec(compile(code, __file__, 'exec'))" bdist_wheel -d /tmp/pip-wheel-q7mozwr8 --python-tag cp37:
  Supporting tesseract v4.0.0
  Configs from pkg-config: {'include_dirs': ['/usr/include'], 'libraries': ['tesseract', 'lept'], 'cython_compile_time_env': {'TESSERACT_VERSION': 67108864}}
  running bdist_wheel
  running build
  running build_ext
  building 'tesserocr' extension
  creating build
  creating build/temp.linux-x86_64-3.7
  gcc -pthread -Wno-unused-result -Wsign-compare -DNDEBUG -g -fwrapv -O3 -Wall -march=x86-64 -mtune=generic -O3 -pipe -fstack-protector-strong -fno-plt -march=x86-64 -mtune=generic -O3 -pipe -fstack-protector-strong -fno-plt -march=x86-64 -mtune=generic -O3 -pipe -fstack-protector-strong -fno-plt -fPIC -I/usr/include -I/usr/include/python3.7m -c tesserocr.cpp -o build/temp.linux-x86_64-3.7/tesserocr.o -std=c++11 -DUSE_STD_NAMESPACE
  tesserocr.cpp: In function 'PyObject* __pyx_pf_9tesserocr_16PyResultIterator_8GetBestLSTMSymbolChoices(__pyx_obj_9tesserocr_PyResultIterator*)':
  tesserocr.cpp:12196:43: error: 'class tesseract::ResultIterator' has no member named 'GetBestLSTMSymbolChoices'
     __pyx_v_output = (__pyx_v_self->_riter->GetBestLSTMSymbolChoices()[0]);
                                             ^~~~~~~~~~~~~~~~~~~~~~~~
  error: command 'gcc' failed with exit status 1

  ----------------------------------------
  Failed building wheel for tesserocr
  Running setup.py clean for tesserocr
Failed to build tesserocr
Installing collected packages: tesserocr, ocrd-tesserocr
  Running setup.py install for tesserocr ... error
    Complete output from command /run/media/flo/a57ed1c0-7fc5-41b1-a6e5-0d43b3ae6a40/data/devel/work/cis-ocrd-py/env/bin/python -u -c "import setuptools, tokenize;__file__='/tmp/pip-install-k_dgo547/tesserocr/setup.py';f=getattr(tokenize, 'open', open)(__file__);code=f.read().replace('\r\n', '\n');f.close();exec(compile(code, __file__, 'exec'))" install --record /tmp/pip-record-fc87h61b/install-record.txt --single-version-externally-managed --compile --install-headers /run/media/flo/a57ed1c0-7fc5-41b1-a6e5-0d43b3ae6a40/data/devel/work/cis-ocrd-py/env/include/site/python3.7/tesserocr:
    Supporting tesseract v4.0.0
    Configs from pkg-config: {'include_dirs': ['/usr/include'], 'libraries': ['lept', 'tesseract'], 'cython_compile_time_env': {'TESSERACT_VERSION': 67108864}}
    running install
    running build
    running build_ext
    building 'tesserocr' extension
    creating build
    creating build/temp.linux-x86_64-3.7
    gcc -pthread -Wno-unused-result -Wsign-compare -DNDEBUG -g -fwrapv -O3 -Wall -march=x86-64 -mtune=generic -O3 -pipe -fstack-protector-strong -fno-plt -march=x86-64 -mtune=generic -O3 -pipe -fstack-protector-strong -fno-plt -march=x86-64 -mtune=generic -O3 -pipe -fstack-protector-strong -fno-plt -fPIC -I/usr/include -I/usr/include/python3.7m -c tesserocr.cpp -o build/temp.linux-x86_64-3.7/tesserocr.o -std=c++11 -DUSE_STD_NAMESPACE
    tesserocr.cpp: In function 'PyObject* __pyx_pf_9tesserocr_16PyResultIterator_8GetBestLSTMSymbolChoices(__pyx_obj_9tesserocr_PyResultIterator*)':
    tesserocr.cpp:12196:43: error: 'class tesseract::ResultIterator' has no member named 'GetBestLSTMSymbolChoices'
       __pyx_v_output = (__pyx_v_self->_riter->GetBestLSTMSymbolChoices()[0]);
                                               ^~~~~~~~~~~~~~~~~~~~~~~~
    error: command 'gcc' failed with exit status 1
...

tesseract is installed on the system:

tesseract 4.0.0-beta.4-26-gfd49
 leptonica-1.77.0
  libgif 5.1.4 : libjpeg 8d (libjpeg-turbo 2.0.1) : libpng 1.6.36 : libtiff 4.0.10 : zlib 1.2.11 : libwebp 1.0.1
 Found AVX
 Found SSE

Setting locale fails on macOS

See protocol:

$ ocrd-tesserocr-recognize 
Traceback (most recent call last):
  File "/Users/stweil/ocr-d/20190824/bin/ocrd-tesserocr-recognize", line 6, in <module>
    from ocrd_tesserocr.cli import ocrd_tesserocr_recognize
  File "/Users/stweil/ocr-d/20190824/lib/python3.7/site-packages/ocrd_tesserocr/__init__.py", line 7, in <module>
    locale.setlocale(locale.LC_ALL, 'C.UTF-8')
  File "/usr/local/Cellar/python/3.7.3/Frameworks/Python.framework/Versions/3.7/lib/python3.7/locale.py", line 604, in setlocale
    return _setlocale(category, locale)
locale.Error: unsupported locale setting

Setting the locale is no longer necessary for Tesseract 4.1.0 which is also the recommended stable version.

locale setting for tesserocr does not work on ubuntu 19.04

Here:

locale.setlocale(locale.LC_ALL, 'C.UTF-8')

the locale should be set to C.UTF-8, but this doesn't work because the locale C.UTF-8 gets aliased to en_US.UTF-8 (at least on my ubuntu 19.04 system, see below)

Afterwards the import here:

from tesserocr import (

fails with the good old:

!strcmp(locale, "C"):Error:Assert failed:in file baseapi.cpp, line 209

The alias comes from the file:
/usr/share/X11/locale/locale.alias
In line 63:
C.UTF-8 en_US.UTF-8

This is the default as can be seen here:

https://github.com/mirror/libX11/blob/87c77a1e6d7034536e9d25ce24a667ebf53486a7/nls/locale.alias.pre#L18

Superfluous newlines

At the moment, superfluous newlines are appended to the TextEquiv/Unicode entries:

                    <pc:TextEquiv>
                        <pc:Unicode>Groſzmaͤchtigſter</pc:Unicode>
                    </pc:TextEquiv>
                    <pc:TextEquiv>
                        <pc:Unicode>stzmächtigstcr
</pc:Unicode>

recognize: expose character white/blacklisting parameters

We should pass the following additional parameters/variables from ocrd-tool.json to Tesseract's API:

param description
tessedit_char_whitelist Whitelist of chars to recognize
tessedit_char_blacklist Blacklist of chars not to recognize
tessedit_char_unblacklist List of chars to override tessedit_char_blacklist

(But maybe omit the tessedit_ prefix.)

Memory leaks

The memory usage of ocrd-tesserocr-segment-region increases for each page, resulting in a total of about 7 GB for 200 pages, 8 GB for 248 pages, 10 GB for 282 pages, 11 GB for 313 pages (observed for http://nbn-resolving.de/urn:nbn:de:bsz:180-digad-22977).

ocrd-tesserocr-segment-line shows a similar effect.

For that book, a machine with 8 GB RAM would have started swapping, thus slowing down the process extremely. Even a large server would get memory problems when processing large books with more than 1000 pages in parallel.

segment-line: Self-intersection at or near point ...

ocrd process \
  "olena-binarize -I OCR-D-IMG -O OCR-D-BIN -p '{\"impl\": \"sauvola-ms-split\"}'" \
  "cis-ocropy-denoise -I OCR-D-BIN -O OCR-D-BIN-DENOISE -p '{\"level-of-operation\":\"page\"}'" \
  "anybaseocr-deskew -I OCR-D-BIN-DENOISE -O OCR-D-BIN-DENOISE-DESKEW" \
  "anybaseocr-crop -I OCR-D-BIN-DENOISE-DESKEW -O OCR-D-CROP" \
  "cis-ocropy-segment -I OCR-D-CROP -O OCR-D-SEG-REG -p '{\"level-of-operation\":\"page\"}'" \
  "tesserocr-segment-line -I OCR-D-SEG-REG -O OCR-D-SEG-LINE" \
  "cis-ocropy-clip -I OCR-D-SEG-LINE -O OCR-D-SEG-LINE-CLIP -p '{\"level-of-operation\":\"line\"}'" \
  "cis-ocropy-dewarp -I OCR-D-SEG-LINE-CLIP -O OCR-D-SEG-LINE-CLIP-DEWARP" \
  "tesserocr-recognize -I OCR-D-SEG-LINE-CLIP-DEWARP -O OCR-D-OCR -p '{\"textequiv_level\":\"glyph\",\"overwrite_words\":true,\"model\":\"GT4HistOCR_50000000.75_322098+GT4HistOCR_50000000.78_258336+GT4HistOCR_5000000-20.95_147211\"}'"

Original image: https://digi.ub.uni-heidelberg.de/diglitData/jb/02_-_arndt1710_-_000_096.tif (58 MB)

16:14:18.969 INFO processor.TesserocrSegmentLine - INPUT FILE 1 / P_0002                                                                                                                                      [5/1964]
16:14:19.028 ERROR ocrd.workspace - page "P_0002" image (binarized,despeckled,deskewed,cropped; 3031x6660) has not been reshaped properly (3089x6687) during rotation                                                 
16:14:19.029 INFO processor.TesserocrSegmentLine - Page 'P_0002' images will use 1200 DPI from image meta-data                                                                                                        
16:14:22.742 ERROR shapely.geos - TopologyException: Input geom 1 is invalid: Self-intersection at or near point 2023 443 at 2023 443                                                                                 
Traceback (most recent call last):                                                                                                                                                                                    
  File "/home/jb/ocrd_all/venv/bin/ocrd-tesserocr-segment-line", line 8, in <module>                                                                                                                                  
    sys.exit(ocrd_tesserocr_segment_line())                        
  File "/home/jb/ocrd_all/venv/lib/python3.6/site-packages/click/core.py", line 829, in __call__
    return self.main(*args, **kwargs)
  File "/home/jb/ocrd_all/venv/lib/python3.6/site-packages/click/core.py", line 782, in main
    rv = self.invoke(ctx)
  File "/home/jb/ocrd_all/venv/lib/python3.6/site-packages/click/core.py", line 1066, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/jb/ocrd_all/venv/lib/python3.6/site-packages/click/core.py", line 610, in invoke
    return callback(*args, **kwargs)
  File "/home/jb/ocrd_all/venv/lib/python3.6/site-packages/ocrd_tesserocr/cli.py", line 26, in ocrd_tesserocr_segment_line
    return ocrd_cli_wrap_processor(TesserocrSegmentLine, *args, **kwargs)
  File "/home/jb/ocrd_all/venv/lib/python3.6/site-packages/ocrd/decorators.py", line 54, in ocrd_cli_wrap_processor
    run_processor(processorClass, ocrd_tool, mets, workspace=workspace, **kwargs)
  File "/home/jb/ocrd_all/venv/lib/python3.6/site-packages/ocrd/processor/base.py", line 57, in run_processor
    processor.process()
  File "/home/jb/ocrd_all/venv/lib/python3.6/site-packages/ocrd_tesserocr/segment_line.py", line 114, in process
    line_poly = line_poly.intersection(region_poly).convex_hull
  File "/home/jb/ocrd_all/venv/lib/python3.6/site-packages/shapely/geometry/base.py", line 649, in intersection
    return geom_factory(self.impl['intersection'](self, other))
  File "/home/jb/ocrd_all/venv/lib/python3.6/site-packages/shapely/topology.py", line 70, in __call__
    self._check_topology(err, this, other)
  File "/home/jb/ocrd_all/venv/lib/python3.6/site-packages/shapely/topology.py", line 38, in _check_topology
    self.fn.__name__, repr(geom)))
shapely.errors.TopologicalError: The operation 'GEOSIntersection_r' could not be performed. Likely cause is invalidity of the geometry <shapely.geometry.polygon.Polygon object at 0x7f271e147390>
Traceback (most recent call last):
  File "/home/jb/ocrd_all/venv/bin/ocrd", line 8, in <module>
    sys.exit(cli())
  File "/home/jb/ocrd_all/venv/lib/python3.6/site-packages/click/core.py", line 829, in __call__
    return self.main(*args, **kwargs)
  File "/home/jb/ocrd_all/venv/lib/python3.6/site-packages/click/core.py", line 782, in main
    rv = self.invoke(ctx)
  File "/home/jb/ocrd_all/venv/lib/python3.6/site-packages/click/core.py", line 1259, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/home/jb/ocrd_all/venv/lib/python3.6/site-packages/click/core.py", line 1066, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/jb/ocrd_all/venv/lib/python3.6/site-packages/click/core.py", line 610, in invoke
    return callback(*args, **kwargs)
  File "/home/jb/ocrd_all/venv/lib/python3.6/site-packages/ocrd/cli/process.py", line 26, in process_cli
    run_tasks(mets, log_level, page_id, tasks)
  File "/home/jb/ocrd_all/venv/lib/python3.6/site-packages/ocrd/task_sequence.py", line 131, in run_tasks
    raise Exception("%s exited with non-zero return value %s" % (task.executable, returncode))
Exception: ocrd-tesserocr-segment-line exited with non-zero return value 1

Running in a docker volume doesn't work

wget 'https://ocr-d-repo.scc.kit.edu/api/v1/dataresources/736a2f9a-92c6-4fe3-a457-edfa3eab1fe3/data/wundt_grundriss_1896.ocrd.zip'
unzip wundt_grundriss_1896.ocrd.zip
cd data
docker run -u $(id -u) -w /data -v $PWD:/data -- ocrd/tesserocr:edge ocrd-tesserocr-binarize -I OCR-D-IMG -O OCR-D-IMG-BIN-DOCKER

This will run ocrd-tesserocr-binarize but will only change the serialization of the mets.xml and add the agent but not do the actual work. What am I doing wrong?

@mikegerber @bertsky @wrznr Input appreciated, thanks!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.