Implementation of full document OCR based on Olena & Tesseract

This addon need the Olena command line utility to analyse digital document available in common picture formats (e.g. png, tif, gif, jpeg, ...).

Olena's development and support for document analysis as well as the integration in Nuxeo through this addon was funded as part of the Scribo (http://scribo.ws) R&D project.

Olena 2.0 and Tesseract 3 are still not yet packaged by default in most Linux distributions hence some manual build steps are required.

http://www.lrde.epita.fr/cgi-bin/twiki/view/Olena/Olena200

Notes

The quality of the extraction is good only for high resolution pictures. For instance photos of a newspaper taken from a mobile phone will likely yield unusable output.
Supporting PDF files would require an additional step to use Apache PDFBox to extract the sizeable pictures from the PDF file to pass to Olena / Tesseract. This is not implemented in the current version.

Building the olena command line tool used by the Nuxeo Addon

Here are some instruction to build it under ubuntu & debian linux.

1- Build tesseract 3 by following the official instructions

2- Install the following packaged dependencies:

$ sudo apt-get install \
  build-essential \
  graphicsmagick-libmagick-dev-compat \
  libmagics++-dev \
  xsltproc \
  fop \
  hevea \
  latex2html \
  autoconf

3- Build Olena itself:

$ wget http://www.lrde.epita.fr/dload/olena/2.0/olena-2.0.tar.bz2
$ tar jxvf olena-*.tar.bz2
$ cd olena-2.0/
$ mkdir _build
$ cd _build
$ ../configure && make
$ cd scribo/src
$ make

You should then have a program content_in_doc; you can test it with:

$ ./content_in_doc /path/to/a/picture.png /path/to/result.xml

Install the content_in_doc program somewhere in your system path so that Nuxeo can pick to up to analyze image documents and extract text annotations.

Building the Nuxeo Addon itself

Using maven 2.2.1 or later, from root of the nuxeo-platform-ocr folder:

$ mvn install

Then copy the jar target/nuxeo-platform-ocr-*-SNAPSHOT.jar into the nxserver/bundles folder of your Nuxeo DM or DAM instance (assuming the default tomcat package).

Using the addon

To test the addon, find a high resolution picture of a digitized newspaper or other text document such as:

http://www.google.com/images?as_q=magazine+article&biw=1280

In Nuxeo DM or DAM import the picture as a new File or Picture document wait approximately 5s (the OCR is working asynchronously in the background). Go to the preview tab and have look at the annotated text areas.

About Nuxeo

Nuxeo provides a modular, extensible Java-based open source software platform for enterprise content management and packaged applications for document management, digital asset management and case management. Designed by developers for developers, the Nuxeo platform offers a modern architecture, a powerful plug-in model and extensive packaging capabilities for building content applications.

More information on: http://www.nuxeo.com/

nuxeo / nuxeo-platform-ocr Goto Github PK

nuxeo-platform-ocr's Introduction

Implementation of full document OCR based on Olena & Tesseract

Notes

Building the olena command line tool used by the Nuxeo Addon

Building the Nuxeo Addon itself

Using the addon

About Nuxeo

nuxeo-platform-ocr's People

Stargazers

Watchers

Forkers

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent