Giter Site home page Giter Site logo

alf-tengine-ocr's Introduction

Alfresco Transformer from PDF to OCRd PDF

This project includes a simple Transformer for Alfresco from PDF to OCRd PDF to be used with ACS Community 7.0+

OCR Transformation is performed by ocrmypdf, a wrapper of Tesseract that includes additional features in order to improve the accuracy of the process.

The Transformer ats-transformer-ocr uses the new Alfresco Local Transform API, that allows to register a Spring Boot Application as a local transformation service.

The folder embed-metadata-action includes an Alfresco Repository Addon that enables the action embed-metadata in Folder Rule feature.

ACS Community 7.4 or later requires modifying default configuration for HTTP requests timeouts. Increase default values (5000 ms / 5 s) to a larger value, like in the following sample that uses 500000 ms / 500 s

httpclient.config.transform.socketTimeout=500000
httpclient.config.transform.connectionRequestTimeout=500000
httpclient.config.transform.connectionTimeout=500000

Local testing

Build Docker Image for Alfresco OCR Transformer

Building the Alfresco OCR Transformer Docker Image is required before running the Docker Compose template provided.

$ cd ats-transformer-ocr

$ mvn clean package

Maven will create a Docker Image named alfresco/tengine-ocr:latest

Starting

$ docker run -p 8090:8090 alfresco/tengine-ocr:latest

Testing

A sample web page has been created in order to test the transformer is working:

http://localhost:8090

Deployment with ACS Stack

Obtaining Repository Addon to enable Embed Metadata Action

Before deploying Alfresco OCR Transformer, embed-metadata-action Repository Addon should be built.

$ cd embed-metadata-action

$ mvn clean package

$ ls target/embed-metadata-action-1.0.0.jar
target/embed-metadata-action-1.0.0.jar

Alternatively embed-metadata-action-1.0.0.jar can be download from Releases

Deploying Repository Addon to enable Embed Metadata Action

Use some of the available alternatives to deploy embed-metadata-action-1.0.0.jar in alfresco service, like adding the JAR to alfresco/modules/jar folder when using Alfresco Docker Installer tool.

Adding Alfresco OCR Transformer to Docker Compose (Local Transformer - HTTP) - Community Edition

Review that the following configuration is applied to docker-compose.yml file.

services:
    alfresco:
        environment:
            JAVA_OPTS : "
                -DlocalTransform.core-aio.url=http://transform-core-aio:8090/
                -DlocalTransform.ocr.url=http://transform-ocr:8090/
            "

    transform-core-aio:
        image: alfresco/alfresco-transform-core-aio:2.3.10
        mem_limit: 1536m
        environment:
            JAVA_OPTS: " -XX:MinRAMPercentage=50 -XX:MaxRAMPercentage=80"

    transform-ocr:
        image: alfresco/tengine-ocr:latest
        mem_limit: 1536m
        environment:
            JAVA_OPTS: " -XX:MinRAMPercentage=50 -XX:MaxRAMPercentage=80"
  • Include the localTransform URL for OCR Transformer in alfresco Docker Container, http://transform-ocr:8090/ by default
  • Declare the new transform-ocr Docker Container

Remember that you need to build Docker Image for alfresco/tengine-ocr before running this composition

Start ACS Stack from folder containing docker-compose.yml file.

$ docker-compose up --build --force-recreate

Sample deployment is available in docker folder.

Adding Alfresco OCR Transformer to Docker Compose (Async Transformer - ActiveMQ) - Enterprise Edition

Review that the following configuration is applied to docker-compose.yml file.

services:
    alfresco:
        environment:
            JAVA_OPTS : "
              -Dlocal.transform.service.enabled=true
              -Dtransform.service.enabled=true
              -Dtransform.service.url=http://transform-router:8095
              -Dsfs.url=http://shared-file-store:8099/
            "

    transform-router:
      image: quay.io/alfresco/alfresco-transform-router:${TRANSFORM_ROUTER_TAG}
      environment:
        JAVA_OPTS: " -XX:MinRAMPercentage=50 -XX:MaxRAMPercentage=80"
        ACTIVEMQ_URL: "nio://activemq:61616"
        CORE_AIO_URL: "http://transform-core-aio:8090"
        TRANSFORMER_URL_OCR: "http://transform-ocr:8090"
        TRANSFORMER_QUEUE_OCR: "ocr-engine-queue"
        FILE_STORE_URL: "http://shared-file-store:8099/alfresco/api/-default-/private/sfs/versions/1/file"

    transform-ocr:
      image: alfresco/tengine-ocr:latest
      mem_limit: 1536m
      environment:
        JAVA_OPTS: " -XX:MinRAMPercentage=50 -XX:MaxRAMPercentage=80 
		  -Docrmypdf.path=ocrmypdf -Docrmypdf.arguments=--skip-text -Dqueue.engineRequestQueue=ocr-engine-queue
		 "
        ACTIVEMQ_URL: "nio://activemq:61616"
        FILE_STORE_URL: "http://shared-file-store:8099/alfresco/api/-default-/private/sfs/versions/1/file"
  • You can optionally disable local.transform service in alfresco Docker Container and enable transform service (asynchronous). Local Transform Service or Transform Service (supports only asynchronous requests) can be enabled or disabled independently of each other. Please keep in mind that when your deployment has Share and SOLR (think of full text indexing), or both then you'll need to have local.transform and transform service (asynchronous) enabled and running. The Repository will try to transform content using the Transform Service via the T-Router if possible and fall back to direct Local Transform Service. Share makes use of both, so functionality such as preview will be unavailable if local.transform service is disabled.
  • Add OCR Transformer configuration to transform-router Docker Container: URL (http://transform-ocr:8090/ by default) and Queue Name (ocr-engine-queue as declared in ats-transformer-ocr/src/main/resources/application-default.yaml)
  • Declare the new transform-ocr Docker Container using the ActiveMQ and Shared File services

Remember that you need to build Docker Image for alfresco/tengine-ocr before running this composition

Start ACS Stack from folder containing docker-compose.yml file.

$ docker-compose up --build --force-recreate

Sample deployment is available in docker-enterprise folder.

Defining the OCR Rule in Alfresco Share

Use your browser to access to Alfresco Share App (by default available in http://localhost:8080/share/)

Create a folder and add following rule (Manage Rules folder option):

  • When: Items are created or enter this folder
  • If all criteria are met: Mimetype is 'Adobe PDF Document'
  • Perform Action: Embed properties as metadata in content

To limit the amount of parallel OCR processing threads, use the Run rule in background checkbox.

From that point, every PDF File uploaded to the folder will be OCRd. Original version for the PDF file will remain as 1.0 version, while the one with text layer on it will be labeled as 1.1 version.

Customizing ocrmypdf arguments

By default, Alfresco OCR Transformer is providing following ocrmypdf configuration.

# Executable command for ocrmypdf program
ocrmypdf.path=ocrmypdf

# Arguments for ocrmypdf invocation. This is the optimized option. 
# If --skip-text is issued, then no image processing or OCR will be performed on pages that already have text.
ocrmypdf.arguments=--skip-text

# To force OCR, use the following:
ocrmypdf.arguments=--force-ocr

Configuration can be changed by using Docker environment variables from command line.

$ docker run -p 8090:8090 -e OCRMYPDF_ARGUMENTS='--skip-text -l eng' alfresco/tengine-ocr:latest

Or with the equivalent notation in docker-compose.yml

transform-ocr:
    image: alfresco/tengine-ocr:latest
    mem_limit: 1536m
    environment:
      JAVA_OPTS: "-XX:MinRAMPercentage=50 -XX:MaxRAMPercentage=80 -Dqueue.engineRequestQueue=ocr-engine-queue"
      OCRMYPDF_ARGUMENTS: "--skip-text -l eng"

Additional contributors

  • Thanks to dgradecak for the embed-metadata action approach: #2

alf-tengine-ocr's People

Contributors

abhinavmishra14 avatar aborroy avatar dependabot[bot] avatar dgradecak avatar tpage-alfresco avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

alf-tengine-ocr's Issues

Creates no new document version on higher execution time

Hey @aborroy - first off: thanks for the OCR transformer, it looks nice and lean!

I'm struggling with a migration from a hand-rolled OCR pipeline with Alfresco 5.0 (CE) to your OCR transformer with Alfresco 7.4 (CE). The direct integration as a folder rule would be much simpler. My setup works so far that I can upload the quick.pdf from this repo and the OCR magic (new document version) works as expected. That's great!

Here's my problem: When I upload a real PDF file (426kb, one page, PDF version 1.4) then no new document version is created, never. My guess is that the issue is caused by resource limits. I've experimented with file size and I think it's more related to the execution time. A bigger file (508kb, one page, PDF version 1.4) sometimes succeeds in a new document version, but not always. I'm pretty sure it's not the file size as the OCR transformer does not configure the maxSourceSizeBytes - which defaults to -1 (no limit) according to the docs.

Here are some screenshots:

I searched for transformer timeouts and configured on the repository the following settings:

-Dtransformer.timeout.default=300
-Dtransformserver.transformationTimeout=300
-Dcontent.transformer.default.timeoutMs=300000    

but this does not change the situation. Unfortunately, I was not able to figure out where the transformOptions.get(TIMEOUT) comes from or how to set it properly.

While digging into this I recognized, when the execution time is less than 5 seconds the new document version is created. I didn't found any defaults for the transformOptions regarding the timeout.

Maybe you could give me a hint? :)

Migrate from previous 6.2 with simple-ocr

I know this is probably not the best way to ask this question, but I can't find another place where to ask.

What would be the procedure (if possible) for migrating/upgrading from Alfresco 6.2 with simple-ocr (docker) to Alfresco 7.1 with TransformEngine OCR (docker) ?

I've successfully done a backup of my data from 6.2, restored it to 7.1 installation and almost everything works as expected.
But in some cases, for instance when trying to access User folder from ACA, there is an error that I've tracked down to :

{"error":{"errorKey":"framework.exception.ApiDefault","statusCode":500,"briefSummary":"A namespace prefix is not registered for uri http://www.keensoft.es/model/content/ocr/1.0","stackTrace":"Per motivi di sicurezza l'analisi dello stack non viene più visualizzata, ma viene mantenuta la proprietà per le versioni precedenti","descriptionURL":"https://api-explorer.alfresco.com","logId":"ae148681-aa5b-4cf3-b33d-a46201617545"}}

I guess my documents have reference to the namespace of the previous module, and it is not found.

What should be done to fix the situation?

Error building alfresco/tengine-ocr on Windows 10 (WSL2 enabled DockerDesktop) environment

Error building alfresco/tengine-ocr on Windows 10 (WSL2 enabled DockerDesktop) environment

Environment Details:

wsl -l -v

NAME                   STATE           VERSION
* docker-desktop-data    Running         2
 docker-desktop         Running         2

docker -v
Docker version 20.10.20, build 9fdeb9c
docker-compose -v
Docker Compose version v2.12.1

java -version

java version "11.0.16.1" 2022-08-18 LTS
Java(TM) SE Runtime Environment 18.9 (build 11.0.16.1+1-LTS-1)
Java HotSpot(TM) 64-Bit Server VM 18.9 (build 11.0.16.1+1-LTS-1, mixed mode)

mvn -v

Apache Maven 3.8.6 (84538c9988a25aec085021c365c560670ad80f63)
Maven home: C:\Abhinav\Softwares\Java\Maven\apache-maven-3.8.6
Java version: 11.0.16.1, vendor: Oracle Corporation, runtime: C:\Program Files\Java\jdk-11.0.16.1
Default locale: en_US, platform encoding: Cp1252
OS name: "windows 10", version: "10.0", arch: "amd64", family: "windows"
``
---------------------------------------------------------------------------------------------------------
**Steps:**
1- Cloned "https://github.com/aborroy/alf-tengine-ocr.git"
2- Changed directory to : C:\Downloads\alf-tengine-ocr\ats-transformer-ocr
3- Started Docker Desktop (WSL2 enabled on Windows 10)
4- Executed `mvn clean install` command. 
5- Build failed with error: 

`[INFO] --- spring-boot-maven-plugin:2.5.4:repackage (repackage) @ ats-transformer-ocr ---
[INFO] Replacing main artifact with repackaged archive
[INFO]
[INFO] --- spring-boot-maven-plugin:2.5.4:repackage (default) @ ats-transformer-ocr ---
[INFO] Replacing main artifact with repackaged archive
[INFO]
[INFO] --- docker-maven-plugin:0.34.1:build (build-image) @ ats-transformer-ocr ---
[INFO] Building tar: C:\Downloads\alf-tengine-ocr\ats-transformer-ocr\target\docker\alfresco\tengine-ocr\latest\tmp\docker-build.tar
[INFO] DOCKER> [alfresco/tengine-ocr:latest]: Created docker-build.tar in 559 milliseconds
[ERROR] DOCKER> Unable to build image [alfresco/tengine-ocr:latest] : "The command '/bin/sh -c set -eux;     ARCH=\"$(dpkg --print-architecture)\";     case \"${ARCH}\" in        armhf)          ESUM='c6b1fda3f8807028cbfcc34a4ded2e8a5a6b6239d2bcc1f06673ea6b1530df94';          BINARY_URL='https://github.com/AdoptOpenJDK/openjdk11-binaries/releases/download/jdk-11.0.5%2B10/OpenJDK11U-jdk_arm_linux_hotspot_11.0.5_10.tar.gz';          ;;        ppc64el|ppc64le)          ESUM='d763481ddc29ac0bdefb24216b3a0bf9afbb058552682567a075f9c0f7da5814';          BINARY_URL='https://github.com/AdoptOpenJDK/openjdk11-binaries/releases/download/jdk-11.0.5%2B10/OpenJDK11U-jdk_ppc64le_linux_hotspot_11.0.5_10.tar.gz';          ;;        amd64|x86_64)          ESUM='6dd0c9c8a740e6c19149e98034fba8e368fd9aa16ab417aa636854d40db1a161';          BINARY_URL='https://github.com/AdoptOpenJDK/openjdk11-binaries/releases/download/jdk-11.0.5%2B10/OpenJDK11U-jdk_x64_linux_hotspot_11.0.5_10.tar.gz';          ;;        *)          echo \"Unsupported arch: ${ARCH}\";          exit 1;          ;;     esac;     curl -LfsSo /tmp/openjdk.tar.gz ${BINARY_URL};     echo \"${ESUM} */tmp/openjdk.tar.gz\" | sha256sum -c -;     mkdir -p /opt/java/openjdk;     cd /opt/java/openjdk;     tar -xf /tmp/openjdk.tar.gz --strip-components=1;     rm -rf /tmp/openjdk.tar.gz;' returned a non-zero code: 35"  ["The command '/bin/sh -c set -eux;     ARCH=\"$(dpkg --print-architecture)\";     case \"${ARCH}\" in        armhf)          ESUM='c6b1fda3f8807028cbfcc34a4ded2e8a5a6b6239d2bcc1f06673ea6b1530df94';          BINARY_URL='https://github.com/AdoptOpenJDK/openjdk11-binaries/releases/download/jdk-11.0.5%2B10/OpenJDK11U-jdk_arm_linux_hotspot_11.0.5_10.tar.gz';          ;;        ppc64el|ppc64le)          ESUM='d763481ddc29ac0bdefb24216b3a0bf9afbb058552682567a075f9c0f7da5814';          BINARY_URL='https://github.com/AdoptOpenJDK/openjdk11-binaries/releases/download/jdk-11.0.5%2B10/OpenJDK11U-jdk_ppc64le_linux_hotspot_11.0.5_10.tar.gz';          ;;        amd64|x86_64)          ESUM='6dd0c9c8a740e6c19149e98034fba8e368fd9aa16ab417aa636854d40db1a161';          BINARY_URL='https://github.com/AdoptOpenJDK/openjdk11-binaries/releases/download/jdk-11.0.5%2B10/OpenJDK11U-jdk_x64_linux_hotspot_11.0.5_10.tar.gz';          ;;        *)          echo \"Unsupported arch: ${ARCH}\";          exit 1;          ;;     esac;     curl -LfsSo /tmp/openjdk.tar.gz ${BINARY_URL};     echo \"${ESUM} */tmp/openjdk.tar.gz\" | sha256sum -c -;     mkdir -p /opt/java/openjdk;     cd /opt/java/openjdk;     tar -xf /tmp/openjdk.tar.gz --strip-components=1;     rm -rf /tmp/openjdk.tar.gz;' returned a non-zero code: 35" ]
[INFO] ------------------------------------------------------------------------
[INFO] BUILD FAILURE
[INFO] ------------------------------------------------------------------------
[INFO] Total time:  7.080 s
[INFO] Finished at: 2022-11-10T13:35:02-05:00
[INFO] ------------------------------------------------------------------------
[ERROR] Failed to execute goal io.fabric8:docker-maven-plugin:0.34.1:build (build-image) on project ats-transformer-ocr: Unable to build image [alfresco/tengine-ocr:latest] : "The command '/bin/sh -c set -eux;     ARCH=\"$(dpkg --print-architecture)\";     case \"${ARCH}\" in        armhf)          ESUM='c6b1fda3f8807028cbfcc34a4ded2e8a5a6b6239d2bcc1f06673ea6b1530df94';          BINARY_URL='https://github.com/AdoptOpenJDK/openjdk11-binaries/releases/download/jdk-11.0.5%2B10/OpenJDK11U-jdk_arm_linux_hotspot_11.0.5_10.tar.gz';          ;;        ppc64el|ppc64le)          ESUM='d763481ddc29ac0bdefb24216b3a0bf9afbb058552682567a075f9c0f7da5814';          BINARY_URL='https://github.com/AdoptOpenJDK/openjdk11-binaries/releases/download/jdk-11.0.5%2B10/OpenJDK11U-jdk_ppc64le_linux_hotspot_11.0.5_10.tar.gz';          ;;        amd64|x86_64)          ESUM='6dd0c9c8a740e6c19149e98034fba8e368fd9aa16ab417aa636854d40db1a161';          BINARY_URL='https://github.com/AdoptOpenJDK/openjdk11-binaries/releases/download/jdk-11.0.5%2B10/OpenJDK11U-jdk_x64_linux_hotspot_11.0.5_10.tar.gz';          ;;        *)          echo \"Unsupported arch: ${ARCH}\";          exit 1;          ;;     esac;     curl -LfsSo /tmp/openjdk.tar.gz ${BINARY_URL};     echo \"${ESUM} */tmp/openjdk.tar.gz\" | sha256sum -c -;     mkdir -p /opt/java/openjdk;     cd /opt/java/openjdk;     tar -xf /tmp/openjdk.tar.gz --strip-components=1;     rm -rf /tmp/openjdk.tar.gz;' returned a non-zero code: 35"  -> [Help 1]
[ERROR]
[ERROR] To see the full stack trace of the errors, re-run Maven with the -e switch.
[ERROR] Re-run Maven using the -X switch to enable full debug logging.
[ERROR]
[ERROR] For more information about the errors and possible solutions, please read the following articles:
[ERROR] [Help 1] http://cwiki.apache.org/confluence/display/MAVEN/MojoExecutionException`


Incomplete transformation on bulk processing!?

Hey @aborroy,

I noticed problems with processing multiple documents at once. Some documents are OCRed, some are not.

What I did & observed

  • I prepared a folder with the embed-metadata-action (as described).
  • I uploaded multiple documents (10 in this case) via webdav to that folder.
  • The embed-metadata-action kicked in and processed all those documents - obviously in parallel, as I've seen a whole bunch of ocrmypdf/tesseract processes.
  • I waited until no such processes were present anymore, before inspect the documents in the UI.
  • Within the Share module, I opened each document and verified, if OCR has been performed (selecting text).
  • Surprisingly, some documents showed no OCR treatment.
  • I deleted the documents and repeated the test. But this time, the same documents where copied slowly after another to the folder.
  • Verification in the UI showed, that all documents were OCRed.

Conclusions/Assumptions:

  • A missing OCR is not caused by the particular document.
  • There is some race-condition or timeout issue, that aborts (?) the OCR processing for some documents.

I wonder:

  • Is there any timeout in Alfresco when calling out to the t-engine? When processing multiple documents in parallel, transforming takes (of course) significantly longer. Maybe too long for some transformations?
  • What is the best practice to debug this?

Usage question

It's not really an Issue, but can you please refer me to some documentation where I can see how to use a share - rule to convert PDF to OCRed PDF?
I only found the REST call to make a rendition but no really useful way to integrate this.

My scenario is that every file which is uploaded to alfresco will automatically be OCRed and be full text searchable. Cant figure out how to do this...

any help is appreciated

regards
stefan

Brazilian Portuguese support

Hi there Angel,

I'm trying to rebuild the docker image so to include portuguese support for alf-tengine-ocr as it's not enabled by default. When issuing the mvn clean package on the ats-transformer-ocr folder, i'm having the following issue:

[ERROR] COMPILATION ERROR :
[INFO] -------------------------------------------------------------
[ERROR] /tmp/1/alf-tengine-ocr/ats-transformer-ocr/src/main/java/org/alfresco/transformer/executors/OcrmypdfCommandExecutor.java:[29,44] cannot access org.alfresco.transformer.util.RequestParamMap
  bad class file: /root/.m2/repository/org/alfresco/alfresco-transformer-base/2.5.4/alfresco-transformer-base-2.5.4.jar(org/alfresco/transformer/util/RequestParamMap.class)
    class file has wrong version 55.0, should be 52.0
    Please remove or make sure it appears in the correct subdirectory of the classpath.
[INFO] 1 error
[INFO] -------------------------------------------------------------
[INFO] ------------------------------------------------------------------------
[INFO] BUILD FAILURE
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 02:04 min
[INFO] Finished at: 2024-01-29T09:19:24-03:00
[INFO] ------------------------------------------------------------------------
[ERROR] Failed to execute goal org.apache.maven.plugins:maven-compiler-plugin:3.8.1:compile (default-compile) on project ats-transformer-ocr: Compilation failure
[ERROR] /tmp/1/alf-tengine-ocr/ats-transformer-ocr/src/main/java/org/alfresco/transformer/executors/OcrmypdfCommandExecutor.java:[29,44] cannot access org.alfresco.transformer.util.RequestParamMap
[ERROR]   bad class file: /root/.m2/repository/org/alfresco/alfresco-transformer-base/2.5.4/alfresco-transformer-base-2.5.4.jar(org/alfresco/transformer/util/RequestParamMap.class)
[ERROR]     class file has wrong version 55.0, should be 52.0
[ERROR]     Please remove or make sure it appears in the correct subdirectory of the classpath.
[ERROR]
[ERROR] -> [Help 1]
[ERROR]
[ERROR] To see the full stack trace of the errors, re-run Maven with the -e switch.
[ERROR] Re-run Maven using the -X switch to enable full debug logging.
[ERROR]
[ERROR] For more information about the errors and possible solutions, please read the following articles:
[ERROR] [Help 1] http://cwiki.apache.org/confluence/display/MAVEN/MojoFailureException

Can you please advice? Also, is it possible to create a docker image on docker hub with english+portuguse+french+spanish?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.