Giter Site home page Giter Site logo

handprint's Introduction

Handprint

An experiment with handwritten text optical recognition on Caltech Archives materials.

Authors: Michael Hucka
Repository: https://github.com/caltechlibrary/handprint
License: BSD/MIT derivative – see the LICENSE file for more information

License Python Latest release

Table of Contents

☀ Introduction

Handprint (Handwritten Page Recognition Test) is a small project to examine the use of alternative optical character recognition (OCR) and handwritten text recognition (HTR) methods on documents from the Caltech Archives. Tests include the use of Google's OCR/HTR capabilities in their Google Cloud Vision API and Tesseract.

✺ Installation instructions

Handprint is a program written in Python 3 that works by invoking cloud-based services. Installation requires both obtaining a copy of Handprint itself, and also signing up for access to the cloud service providers.

⓵   Install Handprint on your computer

The following is probably the simplest and most direct way to install this software on your computer:

sudo pip3 install git+https://github.com/caltechlibrary/handprint.git --upgrade

Alternatively, you can instead clone this GitHub repository and then run setup.py manually. First, create a directory somewhere on your computer where you want to store the files, and cd to it from a terminal shell. Next, execute the following commands:

git clone https://github.com/caltechlibrary/handprint.git
cd handprint
sudo python3 -m pip install . --upgrade

⓶   Obtain cloud service credentials

Credentials for different services need to be provided to Handprint in the form of JSON files. Each service needs a separate JSON file named after the service (e.g., microsoft.json) and placed in a directory that Handprint searches. By default, Handprint searches for the files in a subdirectory named creds where Handprint is installed, but an alternative diretory can be indicated at run-time using the -c command-line option (or /c on Windows).

The specific contents and forms of the files differ depending on the particular service, as described below.

Microsoft

Microsoft's approach to credentials in Azure involves the use of subscription keys. The credentials file for Handprint just needs to contain a single field:

{
 "subscription_key": "YOURKEYHERE"
}

The value of "YOURKEYHERE" will be a string such as "18de248475134eb49ae4a4e94b93461c". To sign up for Azure and obtain a key, visit https://portal.azure.com and sign in using your Caltech Access email address/login. (Note: you will need to turn off browser security plugins such as Ad Block and uMatrix if you have them, or else the site will not work.) It will redirect you to the regular Caltech Access login page and then (after you log in) back to the Dashboard https://portal.azure.com, from where you can create credentials. Some notes about this can be found in the project Wiki pages.

When signing up for an Azure cloud service account, make sure to choose "Western US" as the region so that the service URL begins with "https://westus.api.cognitive.microsoft.com".

Google

Credentials for using a Google service account are stored in a JSON file containing many fields. The overall form looks like this:

{
  "type": "service_account",
  "project_id": "theid",
  "private_key_id": "thekey",
  "private_key": "-----BEGIN PRIVATE KEY-----anotherkey-----END PRIVATE KEY-----\n",
  "client_email": "emailaddress",
  "client_id": "id",
  "auth_uri": "https://accounts.google.com/o/oauth2/auth",
  "token_uri": "https://oauth2.googleapis.com/token",
  "auth_provider_x509_cert_url": "https://www.googleapis.com/oauth2/v1/certs",
  "client_x509_cert_url": "someurl"
}

Getting one of these files is unfortunately a complicated process. It's summarized in the Google Cloud documentation for Creating a service account, but some more explicit instructions can be found in our Handprint project Wiki pages.

▶︎ Running Handprint

Handprint is a command-line driven program. There is a single command-line interface program called handprint. You can run it by starting a terminal shell and cd'ing to the directory where you installed Handprint, and then running the program bin/handprint from there. For example:

bin/handprint -h

Alternatively, you should be able to run Handprint from anywhere using the normal approach to running Python modules:

python3 -m handprint -h

The -h option (/h on Windows) will make handprint display some help information and exit immediately. To make Handprint do more, you can supply other arguments that instruct Handprint to process image files (or alternatively, URLs pointing to image files at a network location) and run handwritten text recognition (HTR) or optical character recognition (OCR) algorithms on them, as explained below.

File formats recognized

Whether the images are stored locally or accessible via URLs, each image should be a single page of a document in which text should be recognized. The accepted by the cloud services at this time are JPEG, PNG, GIF, and BMP only, but Handprint can convert a few others formats into JPEG if necessary. Specifically, Handprint also handles JPEG 2000 and TIFF formats, which it converts to JPEG before sending to the different methods for text recognition.

Supported HTR/OCR methods

Handprint can contact more than one cloud service for OCR and HTR. You can use the -l option (/l on Windows) to make Handprint display a list of the methods currently implemented:

# bin/handprint -l
Known methods (for use as values for option -m):
   microsoft
   google

By default, Handprint will run each known method in turn. To invoke only one specific method, use the -m option (/m on Windows) followed by a method name:

bin/handprint -m microsoft /path/to/images

Service account credentials

Handprint looks for credentials files in the directory where it is installed, but you can put credentials in another directory and then tell Handprint where to find it using the -c option (/c on Windows). Example of use:

bin/handprint -c ~/handprint-credentials /path/to/images

Files versus URLs

Handprint can work both with files and with URLs. By default, arguments are interpreted as being files or directories of files, but if given the -u option (/u on Windows), the arguments are interpreted instead as URLs pointing to images.

A challenge with using URLs is how to name the files that Handprint writes for the results. Some CMS systems store content using opaque schemes that provide no clear names in the URLs, making it impossible for a software tool such as Handprint to guess what file name would make sense to use for local storage. Worse, some systems create extremely long URLs, making it impractical to use the full URL itself as the file name. For example, the following is a real URL pointing to an image in Caltech Archives today:

https://hale.archives.caltech.edu/adore-djatoka//resolver?rft_id=https%3A%2F%2Fhale.archives.caltech.edu%2Fislandora%2Fobject%2Fhale%253A85240%2Fdatastream%2FJP2%2Fview%3Ftoken%3D7997253eb6195d89b2615e8fa60708a97204a4cdefe527a5ab593395ac7d4327&url_ver=Z39.88-2004&svc_id=info%3Alanl-repo%2Fsvc%2FgetRegion&svc_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajpeg2000&svc.format=image%2Fjpeg&svc.level=4&svc.rotate=0

To deal with this situation, Handprint manufactures its own file names when the -u option is used. The scheme is simple: by default, Handprint will use a base name of document-N, where N is an integer. The integers start from 1 for every run of Handprint, and the integers count the URLs found either on the command line or in the file indicated by the -f option. The image found at a given URL is stored in a file named document-N.E where E is the format extension (e.g., document-1.jpeg, document-1.png, etc.). The URL itself is stored in another file named document-1.url. Thus, the files produced by Handprint will look like this when the -u option is used:

document-1.jpeg
document-1.url
document-1.google.txt
document-1.google.json
document-1.microsoft.txt
document-1.microsoft.json

document-2.jpeg
document-2.url
document-2.google.txt
document-2.google.json
document-2.microsoft.txt
document-2.microsoft.json

document-3.jpeg
document-3.url
document-3.google.txt
document-3.google.json
document-3.microsoft.txt
document-3.microsoft.json

...

The base name image can be changed using the -r option (/r on Windows). For example, running Handprint with the option -r einstein will cause the outputs to be named einstein-1.jpeg, einstein-1.url, etc. (assuming, for the sake of this example, that the image file format is jpeg).

The use of the -u option also requires the use of the -o option (/o on Windows) to tell Handprint where to store the results. This is a consequence of the fact that, without being provided with files or directories on the local disk, Handprint can't infer where to write its output.

Example of use:

bin/handprint -u -f /tmp/urls-to-read.txt -o /tmp/results/

Finally, note that providing URLs on the command line can be problematic due to how terminal shells interpret certain characters, and so when supplying URLs, it's usually better to list the URLs in a file in combination with the -f option (/f on Windows).

Command line options

The following table summarizes all the command line options available. (Note: on Windows computers, / must be usedas the prefix character instead of -):

Short Long form opt Meaning Default
-cD --creds-dirD Look for credentials in directory D creds
-fF --from-fileF Read file names or URLs from file F Use names or URLs given on command line
-l --list Disply list of known methods
-mM --methodM Use method M "all"
-oO --outputO Write outputs to directory D Same directories where images are found
-u --given-urls Inputs are URLs, not files or dirs Assume files and/or directories of files
-rR --root-nameR Write outputs to files named R-n Use the base names of the image files
-q --quiet Don't print messages while working Be chatty while working
-C --no-color Don't color-code the output Use colors in the terminal output
-D --debug Debugging mode Normal mode
-V --version Print program version info and exit Do other actions instead

⚑   The o option (/o on Windows) must be provided if the -u option (/u on Windows) is used: the results must be written to the local disk somewhere, because it is not possible to write the results in the network locations represented by the URLs.

✦   If -u is used (meaning, the inputs are URLs and not files or directories), then the outputs will be written by default to names of the form document-n, where n is an integer. Examples: document-1.jpeg, document-1.google.txt, etc. This is because images located in network content management systems may not have any clear names in their URLs.

⚛︎ Data returned

Handprint tries to gather all the data that each service returns for text recognition, and outputs the results in two forms: a .json file containing all the results, and a .txt file containing just the document text. The exact content of the .json file differs for each service.

⁇ Getting help and support

If you find an issue, please submit it in the GitHub issue tracker for this repository.

☺︎ Acknowledgments

The vector artwork of a hand used as a logo for Handprint was created by Kevin from the Noun Project. It is licensed under the Creative Commons CC-BY 3.0 license.

Handprint makes use of numerous open-source packages, without which it would have been effectively impossible to develop Turf with the resources we had. We want to acknowledge this debt. In alphabetical order, the packages are:

☮︎ Copyright and license

Copyright (C) 2018, Caltech. This software is freely distributed under a BSD/MIT type license. Please see the LICENSE file for more information.

handprint's People

Contributors

mhucka avatar

Watchers

 avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.