Giter Site home page Giter Site logo

pdf-image-highlighter's Introduction

PDF-Image Highlighter

I am making a highlighter for a group of image in a PDF file. It is a long journey.

Prerequisites

  • Install Google tesseract in your PC. Follow the installation tutorial here. This project used tesseract with configuration --psm 3 --oem 1 (LSTM model and made for small text detection). To install tesseract in Ubuntu machine:
sudo apt install tesseract-ocr
sudo apt install libtesseract-dev
  • Python 3.x.
  • You will need to install pillow, pytesseract, pdf2image and textdistance.
pip install pillow
pip install pytesseract
pip install pdf2image
pip install textdistance

Example #1: Detect and read the text.

I made some example! I placed my example files in example directory. First, change your active directory to example.

cd example

After that, run this command:

# python example.py <image_file>
python example.py textimage2.png

You will get the result (shortened version):

Example Image

The text extraction result is: 
Attn: Pearlene Then
Sub-BU: Technology Department - PUBO9
To: Public Utilities Board
Accounts Payable Section, Finance Department,
...

Data: 
5;1;1;1;1;1;6;22;35;13;93;Attn:
5;1;1;1;1;2;49;22;68;13;92;Pearlene
5;1;1;1;1;3;122;22;39;13;96;Then
5;1;1;1;2;1;7;43;64;13;90;Sub-BU:
...

Example #2: Highlight text in an image

In this example, I am trying to highlight a spesific text inside an image. Do the following steps to try it:

  1. You can specify the text that you want to search in the image by changing the soe variable located at line 128 on process.py. Feel free to change soe value. By default, i put my soe with some sentences like salta and El zorro. That means, I want to search those sentences inside the given image.
soe = ["salta","El zorro"]
  1. Run process.py followed by the path of the image that you want to detect.
# python process.py <image_path>
python process.py example/textimage.png

Image 3. You will see the result of text detection process according to the sentences that you put inside soe. Result

My upcoming updates

My next plan for this project is:

  • Read pdf file, convert it to a pile of images and detect the defined soe texts inside it.
  • I planned to make an API of this project by using Flask.

pdf-image-highlighter's People

Contributors

herukurniawan avatar

Stargazers

 avatar  avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.