Giter Site home page Giter Site logo

pdf-ocr-overlay's People

Contributors

pankrat avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

pdf-ocr-overlay's Issues

Getting Error as No such file or directory: '/tmp/tmp8mXOjp/-000.html'

16:48:06 [INFO] OCR complete. Merge pages into 'output.pdf'
Exception in thread Thread-1:
Traceback (most recent call last):
File "/usr/lib/python2.7/threading.py", line 801, in __bootstrap_inner
self.run()
File "/usr/lib/python2.7/threading.py", line 754, in run
self.__target(*self.__args, **self.__kwargs)
File "./ocr.py", line 174, in process_page
ocr_page(image, lang=lang, width=width, height=height)
File "./ocr.py", line 155, in ocr_page
html = os.open(hocr, os.O_RDONLY)
OSError: [Errno 2] No such file or directory: '/tmp/tmp8mXOjp/-000.html'

FileNotFoundError

Traceback (most recent call last):
File "ocr.py", line 236, in
for name, version in system_info():
File "ocr.py", line 55, in system_info
pdfimages = Popen(['pdfimages', '-v'], stderr=PIPE).communicate()[1]
File "C:\Python37\lib\subprocess.py", line 756, in init
restore_signals, start_new_session)
File "C:\Python37\lib\subprocess.py", line 1155, in _execute_child
startupinfo)
FileNotFoundError: [WinError 2] The system cannot find the file specified

Missing dependencies, wrong call to commands and ungraceful crash; fails to recognize AutoCAD pdf output

Using it under Ubuntu 14.04.3 server.

  • The readme should mention it requires either Imagemaick or Graphicsmagick, both provide the needed identity command (line 72).
  • If using the Graphicsmagick package, the correct option call is '-version', only one hyphen. This applies both for identity as well as for convert (line 75)

*It fails to recognize an AutoCAD pdf and crashes. Should die gracefully instead, if not able to process the file. Output follows:

[email protected][21:40]:~/Test$~/pdfocroverlay/pdf-ocr-overlay-master/ocr.py M101.pdf M101-searchable-pdfocroverlay.pdf
21:41:26 [INFO] pdfimages   : pdfimages version 0.24.5
21:41:26 [INFO] tesseract   : tesseract 3.03
21:41:26 [INFO] gs          : 9.14
21:41:26 [INFO] identify    : GraphicsMagick 1.3.18 2013-03-10 Q8 http://www.GraphicsMagick.org/
21:41:26 [INFO] convert     : GraphicsMagick 1.3.18 2013-03-10 Q8 http://www.GraphicsMagick.org/
21:41:33 [INFO] 1 pages, 1219mm*914mm
21:41:33 [INFO] Extract pages from M101.pdf
Syntax Error: Unknown character collection 'PDFAUTOCAD-Indentity0'
Syntax Error: Unknown character collection 'PDFAUTOCAD-Indentity0'
21:41:35 [INFO] Process 1 pages with 1 threads
21:41:35 [INFO] Page  1: Run OCR ...
21:41:35 [DEBUG] Page=3456x2592 Image=160x1093 DPI=3x30
Tesseract Open Source OCR Engine v3.03 with Leptonica
Too few characters. Skipping this page
OSD: Weak margin (0.00) for 14 blob text block, but using orientation anyway: 0
21:41:37 [INFO] OCR complete. Merge pages into 'M101-searchable-pdfocroverlay.pdf'
Exception in thread Thread-1:
Traceback (most recent call last):
  File "/usr/lib/python2.7/threading.py", line 810, in __bootstrap_inner
    self.run()
  File "/usr/lib/python2.7/threading.py", line 763, in run
    self.__target(*self.__args, **self.__kwargs)
  File "/home/ubuntu/pdfocroverlay/pdf-ocr-overlay-master/ocr.py", line 176, in process_page
    ocr_page(image, lang=lang, width=width, height=height)
  File "/home/ubuntu/pdfocroverlay/pdf-ocr-overlay-master/ocr.py", line 157, in ocr_page
    html = os.open(hocr, os.O_RDONLY)
OSError: [Errno 2] No such file or directory: '/tmp/tmpa9uFbe/-000.html'

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.