Giter Site home page Giter Site logo

Comments (21)

Shreeshrii avatar Shreeshrii commented on April 28, 2024

Version information -
tesseract 3.05.00dev
leptonica-1.72
libgif 4.1.6(?) : libjpeg 8d (libjpeg-turbo 1.3.1) : libpng 1.6.17 : libtiff 4.0.3 : zlib 1.2.8 : libwebp 0.4.3

from tesseract.

zdenop avatar zdenop commented on April 28, 2024

Did you read https://groups.google.com/d/msg/tesseract-ocr/ToWcnyHqF4c/P7HDEKsR1cEJ ?

from tesseract.

matzeri avatar matzeri commented on April 28, 2024

on cygwin x86_64 and same on x86:
$ tesseract --version
tesseract 3.04.00
leptonica-1.72
libgif 4.1.6(?) : libjpeg 8d (libjpeg-turbo 1.3.1) : libpng 1.6.17 : libtiff 4.0.3 : zlib 1.2.8 : libwebp 0.4.3

$ tesseract eurotext.tif eurotext -l eng+deu pdf
Tesseract Open Source OCR Engine v3.04.00 with Leptonica
Page 1
Warning in pixReadMemTiff: tiff page 1 not found

$ ls -lrt eurotext.pdf
-rw-r--r-- 1 marco Administrators 13K Jul 26 21:36 eurotext.pdf

from tesseract.

Shreeshrii avatar Shreeshrii commented on April 28, 2024

Marco, the version I tested was 'v3.05.00dev' based on the master branch from git (built by Simon).

Could it be that one of the newer commits has caused this issue?

from tesseract.

matzeri avatar matzeri commented on April 28, 2024

I doubt, more likely you are missing some additional library/program or
a missing configuration.

On Mon, Jul 27, 2015 at 5:01 AM, Shreeshrii [email protected]
wrote:

Marco, the version I tested was 'v3.05.00dev' based on the master branch
from git (built by Simon).

Could it be that one of the newer commits has caused this issue?


Reply to this email directly or view it on GitHub
#63 (comment)
.

from tesseract.

jbreiden avatar jbreiden commented on April 28, 2024

Tesseract knows that PDF creation failed and returns an error code. So at least this is not silent data corruption. I'd like to know if the problem is present for PNG input or if it is restricted to TIFF.

from tesseract.

Shreeshrii avatar Shreeshrii commented on April 28, 2024

Jeff, it worked for png and jpg for pdf output. This is using the versions compiled by Simon.

C:\Users\User\Downloads\TESS>tesseract -v
tesseract 3.05.00dev
leptonica-1.72
libgif 4.1.6(?) : libjpeg 8d (libjpeg-turbo 1.3.1) : libpng 1.6.17 : libtiff 4.0.3 : zlib 1.2.8 : libwebp 0.4.3

C:\Users\User\Downloads\TESS>tesseract testing/phototest.gif testing/phototest.gif -l eng pdf
Tesseract Open Source OCR Engine v3.05.00dev with Leptonica
Warning in pixReadMemGif: writing to a temp file, not directly to memory
Error in fopenWriteStream: stream not opened
Error in l_binaryWrite: stream not opened
Error in fopenReadStream: file not found
Error in pixRead: image file not found: /tmp/leptonica/847980_4108_mem.gif
Error in pixReadMemGif: pix not read
Error in pixReadMem: gif: no pix returned
Error during processing.

C:\Users\User\Downloads\TESS>tesseract testing/phototest.tif testing/phototest.tif -l eng pdf
Tesseract Open Source OCR Engine v3.05.00dev with Leptonica
Page 1
Error in fopenWriteStream: stream not opened
Error in pixWrite: stream not opened
Error in fopenReadStream: file not found
Error in extractG4DataFromFile: stream not opened to file
Error in l_generateG4Data: datacomp not extracted
Error in pixGenerateCIData: g4 data not made
Error in l_generateCIDataForPdf: file testing/phototest.tif format is 4; unreadable
Error during processing.

C:\Users\User\Downloads\TESS>tesseract testing/phototest.tif testing/phototest.tif -l eng
Tesseract Open Source OCR Engine v3.05.00dev with Leptonica
Page 1
Warning in pixReadMemTiff: tiff page 1 not found

C:\Users\User\Downloads\TESS>tesseract testing/phototest.png testing/phototest.png -l eng pdf
Tesseract Open Source OCR Engine v3.05.00dev with Leptonica

C:\Users\User\Downloads\TESS>tesseract testing/phototest.jpg testing/phototest.jpg -l eng pdf
Tesseract Open Source OCR Engine v3.05.00dev with Leptonica

from tesseract.

Shreeshrii avatar Shreeshrii commented on April 28, 2024

Directory of C:\Users\User\Downloads\TESS\testing

07/28/15 08:10 55,504 phototest.gif
07/28/15 08:19 0 phototest.gif.pdf
08/28/14 20:38 57,772 phototest.jpg
07/28/15 08:20 61,460 phototest.jpg.pdf
08/28/14 20:38 5,265 phototest.png
07/28/15 08:20 8,890 phototest.png.pdf
07/24/15 12:15 38,668 phototest.tif
07/28/15 08:20 2,910 phototest.tif.pdf
07/28/15 08:20 287 phototest.tif.txt

from tesseract.

jbreiden avatar jbreiden commented on April 28, 2024

Hmmm.... interesting. I suspect this is related to that classic Windows problem
where you can't pass file pointers between different DLLs, especially if they use
different runtimes. If so, we may be in trouble.

from tesseract.

jbreiden avatar jbreiden commented on April 28, 2024

Or... do we still have some ifdefs in the code to do Windows streaming I/O a little differently? I vaguely remember writing some back in the day. Maybe they are misbehaving under Cygwin? Can't seem to find them at the moment.

from tesseract.

Shreeshrii avatar Shreeshrii commented on April 28, 2024

Marco is able to get the pdf output from the 3.04.00 version he packaged
for cygwin.

I was testing based on the (3.05.dev version) files that were built by Simon. I do not have
cygwin installed but will try downloading the files from the mirrors Marco
suggested and see what happens.

FYI, I downloaded the MSYS2 tesseract-ocr package for 3.04.00 (packaged by
Alex at
https://github.com/Alexpux/MINGW-packages/tree/master/mingw-w64-tesseract-ocr)
and am able to get the pdf output from it.

ShreeDevi


भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

On Tue, Jul 28, 2015 at 9:05 AM, jbreiden [email protected] wrote:

Or... do we still have some ifdefs in the code to do Windows streaming I/O
a little differently? I vaguely remember writing some back in the day.
Maybe they are misbehaving under Cygwin? Can't seem to find them at the
moment.


Reply to this email directly or view it on GitHub
#63 (comment)
.

from tesseract.

Shreeshrii avatar Shreeshrii commented on April 28, 2024

Just to clarify, I am referring the pdf output from tif input in the above post.

from tesseract.

Shreeshrii avatar Shreeshrii commented on April 28, 2024

Working with 3.04.00 packaged by Marco for cygwin

ra@Shree ~/tesseract-ocr
$ tesseract testing/phototest.tif phototest.tif
Tesseract Open Source OCR Engine v3.04.00 with Leptonica
Page 1
Warning in pixReadMemTiff: tiff page 1 not found

ra@Shree ~/tesseract-ocr
$ tesseract testing/phototest.tif testing/phototest.tif pdf
Tesseract Open Source OCR Engine v3.04.00 with Leptonica
Page 1
Warning in pixReadMemTiff: tiff page 1 not found

ra@Shree ~/tesseract-ocr
$ tesseract testing/phototest.tif testing/phototest.tif hocr
Tesseract Open Source OCR Engine v3.04.00 with Leptonica
Page 1
Warning in pixReadMemTiff: tiff page 1 not found

ra@Shree ~/tesseract-ocr
$ tesseract --list-langs
List of available languages (2):
eng
osd

ra@Shree ~/tesseract-ocr
$ tesseract -v
tesseract 3.04.00
leptonica-1.72
libgif 4.1.6(?) : libjpeg 8d (libjpeg-turbo 1.3.1) : libpng 1.6.17 : libtiff 4.0.3 : zlib 1.2.8 : libwebp 0.4.3

from tesseract.

Shreeshrii avatar Shreeshrii commented on April 28, 2024

ra@Shree ~/tesseract-ocr/testing
$ ls -lrt
total 165
-rwx---r-x 1 ra ra 38668 Jul 29 11:45 phototest.tif
-rwx---r-x 1 ra ra 102598 Jul 29 11:45 eurotext.tif
-rw----r-- 1 ra ra 7712 Jul 29 11:47 phototest.tif.pdf
-rw----r-- 1 ra ra 287 Jul 29 11:48 phototest.tif.txt
-rw----r-- 1 ra ra 8394 Jul 29 11:48 phototest.tif.hocr

from tesseract.

LeeBear35 avatar LeeBear35 commented on April 28, 2024

I went into pbrush and created a Hello World image and saved it as bmp, gif, jpg, png, and tif. When I process those files using tesseract.exe imagefile textfile -l eng, all the files process correctly except the GIF file. I included the GIF and the output below:

Tesseract Open Source OCR Engine v3.05.00dev with Leptonica
Warning in pixReadMemGif: writing to a temp file, not directly to memory
Error in fopenWriteStream: stream not opened
Error in l_binaryWrite: stream not opened
Error in fopenReadStream: file not found
Error in pixRead: image file not found: /tmp/199506_720_mem.gif
Error in pixReadMemGif: pix not read
Error in pixReadMem: gif: no pix returned
Error during processing.
helloworld

Also here is the version dump:

tesseract 3.05.00dev
leptonica-1.73
libgif 4.1.6(?) : libjpeg 8d (libjpeg-turbo 1.4.2) : libpng 1.6.20 : libtiff 4
.0.6 : zlib 1.2.8 : libwebp 0.4.3

If I was a guessing man I would say maybe it is in the temporary file name /tmp/199506_720_mem.gif likely not conforming to MS windows.

A little more information, looking at the pixReadMemGif routine it makes a call to get a temporary file, in doing so that routine tries to ensure that the tmp directory exists, when I created a tmp directory at the root of the drive where I am running tesseract, the GIF file correctly extracted after creating that directory. That is in the Leptonica utils.c file in the genTempFilename routine.

from tesseract.

Shreeshrii avatar Shreeshrii commented on April 28, 2024

Maybe leptonic is not built with gif library

  • sent from my phone. excuse the brevity.

On 30-Aug-2016 7:40 PM, "LeeBear35" [email protected] wrote:

I went into pbrush and created a Hello World image and saved it as bmp,
gif, jpg, png, and tif. When I process those files using tesseract.exe
imagefile textfile -l eng, all the files process correctly except the GIF
file. I included the GIF and the output below:

Tesseract Open Source OCR Engine v3.05.00dev with Leptonica
Warning in pixReadMemGif: writing to a temp file, not directly to memory
Error in fopenWriteStream: stream not opened
Error in l_binaryWrite: stream not opened
Error in fopenReadStream: file not found
Error in pixRead: image file not found: /tmp/199506_720_mem.gif
Error in pixReadMemGif: pix not read
Error in pixReadMem: gif: no pix returned
Error during processing.
[image: helloworld]
https://cloud.githubusercontent.com/assets/11964590/18092293/6a92bfd8-6e91-11e6-8c27-2e66a0da3114.gif


You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
#63 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AE2_o7pPyAlMmoDPBQ3BkxMC24_LqyNXks5qlDnGgaJpZM4FfEhW
.

from tesseract.

LeeBear35 avatar LeeBear35 commented on April 28, 2024

After further research the issue is with the Leptonica utils.c genTempFilename method, it attempts to ensure that the tmp directory exists on the drive where the program is executing, but fails to create the directory so the resulting temp file returned cannot not be created or used. If the tmp directory is created then the GIF file is processed and extracted correctly.

I updated my post when I discovered this short coming.

Leland Carpenter ♦ Sr. Software Engineer ♦ PRGX USA, Inc.
4904 Hickory Way ♦ Johnsburg, IL 60051-8967
O: 815.307.7634 ♦ [email protected]:[email protected]
[cid:[email protected]]

From: Shreeshrii [mailto:[email protected]]
Sent: Tuesday, August 30, 2016 09:41 AM
To: tesseract-ocr/tesseract [email protected]
Cc: Carpenter, Lee [email protected]; Comment [email protected]
Subject: Re: [tesseract-ocr/tesseract] corrupt pdf output on cygwin (#63)

Maybe leptonic is not built with gif library

  • sent from my phone. excuse the brevity.

On 30-Aug-2016 7:40 PM, "LeeBear35" <[email protected]mailto:[email protected]> wrote:

I went into pbrush and created a Hello World image and saved it as bmp,
gif, jpg, png, and tif. When I process those files using tesseract.exe
imagefile textfile -l eng, all the files process correctly except the GIF
file. I included the GIF and the output below:

Tesseract Open Source OCR Engine v3.05.00dev with Leptonica
Warning in pixReadMemGif: writing to a temp file, not directly to memory
Error in fopenWriteStream: stream not opened
Error in l_binaryWrite: stream not opened
Error in fopenReadStream: file not found
Error in pixRead: image file not found: /tmp/199506_720_mem.gif
Error in pixReadMemGif: pix not read
Error in pixReadMem: gif: no pix returned
Error during processing.
[image: helloworld]
https://cloud.githubusercontent.com/assets/11964590/18092293/6a92bfd8-6e91-11e6-8c27-2e66a0da3114.gif


You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
#63 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AE2_o7pPyAlMmoDPBQ3BkxMC24_LqyNXks5qlDnGgaJpZM4FfEhW
.


You are receiving this because you commented.
Reply to this email directly, view it on GitHubhttps://github.com//issues/63#issuecomment-243462302, or mute the threadhttps://github.com/notifications/unsubscribe-auth/ALaQrt3vwLXo6DiUEMrtKldXWIn3hi2qks5qlEEEgaJpZM4FfEhW.

from tesseract.

matzeri avatar matzeri commented on April 28, 2024

/tmp/199506_720_mem.gif is fine for cygwin. Are you using a cygwin build without a proper directory structure ?

from tesseract.

jbreiden avatar jbreiden commented on April 28, 2024

I have a number of tempfile patches already written for Leptonica to make these calls more
secure and less brittle, and there is ongoing work on this topic. I actually don't know if
cygwin is using the Unix or Windows code path for temporary files, but just want to
mention that there is activity. Don't know why you are getting bad results compared to
other cygwin users.

https://sources.debian.net/src/leptonlib/1.73-5/debian/patches/

from tesseract.

matzeri avatar matzeri commented on April 28, 2024

@jbreiden Starting from 1.73 is following the Unix tmp path.

from tesseract.

LeeBear35 avatar LeeBear35 commented on April 28, 2024

Might be that I am running on the e: drive instead of the c: drive and that there was no e:\tmp, it was just a matter of the routine not swapping out the /tmp for the windows temporary directory.

Leland Carpenter ♦ Sr. Software Engineer ♦ PRGX USA, Inc.
4904 Hickory Way ♦ Johnsburg, IL 60051-8967
O: 815.307.7634 ♦ [email protected]:[email protected]
[cid:[email protected]]

From: jbreiden [mailto:[email protected]]
Sent: Tuesday, August 30, 2016 08:51 PM
To: tesseract-ocr/tesseract [email protected]
Cc: Carpenter, Lee [email protected]; Comment [email protected]
Subject: Re: [tesseract-ocr/tesseract] corrupt pdf output on cygwin (#63)

I have a number of tempfile patches already written for Leptonica to these calls more
secure and less brittle, and there is ongoing work on this topic. I actually don't know if
cygwin is using the Unix or Windows code path for temporary files, but just want to
mention that there is activity. Don't know why you are getting bad results compared to
other cygwin users.

https://sources.debian.net/src/leptonlib/1.73-5/debian/patches/


You are receiving this because you commented.
Reply to this email directly, view it on GitHubhttps://github.com//issues/63#issuecomment-243635600, or mute the threadhttps://github.com/notifications/unsubscribe-auth/ALaQrkGGqG6w13z9K5OGD9_kiB2gU7J2ks5qlN4ggaJpZM4FfEhW.

from tesseract.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.