Giter Site home page Giter Site logo

tesseract-ocr / tesseract Goto Github PK

View Code? Open in Web Editor NEW
57.4K 1.7K 9.0K 51.74 MB

Tesseract Open Source OCR Engine (main repository)

Home Page: https://tesseract-ocr.github.io/

License: Apache License 2.0

C++ 95.98% C 1.06% Shell 0.35% Java 0.94% Makefile 0.87% CMake 0.75% M4 0.04% Python 0.01%
tesseract tesseract-ocr ocr lstm machine-learning ocr-engine hacktoberfest

tesseract's People

Contributors

amitdo avatar bertsky avatar brlin-tw avatar chrismamo1 avatar cjmayo avatar egorpugin avatar eighttails avatar gerhobbelt avatar guidovranken avatar jbreiden avatar jimregan avatar nagadomi avatar nickjwhite avatar noahmetzger avatar parryword avatar randomdsdevel avatar rfschtkt avatar robinwatts avatar robyer avatar romitat avatar shatur avatar shreeshrii avatar stweil avatar sundarcf avatar tfmorris avatar theraysmith avatar wikinaut avatar zamazan4ik avatar zdenop avatar zhuangzhuang avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

tesseract's Issues

stdin option doesn't work for psm 0 and 2

I am using the master branch from github (3.05.00dev) and facing some problems when using the stdin option in conjunction with the -psm 0 and -psm 2 options

cat <image file> | tesseract stdin stdout -psm 0

gives me the error

Error in fopenReadStream: file not found
Error in pixRead: image file not found: stdin
Cannot open input file: stdin

However, the command

cat <image file> | tesseract stdin stdout -psm 1

works as expected.

Do I miss something?

Regards,
Caleb

Compile to EMScripten/asm.js

For those of us who know nothing of C, might someone be kind enough to use EMScripten/asm.js to compile to JavaScript on our behalf for use in the browser (without Node.js, etc.)? Would no doubt be quite slow but would be handy for some web apps... The other existing ports (Ocrad and GOCR) do not seem to hold a candle to the quality of Tesseract. Thanks!

Page segmentation output ocr_float

I'm trying to reproduce results achieved at the ICDAR page segmentation competitions [1,2] with tesseract. I'm struggling to get the tool to output the hOCR tags that I'm expecting for tables and figures etc [3]. At the moment I'm calling tesseract with pagesegmode 1. Should I be adding other options via a config file to achieve the full extent of tesseracts segmentation and labelling ability (I'm not interested in the character recognition element as much).

  1. Antonacopoulos (2013, ICDAR) ICDAR2013 Competition on Historical Book Recognition – HBR2013
  2. Antonacopoulos (2013, ICDAR) ICDAR2013 Competition on Historical Newspaper Layout Analysis – HNLA2013
  3. Breuel (2010) The hOCR Embedded OCR Workflow and Output Format

OpenMP support

In ccmain/par_control.cpp you can see this code:

pragma omp parallel for num_threads(10)

for (int b = 0; b < blobs.size(); ++b) {
...
}

configure.ac should have an option to activate OpenMP in tesseract code (maybe with AC_OpenMP?).

[Suggestion for discussion] Redesign of tessdata installation e.g. via a text file of needed languages and/or as git submodule (to clone or pull/update all languages automatically)

I would like to suggest to slightly redesign the layout and installation of the training data.

At the present, one can install a single language (or a set of languages) by downloading the prepacked trainingdata as described in https://github.com/tesseract-ocr/tesseract/wiki/Compiling#Language%20Data or by downloading the git repo https://github.com/tesseract-ocr/tessdata .

The process is difficult, in case that you always want to automatically use the latest version of (let's say) eng, deu, fra, spa, ita trainingdata.

Without presenting a concrete idea or a PR, I just wanted to start a discussion whether and how a redesign (which should be compatible to the current way of downloadlng/installing language data) could look like, if it is wanted and possible.

Perhaps a small json or text parameter file could indicate which languages are needed (or are currently installed), and only these files are then updated from git.

Or a method "git submodule" which automatically clones (or pulls) all languages from https://github.com/tesseract-ocr/tessdata .

OpenCL segfault

Hardware is a Ubuntu 14.04 laptop with integrated Intel graphics.

./configure --enable-opencl --enable-debug
...
gdb api/.libs/lt-tesseract

(gdb) run testing/phototest.tif -

Starting program: api/.libs/lt-tesseract testing/phototest.tif -
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
[DS] Profile file not available (tesseract_opencl_profile_devices.dat); performing profiling.

Program received signal SIGSEGV, Segmentation fault.
strlen () at ../sysdeps/x86_64/strlen.S:106
106 ../sysdeps/x86_64/strlen.S: No such file or directory.
(gdb) backtrace
#0  strlen () at ../sysdeps/x86_64/strlen.S:106
#1  0x00007ffff77fe549 in writeProfileToFile (profile=0x81c810, 
    serializer=0x7ffff780752f <serializeScore(ds_device*, void**, unsigned int*)>, file=0x7ffff78aa5e0 "tesseract_opencl_profile_devices.dat")
    at opencl_device_selection.h:268
#2  0x00007ffff7807a09 in OpenclDevice::getDeviceSelection ()
    at openclwrapper.cpp:3427
#3  0x00007ffff7800356 in OpenclDevice::InitOpenclRunEnv_DeviceSelection (
    argc=0) at openclwrapper.cpp:527
#4  0x00007ffff7800074 in OpenclDevice::InitEnv () at openclwrapper.cpp:431
#5  0x00007ffff75f01af in tesseract::TessBaseAPI::Init (this=0x7fffffffda40, 
    datapath=0x0, language=0x405a13 "eng", oem=tesseract::OEM_DEFAULT, 
    configs=0x7fffffffe6a0, configs_size=0, vars_vec=0x7fffffffda00, 
    vars_values=0x7fffffffda20, set_only_non_debug_params=false)
    at baseapi.cpp:299
#6  0x0000000000404317 in main (argc=3, argv=0x7fffffffe688)
    at tesseractmain.cpp:181
...

box.train completes with no errors but does not create .tr output

Hi,
I'm running 3.05.00dev on Ubuntu 14.04 LTS.
When running:
tesseract eng.Arial.exp0.tif eng.Arial.exp0 box.train
I'm getting a simple one-line output:
Tesseract Open Source OCR Engine v3.05.00dev with Leptonica

However, no .tr output file is created (anywhere in the filesystem).

My work dir listing is:

Arial.ttf
common.punc
eng.Arial.exp0.box
eng.Arial.exp0.tif
training-text.txt

Running with gdb doesn't give anything additional.

Anything I can look for for extra info? Any ideas what might be causing this?

Thanks

corrupt pdf output on cygwin

Using windows binaries compiled by Simon on cygwin from http://domasofan.spdns.eu/tesseract/

$ tesseract testing\eurotext.tif testing\eurotext -l eng+deu pdf

Tesseract Open Source OCR Engine v3.05.00dev with Leptonica
Page 1
Error in fopenWriteStream: stream not opened
Error in pixWrite: stream not opened
Error in fopenReadStream: file not found
Error in extractG4DataFromFile: stream not opened to file
Error in l_generateG4Data: datacomp not extracted
Error in pixGenerateCIData: g4 data not made
Error in l_generateCIDataForPdf: file testing\eurotext.tif format is 4; unreadable
Error during processing.

the pdf comes out but you can't open it.
adobe reader shows an error that it is corrupted.

(Forum thread - https://groups.google.com/forum/#!msg/tesseract-ocr/ToWcnyHqF4c/FHWGlQhd6poJ )

-Wtautological-undefined-compare warning in tabvector.cpp

Another -Wtautological-undefined-compare warning when building for Android using clang and tess-two:

jni/com_googlecode_tesseract_android/src/textord/tabvector.cpp:526:7: warning: 'this' pointer cannot be null in well-defined C++ code; comparison may be assumed to always evaluate to false [-Wtautological-undefined-compare]
  if (this == NULL) {
      ^~~~    ~~~~

Unable to compile 3.04-rc1 under ubuntu 14.04

The same issue is reported in https://code.google.com/p/tesseract-ocr/issues/detail?can=2&start=0&num=100&q=&colspec=ID%20Type%20Status%20Priority%20Milestone%20Owner%20Summary&groupby=&sort=&id=1307

But the problem is still exists and the ticket is closed.

I got the error during compiling.

The error is:
libtool: link: g++ -std=c++11 -o .libs/tesseract tesseract-tesseractmain.o ./.libs/libtesseract.so -lrt -llept -lpthread
./.libs/libtesseract.so: undefined reference to l_generateCIDataForPdf' ./.libs/libtesseract.so: undefined reference tol_CIDataDestroy'


tesseract -v
tesseract 3.03
leptonica-1.72

libjpeg 8d (libjpeg-turbo 1.3.0) : libpng 1.2.50 : libtiff 4.0.3 : zlib 1.2.8

And here is the deb package I installed:
"autoconf",
"automake",
"libtool",
"libpng12-dev",
"libjpeg-turbo8-dev",
"g++",
"libtiff5-dev",
"libopencv-dev",
"libopencv-objdetect-dev",
"libopencv-highgui-dev",
"libopencv-legacy-dev",
"libopencv-contrib-dev",
"libopencv-videostab-dev",
"libopencv-superres-dev",
"libopencv-ocl-dev",
"libcv-dev",
"libhighgui-dev",
"libcvaux-dev",
"libtesseract-dev",
"git",
"cmake",
"build-essential",
"libleptonica-dev",
"liblog4cplus-dev",
"libcurl3-dev",
"python2.7-dev",
"tk8.5",
"tcl8.5",
"tk8.5-dev",
"tcl8.5-dev",
"imagemagick"

I basically follow the instruction in https://realpython.com/blog/python/setting-up-a-simple-ocr-server/

Please help. Thanks.

have error : Error in findTiffCompression: function not present

i have success install tesseract ,
tesseract 3.02.02
leptonica-1.71
libjpeg 6b : libpng 1.2.49 : zlib 1.2.8

and my libtiff locate in /usr/lib64
locate libtiff
/usr/lib64/libtiff.so.3
/usr/lib64/libtiff.so.3.9.4
/usr/lib64/libtiffxx.so.3
/usr/lib64/libtiffxx.so.3.9.4

so i add ~/.bash_profile with
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/lib:/usr/lib64
and relogin.

but the error still there:
Tesseract Open Source OCR Engine v3.02.02 with Leptonica
Error in findTiffCompression: function not present
Error in pixReadStreamTiff: function not present
Error in pixReadStream: tiff: no pix returned
Error in pixRead: pix not read
Unsupported image type.

how i solve it ? thx alot

Cube and combined modes doesn't work in 3.03

What steps will reproduce the problem?

  1. Try to recognize the attached image with Cube mode. Whitelist is '0123456789'. The result is wrong. It's 1234669890 (6 instead of 5, 9 instead of 7).
  2. Try to recognize the same image with Combined mode. There is a crach with the following error:
    init_cube_objects(true, &tessdata_manager):Error:Assert failed:in file tessedit.cpp, line 203
    [1] 5562 abort tesseract image_sample.jpg stdout -l rus rus-test
    It seems that it happens with eng locale as well as with the rus loc.

What is the expected output? What do you see instead?
In both cases the output should be 1234567890.

What version of the product are you using? On what operating system?
I've tried tesseract 3.03 both on mac and iOS.

Please provide any additional information below.
There is a related thread in the Tesseract-OCR-iOS wrapper, where the issue was originally found: gali8/Tesseract-OCR-iOS#140. You may ask for any additional info there.

image_sample

error LNK2019: unresolved external symbol __imp__l_CIDataDestroy referenced in function - libtesseract304

hi Team ,
we can't able to figure it out ,why we are getting this below error while linking. but, it compiles fine.;

version's : tesseract304 , leptonica-1.72, liblept168d.lib + other libs (got it from leptonica-1.68-win32-lib-include-dirs.zip ), visual studio 2010.

plz help us to resolve this issue..plz...

Error details :-
Error 1 error LNK2019: unresolved external symbol __imp__l_CIDataDestroy referenced in function "private: static bool __cdecl tesseract::TessPDFRenderer::imageToPDFObj(struct Pix *,char *,long,char * *,long *)" (?imageToPDFObj@TessPDFRenderer@tesseract@@CA_NPAUPix@@PADJPAPADPAJ@Z) D:\tesseract-master\vs2010\libtesseract\pdfrenderer.obj libtesseract304
Error 2 error LNK2019: unresolved external symbol __imp__l_generateCIDataForPdf referenced in function "private: static bool __cdecl tesseract::TessPDFRenderer::imageToPDFObj(struct Pix *,char *,long,char * *,long *)" (?imageToPDFObj@TessPDFRenderer@tesseract@@CA_NPAUPix@@PADJPAPADPAJ@Z) D:\tesseract-master\vs2010\libtesseract\pdfrenderer.obj libtesseract304
Error 3 error LNK2019: unresolved external symbol __imp__pixGenerateCIData referenced in function "private: static bool __cdecl tesseract::TessPDFRenderer::imageToPDFObj(struct Pix *,char *,long,char * *,long *)" (?imageToPDFObj@TessPDFRenderer@tesseract@@CA_NPAUPix@@PADJPAPADPAJ@Z) D:\tesseract-master\vs2010\libtesseract\pdfrenderer.obj libtesseract304
Error 4 error LNK2019: unresolved external symbol __imp__pixSetSpp referenced in function "private: static bool __cdecl tesseract::TessPDFRenderer::imageToPDFObj(struct Pix *,char *,long,char * *,long *)" (?imageToPDFObj@TessPDFRenderer@tesseract@@CA_NPAUPix@@PADJPAPADPAJ@Z) D:\tesseract-master\vs2010\libtesseract\pdfrenderer.obj libtesseract304
Error 5 error LNK2019: unresolved external symbol __imp__pixGetSpp referenced in function "private: static bool __cdecl tesseract::TessPDFRenderer::imageToPDFObj(struct Pix *,char *,long,char * *,long *)" (?imageToPDFObj@TessPDFRenderer@tesseract@@CA_NPAUPix@@PADJPAPADPAJ@Z) D:\tesseract-master\vs2010\libtesseract\pdfrenderer.obj libtesseract304
Error 6 error LNK2019: unresolved external symbol __imp__pixForegroundFraction referenced in function "protected: float __thiscall tesseract::EquationDetect::ComputeForegroundDensity(class TBOX const &)" (?ComputeForegroundDensity@EquationDetect@tesseract@@IAEMABVTBOX@@@z) D:\tesseract-master\vs2010\libtesseract\equationdetect.obj libtesseract304
Error 7 error LNK2019: unresolved external symbol __imp__pixaConvertToPdf referenced in function "public: static void __cdecl tesseract::LineFinder::FindAndRemoveLines(int,bool,struct Pix *,int *,int *,struct Pix * *,class tesseract::TabVector_LIST *,class tesseract::TabVector_LIST *)" (?FindAndRemoveLines@LineFinder@tesseract@@SAXH_NPAUPix@@PAH2PAPAU3@PAVTabVector_LIST@2@4@Z) D:\tesseract-master\vs2010\libtesseract\linefind.obj libtesseract304
Error 8 error LNK1120: 7 unresolved externals D:\tesseract-master\vs2010\DLL_Debug\libtesseract304d.dll libtesseract304

What is the official source for training data?

Hi,
I'm packaging tesseract for Gentoo Linux. What is the up-to-date download location for training data? We used to download from Google Code, but that looks like it is going to close eventually. Ideally we would like to have pre-packaged .tar.gz files for each language (like we used to have on Google code). Would that be possible also on github?

Should we still use the language files from here with 3.04.00 ?

Thanks!

Windows Installer

Written in the WIKI
"Windows
An installer is available for Windows from our download page."

But as far as I can see there are no Windows installers to download. May be I'm not right, so, please, tell me, where I can get a current Windows installer.

Issue 1: [ 1669644 ] Crash in letter_is_okay() with trigger

https://code.google.com/p/tesseract-ocr/issues/detail?id=1

Reported by tmbdev, Mar 7, 2007
Filip Gieszczykiewicz - filipg(sf)

recognizing attached tif with v1.03 crashes as follows:
pppppspppppppppspppppppppsppppppppppppppppppppppppppp
Program received signal SIGSEGV, Segmentation fault.
0x080fb8a8 in letter_is_okay (dawg=0xb7f09008, node=0xbf815a04,
char_index=7, prevchar=0 '\0',
word=0xbf815bff "proto-ft", word_end=0) at dawg.cpp:49
49 if (edge_occupied (dawg, edge)) {
(gdb) bt
0 0x080fb8a8 in letter_is_okay (dawg=0xb7f09008, node=0xbf815a04,
char_index=7, prevchar=0 '\0',
word=0xbf815bff "proto-ft", word_end=0) at dawg.cpp:49
1 0x080f3b26 in append_next_choice (dawg=0xb7f09008, node=108107,
permuter=5 '\005',
word=0xbf815bff "proto-ft", choices=0x82e7ad0, char_index=7,
this_choice=0x8260df0,
prevchar=0 '\0', limit=0xbf815c28, rating=0, certainty=-1.15637732,
rating_array=0xbf815ab4,
certainty_array=0xbf815b58, word_ending=0, last_word=0,
result=0xbf815a58) at permdawg.cpp:202
2 0x080f3f03 in dawg_permute (dawg=0xb7f09008, node=108107, permuter=5
'\005',
choices=0x82e7ad0, char_index=7, limit=0xbf815c28, word=0xbf815bff
"proto-ft", rating=0,
certainty=0, rating_array=0xbf815ab4, certainty_array=0xbf815b58,
last_word=0)
at permdawg.cpp:273
3 0x080f40b3 in dawg_permute_and_select (string=0x814f9fc "system
words:", dawg=0xb7f09008,
permuter=5 '\005', character_choices=0x82e7ad0, best_choice=0x8260d40,
system_words=1)
at permdawg.cpp:334
4 0x080f5640 in permute_words (char_choices=0x82e7ad0, rating_limit=1000)
at permute.cpp:1611
5 0x080f6549 in permute_all (char_choices=0x82e7ad0, rating_limit=1000,
raw_choice=0xbf815dc8)
at permute.cpp:1092
6 0x080f6952 in permute_characters (char_choices=0x82e7ad0, limit=1000,
best_choice=0xbf815dd8,
raw_choice=0xbf815dc8) at permute.cpp:1146
7 0x080d1ef6 in chop_word_main (word=0x826f830, fx=1,
best_choice=0xbf815dd8,
raw_choice=0xbf815dc8, tester=0 '\0', trainer=0 '\0') at
chopper.cpp:476
8 0x080cf426 in cc_recog (tessword=0x826f830, best_choice=0xbf815dd8,
best_raw_choice=0xbf815dc8, tester=0 '\0', trainer=0 '\0') at
tface.cpp:247
9 0x08069a94 in recog_word_recursive (word=0x826e9f0, denorm=0x826be54,
matcher=0x80684a0 <tess_default_matcher(PBLOB*, PBLOB*, PBLOB*, WERD*,
DENORM*, BLOB_CHOICE_LIST&)>, tester=0, trainer=,
testing=0 '\0', raw_choice=@0x826be7c,
blob_choices=0xbf8162b8, outword=@0x826be50) at tfacepp.cpp:191
10 0x0806a380 in recog_word (word=0x826e9f0, denorm=0x826be54,
matcher=0x80684a0 <tess_default_matcher(PBLOB*, PBLOB*, PBLOB*, WERD*,
DENORM*, BLOB_CHOICE_LIST&)>, tester=0, trainer=0, testing=0 '\0',
raw_choice=@0x826be7c, blob_choices=0xbf8162b8,
outword=@0x826be50) at tfacepp.cpp:90

I don't think it's related to issue 1546972

It is dependent on the specific image - recreating a new TIF with pbmtext
of the contained text does not crash. Also, scaling image -2.0 or +2.0 does
not crash - just this one does.

Argh, image too big for sf.net - see
http://tesseract-ocr.repairfaq.org/downloads/b37by2.tif

box.train segfault

I have no idea what the box.train config is supposed to do, or what
missing data it needs. I just don't like segfaults.

(gdb) run testing/phototest.tif - box.train
Starting program: /tmp/plang/tesseract-3.04.00/api/.libs/lt-tesseract testing/phototest.tif - box.train
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
Page 1

Program received signal SIGSEGV, Segmentation fault.
0x00007ffff760fe94 in ELIST_ITERATOR::set_to_list (this=0x7fffffffd3e0, list_to_iterate=0x8) at ../ccutil/elst.h:308
308   prev = list->last;
(gdb) backtrace
#0  0x00007ffff760fe94 in ELIST_ITERATOR::set_to_list (this=0x7fffffffd3e0, list_to_iterate=0x8)
    at ../ccutil/elst.h:308
#1  0x00007ffff77dd367 in PAGE_RES_IT::start_page (this=0x7fffffffd390, empty_ok=false) at pageres.cpp:1510
#2  0x00007ffff76116c7 in PAGE_RES_IT::restart_page (this=0x7fffffffd390) at ../ccstruct/pageres.h:681
#3  0x00007ffff76116a7 in PAGE_RES_IT::PAGE_RES_IT (this=0x7fffffffd390, the_page_res=0x0) at ../ccstruct/pageres.h:665
#4  0x00007ffff761e84f in tesseract::Tesseract::ApplyBoxTraining (this=0x808c00, fontname=..., page_res=0x0)
    at applybox.cpp:780
#5  0x00007ffff7609478 in tesseract::TessBaseAPI::Recognize (this=0x7fffffffd9f0, monitor=0x0) at baseapi.cpp:883
#6  0x00007ffff760a4d9 in tesseract::TessBaseAPI::ProcessPage (this=0x7fffffffd9f0, pix=0x83fa10, page_index=0, 
    filename=0x7fffffffe883 "testing/phototest.tif", retry_config=0x0, timeout_millisec=0, renderer=0x13850f0)
    at baseapi.cpp:1222
#7  0x00007ffff7609d4e in tesseract::TessBaseAPI::ProcessPagesMultipageTiff (this=0x7fffffffd9f0, 
    data=0x138fc08 "II*", size=38668, filename=0x7fffffffe883 "testing/phototest.tif", retry_config=0x0, 
    timeout_millisec=0, renderer=0x13850f0, tessedit_page_number=-1) at baseapi.cpp:1057
#8  0x00007ffff760a29b in tesseract::TessBaseAPI::ProcessPagesInternal (this=0x7fffffffd9f0, 
    filename=0x7fffffffe883 "testing/phototest.tif", retry_config=0x0, timeout_millisec=0, renderer=0x13850f0)
    at baseapi.cpp:1176
#9  0x00007ffff7609dc5 in tesseract::TessBaseAPI::ProcessPages (this=0x7fffffffd9f0, 
    filename=0x7fffffffe883 "testing/phototest.tif", retry_config=0x0, timeout_millisec=0, renderer=0x13850f0)
    at baseapi.cpp:1074
#10 0x00000000004031a3 in main (argc=4, argv=0x7fffffffe5f8) at tesseractmain.cpp:316
...

building tesseract under cygwin: training tools don't build

Hi,

I just tried to build the training tools using cygwin.
the normal tesseract program seems to work fine.

Thanks for helping.
you are doing all a good job and the recognizationrate is also very nice now in german texts.

that's what it was showing after typing make training:

$ make training
make[1]: Entering directory '/home/Besitzer/tesseractsrc/training'
depbase=echo boxchar.lo | sed 's|[^/]*$|.deps/&|;s|\.lo$||';
/bin/sh ../libtool --tag=CXX --mode=compile g++ -DHAVE_CONFIG_H -I. -I.. -O2
-DNDEBUG -DUSE_STD_NAMESPACE -DPANGO_ENABLE_ENGINE -I../ccmain -I../api -I../cc
util -I../ccstruct -I../viewer -I../textord -I../dict -I../classify -I../display
-I../wordrec -I../cutil -I../vs2010/port -I/usr/include/leptonica -D_REENTRANT
-I/usr/include/pango-1.0 -I/usr/include/glib-2.0 -I/usr/lib/glib-2.0/include -I
/usr/include/cairo -I/usr/include/glib-2.0 -I/usr/lib/glib-2.0/include -I/usr/in
clude/pixman-1 -I/usr/include/freetype2 -I/usr/include/libpng16 -I/usr/include/f
reetype2 -I/usr/include/libpng16 -std=gnu++11 -MT boxchar.lo -MD -MP -MF $depb
ase.Tpo -c -o boxchar.lo boxchar.cpp &&
mv -f $depbase.Tpo $depbase.Plo
libtool: compile: g++ -DHAVE_CONFIG_H -I. -I.. -O2 -DNDEBUG -DUSE_STD_NAMESPACE
-DPANGO_ENABLE_ENGINE -I../ccmain -I../api -I../ccutil -I../ccstruct -I../viewe
r -I../textord -I../dict -I../classify -I../display -I../wordrec -I../cutil -I..
/vs2010/port -I/usr/include/leptonica -D_REENTRANT -I/usr/include/pango-1.0 -I/u
sr/include/glib-2.0 -I/usr/lib/glib-2.0/include -I/usr/include/cairo -I/usr/incl
ude/glib-2.0 -I/usr/lib/glib-2.0/include -I/usr/include/pixman-1 -I/usr/include/
freetype2 -I/usr/include/libpng16 -I/usr/include/freetype2 -I/usr/include/libpng
16 -std=gnu++11 -MT boxchar.lo -MD -MP -MF .deps/boxchar.Tpo -c boxchar.cpp -o b
oxchar.o
depbase=echo commandlineflags.lo | sed 's|[^/]*$|.deps/&|;s|\.lo$||';
/bin/sh ../libtool --tag=CXX --mode=compile g++ -DHAVE_CONFIG_H -I. -I.. -O2
-DNDEBUG -DUSE_STD_NAMESPACE -DPANGO_ENABLE_ENGINE -I../ccmain -I../api -I../cc
util -I../ccstruct -I../viewer -I../textord -I../dict -I../classify -I../display
-I../wordrec -I../cutil -I../vs2010/port -I/usr/include/leptonica -D_REENTRANT
-I/usr/include/pango-1.0 -I/usr/include/glib-2.0 -I/usr/lib/glib-2.0/include -I
/usr/include/cairo -I/usr/include/glib-2.0 -I/usr/lib/glib-2.0/include -I/usr/in
clude/pixman-1 -I/usr/include/freetype2 -I/usr/include/libpng16 -I/usr/include/f
reetype2 -I/usr/include/libpng16 -std=gnu++11 -MT commandlineflags.lo -MD -MP
-MF $depbase.Tpo -c -o commandlineflags.lo commandlineflags.cpp &&
mv -f $depbase.Tpo $depbase.Plo
libtool: compile: g++ -DHAVE_CONFIG_H -I. -I.. -O2 -DNDEBUG -DUSE_STD_NAMESPACE
-DPANGO_ENABLE_ENGINE -I../ccmain -I../api -I../ccutil -I../ccstruct -I../viewe
r -I../textord -I../dict -I../classify -I../display -I../wordrec -I../cutil -I..
/vs2010/port -I/usr/include/leptonica -D_REENTRANT -I/usr/include/pango-1.0 -I/u
sr/include/glib-2.0 -I/usr/lib/glib-2.0/include -I/usr/include/cairo -I/usr/incl
ude/glib-2.0 -I/usr/lib/glib-2.0/include -I/usr/include/pixman-1 -I/usr/include/
freetype2 -I/usr/include/libpng16 -I/usr/include/freetype2 -I/usr/include/libpng
16 -std=gnu++11 -MT commandlineflags.lo -MD -MP -MF .deps/commandlineflags.Tpo -
c commandlineflags.cpp -o commandlineflags.o
depbase=echo commontraining.lo | sed 's|[^/]*$|.deps/&|;s|\.lo$||';
/bin/sh ../libtool --tag=CXX --mode=compile g++ -DHAVE_CONFIG_H -I. -I.. -O2
-DNDEBUG -DUSE_STD_NAMESPACE -DPANGO_ENABLE_ENGINE -I../ccmain -I../api -I../cc
util -I../ccstruct -I../viewer -I../textord -I../dict -I../classify -I../display
-I../wordrec -I../cutil -I../vs2010/port -I/usr/include/leptonica -D_REENTRANT
-I/usr/include/pango-1.0 -I/usr/include/glib-2.0 -I/usr/lib/glib-2.0/include -I
/usr/include/cairo -I/usr/include/glib-2.0 -I/usr/lib/glib-2.0/include -I/usr/in
clude/pixman-1 -I/usr/include/freetype2 -I/usr/include/libpng16 -I/usr/include/f
reetype2 -I/usr/include/libpng16 -std=gnu++11 -MT commontraining.lo -MD -MP -M
F $depbase.Tpo -c -o commontraining.lo commontraining.cpp &&
mv -f $depbase.Tpo $depbase.Plo
libtool: compile: g++ -DHAVE_CONFIG_H -I. -I.. -O2 -DNDEBUG -DUSE_STD_NAMESPACE
-DPANGO_ENABLE_ENGINE -I../ccmain -I../api -I../ccutil -I../ccstruct -I../viewe
r -I../textord -I../dict -I../classify -I../display -I../wordrec -I../cutil -I..
/vs2010/port -I/usr/include/leptonica -D_REENTRANT -I/usr/include/pango-1.0 -I/u
sr/include/glib-2.0 -I/usr/lib/glib-2.0/include -I/usr/include/cairo -I/usr/incl
ude/glib-2.0 -I/usr/lib/glib-2.0/include -I/usr/include/pixman-1 -I/usr/include/
freetype2 -I/usr/include/libpng16 -I/usr/include/freetype2 -I/usr/include/libpng
16 -std=gnu++11 -MT commontraining.lo -MD -MP -MF .deps/commontraining.Tpo -c co
mmontraining.cpp -o commontraining.o
depbase=echo degradeimage.lo | sed 's|[^/]*$|.deps/&|;s|\.lo$||';
/bin/sh ../libtool --tag=CXX --mode=compile g++ -DHAVE_CONFIG_H -I. -I.. -O2
-DNDEBUG -DUSE_STD_NAMESPACE -DPANGO_ENABLE_ENGINE -I../ccmain -I../api -I../cc
util -I../ccstruct -I../viewer -I../textord -I../dict -I../classify -I../display
-I../wordrec -I../cutil -I../vs2010/port -I/usr/include/leptonica -D_REENTRANT
-I/usr/include/pango-1.0 -I/usr/include/glib-2.0 -I/usr/lib/glib-2.0/include -I
/usr/include/cairo -I/usr/include/glib-2.0 -I/usr/lib/glib-2.0/include -I/usr/in
clude/pixman-1 -I/usr/include/freetype2 -I/usr/include/libpng16 -I/usr/include/f
reetype2 -I/usr/include/libpng16 -std=gnu++11 -MT degradeimage.lo -MD -MP -MF
$depbase.Tpo -c -o degradeimage.lo degradeimage.cpp &&
mv -f $depbase.Tpo $depbase.Plo
libtool: compile: g++ -DHAVE_CONFIG_H -I. -I.. -O2 -DNDEBUG -DUSE_STD_NAMESPACE
-DPANGO_ENABLE_ENGINE -I../ccmain -I../api -I../ccutil -I../ccstruct -I../viewe
r -I../textord -I../dict -I../classify -I../display -I../wordrec -I../cutil -I..
/vs2010/port -I/usr/include/leptonica -D_REENTRANT -I/usr/include/pango-1.0 -I/u
sr/include/glib-2.0 -I/usr/lib/glib-2.0/include -I/usr/include/cairo -I/usr/incl
ude/glib-2.0 -I/usr/lib/glib-2.0/include -I/usr/include/pixman-1 -I/usr/include/
freetype2 -I/usr/include/libpng16 -I/usr/include/freetype2 -I/usr/include/libpng
16 -std=gnu++11 -MT degradeimage.lo -MD -MP -MF .deps/degradeimage.Tpo -c degrad
eimage.cpp -o degradeimage.o
depbase=echo fileio.lo | sed 's|[^/]*$|.deps/&|;s|\.lo$||';
/bin/sh ../libtool --tag=CXX --mode=compile g++ -DHAVE_CONFIG_H -I. -I.. -O2
-DNDEBUG -DUSE_STD_NAMESPACE -DPANGO_ENABLE_ENGINE -I../ccmain -I../api -I../cc
util -I../ccstruct -I../viewer -I../textord -I../dict -I../classify -I../display
-I../wordrec -I../cutil -I../vs2010/port -I/usr/include/leptonica -D_REENTRANT
-I/usr/include/pango-1.0 -I/usr/include/glib-2.0 -I/usr/lib/glib-2.0/include -I
/usr/include/cairo -I/usr/include/glib-2.0 -I/usr/lib/glib-2.0/include -I/usr/in
clude/pixman-1 -I/usr/include/freetype2 -I/usr/include/libpng16 -I/usr/include/f
reetype2 -I/usr/include/libpng16 -std=gnu++11 -MT fileio.lo -MD -MP -MF $depba
se.Tpo -c -o fileio.lo fileio.cpp &&
mv -f $depbase.Tpo $depbase.Plo
libtool: compile: g++ -DHAVE_CONFIG_H -I. -I.. -O2 -DNDEBUG -DUSE_STD_NAMESPACE
-DPANGO_ENABLE_ENGINE -I../ccmain -I../api -I../ccutil -I../ccstruct -I../viewe
r -I../textord -I../dict -I../classify -I../display -I../wordrec -I../cutil -I..
/vs2010/port -I/usr/include/leptonica -D_REENTRANT -I/usr/include/pango-1.0 -I/u
sr/include/glib-2.0 -I/usr/lib/glib-2.0/include -I/usr/include/cairo -I/usr/incl
ude/glib-2.0 -I/usr/lib/glib-2.0/include -I/usr/include/pixman-1 -I/usr/include/
freetype2 -I/usr/include/libpng16 -I/usr/include/freetype2 -I/usr/include/libpng
16 -std=gnu++11 -MT fileio.lo -MD -MP -MF .deps/fileio.Tpo -c fileio.cpp -o file
io.o
depbase=echo ligature_table.lo | sed 's|[^/]*$|.deps/&|;s|\.lo$||';
/bin/sh ../libtool --tag=CXX --mode=compile g++ -DHAVE_CONFIG_H -I. -I.. -O2
-DNDEBUG -DUSE_STD_NAMESPACE -DPANGO_ENABLE_ENGINE -I../ccmain -I../api -I../cc
util -I../ccstruct -I../viewer -I../textord -I../dict -I../classify -I../display
-I../wordrec -I../cutil -I../vs2010/port -I/usr/include/leptonica -D_REENTRANT
-I/usr/include/pango-1.0 -I/usr/include/glib-2.0 -I/usr/lib/glib-2.0/include -I
/usr/include/cairo -I/usr/include/glib-2.0 -I/usr/lib/glib-2.0/include -I/usr/in
clude/pixman-1 -I/usr/include/freetype2 -I/usr/include/libpng16 -I/usr/include/f
reetype2 -I/usr/include/libpng16 -std=gnu++11 -MT ligature_table.lo -MD -MP -M
F $depbase.Tpo -c -o ligature_table.lo ligature_table.cpp &&
mv -f $depbase.Tpo $depbase.Plo
libtool: compile: g++ -DHAVE_CONFIG_H -I. -I.. -O2 -DNDEBUG -DUSE_STD_NAMESPACE
-DPANGO_ENABLE_ENGINE -I../ccmain -I../api -I../ccutil -I../ccstruct -I../viewe
r -I../textord -I../dict -I../classify -I../display -I../wordrec -I../cutil -I..
/vs2010/port -I/usr/include/leptonica -D_REENTRANT -I/usr/include/pango-1.0 -I/u
sr/include/glib-2.0 -I/usr/lib/glib-2.0/include -I/usr/include/cairo -I/usr/incl
ude/glib-2.0 -I/usr/lib/glib-2.0/include -I/usr/include/pixman-1 -I/usr/include/
freetype2 -I/usr/include/libpng16 -I/usr/include/freetype2 -I/usr/include/libpng
16 -std=gnu++11 -MT ligature_table.lo -MD -MP -MF .deps/ligature_table.Tpo -c li
gature_table.cpp -o ligature_table.o
depbase=echo normstrngs.lo | sed 's|[^/]*$|.deps/&|;s|\.lo$||';
/bin/sh ../libtool --tag=CXX --mode=compile g++ -DHAVE_CONFIG_H -I. -I.. -O2
-DNDEBUG -DUSE_STD_NAMESPACE -DPANGO_ENABLE_ENGINE -I../ccmain -I../api -I../cc
util -I../ccstruct -I../viewer -I../textord -I../dict -I../classify -I../display
-I../wordrec -I../cutil -I../vs2010/port -I/usr/include/leptonica -D_REENTRANT
-I/usr/include/pango-1.0 -I/usr/include/glib-2.0 -I/usr/lib/glib-2.0/include -I
/usr/include/cairo -I/usr/include/glib-2.0 -I/usr/lib/glib-2.0/include -I/usr/in
clude/pixman-1 -I/usr/include/freetype2 -I/usr/include/libpng16 -I/usr/include/f
reetype2 -I/usr/include/libpng16 -std=gnu++11 -MT normstrngs.lo -MD -MP -MF $d
epbase.Tpo -c -o normstrngs.lo normstrngs.cpp &&
mv -f $depbase.Tpo $depbase.Plo
libtool: compile: g++ -DHAVE_CONFIG_H -I. -I.. -O2 -DNDEBUG -DUSE_STD_NAMESPACE
-DPANGO_ENABLE_ENGINE -I../ccmain -I../api -I../ccutil -I../ccstruct -I../viewe
r -I../textord -I../dict -I../classify -I../display -I../wordrec -I../cutil -I..
/vs2010/port -I/usr/include/leptonica -D_REENTRANT -I/usr/include/pango-1.0 -I/u
sr/include/glib-2.0 -I/usr/lib/glib-2.0/include -I/usr/include/cairo -I/usr/incl
ude/glib-2.0 -I/usr/lib/glib-2.0/include -I/usr/include/pixman-1 -I/usr/include/
freetype2 -I/usr/include/libpng16 -I/usr/include/freetype2 -I/usr/include/libpng
16 -std=gnu++11 -MT normstrngs.lo -MD -MP -MF .deps/normstrngs.Tpo -c normstrngs
.cpp -o normstrngs.o
depbase=echo pango_font_info.lo | sed 's|[^/]*$|.deps/&|;s|\.lo$||';
/bin/sh ../libtool --tag=CXX --mode=compile g++ -DHAVE_CONFIG_H -I. -I.. -O2
-DNDEBUG -DUSE_STD_NAMESPACE -DPANGO_ENABLE_ENGINE -I../ccmain -I../api -I../cc
util -I../ccstruct -I../viewer -I../textord -I../dict -I../classify -I../display
-I../wordrec -I../cutil -I../vs2010/port -I/usr/include/leptonica -D_REENTRANT
-I/usr/include/pango-1.0 -I/usr/include/glib-2.0 -I/usr/lib/glib-2.0/include -I
/usr/include/cairo -I/usr/include/glib-2.0 -I/usr/lib/glib-2.0/include -I/usr/in
clude/pixman-1 -I/usr/include/freetype2 -I/usr/include/libpng16 -I/usr/include/f
reetype2 -I/usr/include/libpng16 -std=gnu++11 -MT pango_font_info.lo -MD -MP -
MF $depbase.Tpo -c -o pango_font_info.lo pango_font_info.cpp &&
mv -f $depbase.Tpo $depbase.Plo
libtool: compile: g++ -DHAVE_CONFIG_H -I. -I.. -O2 -DNDEBUG -DUSE_STD_NAMESPACE
-DPANGO_ENABLE_ENGINE -I../ccmain -I../api -I../ccutil -I../ccstruct -I../viewe
r -I../textord -I../dict -I../classify -I../display -I../wordrec -I../cutil -I..
/vs2010/port -I/usr/include/leptonica -D_REENTRANT -I/usr/include/pango-1.0 -I/u
sr/include/glib-2.0 -I/usr/lib/glib-2.0/include -I/usr/include/cairo -I/usr/incl
ude/glib-2.0 -I/usr/lib/glib-2.0/include -I/usr/include/pixman-1 -I/usr/include/
freetype2 -I/usr/include/libpng16 -I/usr/include/freetype2 -I/usr/include/libpng
16 -std=gnu++11 -MT pango_font_info.lo -MD -MP -MF .deps/pango_font_info.Tpo -c
pango_font_info.cpp -o pango_font_info.o
pango_font_info.cpp: In member function 'bool tesseract::PangoFontInfo::ParseFon
tDescription(const PangoFontDescription_)':
pango_font_info.cpp:223:46: error: 'strcasestr' was not declared in this scope
is_fraktur_ = (strcasestr(family, "Fraktur") != NULL);
^
Makefile:875: recipe for target 'pango_font_info.lo' failed
make[1]: *_* [pango_font_info.lo] Error 1
make[1]: Leaving directory '/home/Besitzer/tesseractsrc/training'
Makefile:880: recipe for target 'training' failed
make: *** [training] Error 2

Besitzer@simon ~/tesseractsrc
$ make training-install
make[1]: Entering directory '/home/Besitzer/tesseractsrc/training'
depbase=echo pango_font_info.lo | sed 's|[^/]*$|.deps/&|;s|\.lo$||';
/bin/sh ../libtool --tag=CXX --mode=compile g++ -DHAVE_CONFIG_H -I. -I.. -O2
-DNDEBUG -DUSE_STD_NAMESPACE -DPANGO_ENABLE_ENGINE -I../ccmain -I../api -I../cc
util -I../ccstruct -I../viewer -I../textord -I../dict -I../classify -I../display
-I../wordrec -I../cutil -I../vs2010/port -I/usr/include/leptonica -D_REENTRANT
-I/usr/include/pango-1.0 -I/usr/include/glib-2.0 -I/usr/lib/glib-2.0/include -I
/usr/include/cairo -I/usr/include/glib-2.0 -I/usr/lib/glib-2.0/include -I/usr/in
clude/pixman-1 -I/usr/include/freetype2 -I/usr/include/libpng16 -I/usr/include/f
reetype2 -I/usr/include/libpng16 -std=gnu++11 -MT pango_font_info.lo -MD -MP -
MF $depbase.Tpo -c -o pango_font_info.lo pango_font_info.cpp &&
mv -f $depbase.Tpo $depbase.Plo
libtool: compile: g++ -DHAVE_CONFIG_H -I. -I.. -O2 -DNDEBUG -DUSE_STD_NAMESPACE
-DPANGO_ENABLE_ENGINE -I../ccmain -I../api -I../ccutil -I../ccstruct -I../viewe
r -I../textord -I../dict -I../classify -I../display -I../wordrec -I../cutil -I..
/vs2010/port -I/usr/include/leptonica -D_REENTRANT -I/usr/include/pango-1.0 -I/u
sr/include/glib-2.0 -I/usr/lib/glib-2.0/include -I/usr/include/cairo -I/usr/incl
ude/glib-2.0 -I/usr/lib/glib-2.0/include -I/usr/include/pixman-1 -I/usr/include/
freetype2 -I/usr/include/libpng16 -I/usr/include/freetype2 -I/usr/include/libpng
16 -std=gnu++11 -MT pango_font_info.lo -MD -MP -MF .deps/pango_font_info.Tpo -c
pango_font_info.cpp -o pango_font_info.o
pango_font_info.cpp: In member function 'bool tesseract::PangoFontInfo::ParseFon
tDescription(const PangoFontDescription_)':
pango_font_info.cpp:223:46: error: 'strcasestr' was not declared in this scope
is_fraktur_ = (strcasestr(family, "Fraktur") != NULL);
^
Makefile:875: recipe for target 'pango_font_info.lo' failed
make[1]: *_* [pango_font_info.lo] Error 1
make[1]: Leaving directory '/home/Besitzer/tesseractsrc/training'
Makefile:882: recipe for target 'training-install' failed
make: *** [training-install] Error 2

Besitzer@simon ~/tesseractsrc
$

-Wtautological-undefined-compare warning in split.cpp

I get a -Wtautological-undefined-compare warning when building for Android using clang and tess-two:

jni/com_googlecode_tesseract_android/src/ccstruct/split.cpp:223:7: warning: 'this' pointer cannot be null in well-defined C++ code; comparison may be assumed to always evaluate to true [-Wtautological-undefined-compare]
  if (this != NULL) {
      ^~~~    ~~~~

TessBaseAPICreate/Init causes segmentation fault when executed multiple times.

I am putting this here again, as nobody seems to be interested on the forums at google code:

I am not sure if this behavior is intended or if I'm just using the C-API wrong.

I utilize the C-API with the latest git master version of tesseract through the ctypes module of python (loaded shared object). Calling TessBaseAPICreate() multiple times works like a charm, but initializing more than 3 of the returned API handler without destroying them using TessBaseAPIDelete() first causes a segmentation fault. I've got two gdb dumps. The first one is from latest git tesseract with debug symbols, the second one tesseract 3.03 from the Debian jessie repositories.

[0] http://l.unchti.me/tess_dbg.txt
[1] http://l.unchti.me/gdb.txt

how to use ProcessPages api through C code

previously ProcessPages return char*, but now it return either true or false.
now text where is the place to find the text detected by ocr

I see there are result renderer but they write to text or pdf file, how to get text in return as char*

--version: git version number is not upated, when only doing "git pull; make; sudo make install"

I have a working repo and installation of tesseract. When I pull the latest version from git, make and install, the git version information when printed via tesseract --version is not updated.

Reporting this as a bug, but not being sure, whether this is the correct term. Please close the issue if you think that my report is wrong, because one may not use the short sequence (pull - make - install).

I simply do not know, whether the following short sequence was ever designed to work.

How to reproduce the bug:

git pull
make
sudo make install

The following sequence however works, but recompiles everything:

make distclean
git pull
./autogen.sh
make
sudo make install

Training tesseract to recognize text from images

Is there a way to train tesseract to recognize a limited amount of text from an image. I am making a small app that recognizes a printed list of topics and so far using the tess-two library, tesseract does not fully recognize any of the text in the image. I am quite new to OCR so I'm not sure how to make this work. So far all the training instructions I've seen require a font file which I don't have. All I have are different images of the printed text.

How do I train tesseract to recognize the text from that? Where do I start?

orientation detection affected by output renderer

Orientation detection works if we ask for PDF output and fails if we ask for text output. Weird.

tesseract -l deu+eng /bw_2015-03-01_081225.png bad
tesseract -l deu+eng /bw_2015-03-01_081225.png good pdf

good.txt

R etwas zu studieren, zu verstehen und
dann an andere weiterzugeben; Material zu
entwickeln, mit dem etwas leicht und
verständlich vermittelt werden kann. 2 1 O ?

bad.txt

.CONuQmDNED S1H Ev.<80><94>0.1 Cw
DHCHQHQU md? <80><9E>CO<80><94>UMD? HMUBOL DNNU EQ<81><80><94>Om<82>wz
ohu<81><82>d mmm<81>v <80><9E>Gwhm<81>v<80><94>ho 5N\Ch<81><80><94>_ 3N Om

bw_2015-03-01_081225

OS X Yosemite with OpenCL: --enable-opencl now compiles but OCR fails

It seems that any OpenCL operation on my OS X Yosemite machine triggers attempts to allocate extremely large memory blocks and allocation failures.

The size of the attempted allocation is 1125865547108352 bytes or in hex, 0x3fff800001000, which looks special.

OpenCL otherwise works on my machine. I use the Python OpenCV library and a commercial application that uses OpenCL.

Aside from whatever is happening here, it also looks like a bug that the profile data gets written even if OpenCL fails. I highly doubt my graphics card and processor give identical performance so it looks some invalid calculation takes place and the results are then saved.

Testing --list-langs

Checking for languages in an OpenCL binary:

set -x TESSDATA_PREFIX /usr/local/Cellar/tesseract/3.03rc1_3/share   # Homebrew tesseract 3.03
/opt/tesseract-opencl/bin/tesseract --list-langs

Results

[DS] Profile file not available (tesseract_opencl_profile_devices.dat); performing profiling.

[DS] Device: "Intel(R) Core(TM) i5-4570 CPU @ 3.20GHz" (OpenCL) evaluation...
tesseract(9135,0x7fff7a778300) malloc: *** mach_vm_map(size=1125865547108352) failed (error code=3)
*** error: can't allocate region
*** set a breakpoint in malloc_error_break to debug
OpenCL error code is -54 at   when clEnqueueNDRangeKernel .
OpenCL error code is -54 at   when clEnqueueNDRangeKernel kernel_HistogramRectAllChannels .
OpenCL error code is -54 at   when clEnqueueNDRangeKernel kernel_HistogramRectAllChannelsReduction .
OpenCL error code is -54 at   when clEnqueueNDRangeKernel kernel_ThresholdRectToPix .
[DS] Device: "Intel(R) Core(TM) i5-4570 CPU @ 3.20GHz" (OpenCL) evaluated
[DS]          composeRGBPixel: 1540962513.969949 (w=1.2)
[DS]            HistogramRect: 1540962513.969949 (w=2.4)
[DS]       ThresholdRectToPix: 1540962513.969949 (w=4.5)
[DS]        getLineMasksMorph: 1204940900.030019 (w=5.0)
[DS]                    Score: 18506500096.000000

[DS] Device: "GeForce GT 755M" (OpenCL) evaluation...
tesseract(9135,0x7fff7a778300) malloc: *** mach_vm_map(size=1125865547108352) failed (error code=3)
*** error: can't allocate region
*** set a breakpoint in malloc_error_break to debug
[DS] Device: "GeForce GT 755M" (OpenCL) evaluated
[DS]          composeRGBPixel: 1540962513.969949 (w=1.2)
[DS]            HistogramRect: 1540962513.969949 (w=2.4)
[DS]       ThresholdRectToPix: 1540962513.969949 (w=4.5)
[DS]        getLineMasksMorph: 1204940900.030019 (w=5.0)
[DS]                    Score: 18506500096.000000

[DS] Device: "(null)" (Native) evaluation...
[DS] Device: "(null)" (Native) evaluated
[DS]          composeRGBPixel: 256.000000 (w=1.2)
[DS]            HistogramRect: 256.000000 (w=2.4)
[DS]       ThresholdRectToPix: 256.000000 (w=4.5)
[DS]        getLineMasksMorph: 4294966736.000000 (w=5.0)
[DS]                    Score: 21474836480.000000
[DS] Scores written to file (tesseract_opencl_profile_devices.dat).
[DS] Device[1] 1:Intel(R) Core(TM) i5-4570 CPU @ 3.20GHz score is 18506500096.000000
[DS] Device[2] 1:GeForce GT 755M score is 18506500096.000000
[DS] Device[3] 0:(null) score is 21474836480.000000
[DS] Selected Device[1]: "Intel(R) Core(TM) i5-4570 CPU @ 3.20GHz" (OpenCL)
tesseract(9135,0x7fff7a778300) malloc: *** mach_vm_map(size=1125865547108352) failed (error code=3)
*** error: can't allocate region
*** set a breakpoint in malloc_error_break to debug
List of available languages (2):
eng
osd

Subsequent executions try to use the OpenCL profile results but still get errors:

[DS] Profile read from file (tesseract_opencl_profile_devices.dat).
[DS] Device[1] 1:Intel(R) Core(TM) i5-4570 CPU @ 3.20GHz score is 18506500096.000000
[DS] Device[2] 1:GeForce GT 755M score is 18506500096.000000
[DS] Device[3] 0:(null) score is 21474836480.000000
[DS] Selected Device[1]: "Intel(R) Core(TM) i5-4570 CPU @ 3.20GHz" (OpenCL)
tesseract(9139,0x7fff7a778300) malloc: *** mach_vm_map(size=1125865547108352) failed (error code=3)
*** error: can't allocate region
*** set a breakpoint in malloc_error_break to debug
List of available languages (2):
eng
osd

Testing OCR of JPEG to PDF

set -x TESSDATA_PREFIX /usr/local/Cellar/tesseract/3.03rc1_3/share   # Homebrew tesseract 3.03
/opt/tesseract-opencl/bin/tesseract tests/resources/congress.jpg tessopencl -l eng pdf

Result:

Tesseract Open Source OCR Engine v3.04.01dev with Leptonica
[DS] Profile file not available (tesseract_opencl_profile_devices.dat); performing profiling.

[DS] Device: "Intel(R) Core(TM) i5-4570 CPU @ 3.20GHz" (OpenCL) evaluation...
tesseract(9120,0x7fff7a778300) malloc: *** mach_vm_map(size=1125865547108352) failed (error code=3)
*** error: can't allocate region
*** set a breakpoint in malloc_error_break to debug
OpenCL error code is -54 at   when clEnqueueNDRangeKernel .
OpenCL error code is -54 at   when clEnqueueNDRangeKernel kernel_HistogramRectAllChannels .
OpenCL error code is -54 at   when clEnqueueNDRangeKernel kernel_HistogramRectAllChannelsReduction .
OpenCL error code is -54 at   when clEnqueueNDRangeKernel kernel_ThresholdRectToPix .
[DS] Device: "Intel(R) Core(TM) i5-4570 CPU @ 3.20GHz" (OpenCL) evaluated
[DS]          composeRGBPixel: 1539474209.312102 (w=1.2)
[DS]            HistogramRect: 1539474209.312102 (w=2.4)
[DS]       ThresholdRectToPix: 1539474209.312102 (w=4.5)
[DS]        getLineMasksMorph: 1345623668.687865 (w=5.0)
[DS]                    Score: 19197859840.000000
[DS] Device: "GeForce GT 755M" (OpenCL) evaluation...
tesseract(9120,0x7fff7a778300) malloc: *** mach_vm_map(size=1125865547108352) failed (error code=3)
*** error: can't allocate region
*** set a breakpoint in malloc_error_break to debug
[DS] Device: "GeForce GT 755M" (OpenCL) evaluated
[DS]          composeRGBPixel: 1539474209.312102 (w=1.2)
[DS]            HistogramRect: 1539474209.312102 (w=2.4)
[DS]       ThresholdRectToPix: 1539474209.312102 (w=4.5)
[DS]        getLineMasksMorph: 1345623668.687865 (w=5.0)
[DS]                    Score: 19197859840.000000

[DS] Device: "(null)" (Native) evaluation...
[DS] Device: "(null)" (Native) evaluated
[DS]          composeRGBPixel: 256.000000 (w=1.2)
[DS]            HistogramRect: 256.000000 (w=2.4)
[DS]       ThresholdRectToPix: 256.000000 (w=4.5)
[DS]        getLineMasksMorph: 4294966736.000000 (w=5.0)
[DS]                    Score: 21474836480.000000
[DS] Scores written to file (tesseract_opencl_profile_devices.dat).
[DS] Device[1] 1:Intel(R) Core(TM) i5-4570 CPU @ 3.20GHz score is 19197859840.000000
[DS] Device[2] 1:GeForce GT 755M score is 19197859840.000000
[DS] Device[3] 0:(null) score is 21474836480.000000
[DS] Selected Device[1]: "Intel(R) Core(TM) i5-4570 CPU @ 3.20GHz" (OpenCL)
tesseract(9120,0x7fff7a778300) malloc: *** mach_vm_map(size=1125865547108352) failed (error code=3)
*** error: can't allocate region
*** set a breakpoint in malloc_error_break to debug
Warning in pixReadMemJpeg: work-around: writing to a temp file
OpenCL error code is -54 at   when clEnqueueNDRangeKernel kernel_HistogramRectAllChannels .
OpenCL error code is -54 at   when clEnqueueNDRangeKernel kernel_HistogramRectAllChannelsReduction .
OpenCL error code is -54 at   when clEnqueueNDRangeKernel kernel_ThresholdRectToPix .
OpenCL error code is -54 at   when clEnqueueNDRangeKernel kernel_HistogramRectAllChannels .
OpenCL error code is -54 at   when clEnqueueNDRangeKernel kernel_HistogramRectAllChannelsReduction .

Versions

tesseract 3.04.01dev
 leptonica-1.71
  libgif 4.1.6(?) : libjpeg 8d : libpng 1.6.18 : libtiff 4.0.4 : zlib 1.2.5

 OpenCL info:
  Found 1 platforms.
  Platform name: Apple.
  Version: OpenCL 1.2 (May 10 2015 19:38:45).
  Found 2 devices.
    Device 1 name: Intel(R) Core(TM) i5-4570 CPU @ 3.20GHz.
    Device 2 name: GeForce GT 755M.

copy the title instead of holding the pointer

This is a report from someone who makes Tesseract work on Android.
We can make his life easier by copying the string instead of holding onto
the pointer.

https://github.com/tesseract-ocr/tesseract/blob/master/api/renderer.cpp#L55

In separate news, I see that the renderers are not doing proper escaping
of that title before the put it into PDF or HOCR output. Maybe somebody
will worry about that some day.

Here's the relevant part of the report.

I can pass a title from Java through JNI to Tesseract's BeginDocument() just
fine and that title will show up properly in the PDF. But if after calling
BeginDocument() I release the array of bytes representing that title using
ReleaseStringUTFChars in my
Java_com_googlecode_tesseract_android_TessBaseAPI_nativeBeginDocument
method in JNI [1], then the title will show up in the PDF as garbled text, apparently
read from uninitialized memory. I'm guessing this means that Tesseract needs
the reference to that char* to stay around

https://gist.github.com/rmtheis/19965abdfca5c2c9eb26

Please fix the compile with clang

In file included from ./blamer.h:27:
./matrix.h:292:63: error: reinterpret_cast from 'nullptr_t' to 'BLOB_CHOICE_LIST *' is not allowed
    : BandTriMatrix<BLOB_CHOICE_LIST *>(dimension, bandwidth, NOT_CLASSIFIED) {}
                                                              ^~~~~~~~~~~~~~
./matrix.h:33:24: note: expanded from macro 'NOT_CLASSIFIED'
#define NOT_CLASSIFIED reinterpret_cast<BLOB_CHOICE_LIST*>(NULL)
                       ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

clang-3.4.1

Getting "Error: Illegal sample size!" when using custom language

Hello there!

I have done some training and created my own combined language file. When using my custom language I get error code 5000

$ tesseract waa.whatevva.exp0.tif out -l waa

Error: Illegal sample size!
signal_termination_handler:Error:Signal_termination_handler called:Code 5000
Abort trap: 6

When I ran combine_tessdataI got the following output

Combining tessdata files
TessdataManager combined tesseract data files.
Offset for type 0 is 140
Offset for type 1 is 141
Offset for type 2 is -1
Offset for type 3 is 2214
Offset for type 4 is 299822
Offset for type 5 is 300066
Offset for type 6 is -1
Offset for type 7 is -1
Offset for type 8 is -1
Offset for type 9 is -1
Offset for type 10 is -1
Offset for type 11 is -1
Offset for type 12 is -1
Offset for type 13 is 300067
Offset for type 14 is -1
Offset for type 15 is -1
Offset for type 16 is -1

Anybody having an idea of what I'm doing wrong?

Format specifier warning in pdfrenderer.cpp

I get a format specifier warning when building for Android using clang and tess-two:

jni/com_googlecode_tesseract_android/src/api/pdfrenderer.cpp:537:28: warning: format specifies type 'long' but the argument has type 'size_t' (aka 'unsigned int') [-Wformat]
               "stream\n", len);
                           ^~~

I think it should be %zd or %zu instead of %ld, but I'm not sure if that works on Visual Studio too.

tesseract-3.04.00 fails to compile with --disable-graphics

When scrollview is disabled in configure, linking fails (because it still uses scrollview)

libtool: link: g++ -std=c++11 -o .libs/tesseract tesseract-tesseractmain.o  ./.libs/libtesseract.so -lrt -llept -lpthread
./.libs/libtesseract.so: undefined reference to `ScrollView::Brush(ScrollView::Color)'
...
./.libs/libtesseract.so: undefined reference to `window_wait(ScrollView*)'
./.libs/libtesseract.so: undefined reference to `ScrollView::TextAttributes(char const*, int, bool, bool, bool)'
collect2: error: ld returned 1 exit status
Makefile:577: recipe for target 'tesseract' failed

Issue 2: [ 1608107 ] Conditional jump or move depends on uninitialised value(s)

https://code.google.com/p/tesseract-ocr/issues/detail?id=2

2007-03-07T22:24:30.000Z
Reported by tmbdev, Mar 7, 2007
Emmanuel Fleury - efleury(sf)

Hi all,

I ran valgrind on the bokeoa-64bit-branch branch (with a patch submitted by
me) and I found the following problem:

==12801== Conditional jump or move depends on uninitialised value(s)
==12801== at 0x4A4B5E: IntegerMatcher(INT_CLASS_STRUCT_, unsigned long_,
unsigned long_, unsigned short, short, INT_FEATURE_STRUCT_, int, unsigned
char, INT_RESULT_STRUCT_, int) (intmatcher.cpp:1038)
==12801== by 0x49D7F9: AdaptToChar(blobstruct_, LINE_STATS_, unsigned
char, float) (adaptmatch.cpp:1285)
==12801== by 0x49E3DE: AdaptToWord(wordstruct_, textrowstruct_, char
const_, char const_, char const_) (adaptmatch.cpp:725)
==12801== by 0x427AD6: tess_adapter(WERD_, DENORM_, char const_, char
const_, char const_) (tessbox.cpp:349)
==12801== by 0x40DC33: classify_word_pass1(WERD_RES_, ROW_, unsigned
char, CHAR_SAMPLES_LIST_, CHAR_SAMPLE_LIST_) (control.cpp:611)
==12801== by 0x40E06C: recog_all_words(PAGE_RES_, ETEXT_DESC volatile*)
(control.cpp:295)
==12801== by 0x4044D6: recognize_page(STRING&) (tessedit.cpp:159)
==12801== by 0x4034A3: main (tesseractmain.cpp:104)

This bug is also present in 32bits architecture and does not depends on the
architecture.

Comments

Date: 2007-02-01 00:48
Sender: efleury
Logged In: YES
user_id=122014
Originator: YES

Great !

But, did you checkout ??? I cannot get my local CVS archive to get any
update... :-/

Date: 2007-01-31 15:34
Sender: theraysmithProject Admin
Logged In: YES
user_id=1515161
Originator: NO

This is fixed in 1.03. It was causing the adaptive classifier to not get
used enough.

Date: 2006-12-15 13:30
Sender: filipg
Logged In: YES
user_id=37894
Originator: NO

I think this is a FALSE-ALARM from valgrind (another one below):

Breakpoint 1, IntegerMatcher (ClassTemplate=0x925d8d8,
ProtoMask=0x91de6f0,
ConfigMask=0xbf925a78, BlobLength=47, NumFeatures=53, Features=0xbf925270,
min_misses=0,
NormalizationFactor=0 '\0', Result=0xbf926114, Debug=0) at
intmatcher.cpp:1043
1043 if (Features[Feature].CP_misses >= min_misses) {
(gdb) list
1042 for (Feature = 0, used_features = 0; Feature < NumFeatures;
Feature++) {
1043 if (Features[Feature].CP_misses >= min_misses) {
1044 IMUpdateTablesForFeature (ClassTemplate, ProtoMask,
ConfigMask,
1045 Feature, &(Features[Feature]),
1046 FeatureEvidence, SumOfFeatureEvidence,
1047 ProtoEvidence, Debug);
1048 used_features++;
1049 }
(gdb) print min_misses
$5 = 0
(gdb) print Feature
$6 = 0
(gdb) print Features[Feature]
$7 = {X = 97 'a', Y = 30 '\036', Theta = 192 '�', CP_misses = 0
'\0'}

Looks OK to me... The same seems to be true for "Source and destination
overlap
in strcpy". Take a look:

Breakpoint 1, fix_quotes (string=0x86b5b31 ""'License'');",
word=0x87dc530,
blob_choices=0xbfca8f58) at control.cpp:1034
1034 strcpy (ptr + 1, ptr + 2); //shuffle up
(gdb) list 1029
1029 for (ptr = string;
1030 _ptr != '\0'; ptr++, blob_it.forward (), choice_it.forward ())
{
1031 if ((_ptr == ''' || _ptr == '') 1032 && (_(ptr + 1) == '\'' || *(ptr + 1) == '')) {
1033 *ptr = '"'; //turn to double
1034 strcpy (ptr + 1, ptr + 2); //shuffle up
(gdb) print ptr+1
$1 = 0x86b5b32 "'License'');"
(gdb) print ptr+2
$2 = 0x86b5b33 "License'');"

Looks fine to me. Valgrind pointed to above twice as both recognition
passes call
fix_quotes():

==1993== 3 errors in context 1 of 4:
==1993== Source and destination overlap in strcpy(0x469B2FA, 0x469B2FB)
==1993== at 0x4006AAD: strcpy (mc_replace_strmem.c:106)
==1993== by 0x805333E: fix_quotes(char_, WERD_,
BLOB_CHOICE_LIST_CLIST_) (control.cpp:1034)
==1993== by 0x8054DBD: classify_word_pass1(WERD_RES_, ROW_, unsigned
char, CHAR_SAMPLES_LIST_, CHAR_SAMPLE_LIST_) (control.cpp:592)
==1993== by 0x80554C2: recog_all_words(PAGE_RES_, ETEXT_DESC volatile_)
(control.cpp:317)
==1993== by 0x804B9EB: recognize_page(STRING&) (tessedit.cpp:187)
==1993== by 0x804A869: main (tesseractmain.cpp:454)
==1993==
==1993== 6 errors in context 2 of 4:
==1993== Source and destination overlap in strcpy(0x46998D2, 0x46998D3)
==1993== at 0x4006AAD: strcpy (mc_replace_strmem.c:106)
==1993== by 0x805333E: fix_quotes(char_, WERD_,
BLOB_CHOICE_LIST_CLIST_) (control.cpp:1034)
==1993== by 0x8053B4C: match_word_pass2(WERD_RES_, ROW_, float)
(control.cpp:913)

Guess tesseract needs its own valgrind_suppressions.sh...

I've been meaning to play with valgrind for an unrelated reason - looked
into
this report while installing it. Very nice program. Found my rare
corruption
issue in a house app with it and 6 other potential problems!

Check it out if you haven't: http://www.valgrind.org/ (painless Linux
install)

Cheers,
Fil

Add a new config file 'pdftxt' to create PDF and TEXT output at the same time

https://groups.google.com/forum/?utm_medium=email&utm_source=footer#!msg/tesseract-dev/XllxjvK5HtU/C4mebS6lcJoJ

Jeff suggested that users create a myconfig file. I think it will be useful to actually provide the configuration as 'pdftxt' .

tessedit_create_txt 1
tessedit_create_pdf 1

Then make sure that you invoke the command line such that
Tesseract writes to files instead of stdout, e.g.

tesseract myimage.tif myoutput pdftxt

This will read myimage.tif and pdftxt (config file), and produce myoutput.pdf and myoutput.txt

STATS output on library usage

Using Tesseract as a library, I get a ton of this information printed to the console:

Total count=0
Min=0.00 Really=0
Lower quartile=0.00
Median=0.00, ile(0.5)=0.00
Upper quartile=0.00
Max=0.00 Really=0
Range=1
Mean= 0.00
SD= 0.00
Bottom=0, top=38, base=0, x=0

Is there any option or way to disable this?

Crash in ComputeGradient (coutln.cpp)

I'm currently seeing a SIGBUS crash in ComputeGradient on an Android app I'm working on, I can reliably reproduce the crash on a specific phone (Samsung Galaxy S4 Mini, Snapdragon 400), on another device (OnePlus One, Snapdragon 801)

Backtrace (app name removed):

D/CrashAnrDetector( 656): backtrace:
D/CrashAnrDetector( 656): #00 pc 000a746c /data/app-lib/APPNAME/libtess.so
D/CrashAnrDetector( 656): #1 pc 000a8935 /data/app-lib/APPNAME/libtess.so (C_OUTLINE::ComputeEdgeOffsets(int, Pix_)+160)
D/CrashAnrDetector( 656): #2 pc 000b81b1 /data/app-lib/APPNAME/libtess.so
D/CrashAnrDetector( 656): #3 pc 000a3a1d /data/app-lib/APPNAME/libtess.so (BLOBNBOX::ComputeEdgeOffsets(Pix_, Pix_, BLOBNBOX_LIST_)+212)
D/CrashAnrDetector( 656): #4 pc 000a42b7 /data/app-lib/APPNAME/libtess.so (TO_BLOCK::ComputeEdgeOffsets(Pix_, Pix_)+14)
D/CrashAnrDetector( 656): #5 pc 0011f099 /data/app-lib/APPNAME/libtess.so (tesseract::Textord::TextordPage(tesseract::PageSegMode, FCOORD const&, int, int, Pix_, Pix_, Pix_, bool, BLOBNBOX_LIST_, BLOCK_LIST_, TO_BLOCK_LIST_)+72)

I added some debug output to ComputeGradient and it turns out it crashes when y = -2. , after adding a few more lines of debug logging in ComputeEdgeOffsets I see that start.y() is 2 larger than 'height'. If it crashes, it always does so on line 2 of ComputeGradient. SIGBUS would imply an unaligned memory access but as far as I can tell that function only deals with single byte access which shouldn't cause an issue. The -2 y coordinate also seems suspect but I don't know enough about Tesseract and Pix to know if that might be a problem or not.

PDF output without image

Hello,

I noticed the new "pdf" option in Tesseract, which creates a PDF file with the image and the background text. That's great !

But usually, the image given to Tesseract is not as nice as the starting image (because it is optimized for OCR, not for human visualization). Maybe it would be useful to provide the step before, i.e. the PDF of the generated text without the image, so that the user can paste it as a background text with pdftk for example.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.