Comments (14)
Not enough input. In short, box.train needs both an image, and a box file, and from those it creates training data. For a more complete explanation, see the wiki: https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract#run-tesseract-for-training
from tesseract.
Tesseract should return an error when there is insufficient input, not segfault.
from tesseract.
Re-opening, as requested.
from tesseract.
Something like this? (Using the variable also supports box.train.stderr)
diff --git a/api/tesseractmain.cpp b/api/tesseractmain.cpp
index e7abadf..7ddbc93 100644
--- a/api/tesseractmain.cpp
+++ b/api/tesseractmain.cpp
@@ -306,6 +306,11 @@ int main(int argc, char **argv) {
if (b) renderers.push_back(new tesseract::TessBoxTextRenderer(outputbase));
api.GetBoolVariable("tessedit_create_txt", &b);
if (b) renderers.push_back(new tesseract::TessTextRenderer(outputbase));
+ api.GetBoolVariable("tessedit_train_from_boxes", &b);
+ if (b && !strcmp(outputbase, "-")) {
+ fprintf(stderr, "Box input from stdin not supported in box training.\n");
+ exit(1);
+ }
if (!renderers.empty()) {
// Since the PointerVector auto-deletes, null-out the renderers that are
// added to the root, and leave the root in the vector.
from tesseract.
Pretty good! But even more robust is to locate the lower level function that is crashing
due to bad data. Then modify it to return an error instead of crashing. That protects us
even if it gets called from a different code path.
from tesseract.
I think it's a matter for broader discussion.
On the one hand, it's The Right Thing, and I've already done The Wrong Thing by closing an issue that mentions a segfault... but it's an exceptional case. One, because it overloads what is normally the output file position to be a secondary input, and two, because it's not a frequent use case.
from tesseract.
I tried it on opensuse 13.2 64bit and it did not crashed:
tesseract testing/phototest.tif - box.train
Page 1
APPLY_BOXES:
Boxes read from boxfile: 225
Found 225 good blobs.
Generated training data for 60 words
Warning in pixReadMemTiff: tiff page 1 not found
ls -t | head -n 1
-.tr
Just OCR to stdout worked as expected:
tesseract testing/phototest.tif -
Page 1
This is a lot of 12 point text to test the
ocr code and see if it works on all types
of file format.
The quick brown dog jumped over the
lazy fox. The quick brown dog jumped
over the lazy fox. The quick brown dog
jumped over the lazy fox. The quick
brown dog jumped over the lazy fox.
Warning in pixReadMemTiff: tiff page 1 not found
from tesseract.
TESSDATA_PREFIX=/usr/share/tesseract-ocr valgrind api/.libs/lt-tesseract testing/phototest.tif testing/phototest.tif - box.train
==11666== Memcheck, a memory error detector
==11666== Copyright (C) 2002-2013, and GNU GPL'd, by Julian Seward et al.
==11666== Using Valgrind-3.10.1 and LibVEX; rerun with -h for copyright info
==11666== Command: api/.libs/lt-tesseract testing/phototest.tif testing/phototest.tif - box.train
==11666==
Tesseract Open Source OCR Engine v3.05.00dev-11-gd937659 with Leptonica
Page 1
==11666== Invalid read of size 8
==11666== at 0x4FA8E9C: ELIST_ITERATOR::set_to_list(ELIST*) (elst.h:308)
==11666== by 0x514C108: PAGE_RES_IT::start_page(bool) (pageres.cpp:1510)
==11666== by 0x4FAA6CE: PAGE_RES_IT::restart_page() (pageres.h:681)
==11666== by 0x4FAA6AE: PAGE_RES_IT::PAGE_RES_IT(PAGE_RES*) (pageres.h:665)
==11666== by 0x4FB781E: tesseract::Tesseract::ApplyBoxTraining(STRING const&, PAGE_RES*) (applybox.cpp:797)
==11666== by 0x4FA2477: tesseract::TessBaseAPI::Recognize(ETEXT_DESC*) (baseapi.cpp:883)
==11666== by 0x4FA34D8: tesseract::TessBaseAPI::ProcessPage(Pix*, int, char const*, char const*, int, tesseract::TessResultRenderer*) (baseapi.cpp:1222)
==11666== by 0x4FA2D4D: tesseract::TessBaseAPI::ProcessPagesMultipageTiff(unsigned char const*, unsigned long, char const*, char const*, int, tesseract::TessResultRenderer*, int) (baseapi.cpp:1057)
==11666== by 0x4FA329A: tesseract::TessBaseAPI::ProcessPagesInternal(char const*, char const*, int, tesseract::TessResultRenderer*) (baseapi.cpp:1176)
==11666== by 0x4FA2DC4: tesseract::TessBaseAPI::ProcessPages(char const*, char const*, int, tesseract::TessResultRenderer*) (baseapi.cpp:1074)
==11666== by 0x403192: main (tesseractmain.cpp:318)
==11666== Address 0x8 is not stack'd, malloc'd or (recently) free'd
==11666==
==11666==
==11666== Process terminating with default action of signal 11 (SIGSEGV)
==11666== Access not within mapped region at address 0x8
==11666== at 0x4FA8E9C: ELIST_ITERATOR::set_to_list(ELIST*) (elst.h:308)
==11666== by 0x514C108: PAGE_RES_IT::start_page(bool) (pageres.cpp:1510)
==11666== by 0x4FAA6CE: PAGE_RES_IT::restart_page() (pageres.h:681)
==11666== by 0x4FAA6AE: PAGE_RES_IT::PAGE_RES_IT(PAGE_RES*) (pageres.h:665)
==11666== by 0x4FB781E: tesseract::Tesseract::ApplyBoxTraining(STRING const&, PAGE_RES*) (applybox.cpp:797)
==11666== by 0x4FA2477: tesseract::TessBaseAPI::Recognize(ETEXT_DESC*) (baseapi.cpp:883)
==11666== by 0x4FA34D8: tesseract::TessBaseAPI::ProcessPage(Pix*, int, char const*, char const*, int, tesseract::TessResultRenderer*) (baseapi.cpp:1222)
==11666== by 0x4FA2D4D: tesseract::TessBaseAPI::ProcessPagesMultipageTiff(unsigned char const*, unsigned long, char const*, char const*, int, tesseract::TessResultRenderer*, int) (baseapi.cpp:1057)
==11666== by 0x4FA329A: tesseract::TessBaseAPI::ProcessPagesInternal(char const*, char const*, int, tesseract::TessResultRenderer*) (baseapi.cpp:1176)
==11666== by 0x4FA2DC4: tesseract::TessBaseAPI::ProcessPages(char const*, char const*, int, tesseract::TessResultRenderer*) (baseapi.cpp:1074)
==11666== by 0x403192: main (tesseractmain.cpp:318)
from tesseract.
@jbreiden: In openSUSE 13.2 I do not have api/.libs/lt-tesseract just api/.libs/tesseract. And you have two times, so output is testing/phototest.tif.tr
And I got this:
TESSDATA_PREFIX=/usr/share/ valgrind api/.libs/tesseract testing/phototest.tif testing/phototest.tif - box.train
==21845== Memcheck, a memory error detector
==21845== Copyright (C) 2002-2013, and GNU GPL'd, by Julian Seward et al.
==21845== Using Valgrind-3.10.0 and LibVEX; rerun with -h for copyright info
==21845== Command: api/.libs/tesseract testing/phototest.tif testing/phototest.tif - box.train
==21845==
Tesseract Open Source OCR Engine v3.05.00dev-11-gd937659 with Leptonica
Page 1
APPLY_BOXES:
Boxes read from boxfile: 225
Found 225 good blobs.
Generated training data for 60 words
Warning in pixReadMemTiff: tiff page 1 not found
==21845==
==21845== HEAP SUMMARY:
==21845== in use at exit: 19,795,264 bytes in 33 blocks
==21845== total heap usage: 874,639 allocs, 874,606 frees, 60,486,144 bytes allocated
==21845==
==21845== LEAK SUMMARY:
==21845== definitely lost: 0 bytes in 0 blocks
==21845== indirectly lost: 0 bytes in 0 blocks
==21845== possibly lost: 19,753,408 bytes in 24 blocks
==21845== still reachable: 41,856 bytes in 9 blocks
==21845== suppressed: 0 bytes in 0 blocks
==21845== Rerun with --leak-check=full to see details of leaked memory
==21845==
==21845== For counts of detected and suppressed errors, rerun with: -v
==21845== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 0 from 0)
For "stdout version" I got this:
TESSDATA_PREFIX=/usr/share/ valgrind api/.libs/tesseract testing/phototest.tif - box.train
==11238== Memcheck, a memory error detector
==11238== Copyright (C) 2002-2013, and GNU GPL'd, by Julian Seward et al.
==11238== Using Valgrind-3.10.0 and LibVEX; rerun with -h for copyright info
==11238== Command: api/.libs/tesseract ../tesseract-ocr/testing/phototest.tif - box.train
==11238==
Page 1
APPLY_BOXES:
Boxes read from boxfile: 225
Found 225 good blobs.
Generated training data for 60 words
Warning in pixReadMemTiff: tiff page 1 not found
==11238==
==11238== HEAP SUMMARY:
==11238== in use at exit: 19,795,264 bytes in 33 blocks
==11238== total heap usage: 874,629 allocs, 874,596 frees, 60,485,260 bytes allocated
==11238==
==11238== LEAK SUMMARY:
==11238== definitely lost: 0 bytes in 0 blocks
==11238== indirectly lost: 0 bytes in 0 blocks
==11238== possibly lost: 19,753,408 bytes in 24 blocks
==11238== still reachable: 41,856 bytes in 9 blocks
==11238== suppressed: 0 bytes in 0 blocks
==11238== Rerun with --leak-check=full to see details of leaked memory
==11238==
==11238== For counts of detected and suppressed errors, rerun with: -v
==11238== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 0 from 0)
from tesseract.
@jbreiden, please test with latest commit.
from tesseract.
make
make install
gdb /usr/local/bin/tesseract
(gdb) run testing/phototest.tif - box.train
Program received signal SIGSEGV, Segmentation fault.
PAGE_RES_IT::start_page (this=this@entry=0x7fffffffde10, empty_ok=empty_ok@entry=false) at pageres.cpp:1510
1510 block_res_it.set_to_list(&page_res->block_res_list);
(gdb) backtrace
#0 PAGE_RES_IT::start_page (this=this@entry=0x7fffffffde10, empty_ok=empty_ok@entry=false) at pageres.cpp:1510
#1 0x00007ffff76e6f29 in restart_page (this=0x7fffffffde10) at ../ccstruct/pageres.h:681
#2 PAGE_RES_IT (the_page_res=<optimized out>, this=0x7fffffffde10) at ../ccstruct/pageres.h:665
#3 tesseract::Tesseract::ApplyBoxTraining (this=0x819810, fontname=..., page_res=<optimized out>) at applybox.cpp:797
#4 0x00007ffff76dd926 in tesseract::TessBaseAPI::Recognize (this=this@entry=0x7fffffffe450, monitor=monitor@entry=0x0) at baseapi.cpp:883
#5 0x00007ffff76ddc2a in tesseract::TessBaseAPI::ProcessPage (this=0x7fffffffe450, pix=0x84fd10, page_index=<optimized out>, filename=<optimized out>,
retry_config=0x0, timeout_millisec=0, renderer=0x0) at baseapi.cpp:1224
#6 0x00007ffff76de10b in tesseract::TessBaseAPI::ProcessPagesMultipageTiff (this=0x7fffffffe450, data=0x0, data@entry=0x13a0828 "II*", size=8, filename=0x0,
filename@entry=0x7fffffffe85a "testing/phototest.tif", retry_config=retry_config@entry=0x0, timeout_millisec=20909344, timeout_millisec@entry=0,
renderer=0x0, tessedit_page_number=-1) at baseapi.cpp:1057
#7 0x00007ffff76de5fe in tesseract::TessBaseAPI::ProcessPagesInternal (this=this@entry=0x7fffffffe450, filename=<optimized out>,
retry_config=retry_config@entry=0x0, timeout_millisec=timeout_millisec@entry=0, renderer=0x0) at baseapi.cpp:1176
#8 0x00007ffff76dea40 in tesseract::TessBaseAPI::ProcessPages (this=this@entry=0x7fffffffe450, filename=<optimized out>,
retry_config=retry_config@entry=0x0, timeout_millisec=timeout_millisec@entry=0, renderer=<optimized out>) at baseapi.cpp:1074
#9 0x0000000000401dff in main (argc=<optimized out>, argv=0x7fffffffe5e8) at tesseractmain.cpp:432
from tesseract.
I can reproduce this.
I reread this issue. Jim's explanation is still true.
box.train needs both an image, and a box file, and from those it creates training data. For a more complete explanation, see the wiki: https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract#run-tesseract-for-training
from tesseract.
I'd suggest something like this. I didn't check to see if we leak memory
if we go down this error path, but no matter what it is better than a segfault.
--- baseapi.cpp.orig 2016-02-04 01:09:07.790101916 +0000
+++ baseapi.cpp 2016-02-04 01:07:15.464620603 +0000
@@ -851,6 +851,9 @@
page_res_ = new PAGE_RES(false,
block_list_, &tesseract_->prev_word_best_choice_);
}
+ if (page_res_ == NULL) {
+ return -1;
+ }
if (tesseract_->tessedit_make_boxes_from_boxes) {
tesseract_->CorrectClassifyWords(page_res_);
return 0;
from tesseract.
Tested. It works - no segfault.
from tesseract.
Related Issues (20)
- Linker Error for tesseract53.lib HOT 1
- Add redirect function HOT 1
- Add ICD Codes in english trained Data HOT 2
- Some CI jobs (GitHub Actions) are failing HOT 10
- uuencode-generated text is OCRed with many mistakes HOT 2
- Error! The command "tesseract" was not found. HOT 2
- Error! The command "tesseract" was not found
- unicharset_extractor segfault HOT 31
- Please add the API call to translate the language code to the full language name HOT 3
- Warning: LSTMTrainer deserialized an LSTMRecognizer! Error, data/eng/eng_num_vert.lstm is an integer (fast) model, cannot continue training HOT 7
- Add the NN for a 'random' ASCII language HOT 1
- "min_characters_to_try" parameter does not work HOT 2
- phonetic symbols and special characters HOT 1
- inform where we can find tesseract.exe HOT 1
- Native Crash in otsuthr.cpp HOT 2
- CI: vcpkg failure due to missing xz tarball HOT 4
- link error LNK1120 with text2image.exe
- Mac m1, not able to compile HOT 2
- OCR of Indian Currency Sign " ₹" HOT 2
- please support linux binary , like fzf HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from tesseract.