Giter Site home page Giter Site logo

box.train segfault about tesseract HOT 14 CLOSED

tesseract-ocr avatar tesseract-ocr commented on April 28, 2024
box.train segfault

from tesseract.

Comments (14)

jimregan avatar jimregan commented on April 28, 2024

Not enough input. In short, box.train needs both an image, and a box file, and from those it creates training data. For a more complete explanation, see the wiki: https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract#run-tesseract-for-training

from tesseract.

jbreiden avatar jbreiden commented on April 28, 2024

Tesseract should return an error when there is insufficient input, not segfault.

from tesseract.

jimregan avatar jimregan commented on April 28, 2024

Re-opening, as requested.

from tesseract.

jimregan avatar jimregan commented on April 28, 2024

Something like this? (Using the variable also supports box.train.stderr)

diff --git a/api/tesseractmain.cpp b/api/tesseractmain.cpp
index e7abadf..7ddbc93 100644
--- a/api/tesseractmain.cpp
+++ b/api/tesseractmain.cpp
@@ -306,6 +306,11 @@ int main(int argc, char **argv) {
   if (b) renderers.push_back(new tesseract::TessBoxTextRenderer(outputbase));
   api.GetBoolVariable("tessedit_create_txt", &b);
   if (b) renderers.push_back(new tesseract::TessTextRenderer(outputbase));
+  api.GetBoolVariable("tessedit_train_from_boxes", &b);
+  if (b && !strcmp(outputbase, "-")) {
+    fprintf(stderr, "Box input from stdin not supported in box training.\n");
+    exit(1);
+  }
   if (!renderers.empty()) {
     // Since the PointerVector auto-deletes, null-out the renderers that are
     // added to the root, and leave the root in the vector.

from tesseract.

jbreiden avatar jbreiden commented on April 28, 2024

Pretty good! But even more robust is to locate the lower level function that is crashing
due to bad data. Then modify it to return an error instead of crashing. That protects us
even if it gets called from a different code path.

from tesseract.

jimregan avatar jimregan commented on April 28, 2024

I think it's a matter for broader discussion.

On the one hand, it's The Right Thing, and I've already done The Wrong Thing by closing an issue that mentions a segfault... but it's an exceptional case. One, because it overloads what is normally the output file position to be a secondary input, and two, because it's not a frequent use case.

from tesseract.

zdenop avatar zdenop commented on April 28, 2024

I tried it on opensuse 13.2 64bit and it did not crashed:

tesseract testing/phototest.tif - box.train
Page 1
APPLY_BOXES:
Boxes read from boxfile: 225
Found 225 good blobs.
Generated training data for 60 words

Warning in pixReadMemTiff: tiff page 1 not found

ls -t | head -n 1
-.tr

Just OCR to stdout worked as expected:

tesseract testing/phototest.tif -
Page 1
This is a lot of 12 point text to test the
ocr code and see if it works on all types
of file format.

The quick brown dog jumped over the
lazy fox. The quick brown dog jumped
over the lazy fox. The quick brown dog
jumped over the lazy fox. The quick
brown dog jumped over the lazy fox.

Warning in pixReadMemTiff: tiff page 1 not found

from tesseract.

jbreiden avatar jbreiden commented on April 28, 2024
TESSDATA_PREFIX=/usr/share/tesseract-ocr  valgrind api/.libs/lt-tesseract testing/phototest.tif testing/phototest.tif - box.train
==11666== Memcheck, a memory error detector
==11666== Copyright (C) 2002-2013, and GNU GPL'd, by Julian Seward et al.
==11666== Using Valgrind-3.10.1 and LibVEX; rerun with -h for copyright info
==11666== Command: api/.libs/lt-tesseract testing/phototest.tif testing/phototest.tif - box.train
==11666== 
Tesseract Open Source OCR Engine v3.05.00dev-11-gd937659 with Leptonica
Page 1
==11666== Invalid read of size 8
==11666==    at 0x4FA8E9C: ELIST_ITERATOR::set_to_list(ELIST*) (elst.h:308)
==11666==    by 0x514C108: PAGE_RES_IT::start_page(bool) (pageres.cpp:1510)
==11666==    by 0x4FAA6CE: PAGE_RES_IT::restart_page() (pageres.h:681)
==11666==    by 0x4FAA6AE: PAGE_RES_IT::PAGE_RES_IT(PAGE_RES*) (pageres.h:665)
==11666==    by 0x4FB781E: tesseract::Tesseract::ApplyBoxTraining(STRING const&, PAGE_RES*) (applybox.cpp:797)
==11666==    by 0x4FA2477: tesseract::TessBaseAPI::Recognize(ETEXT_DESC*) (baseapi.cpp:883)
==11666==    by 0x4FA34D8: tesseract::TessBaseAPI::ProcessPage(Pix*, int, char const*, char const*, int, tesseract::TessResultRenderer*) (baseapi.cpp:1222)
==11666==    by 0x4FA2D4D: tesseract::TessBaseAPI::ProcessPagesMultipageTiff(unsigned char const*, unsigned long, char const*, char const*, int, tesseract::TessResultRenderer*, int) (baseapi.cpp:1057)
==11666==    by 0x4FA329A: tesseract::TessBaseAPI::ProcessPagesInternal(char const*, char const*, int, tesseract::TessResultRenderer*) (baseapi.cpp:1176)
==11666==    by 0x4FA2DC4: tesseract::TessBaseAPI::ProcessPages(char const*, char const*, int, tesseract::TessResultRenderer*) (baseapi.cpp:1074)
==11666==    by 0x403192: main (tesseractmain.cpp:318)
==11666==  Address 0x8 is not stack'd, malloc'd or (recently) free'd
==11666== 
==11666== 
==11666== Process terminating with default action of signal 11 (SIGSEGV)
==11666==  Access not within mapped region at address 0x8
==11666==    at 0x4FA8E9C: ELIST_ITERATOR::set_to_list(ELIST*) (elst.h:308)
==11666==    by 0x514C108: PAGE_RES_IT::start_page(bool) (pageres.cpp:1510)
==11666==    by 0x4FAA6CE: PAGE_RES_IT::restart_page() (pageres.h:681)
==11666==    by 0x4FAA6AE: PAGE_RES_IT::PAGE_RES_IT(PAGE_RES*) (pageres.h:665)
==11666==    by 0x4FB781E: tesseract::Tesseract::ApplyBoxTraining(STRING const&, PAGE_RES*) (applybox.cpp:797)
==11666==    by 0x4FA2477: tesseract::TessBaseAPI::Recognize(ETEXT_DESC*) (baseapi.cpp:883)
==11666==    by 0x4FA34D8: tesseract::TessBaseAPI::ProcessPage(Pix*, int, char const*, char const*, int, tesseract::TessResultRenderer*) (baseapi.cpp:1222)
==11666==    by 0x4FA2D4D: tesseract::TessBaseAPI::ProcessPagesMultipageTiff(unsigned char const*, unsigned long, char const*, char const*, int, tesseract::TessResultRenderer*, int) (baseapi.cpp:1057)
==11666==    by 0x4FA329A: tesseract::TessBaseAPI::ProcessPagesInternal(char const*, char const*, int, tesseract::TessResultRenderer*) (baseapi.cpp:1176)
==11666==    by 0x4FA2DC4: tesseract::TessBaseAPI::ProcessPages(char const*, char const*, int, tesseract::TessResultRenderer*) (baseapi.cpp:1074)
==11666==    by 0x403192: main (tesseractmain.cpp:318)

from tesseract.

zdenop avatar zdenop commented on April 28, 2024

@jbreiden: In openSUSE 13.2 I do not have api/.libs/lt-tesseract just api/.libs/tesseract. And you have two times, so output is testing/phototest.tif.tr

And I got this:

TESSDATA_PREFIX=/usr/share/ valgrind api/.libs/tesseract testing/phototest.tif testing/phototest.tif - box.train
==21845== Memcheck, a memory error detector
==21845== Copyright (C) 2002-2013, and GNU GPL'd, by Julian Seward et al.
==21845== Using Valgrind-3.10.0 and LibVEX; rerun with -h for copyright info
==21845== Command: api/.libs/tesseract testing/phototest.tif testing/phototest.tif - box.train
==21845== 
Tesseract Open Source OCR Engine v3.05.00dev-11-gd937659 with Leptonica
Page 1
APPLY_BOXES:
   Boxes read from boxfile:     225
   Found 225 good blobs.
Generated training data for 60 words
Warning in pixReadMemTiff: tiff page 1 not found
==21845== 
==21845== HEAP SUMMARY:
==21845==     in use at exit: 19,795,264 bytes in 33 blocks
==21845==   total heap usage: 874,639 allocs, 874,606 frees, 60,486,144 bytes allocated
==21845== 
==21845== LEAK SUMMARY:
==21845==    definitely lost: 0 bytes in 0 blocks
==21845==    indirectly lost: 0 bytes in 0 blocks
==21845==      possibly lost: 19,753,408 bytes in 24 blocks
==21845==    still reachable: 41,856 bytes in 9 blocks
==21845==         suppressed: 0 bytes in 0 blocks
==21845== Rerun with --leak-check=full to see details of leaked memory
==21845== 
==21845== For counts of detected and suppressed errors, rerun with: -v
==21845== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 0 from 0)

For "stdout version" I got this:

TESSDATA_PREFIX=/usr/share/ valgrind api/.libs/tesseract testing/phototest.tif - box.train
==11238== Memcheck, a memory error detector
==11238== Copyright (C) 2002-2013, and GNU GPL'd, by Julian Seward et al.
==11238== Using Valgrind-3.10.0 and LibVEX; rerun with -h for copyright info
==11238== Command: api/.libs/tesseract ../tesseract-ocr/testing/phototest.tif - box.train
==11238== 
Page 1
APPLY_BOXES:
   Boxes read from boxfile:     225
   Found 225 good blobs.
Generated training data for 60 words









Warning in pixReadMemTiff: tiff page 1 not found
==11238== 
==11238== HEAP SUMMARY:
==11238==     in use at exit: 19,795,264 bytes in 33 blocks
==11238==   total heap usage: 874,629 allocs, 874,596 frees, 60,485,260 bytes allocated
==11238== 
==11238== LEAK SUMMARY:
==11238==    definitely lost: 0 bytes in 0 blocks
==11238==    indirectly lost: 0 bytes in 0 blocks
==11238==      possibly lost: 19,753,408 bytes in 24 blocks
==11238==    still reachable: 41,856 bytes in 9 blocks
==11238==         suppressed: 0 bytes in 0 blocks
==11238== Rerun with --leak-check=full to see details of leaked memory
==11238== 
==11238== For counts of detected and suppressed errors, rerun with: -v
==11238== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 0 from 0)

from tesseract.

amitdo avatar amitdo commented on April 28, 2024

@jbreiden, please test with latest commit.

from tesseract.

jbreiden avatar jbreiden commented on April 28, 2024

make
make install
gdb /usr/local/bin/tesseract
(gdb) run testing/phototest.tif - box.train

Program received signal SIGSEGV, Segmentation fault.
PAGE_RES_IT::start_page (this=this@entry=0x7fffffffde10, empty_ok=empty_ok@entry=false) at pageres.cpp:1510
1510      block_res_it.set_to_list(&page_res->block_res_list);
(gdb) backtrace
#0  PAGE_RES_IT::start_page (this=this@entry=0x7fffffffde10, empty_ok=empty_ok@entry=false) at pageres.cpp:1510
#1  0x00007ffff76e6f29 in restart_page (this=0x7fffffffde10) at ../ccstruct/pageres.h:681
#2  PAGE_RES_IT (the_page_res=<optimized out>, this=0x7fffffffde10) at ../ccstruct/pageres.h:665
#3  tesseract::Tesseract::ApplyBoxTraining (this=0x819810, fontname=..., page_res=<optimized out>) at applybox.cpp:797
#4  0x00007ffff76dd926 in tesseract::TessBaseAPI::Recognize (this=this@entry=0x7fffffffe450, monitor=monitor@entry=0x0) at baseapi.cpp:883
#5  0x00007ffff76ddc2a in tesseract::TessBaseAPI::ProcessPage (this=0x7fffffffe450, pix=0x84fd10, page_index=<optimized out>, filename=<optimized out>, 
    retry_config=0x0, timeout_millisec=0, renderer=0x0) at baseapi.cpp:1224
#6  0x00007ffff76de10b in tesseract::TessBaseAPI::ProcessPagesMultipageTiff (this=0x7fffffffe450, data=0x0, data@entry=0x13a0828 "II*", size=8, filename=0x0, 
    filename@entry=0x7fffffffe85a "testing/phototest.tif", retry_config=retry_config@entry=0x0, timeout_millisec=20909344, timeout_millisec@entry=0, 
    renderer=0x0, tessedit_page_number=-1) at baseapi.cpp:1057
#7  0x00007ffff76de5fe in tesseract::TessBaseAPI::ProcessPagesInternal (this=this@entry=0x7fffffffe450, filename=<optimized out>, 
    retry_config=retry_config@entry=0x0, timeout_millisec=timeout_millisec@entry=0, renderer=0x0) at baseapi.cpp:1176
#8  0x00007ffff76dea40 in tesseract::TessBaseAPI::ProcessPages (this=this@entry=0x7fffffffe450, filename=<optimized out>, 
    retry_config=retry_config@entry=0x0, timeout_millisec=timeout_millisec@entry=0, renderer=<optimized out>) at baseapi.cpp:1074
#9  0x0000000000401dff in main (argc=<optimized out>, argv=0x7fffffffe5e8) at tesseractmain.cpp:432

from tesseract.

amitdo avatar amitdo commented on April 28, 2024

I can reproduce this.

I reread this issue. Jim's explanation is still true.

box.train needs both an image, and a box file, and from those it creates training data. For a more complete explanation, see the wiki: https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract#run-tesseract-for-training

from tesseract.

jbreiden avatar jbreiden commented on April 28, 2024

I'd suggest something like this. I didn't check to see if we leak memory
if we go down this error path, but no matter what it is better than a segfault.

--- baseapi.cpp.orig    2016-02-04 01:09:07.790101916 +0000
+++ baseapi.cpp 2016-02-04 01:07:15.464620603 +0000
@@ -851,6 +851,9 @@
     page_res_ = new PAGE_RES(false,
                              block_list_, &tesseract_->prev_word_best_choice_);
   }
+  if (page_res_ == NULL) {
+    return -1;
+  }
   if (tesseract_->tessedit_make_boxes_from_boxes) {
     tesseract_->CorrectClassifyWords(page_res_);
     return 0;

from tesseract.

amitdo avatar amitdo commented on April 28, 2024

Tested. It works - no segfault.

from tesseract.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.