Comments (11)
I'd like to support the original wish. Having something like
tesseract OCR.tif ORIGINAL pdf-overlay
to produce only the text overlay in a pdf file would provide a lot of flexibility. With this, you could write frontends to tesseract capable of overlaying the invisible text overlay on something different from OCR.tiff (e.g. a full color version of OCR.tif, etc.)
from tesseract.
@olcc Tesseract is a raw OCR engine. Have a look at my project, OCRmyPDF, which provides a nice wrapper around Tesseract and takes care of many details to improve visualization.
from tesseract.
from tesseract.
@olcc, the way to produce PDF has significantly changed in Tesseract 8.04. So I have a plan to change this in future commits. I'll take your idea into consideration. But as I remeber the new implementation does not produce the text anymore. It outputs directly to the file. But even with such effort you are able read the file manually and modify as you wish.
from tesseract.
@olcc: tesseract puts to pdf image that you provided as input (e.g. file you see in pdf is not optimized for OCR as you claims). If you have another experience - please provide example. Otherwise close the "issue".
from tesseract.
@jbarlow83: Thanks for pointing to the "OCRmyPDF" wrapper.
@ws233: Tesseract 8.04? I'm quite late, I only have 3.04! ;-) (from Debian)
@zdenop: Sorry, I didn't understand your message. Maybe my English is not good enough. My process is the following:
- ORIGINAL.jpg -> OCR.tif (remove colors, apply threshold, etc.)
- tesseract OCR.tif result -l eng pdf
If you say that showing OCR.tif in the PDF is the right thing to do, I disagree in general. I agree this is a very nice feature. However, most people want to have ORIGINAL.jpg with the ocr text.
from tesseract.
What I want to say is that if you run:
tesseract OCR.tif ORIGINAL pdf
than ORIGINAL.tif is included in ORIGINAL.pdf WITHOUT any modification. If you want to include ORIGINAL.jpg instead of OCR.tif than it is not tesseract issue ;-)
from tesseract.
@olcc we here fully rely on these "mixed-mode" PDFs as generated by
tesseract OCR.tif ORIGINAL pdf
which works with very high quality, depending on the quality what you input to tesseract. I hope, that the present "pdf" option ( -c tessedit_create_pdf=1 ) will really never be dropped from the code.
from tesseract.
@zdenop, is this functionality documented anywhere?
Could you point me to the exact place in the code where it's implemented?
from tesseract.
@amitdo: it is implemented in pdfrenderer
This is not real issue (no bug in tessseract), so I close this issue. Please use tesseract user forum for asking question/support.
from tesseract.
ORIGINAL.tif is included in ORIGINAL.pdf WITHOUT any modification
Whenever possible. The design intent is to copy the image bytes without using a
decompress/compress whenever we can. Sometimes that is impossible (TIFF
is an enormously flexible graphics format) and sometimes we haven't quite
gotten there. For example, TIFF CCITT Group 4 still goes through a lossless
decompress/compress. Simply because we haven't done the work to optimize
this code path in Tesseract / Leptonica. All relevant Tesseract code is in
ai/pdfrenderer.cc but we try to push the image heavy lifting into Leptonica.
https://en.wikipedia.org/wiki/Tagged_Image_File_Format#TIFF_Compression_Tag
from tesseract.
Related Issues (20)
- Failed dependency : liblept.so.5() HOT 4
- Tesseract 5.0.0-alpha command line is crashing HOT 1
- unicharset_extractor does not build anymore HOT 2
- Tesseract fails to OCR text with very clear hexadecimal digits HOT 5
- Two little bugs for tesseract HOT 1
- multithreaded tesseract causes Linux crash HOT 5
- Linker Error for tesseract53.lib HOT 1
- Add redirect function HOT 1
- Add ICD Codes in english trained Data HOT 2
- Some CI jobs (GitHub Actions) are failing HOT 10
- uuencode-generated text is OCRed with many mistakes HOT 2
- Error! The command "tesseract" was not found. HOT 2
- Error! The command "tesseract" was not found
- unicharset_extractor segfault HOT 31
- Please add the API call to translate the language code to the full language name HOT 3
- Warning: LSTMTrainer deserialized an LSTMRecognizer! Error, data/eng/eng_num_vert.lstm is an integer (fast) model, cannot continue training HOT 7
- Add the NN for a 'random' ASCII language HOT 1
- "min_characters_to_try" parameter does not work HOT 2
- phonetic symbols and special characters HOT 1
- inform where we can find tesseract.exe HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from tesseract.