Comments (33)
from tesseract.
According Ray cube is a dead-end and it can be removed soon from the code(see e.g. [1]), so you can not expect any progress...
[1] https://groups.google.com/d/msg/tesseract-dev/mtIJXoUpfJc/6f0EwVNXOM8J
from tesseract.
@zdenop Since Cube is going away, perhaps this can be closed?
from tesseract.
lets wait for Ray...
from tesseract.
It's 'going away' for several years now... :-)
from tesseract.
How Cube is being discontinued, it's training procedure has not been published to the public.
Somehow I got the feeling that Cube was purposely sabotaged and hindered from the public.
from tesseract.
The new LSTM based engine is here.
@theraysmith, I see that the Cube engine is still present in the code. Are you going to drop it in the final 4.0 release?
from tesseract.
Actually I have a comment for this.
There is one reason why cube has survived this long: For Hindi cube+tesseract has half the error rate of either on their own. I haven't actually tested that against the new LSTM engine yet, but I will on Monday, and if the new LSTM engine is better, then yes, cube is likely to get the chop for 4.00, and the ifdefs will be very useful.
from tesseract.
Since the hardware requirements for 4.0 are going to be higher than for 3.xx versions, it will be good to keep Hindi cube+tesseract version also available.
The accuracy results that you are mentioning for Hindi are for which version - 3.02, 3.03, 3.04 ?
from tesseract.
Tests complete. Decision made. Cube is going away in 4.00.
Results:
Engine | Total char errors | Word Recall Errors | Word Precision Errors | Walltime | CPUtime* |
---|---|---|---|---|---|
Tess 3.04 | 13.9 | 30 | 31.2 | 3.0 | 2.8 |
Cube | 15.1 | 29.5 | 30.7 | 3.4 | 3.1 |
Tess+Cube | 11.0 | 24.2 | 25.4 | 5.7 | 5.3 |
LSTM | 7.6 | 20.9 | 20.8 | 1.5 | 2.5 |
Note in the above table that LSTM is faster than Tess 3.04 (without adding cube) in both wall time and CPU time! For wall time by a factor of 2.
from tesseract.
Can you provide some details about used hardware for test?
Did you made test also on single core CPU to see difference?
from tesseract.
And what about the language model used for the test? Is it already available so I can use it for my own tests?
from tesseract.
from tesseract.
I'm going to push the data files now.
Got the first ones. My first test with a simple screenshot gave significant better results with LSTM, but needed 16 minutes CPU time (instead of 9 seconds) with a debug build of Tesseract (-O0). A release build (-O2) needs 17 seconds with LSTM, 4 seconds without for the same image.
Are there also new data files planned for old German (deu_frak)? I was surprised that the default English model with LSTM could recognize some words.
from tesseract.
from tesseract.
from tesseract.
I know German continued to be written in Fraktur until the 1940s, so that might be easier. Or is there an old German that is analogous to Ye Old Shoppe for English?
Fraktur was used for an important German newspaper (Reichsanzeiger) until 1945. I'd like to try some pages from that newspaper with Tesseract LSTM. Surprisingly even with the English data Tesseract was able to recognize at least some words written in Fraktur. Could you give me some hints how to create the data for deu_frak
?
There is an Old High German (similar to Old English), but the German translation of the New Testament by Martin Luther (1521) was one of the first major printed books in German, and basically it started the modern German language (High German) which is used until today.
from tesseract.
I think it would be great to move this discussion to (developers) forum. we are already out scope of original issue post and much more people should be interested in "Faktur topic"...
from tesseract.
Stefan, please share the binaries for 4.0 alpha for Windows.
@Shreeshrii, they are online now at the usual location. See also the related pull request #511. Please report results either in the developer forum as suggested by @zdenop or by personal mail to me.
from tesseract.
Is there a 3.04 vs 4.0 branch in tessdata for the traineddata files?
https://github.com/tesseract-ocr/tessdata/tree/3.04.00
from tesseract.
from tesseract.
from tesseract.
Amit. Please add the info to the wiki also, if you have not already done so.
You can do it yourself... :)
from tesseract.
Thank you! I tested a few devanagari pages with the 4.0 alpha windows binaries and traineddata for Hindi, Sanskrit, Marathi and Nepali. This was on a Windows 10 netbook, intel atom 1.33 ghz cpu, x64 based processor, 32 bit os, 2 GB RAM. I tested only with single page images and there was no performance problem on this basic netbook. The accuracy is much improved in the LSTM version. This is by just eyeballing the output (not using any software for comparison).
From a user point of view, better accuracy maybe preferred to speed. So LSTM based engine seems the way to go, at least for devanagari scripts. I will test some of the other Indian languages later.
I have noticed some differences in processing between Hindi and the other Devanagari based languages and will add issues to the tessdata repository.
Thanks to the developers at Google and the tesseract community!
from tesseract.
I don't think I generated the original deu_frak. I have the fonts to do so with LSTM, but I don't know if I have a decent amount of corpus data to hand.
I have a decent amount of corpus data for Fraktur from scanned books at hand, about 500k lines in hOCR files (~50GB with TIF images). I've yet to publish it, but if you have somewhere where I could send/upload it, I'd be glad to.
Or is there a way to create the neccessary training files myself? I've had a cursory look through the OCR code and it looked like it needed lstmf
files, but I haven't yet found what these are supposed to look like.
from tesseract.
500k lines should make it work really well. I would be happy to take it and help you, but we would have to get into licenses, copyright and all that first.
The text is CC0 and the images are CC-BY-NC, so that shouldn't be an issue :-) They're going to be public anyway once I've prepped the dataset for publication.
But even better if there are instructions, looking forward to playing around with training!
from tesseract.
from tesseract.
Cube is gone! Removal completed as of 9d6e4f6
from tesseract.
Sad news: Cube is no longer with us.
Cube, you will be missed...
from tesseract.
@jbaiter have you tried 4.0 training for Fraktur?
@theraysmith Is there a way to use the old box-tiff pairs at https://github.com/paalberti/tesseract-dan-fraktur for LSTM training?
Also see tesseract related issue at paalberti/tesseract-dan-fraktur#3
from tesseract.
Is there a way to use the old box-tiff pairs at https://github.com/paalberti/tesseract-dan-fraktur for LSTM training?
There will be a way to generate a box file from a tiff image. The box file will be written in the textline format
#659 (comment)
I started working on this today. I wrote the needed code and It seems to output the desired format, but I need to do some tests before publishing it.
from tesseract.
@amitdo Not sure if that will work for Devanagari, because of the length of unicode string.
Is it possible to just add a box with the tab character at end of each line for existing box files?
from tesseract.
Not sure if that will work for Devanagari, because of the length of unicode string.
We will wait and see...
Is it possible to just add a box with the tab character at end of each line for existing box files?
You mean manually?
You should add box coordinates, not just tab.
from tesseract.
Related Issues (20)
- multithreaded tesseract causes Linux crash HOT 5
- Linker Error for tesseract53.lib HOT 1
- Add redirect function HOT 1
- Add ICD Codes in english trained Data HOT 2
- Some CI jobs (GitHub Actions) are failing HOT 10
- uuencode-generated text is OCRed with many mistakes HOT 2
- Error! The command "tesseract" was not found. HOT 2
- Error! The command "tesseract" was not found
- unicharset_extractor segfault HOT 31
- Please add the API call to translate the language code to the full language name HOT 3
- Warning: LSTMTrainer deserialized an LSTMRecognizer! Error, data/eng/eng_num_vert.lstm is an integer (fast) model, cannot continue training HOT 7
- Add the NN for a 'random' ASCII language HOT 1
- "min_characters_to_try" parameter does not work HOT 2
- phonetic symbols and special characters HOT 1
- inform where we can find tesseract.exe HOT 1
- Native Crash in otsuthr.cpp HOT 2
- CI: vcpkg failure due to missing xz tarball HOT 4
- link error LNK1120 with text2image.exe
- Mac m1, not able to compile HOT 2
- OCR of Indian Currency Sign " ₹" HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from tesseract.