Comments (23)
octopus-rtrain creates codec( target character union ) by read_text + lstm.normalize_nfkc
ocropy.read_text() calls occupy.normalize_text() which calls unicodedata.normalize('NFC',s) from inside.
lstm.normalize_nfkc() calls unicodedata.normalize('NFKC',s)
During training loop, correct text(transcript) is loaded by ocrolib.read_text(base+".gt.txt")
This transcript does not go through NFKC normalization.
Doesn't this cause any problem?
from dup-ocropy.
@adnanulhasan One of the papers from your groups, mentions the availability of a a ground truth devanagari database called 'Dev-DB'. Is there a possibility you can link me to it?
from dup-ocropy.
There are no default models, but you can train one easily, either using some training data from real scanned images or artificial data generated using ocropus-linegen. We have used it for Devanagari and Greek script with a lot of success. Some researchers reported results on Arabic Handwriting recognition using OCRopus. I can help in running a basic model if you decide to train your own models.
from dup-ocropy.
Thanks so much Adnan!
Your help would be very appreciated. Can you point me to what you did with the Devanagari or Greek languages? We can also take this offline if you prefer.
from dup-ocropy.
You are welcome!
The only different thing we did with Devanagari is the text-line normalization. Instead of using the default ocropus line normalization, we used a different method.
I think it would be better if we could talk off this platform. You can email me at [email protected].
from dup-ocropy.
Hi, Thanks for this wonderful project.
I am trying to test for Japanese text.
As you know or not, Japanese characters looks like this.
"日本語でFracturは亀の子文字という"
Yes, there are over-20-edge characters. and Japanese uses around 5000 different characters.
Which tuning parameters do I have to care? Rough suggestions are appreciated, I will try.
from dup-ocropy.
In ocropus-rtrain, I changed from repr to unicode.
print " TRU:",unicode(transcript)
print " ALN:",unicode(gta[:len(transcript)+5])
print " OUT:",unicode(pred[:len(transcript)+5])
You are great !!!
My Mac is learning 2705 characters now. It's just like a kid, trying to read.
Model data is over 50 MB.
from dup-ocropy.
After 4 millions iteration with 2402 kinds of Japanese characters, it does not seem to converge. I'll try c++ version.
from dup-ocropy.
How big was your dataset?
from dup-ocropy.
I generated 2000 lines of random text(UTF8) from 2402 chars (official common usage characters).
c++ version seems to be running without any modification.
from dup-ocropy.
For Chinese characters, you probably need a much larger number of hidden units, and possibly some other tricks as well. Please share what you come up with.
from dup-ocropy.
@isaomatsunami Have you made any progress in training Japanese Character? I'm trying to train ocropy to recognize Chinese now.
from dup-ocropy.
No. I tried ocropy with hidden nodes of 200 and found, as far as I estimate, that it began to learn one char by forgetting another.
I am training clstm against 3877 classes of Chinese/Japanese characters with hidden node = 800.
After 150000 iteration, it keeps 3.8-5% error rate. See clstm section.
from dup-ocropy.
anything update about Chinese i have been read Adnan`s phd thesis,and I have 2 million documents (pdf or xps we can transfrom to jpeg) containing Chinese and English characters both ,need some help and tips about how to train a model
do we need to specify the dpi of picture
from dup-ocropy.
Hi,
It would be interesting to see how LSTM would work on Chinese. Can you send me some sample pages?
Kind regards,
Adnan Ul-Hasan
On Sat, Apr 16, 2016 at 9:06 PM -0700, "wanghaisheng" [email protected] wrote:
anything update about Chinese i have been read Adnan`s phd thesis,and I have 2 million documents containing Chinese and English characters both ,need some help and tips about how to train a model
—
You are receiving this because you commented.
Reply to this email directly or view it on GitHub
from dup-ocropy.
@adnanulhasan
you can touch me here [email protected]
from dup-ocropy.
@isaomatsunami sir ,how do you get all your Ground Truth data ?
i am using https://github.com/tmbdev/ocropy/wiki/Working-with-Ground-Truth way right now,but i want to use existing character to generate
from dup-ocropy.
Hi guys
I am working on a indic language telugu model,
I struck at this point
I just want to train it with telugu charecter set,
but the ocropus-rtrain loading all charecters,digits,and all how i even created a telugu='' variable in ocrlib/chars.py but not succeded.
Please help me
from dup-ocropy.
Hi,
Training ocropy for Telugu should be straight forward. You can use -c parameter to include the characters from from GT text files.
from dup-ocropy.
Hi @adnanulhasan thanks for your response,
I'm. Trying command as
Ocorpus-rtrain -o te book/0001/010000.bin.png -c telugucharectars
But it's not working
from dup-ocropy.
Give the path to gt.txt files instead of mentioning telugucharacters.
-c book/0001/010000.gt.txt
from dup-ocropy.
@adnanulhasan thanks dude,
Some time trackback error coming during training
Is it still open?
from dup-ocropy.
@adnanulhasan If I want to train an Arabic model, do you suggest using ocropy or clstm?
what changes should I do to ocropy, char.py?
from dup-ocropy.
Related Issues (20)
- Model for french medieval manuscript HOT 9
- Manually correcting segmentation HOT 2
- --probabilities option of ocropus-rpred causes IndexError HOT 1
- I can't run the test HOT 4
- i HOT 2
- Is there a graphical depiction of the model being used/trained here? HOT 4
- I get bad scaling issues HOT 2
- Error running : ocropus-nlbin ersch.png -o book HOT 2
- Having ERROR: book/0001.bin.png SKIPPED image too tall for a text line (1080, 1920) (use -n to disable this check) HOT 1
- How does it apply to java HOT 4
- Trying to test out ocropus from sources
- 404 Not Found for en-default.pyrnn.gz HOT 1
- AssertionError: you must install and use OCRopus with Python version 2.7 or later, but not Python 3.x HOT 7
- EOF error with cpickle.Unpickler in common.py HOT 2
- Not found, while second step with wget, ERROR 404: Not Found. HOT 1
- run-test, error, with Python 3.7.4 HOT 1
- can't set up conda environment correctly for ocropus HOT 1
- ocropy-gtedit changes certain punctuation and diacritic characters HOT 1
- I want to get 1,000 synthetically generated data? Where do i set the number of data's to be generated? Thanks
- On-premise to cloud migration issue
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from dup-ocropy.