The cipher_identifier from mklarz

The current codebase is messy, split the methods and constants into separate files.

Fix training data wordlist running out when the cipher charset has a limited amount of characters

Looks like the wordlist is running out of words, perhaps randomize the cases if the charset contains upper- and lowercase characters?

$ python scripts/generate_train_data.py babylonian-numbers
Generating 10000 training images for cipher: babylonian-numbers
Cipher image count: 15
Cipher charset: 0123456789ABCDE
Traceback (most recent call last):
  File "scripts/generate_train_data.py", line 534, in <module>
    generate_train_data(cipher, wordlist, limit=limit)
  File "scripts/generate_train_data.py", line 434, in generate_train_data
    sentences = generate_sentences(
  File "scripts/generate_train_data.py", line 163, in generate_sentences
    word = wordlist.pop()
IndexError: pop from an empty deque

Add additional ciphers / alphabets

See:
https://omniglot.com/conscripts/
https://omniglot.com/writing/
http://conscripts.s4.bizhat.com/conscripts.html
http://web.archive.org/web/20200404055359/http://www.ancientscripts.com:80/

Add unit tests

We need to somehow make sure the confidence of each model is above a certain expected threshold.

Add unit tests that randomly generates images for the ciphers and uses the models to verify the accuracy of the models.

Create a script that finds the best possible image preprocessing settings for each cipher

1. Create a (small) training set for each cipher
1. Generate a list of settings for each cipher
1. Run through the list of settings and find the one that return the highest accuracy / confidence

Fix training data generation for atlantean-language

Seems like we don't correctly check the charset of the cipher.

$ python scripts/generate_train_data.py atlantean-language
Generating 10000 training images for cipher: atlantean-language
Cipher image count: 39
Cipher charset: 0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZcst
Generating image #1 [sentence="HMS nutriment shantytown", Traceback (most recent call last):
  File "scripts/generate_train_data.py", line 534, in <module>
    generate_train_data(cipher, wordlist, limit=limit)
  File "scripts/generate_train_data.py", line 442, in generate_train_data
    symbols = get_symbols_from_text(symbol_mapping, sentence)
  File "scripts/generate_train_data.py", line 401, in get_symbols_from_text
    symbols.append(symbol_mapping[character])
KeyError: 'n'

Add image preprocessing settings to the ciphers' information file

The different ciphers need to have separate image preprocessing settings as some of the settings could remove essential information. Need to figure out what the optimal settings are and save these to the cipher's information file. See below for a bad image preprocessing for the Braille alphabet:

Original

Step 1. Grayscaling

Step 2. Noise removal (median blur)

Step 3. Thresholding

Step 4. Auto crop

Step 4.1. Thresholding with binary invert

Step 4.2. Dilation

Step 4.3. Find contours

Step 4.4. Chose contour

Step 4.5. Crop image

In the case of Braille this should ideally stop at grayscaling.

Add instructions for training models

Add additional symbol ciphers

Add an information file per cipher

We need an information file (cipher.json?) that includes the name of the cipher, a description, charset and the list of supported characters.

This would also have to be added when downloading the ciphers from dcode.

In addition, each cipher should have a README.md in their directory with this information, for easier viewing in GitHub.

Add intial trained models for the ciphers

Add CI

Just to test linting to make sure we have a proper coding style throughout the project.

Add support for QR and bar codes

Add the package to PyPi

Add a Dockerfile for the project

Look into replacements for Tesseract

There are other OCR engines that may perform better in our case (OCR on symbols). Test the variants and see if the accuracy of the recognition improve.

Also see https://github.com/OCR4all/OCR4all

Add documentation

Properly document the methods and add a simple readthedocs page

Update README with documentation for each script

Find a new name for the project

It's not limited to dcode, but can actually be used for any (symbol, for now) cipher. Find a new better fitting name.

Add image processing before passing the input to Tesseract

We need to improve the quality of the input images before we pass them to Tesseract to increase the chance of correctly guessing the cipher.
See https://tesseract-ocr.github.io/tessdoc/ImproveQuality

Also see https://tesseract-ocr.github.io/tessdoc/ImproveQuality#examples

Consider creating a web app that can process input

Should be easy enough to create a web app that processes input images and outputs the ciphers' confidence and text. Easier than having to install the package locally.

Add an additional weight calculation for the OCR output text

Currently we only base the "weight" on the confidence from Tesseract and each model. Should consider adding a weight calculation process after getting the text output from the OCR that checks if the text contains any words from a wordlist (i.e. the british-english one).

Look into handling text input to decipher more than just symbol ciphers

It should in theory be possible to handle most (text) ciphers by attempting to decipher them for each available cipher and checking the output against a language wordlist.

This should be fairly easy considering we don't need to do any image processing and can do a weight calculation on how many of the deciphered words exist in the provided wordlist.

See https://github.com/dhondta/python-codext for some ciphers

Generate variants of the training images

Might be worth looking into generating variants of the training images, currently they're too perfect. Create variants that are skewed, pixelated, noisy, etc.

mklarz / cipher_identifier Goto Github PK

cipher_identifier's People

Contributors

Stargazers

Watchers

Forkers

cipher_identifier's Issues