Giter Site home page Giter Site logo

cipher_identifier's People

Contributors

mklarz avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar

Forkers

kernelzeroday

cipher_identifier's Issues

Fix training data wordlist running out when the cipher charset has a limited amount of characters

Looks like the wordlist is running out of words, perhaps randomize the cases if the charset contains upper- and lowercase characters?

$ python scripts/generate_train_data.py babylonian-numbers
Generating 10000 training images for cipher: babylonian-numbers
Cipher image count: 15
Cipher charset: 0123456789ABCDE
Traceback (most recent call last):
  File "scripts/generate_train_data.py", line 534, in <module>
    generate_train_data(cipher, wordlist, limit=limit)
  File "scripts/generate_train_data.py", line 434, in generate_train_data
    sentences = generate_sentences(
  File "scripts/generate_train_data.py", line 163, in generate_sentences
    word = wordlist.pop()
IndexError: pop from an empty deque

Add unit tests

We need to somehow make sure the confidence of each model is above a certain expected threshold.

Add unit tests that randomly generates images for the ciphers and uses the models to verify the accuracy of the models.

Fix training data generation for atlantean-language

Seems like we don't correctly check the charset of the cipher.

$ python scripts/generate_train_data.py atlantean-language
Generating 10000 training images for cipher: atlantean-language
Cipher image count: 39
Cipher charset: 0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZcst
Generating image #1 [sentence="HMS nutriment shantytown", Traceback (most recent call last):
  File "scripts/generate_train_data.py", line 534, in <module>
    generate_train_data(cipher, wordlist, limit=limit)
  File "scripts/generate_train_data.py", line 442, in generate_train_data
    symbols = get_symbols_from_text(symbol_mapping, sentence)
  File "scripts/generate_train_data.py", line 401, in get_symbols_from_text
    symbols.append(symbol_mapping[character])
KeyError: 'n'

Add image preprocessing settings to the ciphers' information file

The different ciphers need to have separate image preprocessing settings as some of the settings could remove essential information. Need to figure out what the optimal settings are and save these to the cipher's information file. See below for a bad image preprocessing for the Braille alphabet:

Original

image

Step 1. Grayscaling

image

Step 2. Noise removal (median blur)

image

Step 3. Thresholding

image

Step 4. Auto crop

Step 4.1. Thresholding with binary invert

image

Step 4.2. Dilation

image

Step 4.3. Find contours

image

Step 4.4. Chose contour

image

Step 4.5. Crop image

image

In the case of Braille this should ideally stop at grayscaling.

Add an information file per cipher

We need an information file (cipher.json?) that includes the name of the cipher, a description, charset and the list of supported characters.

This would also have to be added when downloading the ciphers from dcode.

In addition, each cipher should have a README.md in their directory with this information, for easier viewing in GitHub.

Add intial trained models for the ciphers

  • acere-cipher
  • ancients-stargate-alphabet
  • arthur-invisibles-cipher
  • atlantean-language
  • aurebesh-alphabet
  • babylonian-numbers
  • betamaze-cipher
  • braille-alphabet
  • chinese-code
  • daggers-alphabet
  • dancing-men-cipher
  • dorabella-cipher
  • dotsies-writing
  • draconic-dragon-language
  • elder-futhark
  • enochian-language
  • french-sign-language
  • futurama-alien-alphabet
  • gerudo-language
  • gnommish-alphabet
  • goron-cipher
  • gravity-falls-author-cipher
  • gravity-falls-bill-cipher
  • hylian-language-a-link-between-worlds
  • hylian-language-breath-of-the-wild
  • hylian-language-skyward-sword
  • hylian-language-twilight-princess
  • hymnos-alphabet
  • inuktitut-language
  • iokharic-language
  • klingon-language
  • lingua-ignota-code
  • maritime-signals-code
  • mary-stuart-code
  • mayan-numbers
  • mirror-digits
  • music-sheet-cipher
  • ogham-alphabet
  • pigpen-cipher
  • pokemon-unown-alphabet
  • rosicrucian-cipher
  • semaphore-flag
  • semaphore-trousers-cipher
  • sheikah-language
  • simlish-language
  • standard-galactic-alphabet
  • symbol-font
  • templars-cipher
  • theban-alphabet
  • tic-tac-toe-cipher
  • voynich-manuscript
  • webdings-font
  • wingdings-font
  • zodiac-killer-cipher

Add CI

Just to test linting to make sure we have a proper coding style throughout the project.

Add an additional weight calculation for the OCR output text

Currently we only base the "weight" on the confidence from Tesseract and each model. Should consider adding a weight calculation process after getting the text output from the OCR that checks if the text contains any words from a wordlist (i.e. the british-english one).

Generate variants of the training images

Might be worth looking into generating variants of the training images, currently they're too perfect. Create variants that are skewed, pixelated, noisy, etc.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.