Giter Site home page Giter Site logo

thomaslech / crohme_extractor Goto Github PK

View Code? Open in Web Editor NEW
93.0 8.0 17.0 59.2 MB

CROHME dataset extractor for OFFLINE-text-recognition task.

Home Page: http://blog.mathocr.com

Python 100.00%
pattern-recognition machine-learning dataset parse python data data-science

crohme_extractor's Introduction

Abstract

CROHME datasets originally exhibit features designed for Online-handwritting recognition task.
Apart from drawn traces being encoded, inkml files also contain trace drawing time captured. So we need to extract new feature map, namely matrices of pixel intensities.

The following scripts will get you started with Offline math symbols recognition task.

Setup

All code is compatible with Python 3.5.* version.

  1. Extract CROHME_full_v2.zip (found inside data directory) contents before running any of the above scripts.

  2. Install specified dependencies with pip (Python Package Manager) using the following shell command:

pip install -U -r requirements.txt

Scripts info

  1. extract.py

    • Extracts trace groups from inkml files.
    • Converts extracted trace groups into images. Images are square shaped bitmaps with only black (value 0) and white (value 1) pixels. Black color denotes patterns (ROI).
    • Labels those images (according to inkml files).
    • Flattens images to one-dimensional vectors.
    • Converts labels to one-hot format.
    • Dumps training and testing sets separately into outputs folder.

    Command line arguments: -b [BOX_SIZE] -d [DATASET_VERSION] -c [CATEGORY] -t [THICKNESS]

    Example usage: python extract.py -b 50 -d 2011 2012 2013 -c digits lowercase_letters operators -t 5

    Caution: Script doesn't work properly for images bigger than 200x200 (For yet unknown reason).

  2. balance.py script balances the overall distribution of classes.

    Command line arguments: -b [BOX_SIZE] -ub [UPPER_BOUND][Optional]

    Example usage: python balance.py -b 50 -ub 6000

  3. visualize.py script will plot single figure depicting a random batch of extracted data.

    Command line arguments: -b [BOX_SIZE] -n [N_SAMPLES] -c [COLUMNS]

    Example usage: python visualize.py -b 50 -n 40 -c 8

    Sample Plot: crohme_extractor_plot

  4. extract_hog.py script will extract HoG features.
    This script accepts 1 command line argument, namely hog_cell_size.
    hog_cell_size corresponds to pixels_per_cell parameter of skimage.feature.hog function.
    We use skimage.feature.hog to extract HoG features.
    Example of script execution: python extract_hog.py 5 <-- pixels_per_cell=(5, 5)
    This script loads data previously dumped by extract.py and again dumps its outputs(train, test) separately.

  5. extract_phog.py script will extract PHoG features.
    For PHoG features, HoG feature maps using different cell sizes are concatenated into a single feature vector.
    So this script takes arbitrary number of hog_cell_size values(HoG features have to be previously extracted with extract_hog.py)
    Example of script execution: python extract_phog.py 5 10 20 <-- loads HoGs with respectively 5x5, 10x10, 20x20 cell sizes.

  6. histograms folder contains histograms representing distribution of labels based on different label categories. These diagrams help you better understand extracted data.

Distribution of classes

all_labels_distribution Labels were combined from train and test sets.

crohme_extractor's People

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

crohme_extractor's Issues

No such file or directory: 'outputs/validation/validation.pickle' Help

python extract_hog.py 5

==============================================================

Script flags: <hog_cell_size>

Restoring training set ...
Restoring test set ...
Traceback (most recent call last):
File "extract_hog.py", line 44, in
with open(os.path.join(validation_dir, 'validation.pickle'), 'rb') as validation:
FileNotFoundError: [Errno 2] No such file or directory: 'outputs/validation/validation.pickle'

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.