Arabic-Font-Recognition

Overview
Project Pipeline
Final Results
Contributors
License

Overview

Given an image containing a paragraph written in Arabic, the system classifies the paragraph into one of four fonts (from 0 to 3) with classical machine learning techniques.

Font Code	Font Name
0	Scheherazade New
1	Times New Roman
2	Lemonada
3	IBM Plex Sans Arabic

Project Pipeline

Preprocessing Module

This preprocessing module is designed to enhance the quality of images containing text before further analysis or processing. Here's a breakdown of the module:

Salt and Pepper Noise Detection and Removal:

The detect_salt_and_pepper function assesses if the image contains salt and pepper noise.
If detected, it applies a median filter (median_filter) to reduce the noise.

Binarization:

The binarizeImage function converts the grayscale image to a binary image, making text clearer for extraction.
It ensures the image is in black text on a white background.

Hough Line Transform:

The hough_transforms function detects lines in the binary image using the Hough line transform.
It rotates the image to align detected lines horizontally.

Text Orientation Correction:

The pytesseract_orientation function uses Tesseract OCR to detect the text orientation.
It rotates the image based on the detected orientation for proper alignment.

Image Preprocessing Pipeline:

The preprocess function orchestrates the entire preprocessing pipeline:
Loads the image.
Detects salt and pepper noise and removes it if present.
Binarizes the image.
Applies the Hough transform to align text horizontally.
Corrects the text orientation.
Saves the preprocessed image.

Segmentation of Text

The find_contours function identifies contours within the thresholded and dilated image.
Contours represent the boundaries of distinct objects or regions within the image.
It filters out small or insignificant contours based on width and height thresholds.
For each valid contour, it creates a bounding box around the region of interest and saves it as a separate image file in a specified output directory
The function returns the count of extracted regions.

Feature Extraction/Selection Module

We tried the following approaches:

Horizontal and vertical Histogram
Entropy
HoG
SIFT
Local Phase Quantization (LPQ) After comparing results of the listed above approaches we decided to use LPQ as it yields the best results

Model Selection/Training Module

We tried the following approaches:

KNN
SVM
Decision Tree
Random Forest After comparing results of the listed above approaches we decided to use Random Forest as it yields the best results with LPQ