Giter Site home page Giter Site logo

abdulla20-8 / kurdish-central-handwritten Goto Github PK

View Code? Open in Web Editor NEW
0.0 1.0 1.0 109 KB

this is my first research for my machine learning and deep learning, i collected nearly 6000 image for each class the total class is 45 and the total image is 345000

Home Page: https://www.sciencedirect.com/science/article/pii/S2352340923001324

kurdish-dataset character-recognition dataset digit-recognition handwritten handwritten-digit-recognition kurdish kurdish-handwriting-dataset kurdish-handwritten ocr

kurdish-central-handwritten's Introduction

Kurdish Central handwritten Character & Digit Recognition

This article presents two massive datasets for central Kurdish handwriting digits and isolated characters named K-ZHMARA and K-PIT. The first dataset, named K-ZHMARA dataset, contains 70,000 images of Kurdish digits, 7000 images for each digit, and a printed A4 paper with a grid of 10 × 10 is used for data collection. Apart from digits, the K-PIT dataset includes 245,000 images of all Kurdish characters, 7000 images for each character; data was collected via a printed A4 paper with a grid of 12 × 10 for this dataset. Moreover, both datasets include 315,000 images. Python programming has been used to scan each piece of paper, segment, crop, resize, binarize, and invert the images via edge detection and image processing techniques.

Fore More Information Visit My Paper In Data In Brief Journal Research Here

Objective

OCR aims to modify or convert any type of text or text-containing document, including handwritten, printed, or scanned text images, into a digital format that may be edited and used for more in-depth processing. OCR allows a machine to recognize text in such materials automatically. A few significant obstacles must be identified and overcome to automate successfully, for instance, the existence of a huge and reliable dataset.

There has not been much research done on automatically recognizing Kurdish handwritten characters and digits since machine and deep learning models need huge datasets to achieve high accuracy; the aim of this work is to prepare two huge datasets for the Kurdish language named K-PIT (for Kurdish characters) and K-ZHMARA (for Kurdish digits), these datasets can be used to build a model for handwriting optical character/digit recognition and identification via deep learning and machine learning approaches.

Data Description

Kurdish language dialects are used across four main nation-states in the Middle East [2], and only one dialect, Sorani, has official status in one of these nation-states. The majority of Kurdish-speaking regions are located in Turkey, Iraq, Iran, and Syria. More than 40 million people speak Kurdish as a whole, according to estimates [3,4]. One of the two main dialects of Kurdish, known as Central Kurdish (Sorani), is spoken by an estimated 9 to 10 million people [5]. It is mostly written with a 35-character modified Arabic/Persian alphabet without characters that have recently been replaced, such as (ك), which is no longer used by the Kurdish language and has been replaced with (ک) [6,7]. A large database of isolated handwritten Central Kurdish digit and character images has been developed in this effort, totaling 315,000 images, with 7000 images of each handwritten by more than 1500 native individuals. Table 1 shows the number of images and the percentage of each character in the K-PIT database. The Quantity and Proportion of Digits Obtained for the K-ZHMARA Dataset are shown in Table 2. Central Kurdish uses modified Arabic/Persian (Farsi) characters for writing, and there are numerous expansive databases of Persian and Arabic handwriting characters for recognition of offline characters; some databases even assert that their database can be used to recognize other languages that use the Arabic scripts, for instance, Kurdish [8], [9], [10]. Nevertheless, there are three primary issues. The first is that it does not include all of the Kurdish letters, such as V(ڤ), L (ڵ), J(ژ), R(ڕ), and O (ۆ). The Kurdish language has an inconsistent quantity and percentage of characters, which is the second issue. The third problem is all the datasets worked with the characters only and ignored the digits.

image

image

  • Fore More Information Visit My Paper In Data In Brief Journal Research Here

kurdish-central-handwritten's People

Contributors

abdulla20-8 avatar

Watchers

 avatar

Forkers

maveenmm

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.