Giter Site home page Giter Site logo

sleukrith-set's Introduction

SleukRith Set

Description

Datasets for ancient written text recognition algorithms are of fundamental interest for the training of statistics based recognition methods as well as for benchmarking existing recognition systems. SleukRith Set, the first dataset specifically created for Khmer palm leaf manuscripts, has been constructed. The dataset consists of annotated data from 657 pages of digitized palm leaf manuscripts which are selected arbitrarily from our digitized palm leaf image corpus

SleukRith Set is composed of three types of data:

  • Isolated Characters: Individual or isolated character dataset is the most important data type in SleukRith Set since its information is used to produce the other types of data. In order to segment and annotate a manuscript page into small image patches representing each individual character, a polygon boundary enclosing the character needs to be drawn manually. The ground truther is required to dot out vertex of the polygon one by one until a proper boundary is formed. The ground truther is then prompted to input the correct Unicode or Unicode sequence as label for that character. Some samples of character image patches extracted from annotated character dataset of SleukRith Set are shown below.

char_image_patches

  • Words: After all characters in the page are manually annotated, they can be combined together into words. To form a word, the character components of that word are selected one by one. The selection order is also important since Khmer Unicode sequence does not follow the left to right position order of the characters but instead respects a consonant-first-vowel-second basis. The ground truther is then again prompted to input a Unicode sequence representing the label of the formed word. By default, the word label is generated by putting together the labels of the characters which are the components of that word. The second label should also be provided by the ground truther when either the current word spelling is found to be erroneous or when an equivalent word from the modern Khmer language has a different spelling. The image below illustates some samples of word patch images extracted from annotated word dataset of SleukRith Set.

word_image_patches

  • Lines: Similarly annotated characters may be grouped into lines. To efficiently achieve this, the ground truther uses left click and drag over characters belonging to the same line. He is then asked to create a new line from the selected characters or add them to existing lines.

After all steps in the annotation scheme are complete, an xml file containing all information of the three types of data of the annotation can be exported for each manuscript page. The xml file is divided into two sections. The upper part under the tag name “CharAnno” is dedicated to the annotation at the character level. This section block contains child blocks. Each child block represents an annotated character, information about the coordinates of its polygon boundary and additional attributes including character id, its label, and the id of the line which the character belongs to. The lower part of the file under the tag name “WordAnno” describes the annotation at the word level. Since a word is a combination of characters, only the id’s of the annotated characters defined in the first section are stored along with the id information of the annotated word and its two labels.

<CharAnno>
    <Char id="0" label="យ" lineid="0">
    	<poly x="406" y="100"/>
        <poly x="406" y="87"/>
        ...
    </Char>
    ...
</CharAnno>
<WordAnno>
    <Word id="0" label="កំលាំង" label2="កម្លាំង">
    	<CharInWord id="329"/>
        <CharInWord id="330"/>
        ...
    </Word>
    ...
</WordAnno>

Download

SleukRith Set

Annotation Tool

Datasets Extracted from SleukRith Set

For more information about SleukRith Set, please refer to our paper: Valy, D., Verleysen, M., Chhun, S., & Burie, J. C. (2017). A New Khmer Palm Leaf Manuscript Dataset for Document Analysis and Recognition - SleukRith Set. In 4th International Workshop on Historical Document Imaging and Processing (HIP).

Acknowledgement

We would like to thank the National Library of Cambodia, the EFEO team, and the Buddhist Institute for providing their digital images of palm leaf manuscripts. In addition, we would also like to acknowledge the help with the annotation process of our dataset by volunteer students from the Institute of Technology of Cambodia (ITC) and the National Institute of Posts, Telecommunications, and ICT (NIPTICT).

This research study is supported by ARES-CCD (program AI 2014-2019) under the funding of Belgian university cooperation and the STIC Asia program implemented by the French Ministry of Foreign Affairs and International Development (MAEDI).

sleukrith-set's People

Contributors

donavaly avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.