Giter Site home page Giter Site logo

antonsviridenko / interoperable-transliteration Goto Github PK

View Code? Open in Web Editor NEW

This project forked from interscript/interscript-ruby

0.0 0.0 0.0 24.89 MB

Interoperable script conversion systems (ISCS) with the `interscript` gem

Python 14.75% Ruby 84.35% Shell 0.90%

interoperable-transliteration's Introduction

Interscript: Interoperable Script Conversion Systems and a Ruby implementation

Build Status

Introduction

This repository contains a number of transliteration schemes from:

  • BGN/PCGN

  • ICAO

  • ISO

  • UN (by UNGEGN)

The goal is to achieve interoperable transliteration schemes allowing quality comparisons.

interscript screencast

STATUS (work in progress!)

These transliteration systems currently work:

bgnpcgn-rus-Cyrl-Latn-1947

BGN/PCGN Romanization of Russian

iso-rus-Cyrl-Latn-iso9

ISO 9 Romanization of Russian

icao-rus-Cyrl-Latn-9303

ICAO MRZ Romanization of Russian

bas-rus-Cyrl-Latn-bss

Bulgaria Academy of Science Streamlined System for Russian

Installation

Interscript depends on Ruby. Once you manage to install Ruby, it’s easy.

gem install interscript

Usage

Assume you have a file ready in the source script like this:

cat <<EOT > rus-Cyrl.txt
Эх, тройка! птица тройка, кто тебя выдумал? знать, у бойкого народа ты
могла только родиться, в той земле, что не любит шутить, а
ровнем-гладнем разметнулась на полсвета, да и ступай считать версты,
пока не зарябит тебе в очи. И не хитрый, кажись, дорожный снаряд, не
железным схвачен винтом, а наскоро живьём с одним топором да долотом
снарядил и собрал тебя ярославский расторопный мужик. Не в немецких
ботфортах ямщик: борода да рукавицы, и сидит чёрт знает на чём; а
привстал, да замахнулся, да затянул песню — кони вихрем, спицы в
колесах смешались в один гладкий круг, только дрогнула дорога, да
вскрикнул в испуге остановившийся пешеход — и вон она понеслась,
понеслась, понеслась!

Н.В. Гоголь
EOT

You can run interscript on this text using different transliteration systems.

interscript rus-Cyrl.txt \
  --system=bgnpcgn-rus-Cyrl-Latn-1947 \
  --output=bgnpcgn-rus-Latn.txt

interscript rus-Cyrl.txt \
  --system=iso-rus-Cyrl-Latn-iso9 \
  --output=iso-rus-Latn.txt

interscript rus-Cyrl.txt \
  --system=icao-rus-Cyrl-Latn-9303 \
  --output=icao-rus-Latn.txt

interscript rus-Cyrl.txt \
  --system=bas-rus-Cyrl-Latn-bss \
  --output=bas-rus-Latn.txt

It is then easy to see the exact differences in rendering between the systems.

diff bgnpcgn-rus-Latn.txt bas-rus-Latn.txt

Adding transliteration system

Transliteration systems stored in a maps directory as YAML files. You can create a new file and add it to the directory. The file shout be named as <system-code>.yaml.

File structure

authority_id: bgnpcgn
id: 1947
language: rus
source_script: Cyrl
destination_script: Latn
name: ROMANIZATION OF RUSSIAN, BGN/PCGN 1947 System
url: https://assets.publishing.service.gov.uk/government/uploads/system/uploads/attachment_data/file/807920/ROMANIZATION_OF_RUSSIAN.pdf
creation_date: 1947
confirmation_date: 2019-06
description: The BGN/PCGN system for Russian was adopted ...

notes:
  - The character e should be romanized ye initially, after the vowel ...

tests:
  - source: ДЛИННОЕ ПОКРЫВАЛО
    expected: DLINNOYE POKRYVALO
  - source: Еловая шишка
    expected: Yelovaya shishka

map:
  rules:
    - pattern: (?<=[АаЕеЁёИиОоУуЫыЭэЮюЯяЙйЪъЬь])\u0415 # Е after a, e, ё, и, о, у, ы, э, ю, я, й, ъ, ь
      result: Ye
    - pattern: \b\u0415 # Е initially
      result: Ye

  characters:
    "\u0410": "A"
    "\u0411": "B"
    "\u0412": "V"

Rules

The subsection rules is placed in the map section. All rules are applied in order they are placed before the subsection characters applying. Rules apply to an original text, not to a result of previous rules applying.

Each rule has pattern and result elements.

Pattern is a regex expression. It should be representing as a string without // or %r{} parentheses. For example \b\u0415. In case a rule is depend on previous or next content, lookahead or lookbehind could be used. For example a rule with the pattern (?⇐[АаЕеЁёИиОоУуЫыЭэЮюЯяЙйЪъЬь])\u0415 find every Е after upper or lower case symbols a, e, ё, и, о, у, ы, э, ю, я, й, ъ, ь.

Result is a replacement a for pattern’s match. It can contain a string, an Unicode characters specified by a hexadecimal number, a captured group reference. String with hexadecimal number or captured group reference should be double quoted. For example "Y\u00eb" or "\\1\u00b7\\2". Captured group are referred by double backslash and group’s number.

Testing transliteration systems

To test all transliteration systems in maps directory run a command:

bundle exec rspec

The command takes source texts from test section, transforma it using rules and charmaps from map section and compare resultat with expected text form text section.

To test specific transliteration system set environment variable TRANSLIT_SYSTEM to code of desired system. The code is name of YAML file without extension:

TRANSLIT_SYSTEM=bgnpcgn-rus-Cyrl-Latn-1947 bundle exec rspec

ISCS system codes

The system code identifying a script conversion system has a few components:

e.g. bgnpcgn-rus-Cyrl-Latn-1947

bgnpcgn

the authority identifier

rus

an ISO 639-2 3-letter language code that this system applies to

Cyrl

an ISO 15924 script code, identifying the source script

Latn

an ISO 15924 script code, identifying the target script

1947

an identifier unit within the authority to identify this system

Covered languages

Currently the schemes cover Cyrillic, Armenian, Greek, Arabic and Hebrew.

License

Copyright Ribose.

interoperable-transliteration's People

Contributors

abunashir avatar andrew2net avatar chaaklau avatar manuelfuenmayor avatar ronaldtse avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.