These transliteration systems currently work:
bgnpcgn-rus-Cyrl-Latn-1947
-
BGN/PCGN Romanization of Russian
iso-rus-Cyrl-Latn-iso9
-
ISO 9 Romanization of Russian
icao-rus-Cyrl-Latn-9303
-
ICAO MRZ Romanization of Russian
bas-rus-Cyrl-Latn-bss
-
Bulgaria Academy of Science Streamlined System for Russian
Interscript depends on Ruby. Once you manage to install Ruby, it’s easy.
gem install interscript
Assume you have a file ready in the source script like this:
cat <<EOT > rus-Cyrl.txt
Эх, тройка! птица тройка, кто тебя выдумал? знать, у бойкого народа ты
могла только родиться, в той земле, что не любит шутить, а
ровнем-гладнем разметнулась на полсвета, да и ступай считать версты,
пока не зарябит тебе в очи. И не хитрый, кажись, дорожный снаряд, не
железным схвачен винтом, а наскоро живьём с одним топором да долотом
снарядил и собрал тебя ярославский расторопный мужик. Не в немецких
ботфортах ямщик: борода да рукавицы, и сидит чёрт знает на чём; а
привстал, да замахнулся, да затянул песню — кони вихрем, спицы в
колесах смешались в один гладкий круг, только дрогнула дорога, да
вскрикнул в испуге остановившийся пешеход — и вон она понеслась,
понеслась, понеслась!
Н.В. Гоголь
EOT
You can run interscript
on this text using different transliteration systems.
interscript rus-Cyrl.txt \
--system=bgnpcgn-rus-Cyrl-Latn-1947 \
--output=bgnpcgn-rus-Latn.txt
interscript rus-Cyrl.txt \
--system=iso-rus-Cyrl-Latn-iso9 \
--output=iso-rus-Latn.txt
interscript rus-Cyrl.txt \
--system=icao-rus-Cyrl-Latn-9303 \
--output=icao-rus-Latn.txt
interscript rus-Cyrl.txt \
--system=bas-rus-Cyrl-Latn-bss \
--output=bas-rus-Latn.txt
It is then easy to see the exact differences in rendering between the systems.
diff bgnpcgn-rus-Latn.txt bas-rus-Latn.txt
Transliteration systems stored in a maps
directory as YAML files. You can create a new file and add it to the directory. The file shout be named as <system-code>.yaml
.
authority_id: bgnpcgn
id: 1947
language: rus
source_script: Cyrl
destination_script: Latn
name: ROMANIZATION OF RUSSIAN, BGN/PCGN 1947 System
url: https://assets.publishing.service.gov.uk/government/uploads/system/uploads/attachment_data/file/807920/ROMANIZATION_OF_RUSSIAN.pdf
creation_date: 1947
confirmation_date: 2019-06
description: The BGN/PCGN system for Russian was adopted ...
notes:
- The character e should be romanized ye initially, after the vowel ...
tests:
- source: ДЛИННОЕ ПОКРЫВАЛО
expected: DLINNOYE POKRYVALO
- source: Еловая шишка
expected: Yelovaya shishka
map:
rules:
- pattern: (?<=[АаЕеЁёИиОоУуЫыЭэЮюЯяЙйЪъЬь])\u0415 # Е after a, e, ё, и, о, у, ы, э, ю, я, й, ъ, ь
result: Ye
- pattern: \b\u0415 # Е initially
result: Ye
characters:
"\u0410": "A"
"\u0411": "B"
"\u0412": "V"
The subsection rules
is placed in the map
section. All rules are applied in order they are placed before the subsection characters
applying. Rules apply to an original text, not to a result of previous rules applying.
Each rule has pattern
and result
elements.
Pattern is a regex expression. It should be representing as a string without //
or %r{}
parentheses. For example \b\u0415
. In case a rule is depend on previous or next content, lookahead or lookbehind could be used. For example a rule with the pattern (?⇐[АаЕеЁёИиОоУуЫыЭэЮюЯяЙйЪъЬь])\u0415
find every Е after upper or lower case symbols a, e, ё, и, о, у, ы, э, ю, я, й, ъ, ь.
Result is a replacement a for pattern’s match. It can contain a string, an Unicode characters specified by a hexadecimal number, a captured group reference. String with hexadecimal number or captured group reference should be double quoted. For example "Y\u00eb"
or "\\1\u00b7\\2"
. Captured group are referred by double backslash and group’s number.
To test all transliteration systems in maps
directory run a command:
bundle exec rspec
The command takes source
texts from test
section, transforma it using rules
and charmaps
from map
section and compare resultat with expected
text form text
section.
To test specific transliteration system set environment variable TRANSLIT_SYSTEM
to code of desired system. The code is name of YAML file without extension:
TRANSLIT_SYSTEM=bgnpcgn-rus-Cyrl-Latn-1947 bundle exec rspec
The system code identifying a script conversion system has a few components:
e.g. bgnpcgn-rus-Cyrl-Latn-1947
bgnpcgn
-
the authority identifier
rus
-
an ISO 639-2 3-letter language code that this system applies to
Cyrl
-
an ISO 15924 script code, identifying the source script
Latn
-
an ISO 15924 script code, identifying the target script
1947
-
an identifier unit within the authority to identify this system
-
rus-Cyrl-1.txt
: Copied from the XLS output from http://www.primorsk.vybory.izbirkom.ru/region/primorsk?action=show&global=true&root=254017025&tvd=4254017212287&vrn=100100067795849&prver=0&pronetvd=0®ion=25&sub_region=25&type=242&vibid=4254017212287 -
rus-Cyrl-2.txt
: Copied from the XLS output from http://www.yaroslavl.vybory.izbirkom.ru/region/yaroslavl?action=show&root=764013001&tvd=4764013188704&vrn=4764013188693&prver=0&pronetvd=0®ion=76&sub_region=76&type=426&vibid=4764013188704
-
ALA-LC Romanization systems from 1997 are available here: http://catdir.loc.gov/catdir/cpso/roman.html
-
ALA-LC Romanization systems in current use are here: https://www.loc.gov/catdir/cpso/roman.html
-
UN systems are available here: http://www.eki.ee/wgrs/