reynoldsnlp / udar Goto Github PK

UDAR Does Accented Russian: A finite-state morphological analyzer of Russian that handles stressed wordforms.

License: GNU General Public License v3.0

Python 99.49% Shell 0.51%

natural-language-processing morphological-analysis morphological-disambiguator learner-errors stressed-wordforms finite-state-machine finite-state-transducers finite-state-morphology russian-language russian-morphology

udar's Introduction

UDAR(enie)

UDAR Does Accented Russian: A finite-state morphological analyzer of Russian that handles stressed wordforms.

A python wrapper for the Russian finite-state transducer described originally in chapter 2 of my dissertation.

If you use this work in your research please cite the following:

Reynolds, Robert J. "Russian natural language processing for computer-assisted language learning: capturing the benefits of deep morphological analysis in real-life applications" PhD Diss., UiT–The Arctic University of Norway, 2016. https://hdl.handle.net/10037/9685

Feature requests, issues, and pull requests are welcome!

Dependencies

For all features to be available, you should have hfst and vislcg3 installed as command-line utilities. Specifically, hfst is needed for FST-based tokenization, and vislcg3 is needed for grammatical disambiguation. The version used to successfully test the code is included in each commit in this file. The recommended method for installing these dependencies is as follows:

Debian / Ubuntu

$ wget https://apertium.projectjj.com/apt/install-nightly.sh -O - | sudo bash
$ sudo apt-get install cg3 hfst python3-hfst

MacOS (Python 3.6/3.7 only)

On MacOS, one of udar's dependencies, the python package hfst, is not currently available for Python 3.8+. Hopefully, this will be remedied soon.

$ curl https://apertium.projectjj.com/osx/install-nightly.sh | sudo bash
$ python3 -m pip install hfst

Installation

This package can be installed from PyPI using the usual...

$ python3 -m pip install --user udar

...or directly from this repository using...

$ python3 -m pip install --user git+https://github.com/reynoldsnlp/udar

Introduction

NB! Documentation is currently limited to docstrings. I recommend that you use help() frequently to see how to use classes and methods. For example, to see what options are available for building a Document, try help(Document).

The most common use-case is to use the Document constructor to automatically tokenize and analyze a text. If you print() a Document object, the result is an XFST/HFST stream:

import udar
doc1 = udar.Document('Мы удивились простоте системы.')
print(doc1)
# Мы	мы+Pron+Pers+Pl1+Nom	0.000000
#
# удивились	удивиться+V+Perf+IV+Pst+MFN+Pl	5.078125
#
# простоте	простота+N+Fem+Inan+Sg+Dat	4.210938
# простоте	простота+N+Fem+Inan+Sg+Loc	4.210938
#
# системы	система+N+Fem+Inan+Pl+Acc	5.429688
# системы	система+N+Fem+Inan+Pl+Nom	5.429688
# системы	система+N+Fem+Inan+Sg+Gen	5.429688
#
# .	.+CLB	0.000000

Passing the argument disambiguate=True, or running doc1.disambiguate() after the fact will run a Constraint Grammar to remove as many ambiguous readings as possible. This grammar is far from complete, so some ambiguous readings will remain.

Data objects

`Document` object

Property	Type	Description
text	`str`	Original text of this document
sentences	`List[Sentence]`	List of sentences in this document
num_tokens	`int`	Number of tokens in this document
features	`tuple`	`udar.features.FeatureExtractor` stores extracted features here

Document objects have convenient methods for adding stress or converting to phonetic transcription.

Method	Return type	Description
stressed	`str`	The original text of the document with stress marks
phonetic	`str`	The original text converted to phonetic transcription
transliterate	`str`	The original text converted to Romanized Cyrillic (default=Scholarly)
disambiguate	`None`	Disambiguate readings using the Constraint Grammar
cg3_str	`str`	Analysis stream in the VISL-CG3 format
from_cg3	`Document`	Create `Document` from VISL-CG3 format stream
hfst_str	`str`	Analysis stream in the XFST/HFST format
from_hfst	`Document`	Create `Document` from XFST/HFST format stream
to_dict	`list`	Convert to a complex list object
to_html	`str`	Convert to HTML with markup in `data-` attributes
to_json	`str`	Convert to a JSON string

Examples

stressed_doc1 = doc1.stressed()
print(stressed_doc1)
# Мы́ удиви́лись простоте́ систе́мы.

ambig_doc = udar.Document('Твои слова ничего не значат.', disambiguate=True)
print(sorted(ambig_doc[1].stresses()))  # Note that слова is still ambiguous
# ['сло́ва', 'слова́']

print(ambig_doc.stressed(selection='safe'))  # 'safe' skips сло́ва and слова́
# Твои́ слова ничего́ не зна́чат.
print(ambig_doc.stressed(selection='all'))  # 'all' combines сло́ва and слова́
# Твои́ сло́ва́ ничего́ не зна́чат.
print(ambig_doc.stressed(selection='rand') in {'Твои́ сло́ва ничего́ не зна́чат.', 'Твои́ слова́ ничего́ не зна́чат.'})  # 'rand' randomly chooses between сло́ва and слова́
# True


phonetic_doc1 = doc1.phonetic()
print(phonetic_doc1)
# мы́ уд'ив'и́л'ис' пръстʌт'э́ с'ис'т'э́мы.

`Sentence` object

Property	Type	Description
doc	`Document`	"Back pointer" to the parent document of this sentence
text	`str`	Original text of this sentence
tokens	`List[Token]`	The list of tokens in this sentence
id	`str`	(optional) Sentence id, if assigned at creation

Method	Return type	Description
stressed	`str`	The original text of the sentence with stress marks
phonetic	`str`	The original text converted to phonetic transcription
transliterate	`str`	The original text converted to Romanized Cyrillic (default=Scholarly)
disambiguate	`None`	Disambiguate readings using the Constraint Grammar
cg3_str	`str`	Analysis stream in the VISL-CG3 format
from_cg3	`Sentence`	Create `Sentence` from VISL-CG3 format stream
hfst_str	`str`	Analysis stream in the XFST/HFST format
from_hfst	`Sentence`	Create `Sentence` from XFST/HFST format stream
to_dict	`list`	Convert to a complex list object
to_html	`str`	Convert to HTML with markup in `data-` attributes
to_json	`str`	Convert to a JSON string

`Token` object

Property	Type	Description
id	`str`	The index of this token in the sentence, 1-based
text	`str`	The original text of this token
misc	`str`	Miscellaneous annotations with regard to this token
lemmas	`Set[str]`	All possible lemmas, based on remaining readings
readings	`List[Reading]`	List of readings not removed by the Constraint Grammar
removed_readings	`List[Reading]`	List of readings removed by the Constraint Grammar
deprel	`str`	The dependency relation between this word and its syntactic head. Example: ‘nmod’.

Method	Return type	Description
stresses	`Set[str]`	All possible stressed wordforms, based on remaining readings
stressed	`str`	The original text of the sentence with stress marks
phonetic	`str`	The original text converted to phonetic transcription
most_likely_reading	`Reading`	"Most likely" reading (may be partially random selection)
most_likely_lemmas	`List[str]`	List of lemma(s) from the "most likely" reading
transliterate	`str`	The original text converted to Romanized Cyrillic (default=Scholarly)
force_disambiguate	`None`	Fully disambiguate readings using methods other than the Constraint Grammar
cg3_str	`str`	Analysis stream in the VISL-CG3 format
hfst_str	`str`	Analysis stream in the XFST/HFST format
to_dict	`dict`	Convert to a `dict` object
to_html	`str`	Convert to HTML with markup in `data-` attributes
to_json	`str`	Convert to a JSON string

`Reading` object

Property	Type	Description
subreadings	`List[Subreading]`	Usually only one subreading, but multiple subreadings are possible for complex `Token`s.
lemmas	`List[str]`	Lemmas from all subreadings
grouped_tags	`List[Tag]`	The part-of-speech, morphosyntactic, semantic and other tags from all subreadings
weight	`str`	Weight indicating the likelihood of the reading, without respect to context
cg_rule	`str`	Reference to the rule in the constraint grammar that removed/selected/etc. this reading. If no action has been taken on this reading, then `''`.
is_most_likely	`bool`	Indicates whether this reading has been selected as the most likely reading of its `Token`. Note that some selection methods may be at least partially random.

Method	Return type	Description
cg3_str	`str`	Analysis stream in the VISL-CG3 format
hfst_str	`str`	Analysis stream in the XFST/HFST format
generate	`str`	Generate the wordform from this reading
replace_tag	`None`	Replace a tag in this reading
does_not_conflict	`bool`	Determine whether reading from external tagset (e.g. Universal Dependencies) conflicts with this reading
to_dict	`list`	Convert to a `list` object
to_json	`str`	Convert to a JSON string

`Subreading` object

Property	Type	Description
lemma	`str`	The lemma of the subreading
tags	`List[Tag]`	The part-of-speech, morphosyntactic, semantic and other tags
tagset	`Set[Tag]`	Same as `tags`, but for faster membership testing (`in` Reading)

Method	Return type	Description
cg3_str	`str`	Analysis stream in the VISL-CG3 format
hfst_str	`str`	Analysis stream in the XFST/HFST format
replace_tag	`None`	Replace a tag in this reading
to_dict	`dict`	Convert to a `dict` object
to_json	`str`	Convert to a JSON string

`Tag` object

Property	Type	Description
name	`str`	The name of this tag
ms_feat	`str`	Morphosyntactic feature that this tag is associated with (e.g. `Dat` has ms_feat `CASE`)
detail	`str`	Description of the tag's purpose or meaning
is_L2_error	`bool`	Whether this tag indicates a second-language learner error

Method	Return type	Description
info	`str`	Alias for `Tag.detail`

Convenience functions

A number of functions are included, both for convenience, and to give concrete examples for using the API.

noun_distractors()

This function generates all six cases of a given noun. If the given noun is singular, then the function generates singular forms. If the given noun is plural, then the function generates plural forms. Such a list can be used in a multiple-choice exercise, hence the name distractors.

sg_paradigm = udar.noun_distractors('словом')
print(sg_paradigm == {'сло́ву', 'сло́ве', 'сло́вом', 'сло́ва', 'сло́во'})
# True

pl_paradigm = udar.noun_distractors('словах')
print(pl_paradigm == {'слова́м', 'слова́', 'слова́х', 'слова́ми', 'сло́в'})
# True

If unstressed forms are desired, simply pass the argument stressed=False.

diagnose_L2()

This function will take a text string as the argument, and will return a dictionary of all the types of L2 errors in the text, along with examples of the error.

diag = udar.diagnose_L2('Етот малчик говорит по-русски.')
print(diag == {'Err/L2_e2je': {'Етот'}, 'Err/L2_NoSS': {'малчик'}})
# True

tag_info()

This function will look up the meaning of any tag used by the analyzer.

print(udar.tag_info('Err/L2_ii'))
# L2 error: Failure to change ending ие to ии in +Sg+Loc or +Sg+Dat, e.g. к Марие, о кафетерие, о знание

Using the transducers manually

The transducers come in two varieties: the Analyzer class and the Generator class. For memory efficiency, I recommend using the get_analyzer and get_generator functions, which ensure that each flavor of the transducers remains a singleton in memory.

Analyzer

The Analyzer can be initialized with or without analyses for second-language learner errors using the keyword L2_errors.

analyzer = udar.get_analyzer()  # by default, L2_errors is False
L2_analyzer = udar.get_analyzer(L2_errors=True)

Analyzers are callable. They take a token str and return a sequence of reading/weight tuples.

raw_readings1 = analyzer('сло́ва')
print(raw_readings1)
# (('слово+N+Neu+Inan+Sg+Gen', 5.9755859375),)

raw_readings2 = analyzer('слова')
print(raw_readings2)
# (('слово+N+Neu+Inan+Pl+Acc', 5.9755859375), ('слово+N+Neu+Inan+Pl+Nom', 5.9755859375), ('слово+N+Neu+Inan+Sg+Gen', 5.9755859375))

Generator

The Generator can be initialized in three varieties: unstressed, stressed, and phonetic.

generator = udar.get_generator()  # unstressed by default
stressed_generator = udar.get_generator(stressed=True)
phonetic_generator = udar.get_generator(phonetic=True)

Generators are callable. They take a Reading or raw reading str and return a surface form.

print(stressed_generator('слово+N+Neu+Inan+Pl+Nom'))
# слова́

Working with `Token`s and `Readings`s

You can easily check if a morphosyntactic tag is in a Token, Reading, or Subreading using in:

token2 = udar.Token('слова', analyze=True)
print(token2)
# слова [слово_N_Neu_Inan_Pl_Acc  слово_N_Neu_Inan_Pl_Nom  слово_N_Neu_Inan_Sg_Gen]

print('Gen' in token2)  # do any of the readings include Genitive case?
# True

print('слово' in token2)  # does not work for lemmas; use `in Token.lemmas`
# False

print('слово' in token2.lemmas)
# True

You can make a filtered list of a Token's readings using the following idiom:

pl_readings = [reading for reading in token2 if 'Pl' in reading]
print(pl_readings)
# [Reading(слово+N+Neu+Inan+Pl+Acc, 5.975586, ), Reading(слово+N+Neu+Inan+Pl+Nom, 5.975586, )]

Related projects

Finite-state tools

https://github.com/giellalt/lang-rus (The FSTs underlying this package comes from here)
https://github.com/mikahama/uralicNLP

Russian morphological analysis

udar's People

Contributors

Stargazers

Watchers

Forkers

nsbum

udar's Issues

collect gold-standard corpora

We need a large collection of gold-standard disambiguated Russian texts for FST/CG testing. One way or another, this will require converting tags and format to udar/CG3. Some possibilities include:

opencorpora.org (xml)
apertium for SoC (apertium)
SynTagRus and RNC (conll / mystem)

Another lexical diversity feature?

After Brett and Earl's presentation, I think we should probably add another lexical diversity metric: Maas TTR

https://github.com/kristopherkyle/lexical_diversity/blob/master/lexical_diversity/lex_div.py#L66-L70

Which hfst-dev package has to be used?

Hi,

I am really interested in trying this program and am currently trying to install it.

I tried following the instructions for Debian/Ubuntu, but I got the error that "hfst-dev" package doesn't exist in the package repository. Is the required library maybe libhfst-dev?

It would also maybe be nice if the required Python version would be specified in the Readme, because if I understood it correctly hfst doesn't work with newer Pythons.

Have a great day!

look into pyvislcg3

https://github.com/ljos/pyvislcg3

add descriptions to each feature

Perhaps make the Feature class access the __doc__ property of self.func.

Also add some kind of summary method to the extractor so that you can print off all of the available features.

Add support for Python 3.8

Currently, the hfst package appears to be incompatible with Python 3.8. Once that dependency is updated, add Python 3.8 to tox.ini and to .github/workflows/pythonpackage.yml

Stress on complex cardinals

тридцатичетырехлетний

stress on MWEs with multiple stresses

The lexical underlying form needs to have a persistent stress mark that survives the two-level rule that reduces stresses to the right-most one. For example,...

то есть
так как
красно-жёлтых

Make all objects picklable

profile analysis to find bottlenecks

Maybe start by comparing a) creating lots of little Texts and b) creating one massive Text.

remove superscript enumerators from lemmas

...in the underlying FST.

In `CG_str()` find better way to flush the last token

Currently <dummy> is added to the end. Since the last token is unexpectedly being ignored, the <dummy> token acts as a flush.

vislcg3 stream command flush did not work, so troubleshoot that, or find another workaround.

Improve membership test for ambiguous tags

In the following code, the last statement should evaluate as True.

print(reading)
# новая [новый+A+Fem+AnIn+Sg+Nom]
'Anim' in reading
# False

make all `open()` calls use `utf-8` encoding

reimplement `guess_freq()`

an example of ambiguity resolving in README

Hi!

I think an example of ambiguity resolving might be helpful. For instance:

import udar

doc1 = udar.Document('Мне недостаточно просто твоего честного слова.')
doc2 = udar.Document('Красивые слова!')
doc3 = udar.Document('Твои слова ничего не значат.')

samples = [doc1, doc2, doc3]

for doc in samples:
  doc.disambiguate()
  print(doc.stressed())

prints out

Мне́ недоста́точно про́сто твоего́ честного сло́ва.
Краси́вые слова́!
Твои́ слова ничего́ не зна́чат.

So, in the first and the second sentences an ambiguity was resolved correctly, but ambiguity remains in the third one. It's also not clear that after calling the disambiguate method some words may remain unstressed (and no warning message is printed out). At first, I tried your code with sentences where the disambiguate method doesn't change anything and thought that this is a mistake or code is incomplete.

An thank you for you work!

Imperatives 1Pl

Reconsider whether to mark 1pl as imperatives. If so, then should imperfectives be marked as well? This is both a linguistic and practical question.

add argument to `Sentence.disambiguate(force=None)`

Make it possible to force disambiguation using any number of methods, such as random, weight, stanza, etc.

Using one of these methods guarantees that each token has only one reading. These methods are already part of the stressed() method, so it would make sense to abstract each method to be used either as a method of disambiguation, or a method of simply generating a stressed wordform while leaving ambiguous readings in place.

add alternative output formats

This may not be possible in every case, but where possible, add other common output formats:

connl(x/u)
mystem
Multext-East (Sharoff, et al.)
etc?

numerals do not have case

5+Num+Nom etc.

Lemmas declared more than once

The following code using this module...

from sys import stderr

import lexc_parser as lp


filename = GTPATH + '/langs/rus/src/morphology/lexicon.tmp.lexc'

print('Parsing lexc file...', file=stderr)
with open(filename) as f:
    src = f.read()
lexc = lp.Lexc(src)

primary_lexicons = [entry.cc.id for entry in lexc['Root']
                    if entry.cc is not None and entry.cc.id != 'Numeral']
for lex in primary_lexicons:
    lexc[lex].cc_lemmas_dict

...yields the following lists of lemmas that are declared more than once inside the same part of speech's LEXICON:

Parsing lexc file...
ryan.py:17: UserWarning: Lemmas declared more than once within Adverb:
{'коротко', 'наголо', 'верхом', 'чудно', 'страшно'}
  lexc[lex].cc_lemmas_dict
ryan.py:17: UserWarning: Lemmas declared more than once within Noun:
{'бронирование', 'пояс', 'колонок', 'кочан', 'ничтожество', 'судзуки', 'лекарство', 'орган', 'рондо', 'видение', 'уголь', 'туника', 'сапожок', 'пресс-релиз', 'артикул', 'соболь', 'огнеупоры', 'кондуктор', 'индустрия', 'чижик', 'вязанка', 'воздвижение', 'недвижимость', 'пулярка', 'призрак', 'козырь', 'флагман', 'цоколь', 'бакан', 'нон-стоп', 'гитлерюгенд', 'сопло', 'ширма', 'предвозвестник', 'провидение', 'болванчик', 'генсовет', 'парилка', 'пугало', 'гигант', 'тягло', 'полиграфия', 'комплекс', 'микрометр', 'мебельщик', 'характерность', 'феномен', 'пристенок', 'хаханьки', 'натура', 'наркоминдел', 'чувиха', 'пергамент', 'водолей', 'сельдь', 'ламповая', 'напряг', 'ферула', 'хиханьки', 'глюк', 'настриг', 'туркменбаши', 'пролог', 'метчик', 'обрезание', 'туфелька', 'розан', 'речушка', 'чабер', 'порсканье', 'судья', 'светоч', 'урка', 'хаос', 'проводка', 'лиганд', 'колосс', 'дочушка', 'маки', 'транспорт', 'замглавы', 'полип', 'ирис', 'угольник', 'проволочка', 'лосось', 'единица', 'червец', 'тотем', 'холодность', 'плёночка', 'картель', 'нуклеокапсид', 'жертва', 'истукан', 'предвестник', 'кашица', 'кредит', 'взрослый', 'опрощение', 'сведение', 'ужин', 'отзыв', 'русло', 'солнечник', 'ход', 'ястребок', 'префикс', 'цитокин', 'ирей', 'синтип', 'бучение', 'книговедение', 'трапезная', 'безобразность', 'край', 'чучело', 'созданьице', 'зайчик', 'рол', 'подволока', 'разлив', 'солнышко', 'креветка', 'консерваторка', 'дядя', 'прототип', 'сметливость', 'гуарани', 'субъект', 'заворот', 'видик', 'катанье', 'ведение', 'создание', 'калига', 'устрица', 'хобот', 'прослушка', 'бодяга', 'зев', 'комроты', 'отчёт', 'фрик', 'конус', 'адрес', 'котик', 'камора', 'дышло', 'плазмодий', 'марионетка', 'отправитель', 'усадьба', 'селище', 'живчик', 'лоцман', 'дублет', 'светило', 'боливар', 'мшанка', 'целение', 'юнкер', 'спутник', 'скакунок', 'дуплет', 'ордер'}
  lexc[lex].cc_lemmas_dict
ryan.py:17: UserWarning: Lemmas declared more than once within Predicative:
{'чудно', 'полно', 'страшно'}
  lexc[lex].cc_lemmas_dict
ryan.py:17: UserWarning: Lemmas declared more than once within Pronoun:
{'возле', 'поперёд', 'обок', 'вне', 'внутрь', 'близь', 'помимо', 'посредине', 'напротив', 'поперёк', 'вблизи', 'посреди', 'вперёд', 'наместо', 'спереди', 'наперекор', 'подобно', 'согласно', 'насчёт', 'навроде', 'свыше', 'ниже', 'посередине', 'ради', 'позади', 'вдоль', 'под', 'чрез', 'вроде', 'вследствие', 'посредством', 'выключая', 'у', 'путём', 'касательно', 'превыше', 'накануне', 'относительно', 'вопреки', 'про', 'промежду', 'касаемо', 'около', 'над', 'из-за', 'по', 'сквозь', 'за', 'ввиду', 'соразмерно', 'противу', 'поверх', 'вовнутрь', 'наперерез', 'без', 'позадь', 'вкось', 'вослед', 'пред', 'мимо', 'сообразно', 'из-под', 'опричь', 'внизу', 'между', 'по-над', 'кроме', 'сверху', 'о', 'посередь', 'сверх', 'вкруг', 'внутри', 'промеж', 'через', 'к', 'против', 'от', 'наподобие', 'перед', 'посереди', 'сзади', 'кругом', 'на', 'включая', 'прежде', 'до', 'исключая', 'выше', 'снизу', 'соответственно', 'взамен', 'насупротив', 'для', 'из', 'округ', 'среди', 'меж', 'плюс', 'окрест', 'средь', 'с', 'благодаря', 'спустя', 'вслед', 'при', 'противно²', 'вместо', 'минус', 'вокруг', 'после', 'впереди', 'подле', 'близ', 'по-за', 'изнутри', 'супротив', 'в', 'середь'}
  lexc[lex].cc_lemmas_dict
ryan.py:17: UserWarning: Lemmas declared more than once within Verb:
{'осветить', 'прояснеть', 'отползать', 'запыхаться¹', 'усугубиться', 'тикать', 'усугубить', 'икать'}
  lexc[lex].cc_lemmas_dict
ryan.py:17: UserWarning: Lemmas declared more than once within Propernoun:
{'Мелани', 'Сандро', 'Филатов', 'Зощенко', 'Марго', 'Геркулесович', 'Люси', 'Симонович', 'Фениксович', 'Симон', 'Витольдович', 'Манагуа', 'Якобсон', 'Евтушенко', 'Гордон', 'Исидор', 'Терещенко', 'Геркулесовна', 'Бурденко', 'Исидорович', 'Григоренко', 'Симоновна', 'Фигаро', 'Макаренко', 'Стефанович', 'Филиппов', 'Короленко', 'Геркулес', 'Лонгин', 'Франко', 'Довженко', 'Пегасовна', 'Пегасович', 'Никарагуа', 'Лонгиновна', 'Мартиновна', 'Громыко', 'Элизабет', 'Федотов', 'Павлиновна', 'Лысенко', 'Шевченко', 'Гильфердинг', 'Павлин', 'Шульженко', 'Исаченко', 'Иванов', 'Робинсон', 'Пегас', 'Стефан', 'Мартин', 'Михалков', 'Павлинович', 'Персей', 'Стефановна', 'Семашко', 'Икария', 'Катанга', 'Мемфис', 'Лонгинович', 'Исидоровна', 'Фениксовна', 'Викторович', 'Феникс', 'Стефани', 'Персеевич', 'Новиков', 'Витольдовна', 'Мартинович', 'Любань', 'Витольд', 'Виктор', 'Нестеренко', 'Панченко', 'Гурченко', 'Обухов', 'Персеевна', 'Покров', 'Итака', 'Морган', 'Викторовна'}
  lexc[lex].cc_lemmas_dict
ryan.py:17: UserWarning: Lemmas declared more than once within Punctuation:
{''}
  lexc[lex].cc_lemmas_dict
ryan.py:17: UserWarning: Lemmas declared more than once within Symbols:
{'%'}
  lexc[lex].cc_lemmas_dict
ryan.py:17: UserWarning: Lemmas declared more than once within LexicalizedParticiple:
{'положить', 'сложить'}
  lexc[lex].cc_lemmas_dict

Documentation enhancement

Wondering if documentation might call out a handful of items. It may be obvious, but installation on macOS 10.15.x required stanza and pexpect dependencies installed separately with pip3. And equally obvious - or perhaps not - stanza.download('ru') is required.

Add text transliteration output

https://en.wikipedia.org/wiki/Romanization_of_Russian#Transliteration_table

The idea would be that code something like the following would work:

t = Text('Слова на научную тему.')
t.transliterate()  # scholarly is the default
# 'Slova na naučnuju temu.'
t.transliterate('iso9')
# 'Slova na naučnuû temu.'

improve respacing

Try to find some way to preserve the spacing from the original text.

Some numerical compounds can only be with whole numbers

3,4-летие

Implement python interface to hfst-tokenize instead of subprocess

http://giellatekno.uit.no/doc/ling/preprocessor.html

Implement equality dunders

Currently, none of the custom objects have equality dunders (def __eq__), so the following fails:

>>> import udar
>>> t1 = udar.Text('Мы говорили.')
>>> t2 = udar.Text('Мы говорили.')
>>> t1 == t2
False

Add these for all the objects that it makes sense to.

add readability formulas

This is mostly already done.

Here are some resources:
https://pdfs.semanticscholar.org/950c/b6af99ce6da269f012dd83980f686d5ca65b.pdf
file:///~/Downloads/Assessment_of_Reading_Difficulty_Levels_in_Russian_Academ___.pdf
https://link.springer.com/chapter/10.1007/978-3-030-04497-8_11
https://books.google.com/books?id=PUOCDwAAQBAJ&pg=PA141&lpg=PA141&dq=oborneva+russian+readability&source=bl&ots=BnnNYGzfbi&sig=ACfU3U0xnGHtsaleG5N2o39ArVnbOpESpQ&hl=en&sa=X&ved=2ahUKEwj76c-F74vnAhXIIjQIHR_9AiwQ6AEwAHoECAkQAQ#v=onepage&q=unav&f=false

Log problematic tokens

Enable global logging of problematic tokens encountered during analysis.

tokens that have no stress information
- exclude words that are expected to be missing stress marking
  - proper nouns
  - prepositions
out-of-lexicon tokens
etc.?

HFSTTokenizer chokes on input longer than 550(?) characters

The interactive shell (accessed using pexpect) appears to limit line lengths over 550 (not really sure about this number) characters. If more are given, then bell characters (ascii codepoint 7, displayed as ^G in less) are printed to the logfile and pexpect hangs because it gets no output.

stress on нарисова́нные

should be нарисо́ванные

improve `guess_syllable()`

Superlative tag?

новейший, etc.

Token.stressify() sometimes returns None

Happened with им in the following sentence from robot.ref: Хо́чешь быть челове́ком - будь им. (not sure what the parameters were)

separate prepositions into lemmas by case

For example, с can govern INST, GEN and ACC. Each of these should be a different lemma, e.g. с¹, с², с³.

Count ambiguous tokens for experiment results

negative participles

Participles can generally be negated with не~ as in непрочитанный. The FST does not systematically include such forms.

implement testing

Make ambiguous transitivity tag (+IT?)

Russian verbs do not inflect for transitivity, so having multiple readings distinguished by transitivity is grammatically inaccurate.

Transitivity tags can be helpful for the CG, so we should specify transitivity when possible, but if the transitivity is ambiguous, there should only be one reading.

Readings with `+` fail

A reading that uses + for something other than a Tag delimiter fails.

For example, trying to turn the reading ++Punct into a Reading fails.

Using a regular expression instead of ''.split('+') would be very expensive.

It may be useful to outsource the actual parsing of the reading to _readify(), so that the Reading and MultiReading __init__s just have arguments for preprocessed lemma, tags, and weight.

This is an extreme edge case, so control flow should emphasize speed for typical readings.

>>> import udar
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/jxhou/.local/lib/python3.7/site-packages/udar/__init__.py", line 9, in <module>
    from .convenience import *  # noqa: F401, F403
  File "/home/jxhou/.local/lib/python3.7/site-packages/udar/convenience.py", line 9, in <module>
    from .tag import CASES
ImportError: cannot import name 'CASES' from 'udar.tag' (/home/jxhou/.local/lib/python3.7/site-packages/udar/tag.py)

Tokens without removed readings have an extra newline

Failing test_cg3_parse_w_traces because of it.

for the generator +AnIn should accept +Inan or +Anim

Generating a form that has +AnIn should work if you give it +Inan or +Anim.

Current behavior:

$ echo который+Pron+Rel+Neu+Inan+Sg+Acc | hdrus
который+Pron+Rel+Neu+Inan+Sg+Acc	который+Pron+Rel+Neu+Inan+Sg+Acc+?	inf
$ echo который+Pron+Rel+Neu+AnIn+Sg+Acc | hdrus
который+Pron+Rel+Neu+AnIn+Sg+Acc	которое	6.521484

Desired behavior:

$ echo который+Pron+Rel+Neu+Inan+Sg+Acc | hdrus
который+Pron+Rel+Neu+Inan+Sg+Acc	которое	6.521484