Giter Site home page Giter Site logo

reynoldsnlp / udar Goto Github PK

View Code? Open in Web Editor NEW
27.0 5.0 1.0 164.5 MB

UDAR Does Accented Russian: A finite-state morphological analyzer of Russian that handles stressed wordforms.

License: GNU General Public License v3.0

Python 99.49% Shell 0.51%
natural-language-processing morphological-analysis morphological-disambiguator learner-errors stressed-wordforms finite-state-machine finite-state-transducers finite-state-morphology russian-language russian-morphology

udar's Introduction

UDAR(enie)

Actions Status codecov

UDAR Does Accented Russian: A finite-state morphological analyzer of Russian that handles stressed wordforms.

A python wrapper for the Russian finite-state transducer described originally in chapter 2 of my dissertation.

If you use this work in your research please cite the following:


Reynolds, Robert J. "Russian natural language processing for computer-assisted language learning: capturing the benefits of deep morphological analysis in real-life applications" PhD Diss., UiT–The Arctic University of Norway, 2016. https://hdl.handle.net/10037/9685


Feature requests, issues, and pull requests are welcome!

Dependencies

For all features to be available, you should have hfst and vislcg3 installed as command-line utilities. Specifically, hfst is needed for FST-based tokenization, and vislcg3 is needed for grammatical disambiguation. The version used to successfully test the code is included in each commit in this file. The recommended method for installing these dependencies is as follows:

Debian / Ubuntu

$ wget https://apertium.projectjj.com/apt/install-nightly.sh -O - | sudo bash
$ sudo apt-get install cg3 hfst python3-hfst

MacOS (Python 3.6/3.7 only)

On MacOS, one of udar's dependencies, the python package hfst, is not currently available for Python 3.8+. Hopefully, this will be remedied soon.

$ curl https://apertium.projectjj.com/osx/install-nightly.sh | sudo bash
$ python3 -m pip install hfst

Installation

This package can be installed from PyPI using the usual...

$ python3 -m pip install --user udar

...or directly from this repository using...

$ python3 -m pip install --user git+https://github.com/reynoldsnlp/udar

Introduction

NB! Documentation is currently limited to docstrings. I recommend that you use help() frequently to see how to use classes and methods. For example, to see what options are available for building a Document, try help(Document).

The most common use-case is to use the Document constructor to automatically tokenize and analyze a text. If you print() a Document object, the result is an XFST/HFST stream:

import udar
doc1 = udar.Document('Мы удивились простоте системы.')
print(doc1)
# Мы	мы+Pron+Pers+Pl1+Nom	0.000000
#
# удивились	удивиться+V+Perf+IV+Pst+MFN+Pl	5.078125
#
# простоте	простота+N+Fem+Inan+Sg+Dat	4.210938
# простоте	простота+N+Fem+Inan+Sg+Loc	4.210938
#
# системы	система+N+Fem+Inan+Pl+Acc	5.429688
# системы	система+N+Fem+Inan+Pl+Nom	5.429688
# системы	система+N+Fem+Inan+Sg+Gen	5.429688
#
# .	.+CLB	0.000000

Passing the argument disambiguate=True, or running doc1.disambiguate() after the fact will run a Constraint Grammar to remove as many ambiguous readings as possible. This grammar is far from complete, so some ambiguous readings will remain.

Data objects

Document object

Property Type Description
text str Original text of this document
sentences List[Sentence] List of sentences in this document
num_tokens int Number of tokens in this document
features tuple udar.features.FeatureExtractor stores extracted features here

Document objects have convenient methods for adding stress or converting to phonetic transcription.

Method Return type Description
stressed str The original text of the document with stress marks
phonetic str The original text converted to phonetic transcription
transliterate str The original text converted to Romanized Cyrillic (default=Scholarly)
disambiguate None Disambiguate readings using the Constraint Grammar
cg3_str str Analysis stream in the VISL-CG3 format
from_cg3 Document Create Document from VISL-CG3 format stream
hfst_str str Analysis stream in the XFST/HFST format
from_hfst Document Create Document from XFST/HFST format stream
to_dict list Convert to a complex list object
to_html str Convert to HTML with markup in data- attributes
to_json str Convert to a JSON string

Examples

stressed_doc1 = doc1.stressed()
print(stressed_doc1)
# Мы́ удиви́лись простоте́ систе́мы.

ambig_doc = udar.Document('Твои слова ничего не значат.', disambiguate=True)
print(sorted(ambig_doc[1].stresses()))  # Note that слова is still ambiguous
# ['сло́ва', 'слова́']

print(ambig_doc.stressed(selection='safe'))  # 'safe' skips сло́ва and слова́
# Твои́ слова ничего́ не зна́чат.
print(ambig_doc.stressed(selection='all'))  # 'all' combines сло́ва and слова́
# Твои́ сло́ва́ ничего́ не зна́чат.
print(ambig_doc.stressed(selection='rand') in {'Твои́ сло́ва ничего́ не зна́чат.', 'Твои́ слова́ ничего́ не зна́чат.'})  # 'rand' randomly chooses between сло́ва and слова́
# True


phonetic_doc1 = doc1.phonetic()
print(phonetic_doc1)
# мы́ уд'ив'и́л'ис' пръстʌт'э́ с'ис'т'э́мы.

Sentence object

Property Type Description
doc Document "Back pointer" to the parent document of this sentence
text str Original text of this sentence
tokens List[Token] The list of tokens in this sentence
id str (optional) Sentence id, if assigned at creation
Method Return type Description
stressed str The original text of the sentence with stress marks
phonetic str The original text converted to phonetic transcription
transliterate str The original text converted to Romanized Cyrillic (default=Scholarly)
disambiguate None Disambiguate readings using the Constraint Grammar
cg3_str str Analysis stream in the VISL-CG3 format
from_cg3 Sentence Create Sentence from VISL-CG3 format stream
hfst_str str Analysis stream in the XFST/HFST format
from_hfst Sentence Create Sentence from XFST/HFST format stream
to_dict list Convert to a complex list object
to_html str Convert to HTML with markup in data- attributes
to_json str Convert to a JSON string

Token object

Property Type Description
id str The index of this token in the sentence, 1-based
text str The original text of this token
misc str Miscellaneous annotations with regard to this token
lemmas Set[str] All possible lemmas, based on remaining readings
readings List[Reading] List of readings not removed by the Constraint Grammar
removed_readings List[Reading] List of readings removed by the Constraint Grammar
deprel str The dependency relation between this word and its syntactic head. Example: ‘nmod’.
Method Return type Description
stresses Set[str] All possible stressed wordforms, based on remaining readings
stressed str The original text of the sentence with stress marks
phonetic str The original text converted to phonetic transcription
most_likely_reading Reading "Most likely" reading (may be partially random selection)
most_likely_lemmas List[str] List of lemma(s) from the "most likely" reading
transliterate str The original text converted to Romanized Cyrillic (default=Scholarly)
force_disambiguate None Fully disambiguate readings using methods other than the Constraint Grammar
cg3_str str Analysis stream in the VISL-CG3 format
hfst_str str Analysis stream in the XFST/HFST format
to_dict dict Convert to a dict object
to_html str Convert to HTML with markup in data- attributes
to_json str Convert to a JSON string

Reading object

Property Type Description
subreadings List[Subreading] Usually only one subreading, but multiple subreadings are possible for complex Tokens.
lemmas List[str] Lemmas from all subreadings
grouped_tags List[Tag] The part-of-speech, morphosyntactic, semantic and other tags from all subreadings
weight str Weight indicating the likelihood of the reading, without respect to context
cg_rule str Reference to the rule in the constraint grammar that removed/selected/etc. this reading. If no action has been taken on this reading, then ''.
is_most_likely bool Indicates whether this reading has been selected as the most likely reading of its Token. Note that some selection methods may be at least partially random.
Method Return type Description
cg3_str str Analysis stream in the VISL-CG3 format
hfst_str str Analysis stream in the XFST/HFST format
generate str Generate the wordform from this reading
replace_tag None Replace a tag in this reading
does_not_conflict bool Determine whether reading from external tagset (e.g. Universal Dependencies) conflicts with this reading
to_dict list Convert to a list object
to_json str Convert to a JSON string

Subreading object

Property Type Description
lemma str The lemma of the subreading
tags List[Tag] The part-of-speech, morphosyntactic, semantic and other tags
tagset Set[Tag] Same as tags, but for faster membership testing (in Reading)
Method Return type Description
cg3_str str Analysis stream in the VISL-CG3 format
hfst_str str Analysis stream in the XFST/HFST format
replace_tag None Replace a tag in this reading
to_dict dict Convert to a dict object
to_json str Convert to a JSON string

Tag object

Property Type Description
name str The name of this tag
ms_feat str Morphosyntactic feature that this tag is associated with (e.g. Dat has ms_feat CASE)
detail str Description of the tag's purpose or meaning
is_L2_error bool Whether this tag indicates a second-language learner error
Method Return type Description
info str Alias for Tag.detail

Convenience functions

A number of functions are included, both for convenience, and to give concrete examples for using the API.

noun_distractors()

This function generates all six cases of a given noun. If the given noun is singular, then the function generates singular forms. If the given noun is plural, then the function generates plural forms. Such a list can be used in a multiple-choice exercise, hence the name distractors.

sg_paradigm = udar.noun_distractors('словом')
print(sg_paradigm == {'сло́ву', 'сло́ве', 'сло́вом', 'сло́ва', 'сло́во'})
# True

pl_paradigm = udar.noun_distractors('словах')
print(pl_paradigm == {'слова́м', 'слова́', 'слова́х', 'слова́ми', 'сло́в'})
# True

If unstressed forms are desired, simply pass the argument stressed=False.

diagnose_L2()

This function will take a text string as the argument, and will return a dictionary of all the types of L2 errors in the text, along with examples of the error.

diag = udar.diagnose_L2('Етот малчик говорит по-русски.')
print(diag == {'Err/L2_e2je': {'Етот'}, 'Err/L2_NoSS': {'малчик'}})
# True

tag_info()

This function will look up the meaning of any tag used by the analyzer.

print(udar.tag_info('Err/L2_ii'))
# L2 error: Failure to change ending ие to ии in +Sg+Loc or +Sg+Dat, e.g. к Марие, о кафетерие, о знание

Using the transducers manually

The transducers come in two varieties: the Analyzer class and the Generator class. For memory efficiency, I recommend using the get_analyzer and get_generator functions, which ensure that each flavor of the transducers remains a singleton in memory.

Analyzer

The Analyzer can be initialized with or without analyses for second-language learner errors using the keyword L2_errors.

analyzer = udar.get_analyzer()  # by default, L2_errors is False
L2_analyzer = udar.get_analyzer(L2_errors=True)

Analyzers are callable. They take a token str and return a sequence of reading/weight tuples.

raw_readings1 = analyzer('сло́ва')
print(raw_readings1)
# (('слово+N+Neu+Inan+Sg+Gen', 5.9755859375),)

raw_readings2 = analyzer('слова')
print(raw_readings2)
# (('слово+N+Neu+Inan+Pl+Acc', 5.9755859375), ('слово+N+Neu+Inan+Pl+Nom', 5.9755859375), ('слово+N+Neu+Inan+Sg+Gen', 5.9755859375))

Generator

The Generator can be initialized in three varieties: unstressed, stressed, and phonetic.

generator = udar.get_generator()  # unstressed by default
stressed_generator = udar.get_generator(stressed=True)
phonetic_generator = udar.get_generator(phonetic=True)

Generators are callable. They take a Reading or raw reading str and return a surface form.

print(stressed_generator('слово+N+Neu+Inan+Pl+Nom'))
# слова́

Working with Tokens and Readingss

You can easily check if a morphosyntactic tag is in a Token, Reading, or Subreading using in:

token2 = udar.Token('слова', analyze=True)
print(token2)
# слова [слово_N_Neu_Inan_Pl_Acc  слово_N_Neu_Inan_Pl_Nom  слово_N_Neu_Inan_Sg_Gen]

print('Gen' in token2)  # do any of the readings include Genitive case?
# True

print('слово' in token2)  # does not work for lemmas; use `in Token.lemmas`
# False

print('слово' in token2.lemmas)
# True

You can make a filtered list of a Token's readings using the following idiom:

pl_readings = [reading for reading in token2 if 'Pl' in reading]
print(pl_readings)
# [Reading(слово+N+Neu+Inan+Pl+Acc, 5.975586, ), Reading(слово+N+Neu+Inan+Pl+Nom, 5.975586, )]

Related projects

Finite-state tools

Russian morphological analysis

udar's People

Contributors

konnorjp avatar nsbum avatar reynoldsnlp avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

Forkers

nsbum

udar's Issues

collect gold-standard corpora

We need a large collection of gold-standard disambiguated Russian texts for FST/CG testing. One way or another, this will require converting tags and format to udar/CG3. Some possibilities include:

Which hfst-dev package has to be used?

Hi,

I am really interested in trying this program and am currently trying to install it.

I tried following the instructions for Debian/Ubuntu, but I got the error that "hfst-dev" package doesn't exist in the package repository. Is the required library maybe libhfst-dev?

It would also maybe be nice if the required Python version would be specified in the Readme, because if I understood it correctly hfst doesn't work with newer Pythons.

Have a great day!

add descriptions to each feature

Perhaps make the Feature class access the __doc__ property of self.func.

Also add some kind of summary method to the extractor so that you can print off all of the available features.

Add support for Python 3.8

Currently, the hfst package appears to be incompatible with Python 3.8. Once that dependency is updated, add Python 3.8 to tox.ini and to .github/workflows/pythonpackage.yml

stress on MWEs with multiple stresses

The lexical underlying form needs to have a persistent stress mark that survives the two-level rule that reduces stresses to the right-most one. For example,...

то есть
так как
красно-жёлтых

an example of ambiguity resolving in README

Hi!

I think an example of ambiguity resolving might be helpful. For instance:

import udar

doc1 = udar.Document('Мне недостаточно просто твоего честного слова.')
doc2 = udar.Document('Красивые слова!')
doc3 = udar.Document('Твои слова ничего не значат.')

samples = [doc1, doc2, doc3]

for doc in samples:
  doc.disambiguate()
  print(doc.stressed())

prints out

Мне́ недоста́точно про́сто твоего́ честного сло́ва.
Краси́вые слова́!
Твои́ слова ничего́ не зна́чат.

So, in the first and the second sentences an ambiguity was resolved correctly, but ambiguity remains in the third one. It's also not clear that after calling the disambiguate method some words may remain unstressed (and no warning message is printed out). At first, I tried your code with sentences where the disambiguate method doesn't change anything and thought that this is a mistake or code is incomplete.

An thank you for you work!

Imperatives 1Pl

Reconsider whether to mark 1pl as imperatives. If so, then should imperfectives be marked as well? This is both a linguistic and practical question.

add argument to `Sentence.disambiguate(force=None)`

Make it possible to force disambiguation using any number of methods, such as random, weight, stanza, etc.

Using one of these methods guarantees that each token has only one reading. These methods are already part of the stressed() method, so it would make sense to abstract each method to be used either as a method of disambiguation, or a method of simply generating a stressed wordform while leaving ambiguous readings in place.

add alternative output formats

This may not be possible in every case, but where possible, add other common output formats:

  • connl(x/u)
  • mystem
  • Multext-East (Sharoff, et al.)
  • etc?

Lemmas declared more than once

The following code using this module...

from sys import stderr

import lexc_parser as lp


filename = GTPATH + '/langs/rus/src/morphology/lexicon.tmp.lexc'

print('Parsing lexc file...', file=stderr)
with open(filename) as f:
    src = f.read()
lexc = lp.Lexc(src)

primary_lexicons = [entry.cc.id for entry in lexc['Root']
                    if entry.cc is not None and entry.cc.id != 'Numeral']
for lex in primary_lexicons:
    lexc[lex].cc_lemmas_dict

...yields the following lists of lemmas that are declared more than once inside the same part of speech's LEXICON:

Parsing lexc file...
ryan.py:17: UserWarning: Lemmas declared more than once within Adverb:
{'коротко', 'наголо', 'верхом', 'чудно', 'страшно'}
  lexc[lex].cc_lemmas_dict
ryan.py:17: UserWarning: Lemmas declared more than once within Noun:
{'бронирование', 'пояс', 'колонок', 'кочан', 'ничтожество', 'судзуки', 'лекарство', 'орган', 'рондо', 'видение', 'уголь', 'туника', 'сапожок', 'пресс-релиз', 'артикул', 'соболь', 'огнеупоры', 'кондуктор', 'индустрия', 'чижик', 'вязанка', 'воздвижение', 'недвижимость', 'пулярка', 'призрак', 'козырь', 'флагман', 'цоколь', 'бакан', 'нон-стоп', 'гитлерюгенд', 'сопло', 'ширма', 'предвозвестник', 'провидение', 'болванчик', 'генсовет', 'парилка', 'пугало', 'гигант', 'тягло', 'полиграфия', 'комплекс', 'микрометр', 'мебельщик', 'характерность', 'феномен', 'пристенок', 'хаханьки', 'натура', 'наркоминдел', 'чувиха', 'пергамент', 'водолей', 'сельдь', 'ламповая', 'напряг', 'ферула', 'хиханьки', 'глюк', 'настриг', 'туркменбаши', 'пролог', 'метчик', 'обрезание', 'туфелька', 'розан', 'речушка', 'чабер', 'порсканье', 'судья', 'светоч', 'урка', 'хаос', 'проводка', 'лиганд', 'колосс', 'дочушка', 'маки', 'транспорт', 'замглавы', 'полип', 'ирис', 'угольник', 'проволочка', 'лосось', 'единица', 'червец', 'тотем', 'холодность', 'плёночка', 'картель', 'нуклеокапсид', 'жертва', 'истукан', 'предвестник', 'кашица', 'кредит', 'взрослый', 'опрощение', 'сведение', 'ужин', 'отзыв', 'русло', 'солнечник', 'ход', 'ястребок', 'префикс', 'цитокин', 'ирей', 'синтип', 'бучение', 'книговедение', 'трапезная', 'безобразность', 'край', 'чучело', 'созданьице', 'зайчик', 'рол', 'подволока', 'разлив', 'солнышко', 'креветка', 'консерваторка', 'дядя', 'прототип', 'сметливость', 'гуарани', 'субъект', 'заворот', 'видик', 'катанье', 'ведение', 'создание', 'калига', 'устрица', 'хобот', 'прослушка', 'бодяга', 'зев', 'комроты', 'отчёт', 'фрик', 'конус', 'адрес', 'котик', 'камора', 'дышло', 'плазмодий', 'марионетка', 'отправитель', 'усадьба', 'селище', 'живчик', 'лоцман', 'дублет', 'светило', 'боливар', 'мшанка', 'целение', 'юнкер', 'спутник', 'скакунок', 'дуплет', 'ордер'}
  lexc[lex].cc_lemmas_dict
ryan.py:17: UserWarning: Lemmas declared more than once within Predicative:
{'чудно', 'полно', 'страшно'}
  lexc[lex].cc_lemmas_dict
ryan.py:17: UserWarning: Lemmas declared more than once within Pronoun:
{'возле', 'поперёд', 'обок', 'вне', 'внутрь', 'близь', 'помимо', 'посредине', 'напротив', 'поперёк', 'вблизи', 'посреди', 'вперёд', 'наместо', 'спереди', 'наперекор', 'подобно', 'согласно', 'насчёт', 'навроде', 'свыше', 'ниже', 'посередине', 'ради', 'позади', 'вдоль', 'под', 'чрез', 'вроде', 'вследствие', 'посредством', 'выключая', 'у', 'путём', 'касательно', 'превыше', 'накануне', 'относительно', 'вопреки', 'про', 'промежду', 'касаемо', 'около', 'над', 'из-за', 'по', 'сквозь', 'за', 'ввиду', 'соразмерно', 'противу', 'поверх', 'вовнутрь', 'наперерез', 'без', 'позадь', 'вкось', 'вослед', 'пред', 'мимо', 'сообразно', 'из-под', 'опричь', 'внизу', 'между', 'по-над', 'кроме', 'сверху', 'о', 'посередь', 'сверх', 'вкруг', 'внутри', 'промеж', 'через', 'к', 'против', 'от', 'наподобие', 'перед', 'посереди', 'сзади', 'кругом', 'на', 'включая', 'прежде', 'до', 'исключая', 'выше', 'снизу', 'соответственно', 'взамен', 'насупротив', 'для', 'из', 'округ', 'среди', 'меж', 'плюс', 'окрест', 'средь', 'с', 'благодаря', 'спустя', 'вслед', 'при', 'противно²', 'вместо', 'минус', 'вокруг', 'после', 'впереди', 'подле', 'близ', 'по-за', 'изнутри', 'супротив', 'в', 'середь'}
  lexc[lex].cc_lemmas_dict
ryan.py:17: UserWarning: Lemmas declared more than once within Verb:
{'осветить', 'прояснеть', 'отползать', 'запыхаться¹', 'усугубиться', 'тикать', 'усугубить', 'икать'}
  lexc[lex].cc_lemmas_dict
ryan.py:17: UserWarning: Lemmas declared more than once within Propernoun:
{'Мелани', 'Сандро', 'Филатов', 'Зощенко', 'Марго', 'Геркулесович', 'Люси', 'Симонович', 'Фениксович', 'Симон', 'Витольдович', 'Манагуа', 'Якобсон', 'Евтушенко', 'Гордон', 'Исидор', 'Терещенко', 'Геркулесовна', 'Бурденко', 'Исидорович', 'Григоренко', 'Симоновна', 'Фигаро', 'Макаренко', 'Стефанович', 'Филиппов', 'Короленко', 'Геркулес', 'Лонгин', 'Франко', 'Довженко', 'Пегасовна', 'Пегасович', 'Никарагуа', 'Лонгиновна', 'Мартиновна', 'Громыко', 'Элизабет', 'Федотов', 'Павлиновна', 'Лысенко', 'Шевченко', 'Гильфердинг', 'Павлин', 'Шульженко', 'Исаченко', 'Иванов', 'Робинсон', 'Пегас', 'Стефан', 'Мартин', 'Михалков', 'Павлинович', 'Персей', 'Стефановна', 'Семашко', 'Икария', 'Катанга', 'Мемфис', 'Лонгинович', 'Исидоровна', 'Фениксовна', 'Викторович', 'Феникс', 'Стефани', 'Персеевич', 'Новиков', 'Витольдовна', 'Мартинович', 'Любань', 'Витольд', 'Виктор', 'Нестеренко', 'Панченко', 'Гурченко', 'Обухов', 'Персеевна', 'Покров', 'Итака', 'Морган', 'Викторовна'}
  lexc[lex].cc_lemmas_dict
ryan.py:17: UserWarning: Lemmas declared more than once within Punctuation:
{''}
  lexc[lex].cc_lemmas_dict
ryan.py:17: UserWarning: Lemmas declared more than once within Symbols:
{'%'}
  lexc[lex].cc_lemmas_dict
ryan.py:17: UserWarning: Lemmas declared more than once within LexicalizedParticiple:
{'положить', 'сложить'}
  lexc[lex].cc_lemmas_dict

Documentation enhancement

Wondering if documentation might call out a handful of items. It may be obvious, but installation on macOS 10.15.x required stanza and pexpect dependencies installed separately with pip3. And equally obvious - or perhaps not - stanza.download('ru') is required.

improve respacing

Try to find some way to preserve the spacing from the original text.

Implement equality dunders

Currently, none of the custom objects have equality dunders (def __eq__), so the following fails:

>>> import udar
>>> t1 = udar.Text('Мы говорили.')
>>> t2 = udar.Text('Мы говорили.')
>>> t1 == t2
False

Add these for all the objects that it makes sense to.

add readability formulas

Log problematic tokens

Enable global logging of problematic tokens encountered during analysis.

  • tokens that have no stress information
    • exclude words that are expected to be missing stress marking
      • proper nouns
      • prepositions
  • out-of-lexicon tokens
  • etc.?

HFSTTokenizer chokes on input longer than 550(?) characters

The interactive shell (accessed using pexpect) appears to limit line lengths over 550 (not really sure about this number) characters. If more are given, then bell characters (ascii codepoint 7, displayed as ^G in less) are printed to the logfile and pexpect hangs because it gets no output.

Token.stressify() sometimes returns None

Happened with им in the following sentence from robot.ref: Хо́чешь быть челове́ком - будь им. (not sure what the parameters were)

negative participles

Participles can generally be negated with не~ as in непрочитанный. The FST does not systematically include such forms.

Make ambiguous transitivity tag (+IT?)

Russian verbs do not inflect for transitivity, so having multiple readings distinguished by transitivity is grammatically inaccurate.

Transitivity tags can be helpful for the CG, so we should specify transitivity when possible, but if the transitivity is ambiguous, there should only be one reading.

Readings with `+` fail

A reading that uses + for something other than a Tag delimiter fails.

For example, trying to turn the reading ++Punct into a Reading fails.

Using a regular expression instead of ''.split('+') would be very expensive.

It may be useful to outsource the actual parsing of the reading to _readify(), so that the Reading and MultiReading __init__s just have arguments for preprocessed lemma, tags, and weight.

This is an extreme edge case, so control flow should emphasize speed for typical readings.

add Corpus object

A collection of Documents. It should have methods for summarization, and possibly different kinds of experimentation (like stress, readability, etc.)

cannot import name 'CASES' from 'udar.tag'

When I try to import this package, I get an import error as follows:

>>> import udar
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/jxhou/.local/lib/python3.7/site-packages/udar/__init__.py", line 9, in <module>
    from .convenience import *  # noqa: F401, F403
  File "/home/jxhou/.local/lib/python3.7/site-packages/udar/convenience.py", line 9, in <module>
    from .tag import CASES
ImportError: cannot import name 'CASES' from 'udar.tag' (/home/jxhou/.local/lib/python3.7/site-packages/udar/tag.py)

for the generator +AnIn should accept +Inan or +Anim

Generating a form that has +AnIn should work if you give it +Inan or +Anim.

Current behavior:

$ echo который+Pron+Rel+Neu+Inan+Sg+Acc | hdrus
который+Pron+Rel+Neu+Inan+Sg+Acc	который+Pron+Rel+Neu+Inan+Sg+Acc+?	inf
$ echo который+Pron+Rel+Neu+AnIn+Sg+Acc | hdrus
который+Pron+Rel+Neu+AnIn+Sg+Acc	которое	6.521484

Desired behavior:

$ echo который+Pron+Rel+Neu+Inan+Sg+Acc | hdrus
который+Pron+Rel+Neu+Inan+Sg+Acc	которое	6.521484

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.