tsroten / zhon Goto Github PK

View Code? Open in Web Editor NEW

351.0 351.0 45.0 226 KB

Constants used in Chinese text processing

License: MIT License

Python 100.00%

zhon's People

Contributors

Stargazers

Watchers

Forkers

biznixcn johnnyzhao xiliangsong chagge lixiangnlp shannonyu rudaoshi likaiguo gged ericxsun ttwno-zz dongqing7 aviatorbeijing zhuanghoward zhuangh lazybonesboy ipsolar hanxinhisen hdubey geektemo nonva semsevens o0o0oli yuanjie-ai ttwno myechona chriswo0724 zhangxuemiao mavarick shiztong lizezheng fusyong quanzhilongxia pyeden fengmy napoler windowxiaoming loni415 yangjg86 gdls askain nhsjgczryf awesome-software cafew rickvincent

zhon's Issues

Ideographic number zero

zhon.hanzi.characters doesn't currently include U+3007, 〇. It's not a CJK Unified Ideograph, but it's present in 《现代汉语词典》and CC-CEDICT. zhon.hanzi.characters already includes some characters that aren't unified and the documentation doesn't claim that only unified characters are allowed. Instead, it's described as containing "pertinent CJK ideograph Unicode blocks". I think it's worth extending it to include U+3007.

Typo in README.rst - Narrow Python Builds

sys.maxunicode value should be changed to look like sys.maxunicode.

zhon being used in practice

Found a great place to use zhon's symbol lists. Parsing regular expressions out of UNIHAN.

https://github.com/cihai/unihan-etl

https://github.com/cihai/unihan-etl/blob/335441a/unihan_etl/expansion.py

Thanks for the project

Pinyin words should not include numbers

zhon.pinyin.word and it's related constants should not include numbers (expressing quantity, not tone) in the regular expression pattern. While Pinyin sentences might have numbers in them, individual words should not.

README typo

In the README, Zhon is described as a "module", instead of a "package", which is a better description.

Add logging to build_string

The build_string function needs logging.

Add examples directory with working code

In order to give users an idea of how to use Zhon's constants, some examples are needed.

Make zhon not import zhon.* modules in init.py

Some of the zhon constants are memory intensive (e.g. CC-CEDICT constants). zhon should not automatically import its modules.

Add tests for all zhon.unicode contstants

Tests are still needed for:

zhon.unicode.PINYIN
zhon.unicode.ZHUYIN
zhon.unicode.ASCII
zhon.unicode.RADICALS
zhon.unicode.PUNCTUATION

Typo in README.rst - Narrow Python Builds

When referring to the character \U00020000, tilde marks should be put around it so that the backslash is not hidden.

why I run re.findall('[%s]' % zhon.hanzi.characters, 'I broke a plate: 我打破了一个盘子.'), I got an empty []

why I run re.findall('[%s]' % zhon.hanzi.characters, 'I broke a plate: 我打破了一个盘子.'), I got an empty []???

I use python 2.7.3, ubuntu 12.04

Pinyin regular expressions don't support Latin alpha for a

zhon.pinyin.vowels includes the Latin alpha that is sometimes used instead of a normal a. The regular expressions should support it as well.

Pinyin.syllable truncates 'beì' (accented_syllable, and word)

Using the zhon.pinyon package, the pinyin (utf-8) syllable 'beì' in a regular expression, does not match perfectly. In fact, beì gets truncated to 'e', despite all other pinyin tried so far (~1,000).

Here's an example from Terminal in Mac OS X 10.11.6 and Python 2.7:

>>> import re
>>> import zhon.pinyin
>>> ln = u'南 ná 無 mó 善 shàn 臂 beì 菩薩 pú sà'
>>> pyo = re.findall(zhon.pinyin.syllable, ln)
>>> pyo
[u'n\xe1', u'm\xf3', u'sh\xe0n', u'e', u'p\xfa', u's\xe0']
                                                  ^ missing 'b' and 'ì'

Wrong zhuyin to pinyin for syllables ending with ㄨ

@mthewissen reports:

That is, when they have the first intonation:

from dragonmapper import hanzi, transcriptions
print transcriptions.zhuyin_syllable_to_pinyin(u'ㄓㄨˋ') # works
print transcriptions.zhuyin_syllable_to_pinyin(u'ㄔㄨ') # does not work

I traced things down to the following:

def _parse_zhuyin_syllable(unparsed_syllable):
    """Return the syllable and tone of a Zhuyin syllable."""
    zhuyin_tone = unparsed_syllable[-1]
    if zhuyin_tone in zhon.zhuyin.characters:
        syllable, tone = unparsed_syllable, '1'
    elif zhuyin_tone in zhon.zhuyin.marks:
        for tone_number, tone_mark in _ZHUYIN_TONES.items():
            if zhuyin_tone == tone_mark:
                syllable, tone = unparsed_syllable[:-1], tone_number
    else:
        raise ValueError("Invalid syllable: %s" % unparsed_syllable)

    return syllable, tone

For some reason, there is no ㄨ in zhon.zhuyin.characters? (also no ㄩ)

Typo in README.rst - Bugs/Feature Requests

The link to Zhon's GitHub issues page is not formatted correctly.

Add missing Pinyin unicode code points

There are some Pinyin code points missing that need to be accounted for.

pinyin regex matches junk

re.findall(zhon.pinyin.numbered_syllable, 'foo bar', re.IGNORECASE)
...['fo', 'o', 'ba', 'r']

re.findall(zhon.pinyin.syllable, 'foo bar', re.IGNORECASE)
... ['fo', 'o', 'ba', 'r']

same for accented_syllable

I was hoping for a way to match only pinyin within mixed English text.

README.rst zhon.pinyin.RE_NUMBER typo

Under the section zhon.pinyin.RE_NUMBER, "expression" is spelled wrong.

AttributeError: module 'zhon' has no attribute 'hanzi'

got no attribute hanzi,

import zhon
zhon.hanzi
Traceback (most recent call last):
File "", line 1, in
AttributeError: module 'zhon' has no attribute 'hanzi'

I don't know why seems the package install correct.

Same with others

I use windows 10 anaconda 3

Add Pinyin RE pattern object for numbered/accented

Pinyin is not simply A-Z, a-z, 1-5. It has a certain structure to it. SYL#SYL#... There should be a constant for this.

Add Hyperlink in README.RST

In the description of Zhon, RE pattern object should link like this: RE pattern object.

AttributeError: module 'zhon' has no attribute 'hanzi'

zhon.pinyin.syl assumes non-combining diacritics

If the string separates e.g. the ǎ in xiǎo into 'a\u030c' rather than as one codepoint '\u01ce', which is rendered identically, the regex fails. These separate diacritics occur if you use Unicode normalization NFKD.

I suppose one solution is to duplicate for each way to represent ǎ (ditto for the others), perhaps programmatically generate the two options. Or maybe this just needs a note in the docs?

`DeprecationWarning: invalid escape sequence`

This raises in pytest as of 1.1.5, on Python 3.10:

~/.cache/.../lib/python3.10/site-packages/zhon/pinyin.py:40
  ~/.cache/.../lib/python3.10/site-packages/zhon/pinyin.py:40: DeprecationWarning: invalid escape sequence '\]'
    non_stops = """"#$%&'()*+,-/:;<=>@[\]^_`{|}~"""

~/.cache/.../lib/python3.10/site-packages/zhon/pinyin.py:153
  ~/.cache/...0/lib/python3.10/site-packages/zhon/pinyin.py:153: DeprecationWarning: invalid escape sequence '\]'
    """[%(stops)s]['"\]\}\)]*"""

~/.cache/.../lib/python3.10/site-packages/zhon/pinyin.py:154
  ~/.cache/.../lib/python3.10/site-packages/zhon/pinyin.py:154: DeprecationWarning: invalid escape sequence '\-'
    ) % {'word': word, 'non_stops': non_stops.replace('-', '\-'),

-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html

Why missing \uFF0E?

https://github.com/tsroten/zhon/blob/09bf543696277f71de502506984661a60d24494c/zhon/hanzi.py#L30:#L33

it should in the stops.

Update SIMPLIFIED and TRADITIONAL to allow for mapping between them

Currently, zhon.cedict.TRADITIONAL and zhon.cedict.SIMPLIFIED are strings consisting of each character that occurs in CC-CEDICT. It would be better if they contained a few duplicate characters but the characters in each constant were ordered in a way that allowed mapping between the constants.

>>> zhon.cedict.TRADITIONAL[5689]
'你'
>>> zhon.cedict.SIMPLIFIED[5689]
'你'
>>> zhon.cedict.TRADITIONAL[7899]
'國'
>>> zhon.cedict.SIMPLIFIED[7899]
'国'

pinyin pattern objects should not match non-pinyin character (e.g. numbers/punctuation)

The zhon.pinyin constants should not match numbers and punctuation. They should only match valid pinyin syllables. The user can then add punctuation/whitespace constants when compiling if necessary.

《 · 〈〉﹑ ﹔ not in punctuation list of hanzi

《 · 〈〉﹑ ﹔ this symbols are missing in punctuation list of hanzi

'r' suffix bug

There is a typo in zhon.pinyin that makes the r-suffix (as used in erhua-style Chinese) combine with its previous syllable in the regular expression. Based on phonetics alone, it should be combined to form one syllable. However, because it is represented by an additional character it is unwise to treat it as one syllable. Doing so will create problems when using zhon.pinyin to interact with Chinese characters, which is likely a common scenario for users.

For example, huar is parsed as one syllable. In hua1r5 the r-suffix is ignored altogether.