Giter Site home page Giter Site logo

zhon's People

Contributors

dependabot[bot] avatar pyeden avatar tsroten avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

zhon's Issues

Ideographic number zero

zhon.hanzi.characters doesn't currently include U+3007, . It's not a CJK Unified Ideograph, but it's present in 《现代汉语词典》and CC-CEDICT. zhon.hanzi.characters already includes some characters that aren't unified and the documentation doesn't claim that only unified characters are allowed. Instead, it's described as containing "pertinent CJK ideograph Unicode blocks". I think it's worth extending it to include U+3007.

Pinyin words should not include numbers

zhon.pinyin.word and it's related constants should not include numbers (expressing quantity, not tone) in the regular expression pattern. While Pinyin sentences might have numbers in them, individual words should not.

README typo

In the README, Zhon is described as a "module", instead of a "package", which is a better description.

Pinyin.syllable truncates 'beì' (accented_syllable, and word)

Using the zhon.pinyon package, the pinyin (utf-8) syllable 'beì' in a regular expression, does not match perfectly. In fact, beì gets truncated to 'e', despite all other pinyin tried so far (~1,000).

Here's an example from Terminal in Mac OS X 10.11.6 and Python 2.7:

>>> import re
>>> import zhon.pinyin
>>> ln = u'南 ná 無 mó 善 shàn 臂 beì 菩薩 pú sà'
>>> pyo = re.findall(zhon.pinyin.syllable, ln)
>>> pyo
[u'n\xe1', u'm\xf3', u'sh\xe0n', u'e', u'p\xfa', u's\xe0']
                                                  ^ missing 'b' and 'ì'

Wrong zhuyin to pinyin for syllables ending with ㄨ

@mthewissen reports:

That is, when they have the first intonation:

from dragonmapper import hanzi, transcriptions
print transcriptions.zhuyin_syllable_to_pinyin(u'ㄓㄨˋ') # works
print transcriptions.zhuyin_syllable_to_pinyin(u'ㄔㄨ') # does not work

I traced things down to the following:

def _parse_zhuyin_syllable(unparsed_syllable):
    """Return the syllable and tone of a Zhuyin syllable."""
    zhuyin_tone = unparsed_syllable[-1]
    if zhuyin_tone in zhon.zhuyin.characters:
        syllable, tone = unparsed_syllable, '1'
    elif zhuyin_tone in zhon.zhuyin.marks:
        for tone_number, tone_mark in _ZHUYIN_TONES.items():
            if zhuyin_tone == tone_mark:
                syllable, tone = unparsed_syllable[:-1], tone_number
    else:
        raise ValueError("Invalid syllable: %s" % unparsed_syllable)

    return syllable, tone

For some reason, there is no ㄨ in zhon.zhuyin.characters? (also no ㄩ)

pinyin regex matches junk

re.findall(zhon.pinyin.numbered_syllable, 'foo bar', re.IGNORECASE)
...['fo', 'o', 'ba', 'r']

re.findall(zhon.pinyin.syllable, 'foo bar', re.IGNORECASE)
... ['fo', 'o', 'ba', 'r']

same for accented_syllable

I was hoping for a way to match only pinyin within mixed English text.

AttributeError: module 'zhon' has no attribute 'hanzi'

got no attribute hanzi,

import zhon
zhon.hanzi
Traceback (most recent call last):
File "", line 1, in
AttributeError: module 'zhon' has no attribute 'hanzi'

I don't know why seems the package install correct.

20180120175620

Same with others
20180120175742

I use windows 10 anaconda 3

zhon.pinyin.syl assumes non-combining diacritics

If the string separates e.g. the ǎ in xiǎo into 'a\u030c' rather than as one codepoint '\u01ce', which is rendered identically, the regex fails. These separate diacritics occur if you use Unicode normalization NFKD.

I suppose one solution is to duplicate for each way to represent ǎ (ditto for the others), perhaps programmatically generate the two options. Or maybe this just needs a note in the docs?

`DeprecationWarning: invalid escape sequence`

This raises in pytest as of 1.1.5, on Python 3.10:

~/.cache/.../lib/python3.10/site-packages/zhon/pinyin.py:40
  ~/.cache/.../lib/python3.10/site-packages/zhon/pinyin.py:40: DeprecationWarning: invalid escape sequence '\]'
    non_stops = """"#$%&'()*+,-/:;<=>@[\]^_`{|}~"""

~/.cache/.../lib/python3.10/site-packages/zhon/pinyin.py:153
  ~/.cache/...0/lib/python3.10/site-packages/zhon/pinyin.py:153: DeprecationWarning: invalid escape sequence '\]'
    """[%(stops)s]['"\]\}\)]*"""

~/.cache/.../lib/python3.10/site-packages/zhon/pinyin.py:154
  ~/.cache/.../lib/python3.10/site-packages/zhon/pinyin.py:154: DeprecationWarning: invalid escape sequence '\-'
    ) % {'word': word, 'non_stops': non_stops.replace('-', '\-'),

-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html

Update SIMPLIFIED and TRADITIONAL to allow for mapping between them

Currently, zhon.cedict.TRADITIONAL and zhon.cedict.SIMPLIFIED are strings consisting of each character that occurs in CC-CEDICT. It would be better if they contained a few duplicate characters but the characters in each constant were ordered in a way that allowed mapping between the constants.

>>> zhon.cedict.TRADITIONAL[5689]
'你'
>>> zhon.cedict.SIMPLIFIED[5689]
'你'
>>> zhon.cedict.TRADITIONAL[7899]
'國'
>>> zhon.cedict.SIMPLIFIED[7899]
'国'

'r' suffix bug

There is a typo in zhon.pinyin that makes the r-suffix (as used in erhua-style Chinese) combine with its previous syllable in the regular expression. Based on phonetics alone, it should be combined to form one syllable. However, because it is represented by an additional character it is unwise to treat it as one syllable. Doing so will create problems when using zhon.pinyin to interact with Chinese characters, which is likely a common scenario for users.

For example, huar is parsed as one syllable. In hua1r5 the r-suffix is ignored altogether.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.