tsroten / dragonmapper Goto Github PK

View Code? Open in Web Editor NEW

53.0 53.0 18.0 1.47 MB

Identification and conversion functions for Chinese text processing

License: MIT License

Python 100.00%

dragonmapper's People

Contributors

Stargazers

Watchers

Forkers

chagge lixiangnlp astromme ttwno-zz dongqing7 mjaspers2mtu ttwno july0516 shaojinding boyleconnor watarain jayanand1 domigome

dragonmapper's Issues

AttributeError: 'module' object has no attribute 'hanzi'

Hey! I'm trying to use dragonmapper and I'm getting a confusing error. I installed it on Ubuntu 16.04 with the following command:

pip install dragonmapper --user

Then tried to use it:

import dragonmapper
dragonmapper.hanzi.is_simplified(u'你好')

but get the following error:
AttributeError: 'module' object has no attribute 'hanzi'

python --version outputs:
Python 2.7.12

What's up?

Wrong to_pinyin

print hanzi.to_pinyin(u'绿')

outputs l̈ù
but should be lǜ

Numbered Pinyin issues encountered in CEDICT

Hi, Dragonmapper is an awesome library. I am using it (0.2.6) for many projects, which use CEDICT as a data source for further text processing. I found problems with numbered pinyin, accented pinyin, and zhuyin fuhao transcriptions.

Before I begin, I want to note I am not a Mandarin expert, therefore I don't know if my suggestions are the correct ones. A lot of my suggested clean up edits to CEDIT have been accepted. However, since CEDIT is not in a standard format like .csv, I had to build my own parser, read the data line by line, and .split() it to feed Dragonmapper. I'm not sure whether every issue I've discovered should be solved by Dragonmapper, I will simply present the problems I needed to work around and leave it up to discussion.

Issues

Numbered Pinyin do not convert to Accented
Accented pinyin which do not convert to zhuyin fuhao
Already noted in issue 27
Taiwanese pronunciation exceptions

Numbered Pinyin do not convert to Accented Pinyin

More than 2000 entries in the CEDICT have 'u:' combinations. 'yo1' and 'yo5' also have a combined 5 items in CEDICT which Dragonmapper cannot convert these items from numbered pinyin to accented pinyin. I found it necessary to loop through in this order:

'u:4', 'ǜ'
'u:3', 'ǚ'
'u:2', 'ǘ'
'u:1', 'ǖ'
'u:', 'ü'
'yo1', 'yō'
'yo5', 'yo'

These items raise 'ValueError: Not a valid syllable:' exceptions.

Accented pinyin which do not convert to zhuyin fuhao

I also encountered the following items which do not convert correctly:

'ó':'ㄛˊ' # 哦哦 [o2] /oh (interjection indicating doubt or surprise)/
'ò':'ㄛˋ' # 哦哦 [o4] /oh (interjection indicating that one has just learned sth)/
'ō':'ㄛ'
'ǒ':'ㄛˇ'
'yō':'ㄧㄛ'
'yo':'ㄧㄛ˙'
'dia3':'ㄉㄧㄚˇ' # diǎ 嗲嗲 [dia3] /coy/childish/
'm2':'ㄇˊ'
'm4':'ㄇˋ'

Already noted in issue 27

#27

'tēi':'ㄊㄨㄟ' # Workaround for 忒忒 [tei1] /(dialect) too/very/also pr. [tui1]/
'eng1':'ㄥ' # Work around for ēng 鞥鞥 [eng1] /reins/

Taiwanese Pronunciation Exceptions

I found it necessary to skip items which contained Taiwanese pronunciations of ['khè' ,'goá' ,'khàu' ,'ô' ,'yai2'] . I'm not sure anything can be done about this with Dragonmapper.
dragonmapper.hanzi.to_zhuyin('goá')
Results in a 'ValueError: Not a valid syllable: o5'

ERROR on "shai"'s IPA

Other Pinyin with "ai" all have IPA map to "aɪ", only the "shai".

accented_to_numbered fails with uppercase letters

accented_to_numbered() returns the original syllable with a '5' added to the end instead of actually deciphering the tone if the vowel with the diacritic is uppercase. This requires a simple fix that involves using _lower_case().

Tests in the test suite fail

When packaging this package for openSUSE/Tumbleweed (and I have to admit I know almost nothing about the Chinese alphabet, being a Czech myself) the test suite started to fail:

[  103s] =================================== FAILURES ===================================
[  103s] _____________________ TestIdentifyFunctions.test_identify ______________________
[  103s] 
[  103s] self = <dragonmapper.tests.test_transcriptions.TestIdentifyFunctions testMethod=test_identify>
[  103s] 
[  103s]     def test_identify(self):
[  103s] >       self.assertEqual(trans.identify(self.numbered_pinyin), trans.PINYIN)
[  103s] E       AssertionError: 0 != 1
[  103s] 
[  103s] dragonmapper/tests/test_transcriptions.py:19: AssertionError
[  103s] ______________________ TestIdentifyFunctions.test_is_ipa _______________________
[  103s] 
[  103s] self = <dragonmapper.tests.test_transcriptions.TestIdentifyFunctions testMethod=test_is_ipa>
[  103s] 
[  103s]     def test_is_ipa(self):
[  103s] >       self.assertTrue(trans.is_ipa(self.ipa))
[  103s] E       AssertionError: False is not true
[  103s] 
[  103s] dragonmapper/tests/test_transcriptions.py:40: AssertionError
[  103s] _____________________ TestIdentifyFunctions.test_is_pinyin _____________________
[  103s] 
[  103s] self = <dragonmapper.tests.test_transcriptions.TestIdentifyFunctions testMethod=test_is_pinyin>
[  103s] 
[  103s]     def test_is_pinyin(self):
[  103s] >       self.assertTrue(trans.is_pinyin(self.numbered_pinyin))
[  103s] E       AssertionError: False is not true
[  103s] 
[  103s] dragonmapper/tests/test_transcriptions.py:26: AssertionError
[  103s] _______________ TestIdentifyFunctions.test_is_pinyin_compatible ________________
[  103s] 
[  103s] self = <dragonmapper.tests.test_transcriptions.TestIdentifyFunctions testMethod=test_is_pinyin_compatible>
[  103s] 
[  103s]     def test_is_pinyin_compatible(self):
[  103s]         self.assertFalse(trans.is_pinyin_compatible(self.ipa))
[  103s] >       self.assertTrue(trans.is_pinyin_compatible(self.numbered_pinyin))
[  103s] E       AssertionError: False is not true
[  103s] 
[  103s] dragonmapper/tests/test_transcriptions.py:48: AssertionError
[  103s] ________________ TestConvertFunctions.test_accented_to_numbered ________________
[  103s] 
[  103s] self = <dragonmapper.tests.test_transcriptions.TestConvertFunctions testMethod=test_accented_to_numbered>
[  103s] 
[  103s]     def test_accented_to_numbered(self):
[  103s] >       numbered_pinyin = trans.to_pinyin(self.accented_pinyin, accented=False)
[  103s] 
[  103s] dragonmapper/tests/test_transcriptions.py:75: 
[  103s] _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
[  103s] 
[  103s] s = 'Wǒ shì yīgè měiguórén.', accented = False
[  103s] 
[  103s]     def to_pinyin(s, accented=True):
[  103s]         """Convert *s* to Pinyin.
[  103s]     
[  103s]         If *accented* is ``True``, diacritics are added to the Pinyin syllables. If
[  103s]         it's ``False``, numbers are used to indicate tone.
[  103s]     
[  103s]         """
[  103s]         identity = identify(s)
[  103s]         if identity == PINYIN:
[  103s]             if _has_accented_vowels(s):
[  103s]                 return s if accented else accented_to_numbered(s)
[  103s]             else:
[  103s]                 return numbered_to_accented(s) if accented else s
[  103s]         elif identity == ZHUYIN:
[  103s]             return zhuyin_to_pinyin(s, accented=accented)
[  103s]         elif identity == IPA:
[  103s]             return ipa_to_pinyin(s, accented=accented)
[  103s]         else:
[  103s] >           raise ValueError("String is not a valid Chinese transcription.")
[  103s] E           ValueError: String is not a valid Chinese transcription.
[  103s] 
[  103s] dragonmapper/transcriptions.py:435: ValueError
[  103s] ________________ TestConvertFunctions.test_numbered_to_accented ________________
[  103s] 
[  103s] self = <dragonmapper.tests.test_transcriptions.TestConvertFunctions testMethod=test_numbered_to_accented>
[  103s] 
[  103s]     def test_numbered_to_accented(self):
[  103s] >       accented_pinyin = trans.to_pinyin(self.numbered_pinyin)
[  103s] 
[  103s] dragonmapper/tests/test_transcriptions.py:71: 
[  103s] _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
[  103s] 
[  103s] s = 'Wo3 shi4 yi1ge4 mei3guo2ren2.', accented = True
[  103s] 
[  103s]     def to_pinyin(s, accented=True):
[  103s]         """Convert *s* to Pinyin.
[  103s]     
[  103s]         If *accented* is ``True``, diacritics are added to the Pinyin syllables. If
[  103s]         it's ``False``, numbers are used to indicate tone.
[  103s]     
[  103s]         """
[  103s]         identity = identify(s)
[  103s]         if identity == PINYIN:
[  103s]             if _has_accented_vowels(s):
[  103s]                 return s if accented else accented_to_numbered(s)
[  103s]             else:
[  103s]                 return numbered_to_accented(s) if accented else s
[  103s]         elif identity == ZHUYIN:
[  103s]             return zhuyin_to_pinyin(s, accented=accented)
[  103s]         elif identity == IPA:
[  103s]             return ipa_to_pinyin(s, accented=accented)
[  103s]         else:
[  103s] >           raise ValueError("String is not a valid Chinese transcription.")
[  103s] E           ValueError: String is not a valid Chinese transcription.
[  103s] 
[  103s] dragonmapper/transcriptions.py:435: ValueError
[  103s] =============================== warnings summary ===============================
[  103s] dragonmapper/transcriptions.py:493
[  103s]   /home/abuild/rpmbuild/BUILD/dragonmapper-0.2.6/dragonmapper/transcriptions.py:493: DeprecationWarning: invalid escape sequence \s
[  103s]     re_pattern = '(?:%(syllable)s|\s)+' % {'syllable': zhon.zhuyin.syl}
[  103s] 
[  103s] -- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
[  103s] =========================== short test summary info ============================
[  103s] FAILED dragonmapper/tests/test_transcriptions.py::TestIdentifyFunctions::test_identify
[  103s] FAILED dragonmapper/tests/test_transcriptions.py::TestIdentifyFunctions::test_is_ipa
[  103s] FAILED dragonmapper/tests/test_transcriptions.py::TestIdentifyFunctions::test_is_pinyin
[  103s] FAILED dragonmapper/tests/test_transcriptions.py::TestIdentifyFunctions::test_is_pinyin_compatible
[  103s] FAILED dragonmapper/tests/test_transcriptions.py::TestConvertFunctions::test_accented_to_numbered
[  103s] FAILED dragonmapper/tests/test_transcriptions.py::TestConvertFunctions::test_numbered_to_accented
[  103s] =================== 6 failed, 24 passed, 1 warning in 0.54s ====================

Complete build log with all packages installed and steps taken to reproduce.

It is possible that the difference is upgrade of packages this one is dependent on: hanzidentifier-1.1.0 and zhon-2.0.2.

numbered_to_accented doesn't add apostrophes

numbered_to_accented() should add apostrophes before syllables that start with vowels. Apostrophes aren't needed with numbered pinyin (even though they probably technically should have them), but when converting to diacritic pinyin, they should definitely be there.

accented_to_numbered() not working for 'ai' syllable

print(accented_to_numbered("aī"))
print(accented_to_numbered("aí"))
print(accented_to_numbered("aǐ"))
print(accented_to_numbered("aì"))
print(accented_to_numbered("ai"))

produces:

a5ī
a5í
a5ǐ
a5ì
ai5

empty string returned from hanzi.to_pinyin()

I'm trying to use dragonmapper to convert characters to pinyin, and I'm trying the tutorial but I'm stuck.
http://dragonmapper.readthedocs.org/en/latest/tutorial.html

from dragonmapper import hanzi
s = '这个字怎么念？' 
pinyin = hanzi.to_pinyin(s)

At this point pinyin is an empty string u''. What am I doing wrong?

Wrong Pinyin/ No Zhuyin for 奧地利

# coding: UTF-8

from dragonmapper import hanzi, transcriptions, __version__
print __version__

au = u'奧'
print hanzi.to_pinyin(au)
print hanzi.to_pinyin(au, accented=False)
print hanzi.to_zhuyin(au)

austria = u'奧地利'
print hanzi.to_pinyin(austria, accented=True)
print hanzi.to_pinyin(austria, accented=False)
print hanzi.to_zhuyin(austria)

Outputs:

0.2.3
ào
ao4
ㄠˋ
Àodìlì
Ào5di4li4
Traceback (most recent call last):
  File "<filepath>", line 13, in <module>
    print hanzi.to_zhuyin(austria)
  File "build/bdist.linux-x86_64/egg/dragonmapper/hanzi.py", line 190, in to_zhuyin
  File "build/bdist.linux-x86_64/egg/dragonmapper/transcriptions.py", line 365, in pinyin_to_zhuyin
  File "build/bdist.linux-x86_64/egg/dragonmapper/transcriptions.py", line 341, in _convert
  File "build/bdist.linux-x86_64/egg/dragonmapper/transcriptions.py", line 229, in pinyin_syllable_to_zhuyin
ValueError: Not a valid syllable: o5

(using the developer branch)

hanzi.to_pinyin delimiter is ignored

Summary

The delimiter parameter to to_pinyin() has no effect

Example:

hanzi.to_pinyin("我猕猴桃过敏。", delimiter='.')
# ACTUAL OUTPUT:
#     'wǒmíhóutáoguòmǐn。'

# EXPECTED OUTPUT:
#     'wǒ.míhóutáo.guòmǐn。'

The default delimiter of empty string ' ' is not applied either:

hanzi.to_pinyin("我猕猴桃过敏。"')
# ACTUAL OUTPUT:
#     'wǒmíhóutáoguòmǐn。'

# EXPECTED OUTPUT:
#     'wǒ míhóutáo guòmǐn。'

"eng" and "tei" missing in transcriptions.csv

The following entries are missing in transcriptions.csv:

eng,ㄥ,ŋ
tei,ㄊㄟ,tʰeɪ

I discovered them by trying to convert all of CC-CEDICT to zhuyin. I'm also not certain whether I used the correct IPA to transcript the pinyin and zhuyin.

Example entries from CEDICT using those syllables:
忒忒 [tei1] /(dialect) too/very/also pr. [tui1]/
鞥鞥 [eng1] /reins/

Using Pleco I verified that those are not mistakes in CEDICT and that these should probably be added to dragonmapper.

Typo in try statements: IndexError

Multiple functions in dragonmapper.transcriptions handle exceptions for IndexError instead of KeyError like they should. It's a simple typo that can be easily fixed.

Pinyin/Zhuyin/IPA syllable `o` is missing.

>>> print(transcriptions.pinyin_to_zhuyin(ó))
Traceback (most recent call last):
  File "D:\OneDrive\My Programs\zhuyin converter\Convert2BopomofoPunctuation.py", line 100, in <module>
    print(printBopomofo(eachLine)+"\n"*3)
  File "D:\OneDrive\My Programs\zhuyin converter\Convert2BopomofoPunctuation.py", line 42, in printBopomofo
    bopomofoDictionary=makeToneDictionary(hanzistring2)
  File "D:\OneDrive\My Programs\zhuyin converter\Convert2BopomofoPunctuation.py", line 35, in makeToneDictionary
    bopomofoList=listBopomofo(hanzi)
  File "D:\OneDrive\My Programs\zhuyin converter\Convert2BopomofoPunctuation.py", line 11, in listBopomofo
    print(transcriptions.pinyin_to_zhuyin(ó))
NameError: name 'ó' is not defined

also the following string returns an error

 (1 4 事情是這樣的 ， 父親讀到也看到許多偉大而奇妙的事時 ， 他向主高呼許多事 ， 諸如 ： 哦 ， 主神全能者 ， 您的事工多麼偉大而奇妙 ！ 您的寶座在高天之上 ， 您的大能、良善和慈悲廣被世上全民 ， 而且 ， 由於您的慈悲 ， 您不會讓歸向您的人滅亡 ！ )

>>> pinyin=hanzi.to_zhuyin(hanzistring)
Traceback (most recent call last):
  File "C:\Users\Kei\AppData\Local\Programs\Python\Python35\lib\site-packages\dragonmapper\transcriptions.py", line 227, in pinyin_syllable_to_zhuyin
    zhuyin_syllable = _PINYIN_MAP[pinyin_syllable.lower()]['Zhuyin']
KeyError: 'o'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "D:\OneDrive\My Programs\zhuyin converter\Convert2BopomofoPunctuation.py", line 100, in <module>
    print(printBopomofo(eachLine)+"\n"*3)
  File "D:\OneDrive\My Programs\zhuyin converter\Convert2BopomofoPunctuation.py", line 42, in printBopomofo
    bopomofoDictionary=makeToneDictionary(hanzistring2)
  File "D:\OneDrive\My Programs\zhuyin converter\Convert2BopomofoPunctuation.py", line 35, in makeToneDictionary
    bopomofoList=listBopomofo(hanzi)
  File "D:\OneDrive\My Programs\zhuyin converter\Convert2BopomofoPunctuation.py", line 12, in listBopomofo
    pinyin=hanzi.to_zhuyin(hanzistring)
  File "C:\Users\Kei\AppData\Local\Programs\Python\Python35\lib\site-packages\dragonmapper\hanzi.py", line 190, in to_zhuyin
    zhuyin = pinyin_to_zhuyin(numbered_pinyin)
  File "C:\Users\Kei\AppData\Local\Programs\Python\Python35\lib\site-packages\dragonmapper\transcriptions.py", line 365, in pinyin_to_zhuyin
    remove_apostrophes=True, separate_syllables=True)
  File "C:\Users\Kei\AppData\Local\Programs\Python\Python35\lib\site-packages\dragonmapper\transcriptions.py", line 341, in _convert
    new += syllable_function(match.group())
  File "C:\Users\Kei\AppData\Local\Programs\Python\Python35\lib\site-packages\dragonmapper\transcriptions.py", line 229, in pinyin_syllable_to_zhuyin
    raise ValueError('Not a valid syllable: %s' % s)
ValueError: Not a valid syllable: o2

Wrong IPA for yong & you in transcriptions.csv

Correct IPA transcriptions:

yong  -> iʊŋ
you   -> ioʊ

numbered_to_accented incorrectly converts 'v' vowel

numbered_to_accented() leaves the 'v' vowel untouched because it automatically returns it without processing it. The function that replaces 'v' with '\u00fc' needs to come before the check for unaccented vowels.

Pinyin to Zhuyin/IPA conversion fails with -r suffix

When converting a Pinyin syllable that ends with the -r suffix to Zhuyin/IPA, a ValueError is raised because the transcription mapping data is lacking the 'r' syllable. In IPA 'ɻ' should be used for the -r suffix. In Zhuyin, 'ㄦ' should be used.

Wrong zhuyin to pinyin for syllables ending with ㄨ

That is, when they have the first intonation:

from dragonmapper import hanzi, transcriptions
print transcriptions.zhuyin_syllable_to_pinyin(u'ㄓㄨˋ') # works
print transcriptions.zhuyin_syllable_to_pinyin(u'ㄔㄨ') # does not work

I traced things down to the following:

def _parse_zhuyin_syllable(unparsed_syllable):
    """Return the syllable and tone of a Zhuyin syllable."""
    zhuyin_tone = unparsed_syllable[-1]
    if zhuyin_tone in zhon.zhuyin.characters:
        syllable, tone = unparsed_syllable, '1'
    elif zhuyin_tone in zhon.zhuyin.marks:
        for tone_number, tone_mark in _ZHUYIN_TONES.items():
            if zhuyin_tone == tone_mark:
                syllable, tone = unparsed_syllable[:-1], tone_number
    else:
        raise ValueError("Invalid syllable: %s" % unparsed_syllable)

    return syllable, tone

For some reason, there is no ㄨ in zhon.zhuyin.characters? (also no ㄩ)

dragonmapper.hanzi.to_pinyin('战略') = zhànlüÈ, not zhànlüè

Seems to be a problem with all 'lüè'. Don't know if it's just my code but it seems to be a dragonmapper issue.

‘嗲’ causes error "ValueError: Not a valid syllable: dia3"

hanzi.to_zhuyin('嗲')
ValueError: Not a valid Syllable: dia3

I have it fixed in my HTML branch, but I can submit a separate PR if you want :-)
Just let me kniow

yo is missing in transcription.csv

哟，他怎么来了？
Yó，tā zěnme lái le？
Oh, how did he get here?

Transcription of yo

yo,ㄧㄛ,iɔ

UnicodeDecodeError when installing on python 3 under windows

No problems with python 2 on windows, but installing for 3.4 gives the following traceback:

File "<string>", line 17, in <module>
   File "C:\Users\Gal\AppData\Local\Temp\pip_build_Gal\dragonmapper\setup.py", line 24, in <module>
     readme = open_file('README.rst')
   File "C:\Users\Gal\AppData\Local\Temp\pip_build_Gal\dragonmapper\setup.py", line 22, in open_file
     return f.read()
   File "C:\Anaconda3\lib\encodings\cp1252.py", line 23, in decode
     return codecs.charmap_decode(input,self.errors,decoding_table)[0]
 UnicodeDecodeError: 'charmap' codec can't decode byte 0x8d in position 1282: character maps to <undefined>
 Complete output from command python setup.py egg_info:
 Traceback (most recent call last):
File "<string>", line 17, in <module>

File "C:\Users\Gal\AppData\Local\Temp\pip_build_Gal\dragonmapper\setup.py", line 24, in <module>

 readme = open_file('README.rst')

File "C:\Users\Gal\AppData\Local\Temp\pip_build_Gal\dragonmapper\setup.py", line 22, in open_file

 return f.read()

File "C:\Anaconda3\lib\encodings\cp1252.py", line 23, in decode

 return codecs.charmap_decode(input,self.errors,decoding_table)[0]

UnicodeDecodeError: 'charmap' codec can't decode byte 0x8d in position 1282: character maps to <undefined>

weird output from to_pinyin

The following code

print dragonmapper.hanzi.to_pinyin(u'女')
print dragonmapper.hanzi.to_pinyin(u'女人')

outputs:

n̈ǔ
nǚren

where i think it should be:

nǚ
nǚren

Invalid pinyin for 手 ( shǒu) ...

import dragonmapper.hanzi as hz

In [1]: hz.to_pinyin('收')
Out[1]: 'shoū'

In [2]: hz.to_pinyin('手')
Out[2]: 'shoǔ'

In [62]: dragonmapper.__version__
Out[62]: '0.2.3'

python 3.4

Virtualenv in Ubuntu Linux Trusty

Typo in setup.py

setup.py has a typo in the description that is displayed on PyPI: "Chinesetext" should be "Chinese text".