tsroten / dragonmapper Goto Github PK
View Code? Open in Web Editor NEWIdentification and conversion functions for Chinese text processing
License: MIT License
Identification and conversion functions for Chinese text processing
License: MIT License
Hey! I'm trying to use dragonmapper and I'm getting a confusing error. I installed it on Ubuntu 16.04 with the following command:
pip install dragonmapper --user
Then tried to use it:
import dragonmapper
dragonmapper.hanzi.is_simplified(u'你好')
but get the following error:
AttributeError: 'module' object has no attribute 'hanzi'
python --version outputs:
Python 2.7.12
What's up?
print hanzi.to_pinyin(u'绿')
outputs l̈ù
but should be lǜ
Hi, Dragonmapper is an awesome library. I am using it (0.2.6) for many projects, which use CEDICT as a data source for further text processing. I found problems with numbered pinyin, accented pinyin, and zhuyin fuhao transcriptions.
Before I begin, I want to note I am not a Mandarin expert, therefore I don't know if my suggestions are the correct ones. A lot of my suggested clean up edits to CEDIT have been accepted. However, since CEDIT is not in a standard format like .csv, I had to build my own parser, read the data line by line, and .split() it to feed Dragonmapper. I'm not sure whether every issue I've discovered should be solved by Dragonmapper, I will simply present the problems I needed to work around and leave it up to discussion.
Issues
More than 2000 entries in the CEDICT have 'u:' combinations. 'yo1' and 'yo5' also have a combined 5 items in CEDICT which Dragonmapper cannot convert these items from numbered pinyin to accented pinyin. I found it necessary to loop through in this order:
These items raise 'ValueError: Not a valid syllable:' exceptions.
I also encountered the following items which do not convert correctly:
I found it necessary to skip items which contained Taiwanese pronunciations of ['khè' ,'goá' ,'khàu' ,'ô' ,'yai2']
. I'm not sure anything can be done about this with Dragonmapper.
dragonmapper.hanzi.to_zhuyin('goá')
Results in a 'ValueError: Not a valid syllable: o5'
Other Pinyin with "ai" all have IPA map to "aɪ", only the "shai".
accented_to_numbered()
returns the original syllable with a '5'
added to the end instead of actually deciphering the tone if the vowel with the diacritic is uppercase. This requires a simple fix that involves using _lower_case()
.
When packaging this package for openSUSE/Tumbleweed (and I have to admit I know almost nothing about the Chinese alphabet, being a Czech myself) the test suite started to fail:
[ 103s] =================================== FAILURES ===================================
[ 103s] _____________________ TestIdentifyFunctions.test_identify ______________________
[ 103s]
[ 103s] self = <dragonmapper.tests.test_transcriptions.TestIdentifyFunctions testMethod=test_identify>
[ 103s]
[ 103s] def test_identify(self):
[ 103s] > self.assertEqual(trans.identify(self.numbered_pinyin), trans.PINYIN)
[ 103s] E AssertionError: 0 != 1
[ 103s]
[ 103s] dragonmapper/tests/test_transcriptions.py:19: AssertionError
[ 103s] ______________________ TestIdentifyFunctions.test_is_ipa _______________________
[ 103s]
[ 103s] self = <dragonmapper.tests.test_transcriptions.TestIdentifyFunctions testMethod=test_is_ipa>
[ 103s]
[ 103s] def test_is_ipa(self):
[ 103s] > self.assertTrue(trans.is_ipa(self.ipa))
[ 103s] E AssertionError: False is not true
[ 103s]
[ 103s] dragonmapper/tests/test_transcriptions.py:40: AssertionError
[ 103s] _____________________ TestIdentifyFunctions.test_is_pinyin _____________________
[ 103s]
[ 103s] self = <dragonmapper.tests.test_transcriptions.TestIdentifyFunctions testMethod=test_is_pinyin>
[ 103s]
[ 103s] def test_is_pinyin(self):
[ 103s] > self.assertTrue(trans.is_pinyin(self.numbered_pinyin))
[ 103s] E AssertionError: False is not true
[ 103s]
[ 103s] dragonmapper/tests/test_transcriptions.py:26: AssertionError
[ 103s] _______________ TestIdentifyFunctions.test_is_pinyin_compatible ________________
[ 103s]
[ 103s] self = <dragonmapper.tests.test_transcriptions.TestIdentifyFunctions testMethod=test_is_pinyin_compatible>
[ 103s]
[ 103s] def test_is_pinyin_compatible(self):
[ 103s] self.assertFalse(trans.is_pinyin_compatible(self.ipa))
[ 103s] > self.assertTrue(trans.is_pinyin_compatible(self.numbered_pinyin))
[ 103s] E AssertionError: False is not true
[ 103s]
[ 103s] dragonmapper/tests/test_transcriptions.py:48: AssertionError
[ 103s] ________________ TestConvertFunctions.test_accented_to_numbered ________________
[ 103s]
[ 103s] self = <dragonmapper.tests.test_transcriptions.TestConvertFunctions testMethod=test_accented_to_numbered>
[ 103s]
[ 103s] def test_accented_to_numbered(self):
[ 103s] > numbered_pinyin = trans.to_pinyin(self.accented_pinyin, accented=False)
[ 103s]
[ 103s] dragonmapper/tests/test_transcriptions.py:75:
[ 103s] _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
[ 103s]
[ 103s] s = 'Wǒ shì yīgè měiguórén.', accented = False
[ 103s]
[ 103s] def to_pinyin(s, accented=True):
[ 103s] """Convert *s* to Pinyin.
[ 103s]
[ 103s] If *accented* is ``True``, diacritics are added to the Pinyin syllables. If
[ 103s] it's ``False``, numbers are used to indicate tone.
[ 103s]
[ 103s] """
[ 103s] identity = identify(s)
[ 103s] if identity == PINYIN:
[ 103s] if _has_accented_vowels(s):
[ 103s] return s if accented else accented_to_numbered(s)
[ 103s] else:
[ 103s] return numbered_to_accented(s) if accented else s
[ 103s] elif identity == ZHUYIN:
[ 103s] return zhuyin_to_pinyin(s, accented=accented)
[ 103s] elif identity == IPA:
[ 103s] return ipa_to_pinyin(s, accented=accented)
[ 103s] else:
[ 103s] > raise ValueError("String is not a valid Chinese transcription.")
[ 103s] E ValueError: String is not a valid Chinese transcription.
[ 103s]
[ 103s] dragonmapper/transcriptions.py:435: ValueError
[ 103s] ________________ TestConvertFunctions.test_numbered_to_accented ________________
[ 103s]
[ 103s] self = <dragonmapper.tests.test_transcriptions.TestConvertFunctions testMethod=test_numbered_to_accented>
[ 103s]
[ 103s] def test_numbered_to_accented(self):
[ 103s] > accented_pinyin = trans.to_pinyin(self.numbered_pinyin)
[ 103s]
[ 103s] dragonmapper/tests/test_transcriptions.py:71:
[ 103s] _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
[ 103s]
[ 103s] s = 'Wo3 shi4 yi1ge4 mei3guo2ren2.', accented = True
[ 103s]
[ 103s] def to_pinyin(s, accented=True):
[ 103s] """Convert *s* to Pinyin.
[ 103s]
[ 103s] If *accented* is ``True``, diacritics are added to the Pinyin syllables. If
[ 103s] it's ``False``, numbers are used to indicate tone.
[ 103s]
[ 103s] """
[ 103s] identity = identify(s)
[ 103s] if identity == PINYIN:
[ 103s] if _has_accented_vowels(s):
[ 103s] return s if accented else accented_to_numbered(s)
[ 103s] else:
[ 103s] return numbered_to_accented(s) if accented else s
[ 103s] elif identity == ZHUYIN:
[ 103s] return zhuyin_to_pinyin(s, accented=accented)
[ 103s] elif identity == IPA:
[ 103s] return ipa_to_pinyin(s, accented=accented)
[ 103s] else:
[ 103s] > raise ValueError("String is not a valid Chinese transcription.")
[ 103s] E ValueError: String is not a valid Chinese transcription.
[ 103s]
[ 103s] dragonmapper/transcriptions.py:435: ValueError
[ 103s] =============================== warnings summary ===============================
[ 103s] dragonmapper/transcriptions.py:493
[ 103s] /home/abuild/rpmbuild/BUILD/dragonmapper-0.2.6/dragonmapper/transcriptions.py:493: DeprecationWarning: invalid escape sequence \s
[ 103s] re_pattern = '(?:%(syllable)s|\s)+' % {'syllable': zhon.zhuyin.syl}
[ 103s]
[ 103s] -- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
[ 103s] =========================== short test summary info ============================
[ 103s] FAILED dragonmapper/tests/test_transcriptions.py::TestIdentifyFunctions::test_identify
[ 103s] FAILED dragonmapper/tests/test_transcriptions.py::TestIdentifyFunctions::test_is_ipa
[ 103s] FAILED dragonmapper/tests/test_transcriptions.py::TestIdentifyFunctions::test_is_pinyin
[ 103s] FAILED dragonmapper/tests/test_transcriptions.py::TestIdentifyFunctions::test_is_pinyin_compatible
[ 103s] FAILED dragonmapper/tests/test_transcriptions.py::TestConvertFunctions::test_accented_to_numbered
[ 103s] FAILED dragonmapper/tests/test_transcriptions.py::TestConvertFunctions::test_numbered_to_accented
[ 103s] =================== 6 failed, 24 passed, 1 warning in 0.54s ====================
Complete build log with all packages installed and steps taken to reproduce.
It is possible that the difference is upgrade of packages this one is dependent on: hanzidentifier-1.1.0
and zhon-2.0.2
.
numbered_to_accented()
should add apostrophes before syllables that start with vowels. Apostrophes aren't needed with numbered pinyin (even though they probably technically should have them), but when converting to diacritic pinyin, they should definitely be there.
print(accented_to_numbered("aī"))
print(accented_to_numbered("aí"))
print(accented_to_numbered("aǐ"))
print(accented_to_numbered("aì"))
print(accented_to_numbered("ai"))
produces:
a5ī
a5í
a5ǐ
a5ì
ai5
I'm trying to use dragonmapper to convert characters to pinyin, and I'm trying the tutorial but I'm stuck.
http://dragonmapper.readthedocs.org/en/latest/tutorial.html
from dragonmapper import hanzi
s = '这个字怎么念?'
pinyin = hanzi.to_pinyin(s)
At this point pinyin
is an empty string u''
. What am I doing wrong?
# coding: UTF-8
from dragonmapper import hanzi, transcriptions, __version__
print __version__
au = u'奧'
print hanzi.to_pinyin(au)
print hanzi.to_pinyin(au, accented=False)
print hanzi.to_zhuyin(au)
austria = u'奧地利'
print hanzi.to_pinyin(austria, accented=True)
print hanzi.to_pinyin(austria, accented=False)
print hanzi.to_zhuyin(austria)
Outputs:
0.2.3
ào
ao4
ㄠˋ
Àodìlì
Ào5di4li4
Traceback (most recent call last):
File "<filepath>", line 13, in <module>
print hanzi.to_zhuyin(austria)
File "build/bdist.linux-x86_64/egg/dragonmapper/hanzi.py", line 190, in to_zhuyin
File "build/bdist.linux-x86_64/egg/dragonmapper/transcriptions.py", line 365, in pinyin_to_zhuyin
File "build/bdist.linux-x86_64/egg/dragonmapper/transcriptions.py", line 341, in _convert
File "build/bdist.linux-x86_64/egg/dragonmapper/transcriptions.py", line 229, in pinyin_syllable_to_zhuyin
ValueError: Not a valid syllable: o5
(using the developer branch)
The delimiter parameter to to_pinyin()
has no effect
hanzi.to_pinyin("我猕猴桃过敏。", delimiter='.')
# ACTUAL OUTPUT:
# 'wǒmíhóutáoguòmǐn。'
# EXPECTED OUTPUT:
# 'wǒ.míhóutáo.guòmǐn。'
The default delimiter of empty string ' '
is not applied either:
hanzi.to_pinyin("我猕猴桃过敏。"')
# ACTUAL OUTPUT:
# 'wǒmíhóutáoguòmǐn。'
# EXPECTED OUTPUT:
# 'wǒ míhóutáo guòmǐn。'
The following entries are missing in transcriptions.csv:
eng,ㄥ,ŋ
tei,ㄊㄟ,tʰeɪ
I discovered them by trying to convert all of CC-CEDICT to zhuyin. I'm also not certain whether I used the correct IPA to transcript the pinyin and zhuyin.
Example entries from CEDICT using those syllables:
忒 忒 [tei1] /(dialect) too/very/also pr. [tui1]/
鞥 鞥 [eng1] /reins/
Using Pleco I verified that those are not mistakes in CEDICT and that these should probably be added to dragonmapper.
Multiple functions in dragonmapper.transcriptions
handle exceptions for IndexError
instead of KeyError
like they should. It's a simple typo that can be easily fixed.
>>> print(transcriptions.pinyin_to_zhuyin(ó))
Traceback (most recent call last):
File "D:\OneDrive\My Programs\zhuyin converter\Convert2BopomofoPunctuation.py", line 100, in <module>
print(printBopomofo(eachLine)+"\n"*3)
File "D:\OneDrive\My Programs\zhuyin converter\Convert2BopomofoPunctuation.py", line 42, in printBopomofo
bopomofoDictionary=makeToneDictionary(hanzistring2)
File "D:\OneDrive\My Programs\zhuyin converter\Convert2BopomofoPunctuation.py", line 35, in makeToneDictionary
bopomofoList=listBopomofo(hanzi)
File "D:\OneDrive\My Programs\zhuyin converter\Convert2BopomofoPunctuation.py", line 11, in listBopomofo
print(transcriptions.pinyin_to_zhuyin(ó))
NameError: name 'ó' is not defined
also the following string returns an error
(1 4 事情是這樣的 , 父親讀到也看到許多偉大而奇妙的事時 , 他向主高呼許多事 , 諸如 : 哦 , 主神全能者 , 您的事工多麼偉大而奇妙 ! 您的寶座在高天之上 , 您的大能、良善和慈悲廣被世上全民 , 而且 , 由於您的慈悲 , 您不會讓歸向您的人滅亡 ! )
>>> pinyin=hanzi.to_zhuyin(hanzistring)
Traceback (most recent call last):
File "C:\Users\Kei\AppData\Local\Programs\Python\Python35\lib\site-packages\dragonmapper\transcriptions.py", line 227, in pinyin_syllable_to_zhuyin
zhuyin_syllable = _PINYIN_MAP[pinyin_syllable.lower()]['Zhuyin']
KeyError: 'o'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "D:\OneDrive\My Programs\zhuyin converter\Convert2BopomofoPunctuation.py", line 100, in <module>
print(printBopomofo(eachLine)+"\n"*3)
File "D:\OneDrive\My Programs\zhuyin converter\Convert2BopomofoPunctuation.py", line 42, in printBopomofo
bopomofoDictionary=makeToneDictionary(hanzistring2)
File "D:\OneDrive\My Programs\zhuyin converter\Convert2BopomofoPunctuation.py", line 35, in makeToneDictionary
bopomofoList=listBopomofo(hanzi)
File "D:\OneDrive\My Programs\zhuyin converter\Convert2BopomofoPunctuation.py", line 12, in listBopomofo
pinyin=hanzi.to_zhuyin(hanzistring)
File "C:\Users\Kei\AppData\Local\Programs\Python\Python35\lib\site-packages\dragonmapper\hanzi.py", line 190, in to_zhuyin
zhuyin = pinyin_to_zhuyin(numbered_pinyin)
File "C:\Users\Kei\AppData\Local\Programs\Python\Python35\lib\site-packages\dragonmapper\transcriptions.py", line 365, in pinyin_to_zhuyin
remove_apostrophes=True, separate_syllables=True)
File "C:\Users\Kei\AppData\Local\Programs\Python\Python35\lib\site-packages\dragonmapper\transcriptions.py", line 341, in _convert
new += syllable_function(match.group())
File "C:\Users\Kei\AppData\Local\Programs\Python\Python35\lib\site-packages\dragonmapper\transcriptions.py", line 229, in pinyin_syllable_to_zhuyin
raise ValueError('Not a valid syllable: %s' % s)
ValueError: Not a valid syllable: o2
Correct IPA transcriptions:
yong -> iʊŋ
you -> ioʊ
numbered_to_accented()
leaves the 'v'
vowel untouched because it automatically returns it without processing it. The function that replaces 'v'
with '\u00fc'
needs to come before the check for unaccented vowels.
When converting a Pinyin syllable that ends with the -r suffix to Zhuyin/IPA, a ValueError
is raised because the transcription mapping data is lacking the 'r'
syllable. In IPA 'ɻ'
should be used for the -r suffix. In Zhuyin, 'ㄦ'
should be used.
That is, when they have the first intonation:
from dragonmapper import hanzi, transcriptions
print transcriptions.zhuyin_syllable_to_pinyin(u'ㄓㄨˋ') # works
print transcriptions.zhuyin_syllable_to_pinyin(u'ㄔㄨ') # does not work
I traced things down to the following:
def _parse_zhuyin_syllable(unparsed_syllable):
"""Return the syllable and tone of a Zhuyin syllable."""
zhuyin_tone = unparsed_syllable[-1]
if zhuyin_tone in zhon.zhuyin.characters:
syllable, tone = unparsed_syllable, '1'
elif zhuyin_tone in zhon.zhuyin.marks:
for tone_number, tone_mark in _ZHUYIN_TONES.items():
if zhuyin_tone == tone_mark:
syllable, tone = unparsed_syllable[:-1], tone_number
else:
raise ValueError("Invalid syllable: %s" % unparsed_syllable)
return syllable, tone
For some reason, there is no ㄨ in zhon.zhuyin.characters
? (also no ㄩ)
Seems to be a problem with all 'lüè'. Don't know if it's just my code but it seems to be a dragonmapper issue.
hanzi.to_zhuyin('嗲')
ValueError: Not a valid Syllable: dia3
I have it fixed in my HTML branch, but I can submit a separate PR if you want :-)
Just let me kniow
哟,他怎么来了?
Yó,tā zěnme lái le?
Oh, how did he get here?
Transcription of yo
yo,ㄧㄛ,iɔ
No problems with python 2 on windows, but installing for 3.4 gives the following traceback:
File "<string>", line 17, in <module>
File "C:\Users\Gal\AppData\Local\Temp\pip_build_Gal\dragonmapper\setup.py", line 24, in <module>
readme = open_file('README.rst')
File "C:\Users\Gal\AppData\Local\Temp\pip_build_Gal\dragonmapper\setup.py", line 22, in open_file
return f.read()
File "C:\Anaconda3\lib\encodings\cp1252.py", line 23, in decode
return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x8d in position 1282: character maps to <undefined>
Complete output from command python setup.py egg_info:
Traceback (most recent call last):
File "<string>", line 17, in <module>
File "C:\Users\Gal\AppData\Local\Temp\pip_build_Gal\dragonmapper\setup.py", line 24, in <module>
readme = open_file('README.rst')
File "C:\Users\Gal\AppData\Local\Temp\pip_build_Gal\dragonmapper\setup.py", line 22, in open_file
return f.read()
File "C:\Anaconda3\lib\encodings\cp1252.py", line 23, in decode
return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x8d in position 1282: character maps to <undefined>
The following code
print dragonmapper.hanzi.to_pinyin(u'女')
print dragonmapper.hanzi.to_pinyin(u'女人')
outputs:
n̈ǔ
nǚren
where i think it should be:
nǚ
nǚren
import dragonmapper.hanzi as hz
In [1]: hz.to_pinyin('收')
Out[1]: 'shoū'
In [2]: hz.to_pinyin('手')
Out[2]: 'shoǔ'
In [62]: dragonmapper.__version__
Out[62]: '0.2.3'
python 3.4
Virtualenv in Ubuntu Linux Trusty
setup.py
has a typo in the description that is displayed on PyPI: "Chinesetext" should be "Chinese text".
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.