cburgmer / cjklib Goto Github PK

View Code? Open in Web Editor NEW

147.0 147.0 49.0 1.75 MB

Han character library for CJKV languages

License: Other

PHP 1.99% Python 98.01%

cjklib's People

Stargazers

Watchers

Forkers

tony ospalh gbraad msavva rmarquis pombredanne levonxxl aukw mraygalaxy kentvu wfxiang08 msikma dboris skishore leekangsan ninchanese chanind ruddfawcett arita37 zhang-jinyi muellert sovanyio cosecant-csc ericbusch jdlorimer tianchi03 lizhengdan pineclone wangchuan2008888 smartree 2050utopia apachedx chunyuqiang adasupport hell-to-heaven czfzc j3w7 hunvreus yueqianzhang hime-hina flamingring

cjklib's Issues

Use SQLAlchemy Tables/Schemas for installing data?

After seeing #4, some SQL languages handle table creation a bit differently.

SQLAlchemy has it's own way of creating tables, @cburgmer, would SQLAlchemy tables/schemas make sense for handling creating of /data's schemas?

If this is something that could help improve robustness via using SQLAlchemy's dialects, I'd like to take a bite. Anything you can think of that would block this?

Get Yale readings

Since Yale encodes the difference between the high level and high falling tones but Jyutping doesn't, would it be possible to get the Yale readings directly?

Update cjklib to be compatible with SQLAlchemy >=0.7

I'm currently working on a new project that uses Flask and cjklib. However, Flash-SQLAlchemy seems to error our when I'm using SQLAlchemy 0.6.9 (despite it stating it's compatible with 0.6 or higher). This means I've got a problem, since I need the lower version of SQLAlchemy for cjklib, and the higher version for Flask-SQLAlchemy.

It would be nice if cjklib could be updated, or if at least some information could be posted on what's currently preventing 0.7 from being usable.

make it run with Python3

It would be great if this library could work with Python3, and, by extension, with a recent version of SQLAlchemy.

Character has no stroke count information

Hi @cburgmer, thanks so much for this library. After a bit of fiddling, I was able to get everything going! Everything I've tried so far works: translation, pinyin, etc. I can't, however, seem to figure out getStrokeCount():

from cjklib.characterlookup import CharacterLookup
cjk = CharacterLookup('C')
print(cjk.getStrokeCount(u'说'))

When I run the above, I get the following:

Traceback (most recent call last):
File "/Users/user/Documents/GitHub/chinese/hanzi.csv/generate.py", line 7, in print(cjk.getStrokeCount(u'说'))
File "/Users/user/Documents/GitHub/chinese/hanzi.csv/cjklib/characterlookup.py", line 644, in getStrokeCount
"Character has no stroke count information")
cjklib.exception.NoInformationError: Character has no stroke count information

I've tried rebuilding the databases, reinstalling, etc. but no luck. I was wondering if you had any suggestions?

have a test suite

It would be nice if this software had a test suite. It's quite difficult to develop without one.

LICENSE change, things alive here?

@cburgmer

Can we change the license at the software-level to BSD, MIT or Apache?

My reasons are for the ones stated here: ScottDuckworth/python-anyvcs#32 (comment).

Due to the nature of cjklib being python and the data libraries being useful in pieces, a simpler license would be a more helpful measure at this point.

I'm going to cross-post this to the google group (https://code.google.com/p/cjklib/issues/detail?id=23&thanks=23&ts=1386125345)

State of the cjklib / understanding our datasets

I think it'd be good to get a state of matters for where we stand on cjklib in terms of its current codebase. Do we want to use it? As it stands, I'm not sure if I'm failing to grasp the complexities of comingling our data, or if there are architectural mistakes within that just would be best if we rewrote it.

If that is the case - I wonder if you could take some time to document what is what from a data perspective. Here are few questions that'd be helpful to have answers on:

In cjklib.data's csv an sql files - what are these datasets? how are they used? are they used in the same way? what data do/can they hold?

More specifically, what is the following:

edict
cedict
cedictgr
handedict
cfdict
unihan
kanjidic2

and

cantoneseipainitialfinal
cantoneseipainitialfinal
cantoneseyaleinitialnucleuscoda
cantoneseyalesyllables
characterdecomposition
charactershanghaineseipa
grabbreviation
grrhotacisedfinals
grsyllables
jyutpinginitialfinal
jyutpingipamapping
jyutpingsyllables
jyutpingyalemapping
kangxiradical
localecharacterglyph
mandarinipainitialfinal
pinyinbraillefinalmapping
pinyinbrailleinitialmapping
pinyingrmapping
pinyininitialfinal
pinyinipamapping
pinyinsyllables
radicalequivalentcharacter
shanghaineseipasyllables
strokeorder
strokes
Unihan.zip (is this downloaded to here?)
wadegilesinitialfinal
wadegilespinyinmapping
wadegilessyllables

What are the above? Why are some included while otheres are downloaded remotely? Can we package any/all of the remote data in cjklib? Is it it matter of licensing of assuring downloading of fresh data?

What data in the above datasets intersect, where?

If there is a place where the data intersects, often, I'm assuming we're massaging it in some sense so we can match it to a lookup? Maybe it'd help to have a spreadsheet / table on this?

I think that if we mapped the data we have to a spreadsheet it'd offer us all a better view of the picture - imo. Then we can take a look back away from legacy assumptions and be in a better position to make pull requests for larger architecture changes.

I realize the above is a pretty time-consuming thing, think you could take a bite at it though?

Traditional Hanzi to Kanji conversion

I would like to have a Traditional Hanzi to Kanji converter.

Pinyin to MandarinIPA bugs

Thanks for your wonderful cjklib and cjknife command-line tool. When making system calls to cjknife to produce IPA for some Pinyin (I'm writing a command-line pinyin drilling program in R) and I noticed some bugs in the production of MandarinIPA using the following system call:

cjknife -s Pinyin -t MandarinIPA -m pinyin_to_convert_to_ipa

cjknife throws an error when asking it to convert the legitimate pinyin yo, m, n, ng, hng, and hm. I've seen yo (final io without an initial) cast in ipa as [jo] or [jɔ]. Sometimes they use the i with a tilde underneath instead of a j. According to Wikipedia's syllabic consonant page you should be able to use [m̩], [n̩], [ŋ̍], [xŋ̍], and [xm̩] for those Mandarin syllabic consonant interjections (IPA adds a little line above or below to signify it is a syllabic consonant).
cjknife gives 'o' IPA for Pinyin (u)o after b, p, m, f where it would have a 'wo' sound e.g. po = [pʰwo] not [p‘o]. Although written with an 'o' in fact bo, po, mo, fo (and wo) all have "uo" finals. The only examples of pure "o" finals are the interjection "o" and the rather rare participle "lo" (yo being the only example of the "io" final).
cjknife gives incorrect IPA for erhua e.g. dianr3 = tjɐɚ̯ not tiɛn.ər
If we restrict the erhua to what is expected to know in order to pass the 普通话水平测试 exam (i.e. who has a standard Mandarin pronunciation) we still have a lot of erhua syllables. For comparison I've compiled by own Mandarin syllable to IPA mapping:

https://u14129277.dl.dropboxusercontent.com/u/14129277/pinyin_ipa.csv

which I built from the following tables I compiled (the final and initial one mainly from the Pinyin and Erhua pages on Wikipedia but also from other sources) and the pinyin to initial to final I decomposed by hand from all the pinyin examples I could find):

https://u14129277.dl.dropboxusercontent.com/u/14129277/initial.csv

https://u14129277.dl.dropboxusercontent.com/u/14129277/final.csv

https://u14129277.dl.dropboxusercontent.com/u/14129277/pinyin_initial_final.csv

Thanks!

Respect cjklib.conf url setting when installing dictionaries

See #1.

We want to configure where dictionaries are stored to.

cburgmer / cjklib Goto Github PK

cjklib's People

Stargazers

Watchers

Forkers

cjklib's Issues

Use SQLAlchemy Tables/Schemas for installing data?

Get Yale readings

Update cjklib to be compatible with SQLAlchemy >=0.7

make it run with Python3

Character has no stroke count information

have a test suite

LICENSE change, things alive here?

State of the cjklib / understanding our datasets

Traditional Hanzi to Kanji conversion

Pinyin to MandarinIPA bugs

Respect cjklib.conf url setting when installing dictionaries

cjklib.org is down

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent