Giter Site home page Giter Site logo

gumblex / zhconv Goto Github PK

View Code? Open in Web Editor NEW
502.0 8.0 35.0 937 KB

Simple conversion and localization between simplified and traditional Chinese using tables from MediaWiki.

Home Page: https://pypi.python.org/pypi/zhconv

License: MIT License

Python 100.00%
chinese-simplified chinese-traditional mediawiki

zhconv's Introduction

简易中文简繁转换

文档

zhconv 提供基于 MediaWiki 和 OpenCC 词汇表的最大正向匹配简繁转换,支持地区词转换:zh-cn, zh-tw, zh-hk, zh-sg, zh-hans, zh-hant。Python 2、3通用。

若要求高精确度,参见 OpenCCopencc-python

>>> print(convert(u'我幹什麼不干你事。', 'zh-cn'))
我干什么不干你事。
>>> print(convert(u'人体内存在很多微生物', 'zh-tw'))
人體內存在很多微生物

其中,zh-hans, zh-hant 仅转换简繁,不转换地区词。

完整支持 MediaWiki 人工转换语法:

>>> print(convert_for_mw(u'在现代,机械计算-{}-机的应用已经完全被电子计算-{}-机所取代', 'zh-hk'))
在現代,機械計算機的應用已經完全被電子計算機所取代
>>> print(convert_for_mw(u'-{zh-hant:資訊工程;zh-hans:计算机工程学;}-是电子工程的一个分支,主要研究计算机软硬件和二者间的彼此联系。', 'zh-tw'))
資訊工程是電子工程的一個分支,主要研究計算機軟硬體和二者間的彼此聯繫。
>>> print(convert_for_mw(u'張國榮曾在英國-{zh:利兹;zh-hans:利兹;zh-hk:列斯;zh-tw:里茲}-大学學習。', 'zh-sg'))
张国荣曾在英国利兹大学学习。
>>> print(convert_for_mw('毫米(毫公分),符號mm,是長度單位和降雨量單位,-{zh-hans:**作-{公釐}-或-{公厘}-;zh-hant:港澳和大陸稱為-{毫米}-(台灣亦有使用,但較常使用名稱為毫公分);zh-mo:台灣作-{公釐}-或-{公厘}-;zh-hk:台灣作-{公釐}-或-{公厘}-;}-。', 'zh-cn'))
毫米(毫公分),符号mm,是长度单位和降雨量单位,**作公釐或公厘。

和其他高级字词转换语法

转换字典可下载 MediaWiki 源码包中的 includes/ZhConversion.php,使用 convmwdict.py 可转换成 json 格式。

代码授权协议采用 MIT 协议;转换表由于来自 MediaWiki,为 GPLv2+ 协议。

在Spark集群中使用该项目

在分布式集群中,也许受环境限制,不便于在每台机器上安装该项目。 那么你可以从driver机器中单独上传该项目的egg文件,不需要依赖于其它的项目。

# python setup.py bdist_egg

# ls dist
zhconv-1.2.2-py2.7.egg

如果在本地,则可以直接执行sys.path.append('PATH_TO_ZHCONV/zhconv-1.2.2-py2.7.egg')后使用。

小工具

EPUB 电子书简繁转换:python3 epubzhconv.py 输入.epub 输出.epub zh-{cn,tw}


Simple Chinese Conversion Library

zhconv converts between Simplified and Traditional Chinese using maximum forward matching. The conversion table is based on MediaWiki and OpenCC. Supports regional vocabulary: zh-cn, zh-tw, zh-hk, zh-sg, zh-hans, zh-hant. Supports both Python 2 and 3.

Example:

>>> print(convert(u'我幹什麼不干你事。', 'zh-cn'))
我干什么不干你事。
>>> print(convert(u'人体内存在很多微生物', 'zh-tw'))
人體內存在很多微生物

If zh-hans or zh-hant is used, then regional vocabulary conversion will be disabled.

Documentation is available in Chinese.

The code is licensed under MIT, while the conversion table is licensed under GPLv2+.

zhconv's People

Contributors

guangyi-z avatar gumblex avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

zhconv's Issues

古文翻译

老王,你的古文翻译功能,能给个接口不

命令行含义不明

https://pypi.org/project/zhconv/1.4.0/#description

这里我试了半天,不知道怎么输入命令执行

python -mzhconv [-w] {zh-cn|zh-tw|zh-hk|zh-sg|zh-hans|zh-hant|zh} < input > output

我试过以下命令,怎么用都是错的

python3 -mzhconv -w zh-cn ./tc/ ./out
python3 -mzhconv -zh-cn ./tc ./out
python3 -mzhconv zh-cn ./tc ./out

请吾作纯浏览器 JavaScript 之版

我以此事颇有意思,欲为纯浏览器本者,并加之级段落启发式模板,卿可议开源许我做这衍生本乎?

(我觉得这个工作很有意思,想做一个纯浏览器的版本,并加入一些段落级的启发式模板,你的开源协议可以允许我做这样的衍生版本吗?)

pkg_resources is deprecated as an API

  C:\pv\tts\lib\site-packages\zhconv\zhconv.py:33: DeprecationWarning: pkg_resources is deprecated as an API. See https://setuptools.pypa.io/en/latest/pkg_resources.html
    from pkg_resources import resource_stream

..\..\..\..\..\pv\tts\lib\site-packages\pkg_resources\__init__.py:2868
..\..\..\..\..\pv\tts\lib\site-packages\pkg_resources\__init__.py:2868
..\..\..\..\..\pv\tts\lib\site-packages\pkg_resources\__init__.py:2868
..\..\..\..\..\pv\tts\lib\site-packages\pkg_resources\__init__.py:2868
  C:\pv\tts\lib\site-packages\pkg_resources\__init__.py:2868: DeprecationWarning: Deprecated call to `pkg_resources.declare_namespace('google')`.
  Implementing implicit namespace packages (as specified in PEP 420) is preferred to `pkg_resources.declare_namespace`. See https://setuptools.pypa.io/en/latest/references/keywords.html#keyword-namespace-packages
    declare_namespace(pkg)

MediaWiki 转换词典的版权问题

MediaWiki 的代码是 GPL 的,他的转换词典有通过其他协议释出的版本吗?如果没有的话 MIT 协议的项目应该没法用他的词典。

Maybe raise a keyerror is a better way to process the locale not supported

Because return the not actually converted sentence may confuse someone that is careless about the supported locales.
And I think raise an error early can stop future big error for user codes.

zhconv/zhconv/zhconv.py

Lines 235 to 254 in 078838e

def convert(s, locale, update=None):
"""
Main convert function.
:param s: must be `unicode` (Python 2) or `str` (Python 3).
:param locale: should be one of ``('zh-hans', 'zh-hant', 'zh-cn', 'zh-sg'
'zh-tw', 'zh-hk', 'zh-my', 'zh-mo')``.
:param update: a dict which updates the conversion table, eg.
``{'from1': 'to1', 'from2': 'to2'}``
>>> print(convert('我幹什麼不干你事。', 'zh-cn'))
我干什么不干你事。
>>> print(convert('我幹什麼不干你事。', 'zh-cn', {'不干': '不幹'}))
我干什么不幹你事。
>>> print(convert('人体内存在很多微生物', 'zh-tw'))
人體內存在很多微生物
"""
if locale == 'zh' or locale not in Locales:
# "no conversion"
return s

Let's see a mistake that may confuse the users in using the locale:

inputs = "我幹什麼不干你事"
print(convert(temp, locale='zh_cn'))
#output the same as the inputs but no warnings or erros 

Maybe many users like me use zh_cn instead of the correct zh_cn by mistake sometime.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.