Giter Site home page Giter Site logo

somiao-pinyin's Introduction

Somiao Pinyin: Train your own Chinese Input Method with Seq2seq Model

Personalized Chinese Pinyin Input Method with Seq2seq model

Original code in https://github.com/Kyubyong/neural_chinese_transliterator for research purpose.

This repository intends to experiment with different training data and interactive user inputs, and possibly develop towards a real data-personalized and model-localized Pinyin Input product.

Requrements

  • Python (>=3.5)

  • TensorFlow (>=r1.2)

  • xpinyin (for Chinese pinyin annotation)

  • distance (for calculating the similarity score between two strings)

  • tqdm

Usage

Training:

  • STEP 1. Download Leipzig Chinese Corpus

    Extract it and copy zho_news_2007-2009_1M-sentences.txt to data/ folder.

    Or use your own Chinese Corpus with the same format.

  • STEP 2. Build a Pinyin-Chinese parallel corpus.

#python3 build_corpus.py
  • STEP 3. Run prepro.py to make vocabulary and training data.
#python3 prepro.py
  • STEP 4. Adjust hyperparameters in hyperparams.py if necessary.

  • STEP 5. Train the model

#python3 train.py

Inference with command line input:

For command line input testing, run:

python3 eval.py

You may change the main function name to use the original testing data evaluation.

Testing with pre-trained models:

Download the pre-trained model from blog, unzip it to generate /log and /data.

Remember to overwrite the pickle files in /data with the pre-trained model data.

Then run for command line input testing:

python3 eval.py

Sample Results

Model is trained from Chinese News in 2007-2009. So many now common Chinese sayings are not learned.

请输入测试拼音:nihao
你好

请输入测试拼音:chenggongle
成功了

请输入测试拼音:wolegequ
我了个曲

请输入测试拼音:taibangla
太棒啦

请输入测试拼音:dacolehuizenmeyang
打破了会怎么样

请输入测试拼音:pujinghehujintaotongdianhua
普京和胡锦涛通电话

请输入测试拼音:xiangbuqilaishinianqianfashengleshenme
想不起来十年前发生了什么

请输入测试拼音:meiguohongzhawomenzainansilafudedashiguan
美国轰炸我们在南斯拉夫的大事馆

请输入测试拼音:liudehuanageshihouhaonianqing
刘德华那个时候好年轻

请输入测试拼音:shishihouxunlianyixiabilibilideyuliaole
是时候训练一下比例比例的预料了

TODOLIST

  • Pretrained models on different contexts

  • Model selection for using different models while input different things (chatting? writing scientific papers? etc...)

  • Function to record LOCALLY what user has input as personalized corpus

  • User Interface

  • ...

somiao-pinyin's People

Contributors

crownpku avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

somiao-pinyin's Issues

预训练模型链接失效

您好~请问现在还可以重新发一下预训练的模型链接嘛?之前那个已经被河蟹了。

关于变量输入问题

楼主,您好!我想问下您,
x = [pnyn2idx.get(pnyn, 1) for pnyn in pnyn_sent] # 1: OOV
y = [hanzi2idx.get(hanzi, 1) for hanzi in hanzi_sent] # 1: OOV

xs.append(np.array(x, np.int32).tostring())
ys.append(np.array(y, np.int32).tostring())
为什么这里要把x和y转换成字符串类型那?

训练模型时出错

我加载好了语料库,在训练模型时出了一个错误'Tensor' object has no attribute 'to_proto',你可以把你训练的模型发一份网址吗,网盘的链接失效了。感谢,感谢

请问训练的时候总是显示已杀死是什么原因?

请问你出现过这种情况吗?

totalMemory: 10.92GiB freeMemory: 10.77GiB
2019-11-05 15:36:50.359665: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1512] Adding visible gpu devices: 0, 1
2019-11-05 15:36:50.360769: I tensorflow/core/common_runtime/gpu/gpu_device.cc:984] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-11-05 15:36:50.360781: I tensorflow/core/common_runtime/gpu/gpu_device.cc:990] 0 1
2019-11-05 15:36:50.360786: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1003] 0: N Y
2019-11-05 15:36:50.360789: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1003] 1: Y N
2019-11-05 15:36:50.360856: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 10478 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1080 Ti, pci bus id: 0000:01:00.0, compute capability: 6.1)
2019-11-05 15:36:50.361110: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:1 with 10481 MB memory) -> physical GPU (device: 1, name: GeForce GTX 1080 Ti, pci bus id: 0000:02:00.0, compute capability: 6.1)
0%| | 0/16733 [00:00<?, ?b/s]2019-11-05 15:38:21.301871: I tensorflow/stream_executor/dso_loader.cc:152] successfully opened CUDA library libcublas.so.10.0 locally
已杀死

训练模型出错

博主你好,这个语料我加载了,只是模型训练不成功,显示一个'Tensor' object has no attribute 'to_proto'的错误,你可以把训练了的模型发一个网址链接吗?那个网盘的链接失效了,谢谢,谢谢

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.