paddlepaddle / models Goto Github PK

Officially maintained, supported by PaddlePaddle, including CV, NLP, Speech, Rec, TS, big models and so on.

License: Apache License 2.0

Shell 1.38% Python 61.93% Makefile 0.06% C++ 0.92% CMake 0.16% Jupyter Notebook 35.55%

paddlepaddle deep-learning neural-network computer-vision natural-language-processing recommendation speech nlp cv models

models's People

Contributors

Stargazers

Watchers

Forkers

lcy-seso chengxiaohua1105 pkuyym guoshengcs jiayifeng kuke wwhu dzhwinter xinghai-sun qingqing01 pakchoi superjomn zhaopu7 llxxxll sanjaymeena winnerineast damon-wyg leemayi chrisxu2016 gucasbrg xlhlhlx xhzhang1212 lcy-hugepanda livc wanghaoshuang chris1201 dtgi anpark youzi337 fanfannothing soonhwan-kwon fty8788 luotao1 ybbgpy crabxie li099 leepaul009 lizhengwen02 flame12334 mqrshiyan jessiemeng wetinx zgcgreat gujinji xiaokunlee joseph-chan davidzze smallding tensor-tang nickyfantasy bobateadev hctym yayawcx peterzhang2029 juliecbd wangkuiyi alvinzz qizailiu liuyuuan northstar leloucj qoboty pengli09 gongweibao will-am pipikayu xiaoyeye1117 datian1993 bushidonggua cutecha xreki 690609237 liuchyi ihuzb qchff citysir hooops xusoda qijune liang233 olenet xuruiwen bit1002lst zmoon111 qchaohust captainwoon yzx1992 alxsoares tigerneil chaoqiang1987 nullnonenilnull zoujg hui233 duguhb wanghaox why404notfound lvzhonghou breaknormal1 fudanwjj misty-biglei

models's Issues

Add loading model function for train.py

Train ds2 on THCHS30 (WIP)

I am trying to train ds2 on THCHS30 which is a mandarin dataset. A phenomenon is that we encounter error explosion easily when batch_size is big liking 64, 128 or 256. I try to clip the error to 1000 when inf appears. The clipping operation is very tricky, I catch inf and clip in paddle/v2/trainer.py as following:

cost_sum = out_args.sum()
import math
if (math.isinf(cost_sum)): cost_sum = 1000.0
cost = cost_sum / len(data_batch)

We can train ds2 normally after adding the clipping operation. I have tried to train the model using batch_shuffle_clipped provider and instance_shuffle provider. The learning curves are as following:

As we can see, instance_shuffle converges badly and I abandon this configuration after training several iterations. The batch_shuffle_clipped configuration looks like converging very slowly when the training cost goes down to 170~. The key settings of above experiments are:

setting	value
batch_size	64
trainer_count	4
num_conv_layers	2
adam_learning_rate	0.0005

I also product another experiment on a tiny dataset which only contains 128 samples (first 128). Training data and validation data both use the tiny dataset. Training settings are same with above. Following are the learning curves:

From the figure above we can learn that the convergence is very unstable and slow. There exists a unreasonable gap between training curve and validation curve. Since the training data and validation data are same, the difference of two curves should be minor. Will figure out the reason of the anomalies.

Need new tutorial， about how to use the models

Maybe

how to setup paddle
how to clone models
how to input data
how to train model（k8s & mpi）
how to select a model
how to forecast this model use API (C API & Python API & more)

example configuration for neural machine translation with external memory.

the idea of external memory originates from NTM.
In this example, we would like to show how to add external memory to a NMT model. Please refer to this configuration file written in old PaddlePaddle API.
The example must include the training and generating process.
Please directly use the wmt14 dataset.
Please pull your codes and doc to the mt_with_external_memory directory.

Add basic network configuration and training script for word embedding task

@lcy-seso
Don't understand how to add prediction logic in the task.

example configuration for LTR task.

In this example, we would like to show how to do LTR task in PaddlePaddle. Usually, there are three ways:

pointwise (a simple regression/classification task, so we do not consider this situation in this example)
pairwise, please refer to this configuration file written in old PaddlePaddleAPI.
listwise, please refer to this configuration file written in old PaddlePaddleAPI.

The example also must include how to predict. Note that in LTR task, training and testing network is different.

I suggest to use movielens dataset.

Please pull your codes and docs into the ltr directory.

Basic Requirements on examples added into this repo.

Goals

Goals of example configuration files are twofold:

Example configuration file is a kind of documentations to some extent, which shows how to use layers/pre-defined small networks in PaddlePaddle through examples. It is expected to be more lightweight than PaddleBook.
Backup of configuration files for newly developed layers/modules. Every time, if anyone adds a new layer into PaddlePaddle, or add a new pre-defined network into the paddle python package, he must have written a test configuration file. This test configuration file is also a good example to show how to use his new layer/network for other users.

Basic Requirements

Each example has its own directory, and (optional) some complicated examples may have sub-directory.
Each example should have at least two files
- the model configuration, which includes the definition of the network topology, the optimization algorithm, the PaddlePaddle trainer, and how to read the data.
- a README doc.
Each example should be tested without bugs.

About training/testing data

always consider using dataset in paddle.dataset package first.
If there is no appropriate dataset in paddle.dataset package, give a description of the format of the input data

Requirements on the README doc

The REAMDE doc should have at least the following two sections:
1. Brief introduction. The introduction should include:
  - what task does the configuration file for?
  - If the example does not provide data for training, it should provide descriptions about the format of the training data, and give an example to at least one training data.
2. Explanation about the model architecture. A picture to show the model architecture, or refer to a certain reference. Chose a convenient way to make some explanations about the model architecture to let other know about why do you write the example and what the example wants to explain.
3. [optional] How to run, but try to provide such an explanation as much as possible.

example configuration for nce_cost.

In this example, we would like to show how to use NCE in PaddlePaddle, which is useful when training language model with a large vocabulary.
The example must include a training and a predicting process.
Please consider to use the following data for training and add these datasets into paddle.dataset package.
1. 1billionword-training, 1 billionword-test
2. ptb
Note that the dataset is also used in this task, please consider to share the data processing work.
Please pull your codes and docs into the nce_cost directory.

Add demo for SSD model (WIP)

Support variable input batch and SortaGrad for deep speech2.

Add WER and CER evaluation script.

Example configuration for image classification.

In this example, we would like to show how to do image classification task in PaddlePaddle. We need to provide several classic configuration, AlexNet, VGG, GoogleNet, ResNet. It's betther to provide GoogleNet-v3/v4, identity-mapping-ResNet, but it doesn't matter if they are not provided in the first version.

Reference configuration:

DataSet:

ImageNet
Oxford-flowers: http://www.robots.ox.ac.uk/~vgg/data/flowers/102/index.html

Please fix this typo

I don't have permission to revise it.

Add post processing for decoder

If training texts are end of white space, sentences generated by decoder should be end of white space too. In such situation, we need a post processing logic to re-calculate a LM score after replacing white space by end token.

add data argumentation part for DeepSpeech2

add data augmentation part for DeepSpeech2(inclued noise_speech, impuls_response eg)

add data augmentation class ,inclued noise_speech, impuls_response, resampler, speed_perturb, online_bayesias_normalization.
add function to audio.py, eg. convolve， add_noise, normalizer
add noise data preprocess script

The numerical stability in ctc beam search decoder

The computation of probabilities in CTC beam search involves the addition and multiplication of very small numbers. To make sure the numerical stability, many other implementations first convert the probabilities into log format, then execute the operation.

In the Deep Speech 2 project, we implement two versions of beam search decoder, computing probability in the original and log form respectively. Currently, we use the the former for which is found to have a bit benefit in efficiency. But we also care about the numerical stability, so we have an independent test to compare the two decoders with the ctc beam search decoder in TensorFlow.

Run test_ctc_beam_search_decoder.py, the outputs look like

When the length of input probability list is limited to several hundreds, the two decoders get almost the same scores and decoding results. Hence we believe that the numerical stability may be not a problem in the decoder right now, but we will be careful about it all the way.

Add a GAN example.

GAN is one of some of the old examples that haven't merged into PaddlePaddle yet. Here is the original pull request: PaddlePaddle/book#30. We are going to merge it into PaddlePaddle first.

GAN will be a good example to enhance PaddlePaddle's control flow of sub-graphs in the entire network during training.

example configuration for scheduled_sampling.

In this example, we would like to show how to use scheduled sampling for seq2seq task.
Please refer to this configuration file written in old PaddlePaddle API.
The example must include the training and the generating process.
Please pull your codes and docs into the scheduled_sampling directory.

huber_cost is running incorrectly, connect to issue #10

@lcy-seso huber_cost is used error ,connect to issue #10

##example

import paddle.v2 as paddle
import paddle.v2.dataset.uci_housing as uci_housing

def main():
# init
paddle.init(use_gpu=False, trainer_count=1)

# network config
x = paddle.layer.data(
    name='x',
    type=paddle.data_type.dense_vector(13))

y_predict = paddle.layer.fc(
    input=x,
    size=1,
    act=paddle.activation.Linear())

y = paddle.layer.data(
    name='y',
    type=paddle.data_type.dense_vector(1))

cost = paddle.layer.huber_cost(input=y_predict, label=y)

# create parameters
parameters = paddle.parameters.create(cost)

# create optimizer
optimizer = paddle.optimizer.Momentum(momentum=0)

trainer = paddle.trainer.SGD(
    cost=cost, parameters=parameters, update_equation=optimizer)

feeding = {'x': 0, 'y': 1}

# event_handler to print training and testing info
def event_handler(event):
    if isinstance(event, paddle.event.EndIteration):
        if event.batch_id % 100 == 0:
            print "Pass %d, Batch %d, Cost %f" % (
                event.pass_id, event.batch_id, event.cost)

    if isinstance(event, paddle.event.EndPass):
        result = trainer.test(
            reader=paddle.batch(uci_housing.test(), batch_size=2),
            feeding=feeding)
        print "Test %d, Cost %f" % (event.pass_id, result.cost)

# training
trainer.train(
    reader=paddle.batch(
        paddle.reader.shuffle(uci_housing.train(), buf_size=500),
        batch_size=2),
    feeding=feeding,
    event_handler=event_handler,
    num_passes=30)

if name == 'main':
main()

Failuer infomation

train language model Segmentation fault

python train.py 
prepare vocab...
Segmentation fault

train.txt 1.8GB - chinese UTF-8 Unicode text

cat /proc/meminfo |grep MemTotal
MemTotal: 4048124 kB

example configuration for NER.

In this example, we would like to show how to do sequence tagging task in PaddlePaddle. We take NER as an example.
Please refer to this configuration file written in old PaddlePaddle API.
The example must include the training and the decoding process.
Please pull your codes and docs into the sequence_tagging_for_ner directory.

Refine SoundFile installation process.

First install libsndfile.
Install SoundFile using pip.

Fix ci

Currently, dependencies installation is always failed in CI, since some complex requirements is lack of well tested installing process. I try to settle this problem by following the below solutions:

Unify the dependency installation process in setup.sh.
Change the version of package scipy from 0.13.0b1 to 0.13.1

example configuration for ntm_addressing_mechanism.

In this example, we would like to show how the three addressing mechanism in NTM.
Please refer to this configuration file written in old PaddlePaddle API.
The example must include the training and the generating process.
Please pull your codes and docs into the ntm_addressing_mechanism directory.

add data augmentor class

1.add data augmentor class for DeepSpeech2(inclued speed_change, resample, online bayesias_noline)

2.NosieAugmentor and ImpulseResaponseAugmentor will be added later

Add unittest for ci

There are several new python modules added in deep speech project. These modules should be tested before merging. Since some scripts need PaddlePaddle running environment, so I expand the .travis.yml to support docker. There are several rules when writing unit test scripts.

all unit test scripts should be placed into tests directory.
all unit test scripts should be named like test*.py.

Pruning speedups beam search

The ctc beam search in DS2, or prefix beam search consists of appending candidate characters to prefixes and repeatedly looking up the n-gram language model. Both the two processes are time-consuming, and bring difficulty in parameters tuning and deployment.

A proven effective way is to prune the beam search. Specifically, in extending prefix only the fewest number of characters whose cumulative probability is at least p need to be considered, instead of all the characters in the vocabulary. When p is taken 0.99 as recommended by the DS2 paper, about 20x speedup is yielded in English transcription than that without pruning, with very little loss in accuracy. And for the Mandarin, the speedup ratio is reported to be up to 150x.

Due to pruning, the tuning of parameters gets more efficiently. There are two important parameters alpha and beta in beam search, associated with language model and word insertion respectively. With a more acceptable speed, alpha and beta can be searched elaborately. And the relation between WER and the two parameters turns out to be:

With the optimal parameters alpha=0.26 and beta=0.1 as shown in above figure, currently the beam search decoder has decreased WER to 13% from the best path decoding accuracy 22%.

how to run the model?

how to

install PaddlePaddle
run the model
custom data
run the model in PaddlePaddle Cloud

Modify inference script

example configuration for text classification.

In this example, we would like to show how to do text classification task in PaddlePaddle.
We provide a DNN network and a CNN network for texe classification
1. for DNN network, please refer to this configuration file written in old PaddlePaddle API.
2. for CNN network, please refer to this configuration file written in old PaddlePaddle API.
The example must include the training and the predicting process.
Please pull your codes and docs into the text_classification directory.

models code &document Cannot find in PaddlePaddle Documents

example configuration for regression.

In this example, we would like to show how the do regression task in PaddlePaddle.
Please refer to this configuration file written in old PaddlePaddle API.
Note that, currently PaddlePaddle provides two regression costs: mse and huber loss. Please consider adding an example for huber loss.
Please pull your codes and docs into the regressoin directory.

train lm_rnn, Check failed: numUpdates_ > prevNumUpdate_(0 vs.0)

F0606 AverageOptimizer.h:76 Check failed: numUpdates_ > prevNumUpdate_(0 vs.0)
FYI image

fix:
add parameter max_average_window=10000

model_average=paddle.optimizer.ModelAverage(average_window=0.5, max_ average_window=10000)

Add a CTR example

We are now working on a CTR model trained by PaddlePaddle.

The model structure mainly follows the paper Wide & Deep Learning for Recommender Systems.

Add a public mandarin dataset

Add DSSM

Add Deep Structured Semantic Model(DSSM)
学习两个句子之间的相关性

Abnormal learning curve bumping at early batches of each epoch during DS2 training.

After merging PR #74, we have seen such abnormal learning curve:

The figure plots the training cost. Notice that in the tails of the curve, there are many spikes, exactly locating at the first batch of each epoch.

Besides, it is not easy to reproduce the phenomenon in a small dataset.

Summary of Bugs of V2 APIs

Nested sequence is problematic in V2 PaddlePaddle/Paddle#2065
Incorrect topological parsing with memory-layer PaddlePaddle/Paddle#2061
No check for static parameters. PaddlePaddle/Paddle#2069
Some global parameters crucial for learning performance are not properly set PaddlePaddle/Paddle#2042
Clipping can not work that makes training RNN model unstable PaddlePaddle/Paddle#1894 PaddlePaddle/Paddle#1961
Inference in V2 is slower than the old API PaddlePaddle/Paddle#1961
size and num_filters cannot be obtained in V2 APIs. PaddlePaddle/Paddle#1811

Not supported

Multiplex Layer for scheduled sampling is not supported in v2.
Subsequence Layer PaddlePaddle/Paddle#2026
Some element-wise computation PaddlePaddle/Paddle#1790
C-API not ready for V2 PaddlePaddle/Paddle#1949

Refactor the whole data preprocessor part for DeepSpeech2.

Refactor the whole data preprocessor for DeepSpeech2 (e.g. re-design classes, re-organize dir, add augmentation interfaces etc.):

Refactor the data preprocessor with newly added classes, e.g. AudioSegment, SpeechSegment, TextFeaturizer, AudioFeaturizer, SpeechFeaturizer etc.
Add data augmentation interfaces and classes e.g. AugmentorBase, AugmentationPipeline, VolumePerturbAugmentor etc., to make it easier to add more data augmentation models.
Separate normalizer's mean-std computing from DataGenerator. Add FeatureNormalizer. -
Add an independent tool compute_mean_std.py for users to create mean_std file before training.
Re-organize data directory into datasets and data_utils.
Test for convergence.
Add module, class, function docs.
Update README.md.

Deep Speech 2 on PaddlePaddle: Plan & Task Breakdown

We are planning to build Deep Speech 2 (DS2) [1], a powerful Automatic Speech Recognition (ASR) engine, on PaddlePaddle. For the first-stage plan, we have the following short-term goals:

Release a basic distributed implementation of DS2 on PaddlePaddle.
Contribute a chapter of Deep Speech to PaddlePaddle Book.

Intensive system optimization and low-latency inference library (details in [1]) are not yet covered in this first-stage plan.

Tasks

We roughly break down the project into 14 tasks:

Develop an audio data provider:
- Json filelist generator
- Audio file format transformer.
- Spectrogram feature extraction, power normalization etc.
- Batch data reader with SortaGrad.
- Data augmentation (optional).
- Prepare (one or more) public English data sets & baseline.
- PaddlePaddle/Paddle#2226
- PaddlePaddle/Paddle#2227
Create a simplified DS2 model configuration:
- With only fixed-length (by padding) audio sequences (otherwise need Task 3).
- With only bidirectional-GRU (otherwise need Task 4).
- With only greedy decoder (otherwise need Task 5, 6).
- PaddlePaddle/Paddle#2231
Develop to support variable-shaped dense-vector (image) batches of input data.
- Update DenseScanner in dataprovider_converter.py, etc.
- PaddlePaddle/Paddle#2198
Develop a new lookahead-row-convolution layer (See [1] for details):
- Lookahead convolution windows.
- Within-row convolution, without kernels shared across rows.
- PaddlePaddle/Paddle#2228
Build KenLM n-gram language model for beam search decoding:
- Use KenLM toolkit, Kneser-Ney smoothed, 5-gram, with pruning etc.
- Prepare the corpus & train the model.
- Create infererence interfaces plugable to CTC beam search (for Task 6).
- PaddlePaddle/Paddle#2229
Develop a beam search decoder with CTC + LM + WORDCOUNT:
- Beam search with CTC.
- Beam search with external custom scorer (e.g. LM).
- Try to design a more general beam search interface.
- PaddlePaddle/Paddle#2230
Develop a Word Error Rate evaluator:
- update ctc_error_evaluator(CER) to support WER.
Prepare internal dataset for Mandarin (optional):
- Dataset, baseline, evaluation details.
- Particular data preprocessing for Mandarin.
- Might need cooperating with the Department of Speech.
- PaddlePaddle/Paddle#2232
Create standard DS2 model configuration:
- With variable-length audio sequences (need Task 3).
- With unidirectional-GRU + row-convolution (need Task 4).
- With CTC-LM beam search decoder (need Task 5, 6).
Make it run perfectly on clusters.
Experiments and benchmarking (for accuracy, not efficiency):
- With public English dataset.
- With internal (Baidu) Mandarin dataset (optional).
Time profiling and optimization.
Prepare docs.
Prepare PaddlePaddle Book chapter with a simplified version.

Task Dependency

Tasks parallelizable within phases:

Roadmap	Description	Parallelizable Tasks
Phase I	Basic model & components	Task 1 ~ Task 8
Phase II	Standard model & benchmarking & profiling	Task 9 ~ Task 12
Phase III	Documentations	Task13 ~ Task14

Issue for each task will be created later. Contributions, discussions and comments are all highly appreciated and welcomed!

Possible Future Work

Efficiency Improvement
Accuracy Improvement
Low-latency Inference Library
Large-scale benchmarking

References

Dario Amodei, etc., Deep Speech 2 : End-to-End Speech Recognition in English and Mandarin. ICML 2016.

Improve audio featurizer and add shift augmentor for DS2.

Improve audio featurizer by adding e.g. resampling, db_normalization, and random shift, as suggested in speech_dl codes.

Refine librispeech.py for DeepSpeech2

Add manifest line check.
Avoid re-unpacking if unpacked data already exists.
Add full_download (download all 7 sub-datasets of LibriSpeech).

Add multi-threading support for DS2 data generator.

Add multi-threading support for DS2 data generator, to accelerate the training.

example configuration for word embedding.

In this example, we would like to show how to use hsigmoid to training word embedding for large vocabulary.
Please refer to this example provided by @reyoung. Note that this example is not for word embedding training exactly. You‘d better learn and then modify it.
The example must include the training and the predicting process.
Please pull your codes and docs into the word_embedding directory.
In future, we may add more configuration for models with a large vocabulary.

example configuration for seq2seq.

In this example, we would like to show some different configuration for seq2seq task.
Please refer to the following two configuration files:
1. grid lstm written in old PaddlePaddle API.
2. encoder-decoder without attention written in old PaddlePaddle API.
The example must include the training and the generating process.
Please pull your codes and docs into the seq2seq directory.

Inferece error occurs in language model.

A part of model code:


firstword = paddle.layer.data(
    name="firstw", type=paddle.data_type.integer_value(dict_size))
secondword = paddle.layer.data(
    name="secondw", type=paddle.data_type.integer_value(dict_size))
thirdword = paddle.layer.data(
    name="thirdw", type=paddle.data_type.integer_value(dict_size))
fourthword = paddle.layer.data(
    name="fourthw", type=paddle.data_type.integer_value(dict_size))
nextword = paddle.layer.data(
    name="fifthw", type=paddle.data_type.integer_value(dict_size))

Efirst = wordemb(firstword)
Esecond = wordemb(secondword)
Ethird = wordemb(thirdword)
Efourth = wordemb(fourthword)


contextemb = paddle.layer.concat(input=[Efirst, Esecond, Ethird, Efourth])

hidden1 = paddle.layer.fc(input=contextemb,
                          size=hiddensize,
                          act=paddle.activation.Sigmoid(),
                          layer_attr=paddle.attr.Extra(drop_rate=0.5),
                          bias_attr=paddle.attr.Param(learning_rate=1),
                          param_attr=paddle.attr.Param(
                                initial_std=1. / math.sqrt(embsize * 8),
                                learning_rate=1))


predictword = paddle.layer.fc(input=hidden1,
                              size=dict_size,
                              bias_attr=paddle.attr.Param(learning_rate=1),
                              act=paddle.activation.Softmax())

The part of inference code:

embsize = 32 
hiddensize = 256 
N = 5

def main():
    paddle.init(use_gpu=False, trainer_count=3) 
    word_dict = paddle.dataset.imikolov.build_dict()
    dict_size = len(word_dict)

    _, prediction, _= nnlm(hiddensize, embsize, dict_size)

    #with gzip.open("model_params.tar.gz", 'r') as f:
    #   parameters.from_tar(f)
    parameters = paddle.parameters.Parameters.from_tar(gzip.open("model_params.tar.gz", 'r'))

    infer_data = []
    infer_label_data = []
    cnt = 0
    for item in paddle.dataset.imikolov.test(word_dict, N)():
        infer_data.append((item[:4]))
        infer_label_data.append(item[4])
        cnt += 1
        if cnt == 100:
            break

    predictions = paddle.infer(
        output_layer = prediction,
        parameters = parameters,
        input = infer_data
    )

    for i, prob in enumerate(predictions):
        print prob, infer_label_data[i]

if __name__ == '__main__':
    main()

10 elements ahead of item[:4]:

[(2, 1063, 95, 353), (1063, 95, 353, 5), (95, 353, 5, 335), (353, 5, 335, 51), (5, 335, 51, 2072), (335, 51, 2072, 6), (51, 2072, 6, 319), (2072, 6, 319, 2072), (6, 319, 20, 5), (319, 2072, 5, 0)]

example configuration for nested sequence.

In this example, we would like to show how to use nested sequence, which is one of the most amazing things in PaddlePaddle.
We would like to show how to use the nested sequence in the following task:
1. text classification, please refer to this configuration written in old PaddlePaddle API.
2. text generation, please refer to this configuration written in old PaddlePaddle API. The example must include the training and the generating process.
About dataset:
1. for the text classification task, please directly use the imdb dataset.
2. for the text generation task, you have to give some example training dataset.
Please pull your codes and docs into the nested_sequence directory.

Provide a general tool for converting Caffe models.

1. A general tool caffe2paddle.py which can be used to convert most of Caffe's model.
1. Verification method to verify that the accuracy of converted model is correct.
1. Usage document.
1. Provide a ResNet model based on ImageNet.

example configuration for language model.

In this example, we would like to show how to train:
1. a n-gram language model (please refer to this configuration file written in old PaddlePaddle API）
2. a rnn/lstm language model (please refer to this configuration file written in old PaddlePaddle API)
Besides, we would like to show how to generate sequence from:
1. a N-gram language
2. a rnn language model
In the generation process, please consider the following two situations:
1. generate a sequence by using paddle.infer API, in which you need to implement one-way search or beam search (beam search is more complicated, I suggest to implement one-way search first.)
2. generate a sequence by using recurrent_layer_group
Please consider to use the following data for training and add these datasets into paddle.dataset package.
1. 1billionword-training, 1 billionword-test
2. ptb
Please pull your codes and docs into the language_model directory.