cjymz886 / text-cnn Goto Github PK

嵌入Word2vec词向量的CNN中文文本分类

License: MIT License

Python 100.00%

text-classification cnn word2vec tensorflow

text-cnn's Introduction

Text classification with CNN and Word2vec

本文是参考gaussic大牛的“text-classification-cnn-rnn”后，基于同样的数据集，嵌入词级别所做的CNN文本分类实验结果，gaussic大牛是基于字符级的；

进行了第二版的更新：1.加入不同的卷积核；2.加入正则化；3.词仅为中文或英文，删掉文本中数字、符号等类型的词；4.删除长度为1的词；

训练结果较第一版有所提升，验证集准确率从96.5%达到97.1%，测试准确率从96.7%达到97.2%。

本实验的主要目是为了探究基于Word2vec训练的词向量嵌入CNN后，对模型的影响，实验结果得到的模型在验证集达到97.1%的效果，gaussic大牛为94.12%；

更多详细可以阅读gaussic大牛的博客：text-classification-cnn-rnn

1 环境

python3
tensorflow 1.3以上CPU环境下
gensim
jieba
scipy
numpy
scikit-learn

2 CNN卷积神经网络

模型CNN配置的参数在text_model.py中，具体为：

模型CNN大致结构为：

3 数据集

本实验同样是使用THUCNews的一个子集进行训练与测试，数据集请自行到THUCTC：一个高效的中文文本分类工具包下载，请遵循数据提供方的开源协议;

文本类别涉及10个类别：categories = ['体育', '财经', '房产', '家居', '教育', '科技', '时尚', '时政', '游戏', '娱乐']，每个分类6500条数据；

cnews.train.txt: 训练集(5000*10)

cnews.val.txt: 验证集(500*10)

cnews.test.txt: 测试集(1000*10)

训练所用的数据，以及训练好的词向量可以下载：链接: https://pan.baidu.com/s/1DOgxlY42roBpOKAMKPPKWA，密码: up9d

4 预处理

本实验主要对训练文本进行分词处理，一来要分词训练词向量，二来输入模型的以词向量的形式；

另外，词仅为中文或英文，词的长度大于1;

处理的程序都放在loader.py文件中；

5 运行步骤

python train_word2vec.py，对训练数据进行分词，利用Word2vec训练词向量(vector_word.txt)

python text_train.py，进行训练模型

python text_test.py，对模型进行测试

python text_predict.py，提供模型的预测

6 训练结果

运行：python text_train.py

本实验经过6轮的迭代，满足终止条件结束，在global_step=2000时在验证集得到最佳效果97.1%

7 测试结果

运行：python text_test.py

对测试数据集显示，test_loss=0.1，test_accuracy=97.23%，其中“体育”类测试为100%，整体的precision=recall=F1=97%

8 预测结果

运行:python text_predict.py

随机从测试数据中挑选了五个样本，输出原文本和它的原文本标签和预测的标签，下图中5个样本预测的都是对的；

9 参考

text-cnn's People

Contributors

Stargazers

Watchers

Forkers

hwaking cheng-yi-ting yanjiang0216 nwf5d lidutech huangpd i69086 datali01 huige555551 hellobilllee fangzx123 childbig qkqkqqk38 lonngxiang lazysheep007 xuewengeophysics wgh1992 zaviershaw kxukai huangzilong0626 fmelon 1165125080 nightzzzz simonsxuan minyanan nora616 xiaoxuewang1 guoyin90 dingdingdingyinyinyin lqdlove enzoliang zwj-here junhaoxue wxxcoder zihuitang thumastan wang1231 huangxiancun zhangfaxin mengxiang wings876 chenshangzhi12315 scutjoeham zhangtian1993 wangtianl yip522364642 jinranmah youthorange austcoder sakurakdx henryfang1037 wsnuser snow19950625 meizai233 liushawuji grey-yibu dnqk28 wsg2016 xiaolinpeter carlos121493 yuze-wu masonyyp pppppp040 qiuhcj llhlisa th-96 xzzavic littlewang1220 ogwang126 msopengl davidwuzc yyqalisa lensenquant freedom-love fatbirdl yunfeiyue shiliangxu sophiezang 8-diagrams aj7091 stonewongrua 657248688 hackbuteer001 zhangdazhuang zhouwenchang asmallsheep mondon11 panbo-bridge xxxxujy jiangminke iendi antiwalker dystudio minovoo csliangchen yuzhang112 goodiesqqsz trustb jiangtingyu zhangyunfang

text-cnn's Issues

是否没有应用L2正则化？？？

在 text_model.py 文件中的第 98、99 行，loss定义了两次
请问如果使用L2正则化是不是要注释 99 行？

2维卷积

问下作者为什么用二维卷积处理文本处理文本不是以为卷积吗

用于谣言识别或检测是否可行

请问作者如果想要将CNN/RNN用于谣言的识别应该怎么去实践？

运行train_word2vec.py和text_train.py的时候出问题

您好，gaussic的那个我之前有运行过。之后一直在找基于Word2vec训练的词向量嵌入CNN后的模型训练。有幸发现您的项目。
不过我在使用THUCNews文本运行train_word2vec.py的时候会提示
RuntimeError: you must first build vocabulary before training the model
后下载了您训练好的vector_word.txt，运行text_train.py的时候会提示
ValueError: zero-size array to reduction operation maximum which has no identity。
想请问下您，以上问题要怎样解决呢？

怎样获取分类为某一个标签得概率或者得分呢

比如我就输入了"啊啊啊啊啊" 怎样都会匹配出来一个标签但这种应该归类为其他怎样获取一个分数判断一下

如何预测一个没有进行类别标注的文本的类别

大佬，模型中需要用到验证集和测试集，都是标注好的。现在如果有一批没有标注的数据，如何预测每个的类别呢？

关于去停用词后准确率下降

这个模型在生成词表的时候可以去除停用词吗，我使用百度和哈工大的停用词表训练测试集准确率下降了，请问可能是什么原因呢？

运行text_train.py时报错

您好，非常感谢您的分享~
但我更换自己的数据集并运行train_word2vec.py重新训练词向量后，text_train.py报错显示：
ValueError: Too many elements provided. Needed at most 512000, but received 800000
后来我将text_model.py的vocab_size和loade.py中的build_vocab(filenames,vocab_dir,vocab_size=5000)均改为5000，重新训练词向量后再运行text_train.py，报错显示：
ValueError: Too many elements provided. Needed at most 320000, but received 500000
请问我该如何解决这个问题？

请问word2vec词表中未出现的词怎么表示呢？

您好，我想请问如果待预测的文本中出现词向量表中没有的单词，是怎样表示的呢？谢谢您！

官网数据集数量与描述不对应

你好，你在readme中说每个类别下面为6500条数据，而如今我在官网下载的数据集每个类别下面的数目比这个大很多，都是9w,5w,13w等数量级。我有个疑惑是因为数据集官网一致在增加还是你当时每个类别下边只选择了6500条数据进行实验？（比较小白）

運行text_test.py出現錯誤

運行text_test.py出現以下錯誤資訊

Configuring CNN model...
Loading test data...
Building prefix dict from the default dictionary ...
Loading model from cache /tmp/jieba.cache
Loading model cost 0.712 seconds.
Prefix dict has been built succesfully.
2018-10-16 01:15:17.947491: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.1 instructions, but these are available on your machine and could speed up CPU computations.
2018-10-16 01:15:17.947539: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.2 instructions, but these are available on your machine and could speed up CPU computations.
2018-10-16 01:15:17.947550: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX instructions, but these are available on your machine and could speed up CPU computations.
Testing...
terminate called after throwing an instance of 'std::bad_alloc'
what(): std::bad_alloc
已經終止 (core dumped)

为什么 test的时候，预测结果都是同一个标签？

我训练的时候一共7个 lable 每个lable在训练集的样本数量都是5000，可以训练好进行模型测试的时候，测试出来的标签都是同一个label（比如全是手机）

运行text_test.py出现错误

============================= test session starts =============================
platform win32 -- Python 3.6.2, pytest-4.4.0, py-1.8.0, pluggy-0.9.0
rootdir: D:\lc_bs\text-cnn-mastercollected 1 item

text_test.py FLoading test data...

text_test.py:38 (test)
def test():
print("Loading test data...")
t1=time.time()

  x_test,y_test=process_file(config.test_filename,word_to_id,cat_to_id,config.seq_length)

E NameError: name 'config' is not defined

text_test.py:42: NameError
[100%]

运行text_train.py时出现No optimization over 1000 steps, stop training

运行text_train.py时出现No optimization over 1000 steps, stop training的字样，请问有人遇到过吗，怎么解决呢

请问用的是TensorFlow 哪个版本？我用的2.x，很多api都不支持了，。。谢谢。

TCNN

实际上 Text-cnn在训练的时候卷积层可以使用多个不同尺寸的卷积核最后拼接到一个中间层上然后输入多个 dense 可以尝试如下操作

    # 第一卷积层
    conv_4_layer = Convolution1D(200, 4, activation='tanh')(embedding)
    # 第一池化层
    max_pool_4_layer = MaxPooling1D(4)(conv_4_layer)
    # 第一扁平层
    flat_4_layer = Flatten()(max_pool_4_layer)

    # 第二卷积层
    conv_5_layer = Convolution1D(200, 5, activation='tanh')(embedding)
    # 第二池化层
    max_pool_5_layer = MaxPooling1D(5)(conv_5_layer)
    # 第二扁平层
    flat_5_layer = Flatten()(max_pool_5_layer)

    # 第三卷积层
    conv_6_layer = Convolution1D(200, 6, activation='tanh')(embedding)
    # 第三池化层
    max_pool_6_layer = MaxPooling1D(6)(conv_6_layer)
    # 第三扁平层
    flat_6_layer = Flatten()(max_pool_6_layer)

    # 组合
    CNNs = concatenate([flat_4_layer, flat_5_layer, flat_6_layer])

请问在生成词向量后，为何再生成词汇表，感觉有单词丢失了

word2vec训练生成的词向量对应为20万，128维度，但是之后你又重新生成一个词汇表是6000个单词的，那通过这6000生成的训练数据，对于原始句子是不是与信息丢失了啊，比如词向量与元数据中，“工商”这个词比较重要，但是词汇表里若没有，则不会出现在词汇表i，那就会影响训练效果啊，可以解释下吗？挺疑惑的

可以把word2vec换成用bert预训练词向量吗

run text_test.py时候出现问题

报这个错的同时出现OOM,
ResourceExhaustedError: OOM when allocating tensor with shape[10000,256,1,596] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bf,
请帮忙看看,

python text_test.py出現錯誤

训练的时候使用的词向量是否没用到Word2Vec向量？

在text_train.py文件中：

x_train, y_train = process_file(config.train_filename, word_to_id, cat_to_id, config.seq_length)
x_val, y_val = process_file(config.val_filename, word_to_id, cat_to_id, config.seq_length)

训练数据使用的是：process_file方法，训练集x使用的是 word_to_id:get from def read_vocab()，是根据生成词表的位置索引。

请问如何使用训练好的Word2Vec向量？

运行text_trian.py时程序Killed

大佬您好，我在运行text_train.py时，完成epoch:1后程序自动Killed是什么原因呢？谢谢大佬解答

更换训练集，训练集只有三个label，text_predict载入模型报错

更换训练集后修改了text_model中的num_classess参数为3，同时更改了loader中的label。
训练完成后使用text_predict文件载入模型报错，
saver.restore(sess=session, save_path=save_path)这里失败。
检查过save_path没有问题。。。
请问为什么呢。。。我想不出来QAQ