Giter Site home page Giter Site logo

text-cnn's Introduction

Text classification with CNN and Word2vec

本文是参考gaussic大牛的“text-classification-cnn-rnn”后,基于同样的数据集,嵌入词级别所做的CNN文本分类实验结果,gaussic大牛是基于字符级的;

进行了第二版的更新:1.加入不同的卷积核;2.加入正则化;3.词仅为中文或英文,删掉文本中数字、符号等类型的词;4.删除长度为1的词;

训练结果较第一版有所提升,验证集准确率从96.5%达到97.1%,测试准确率从96.7%达到97.2%。

本实验的主要目是为了探究基于Word2vec训练的词向量嵌入CNN后,对模型的影响,实验结果得到的模型在验证集达到97.1%的效果,gaussic大牛为94.12%;

更多详细可以阅读gaussic大牛的博客:text-classification-cnn-rnn

1 环境

python3
tensorflow 1.3以上CPU环境下
gensim
jieba
scipy
numpy
scikit-learn

2 CNN卷积神经网络

模型CNN配置的参数在text_model.py中,具体为:

image

模型CNN大致结构为:

image

3 数据集

本实验同样是使用THUCNews的一个子集进行训练与测试,数据集请自行到THUCTC:一个高效的中文文本分类工具包下载,请遵循数据提供方的开源协议;

文本类别涉及10个类别:categories = ['体育', '财经', '房产', '家居', '教育', '科技', '时尚', '时政', '游戏', '娱乐'],每个分类6500条数据;

cnews.train.txt: 训练集(5000*10)

cnews.val.txt: 验证集(500*10)

cnews.test.txt: 测试集(1000*10)

训练所用的数据,以及训练好的词向量可以下载:链接: https://pan.baidu.com/s/1DOgxlY42roBpOKAMKPPKWA,密码: up9d

4 预处理

本实验主要对训练文本进行分词处理,一来要分词训练词向量,二来输入模型的以词向量的形式;

另外,词仅为中文或英文,词的长度大于1;

处理的程序都放在loader.py文件中;

5 运行步骤

python train_word2vec.py,对训练数据进行分词,利用Word2vec训练词向量(vector_word.txt)

python text_train.py,进行训练模型

python text_test.py,对模型进行测试

python text_predict.py,提供模型的预测

6 训练结果

运行:python text_train.py

本实验经过6轮的迭代,满足终止条件结束,在global_step=2000时在验证集得到最佳效果97.1%

image

7 测试结果

运行:python text_test.py

对测试数据集显示,test_loss=0.1,test_accuracy=97.23%,其中“体育”类测试为100%,整体的precision=recall=F1=97%

image

8 预测结果

运行:python text_predict.py

随机从测试数据中挑选了五个样本,输出原文本和它的原文本标签和预测的标签,下图中5个样本预测的都是对的;

image

9 参考

  1. Convolutional Neural Networks for Sentence Classification
  2. gaussic/text-classification-cnn-rnn
  3. YCG09/tf-text-classification

image

text-cnn's People

Contributors

cjymz886 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

text-cnn's Issues

2维卷积

问下作者为什么用二维卷积处理文本 处理文本不是以为卷积吗

运行train_word2vec.py和text_train.py的时候出问题

您好,gaussic的那个我之前有运行过。之后一直在找基于Word2vec训练的词向量嵌入CNN后的模型训练。有幸发现您的项目。
不过 我在使用THUCNews文本运行train_word2vec.py的时候会提示
RuntimeError: you must first build vocabulary before training the model
后下载了您训练好的vector_word.txt,运行text_train.py的时候会提示
ValueError: zero-size array to reduction operation maximum which has no identity。
想请问下您,以上问题要怎样解决呢?

关于去停用词后准确率下降

这个模型在生成词表的时候可以去除停用词吗,我使用百度和哈工大的停用词表训练测试集准确率下降了,请问可能是什么原因呢?

运行text_train.py时报错

您好,非常感谢您的分享~
但我更换自己的数据集并运行train_word2vec.py重新训练词向量后,text_train.py报错显示:
ValueError: Too many elements provided. Needed at most 512000, but received 800000
后来我将text_model.py的vocab_size和loade.py中的build_vocab(filenames,vocab_dir,vocab_size=5000)均改为5000,重新训练词向量后再运行text_train.py,报错显示:
ValueError: Too many elements provided. Needed at most 320000, but received 500000
请问我该如何解决这个问题?

官网数据集数量与描述不对应

你好,你在readme中说每个类别下面为6500条数据,而如今我在官网下载的数据集每个类别下面的数目比这个大很多,都是9w,5w,13w等数量级。我有个疑惑是因为数据集官网一致在增加还是你当时每个类别下边只选择了6500条数据进行实验?(比较小白)

運行text_test.py出現錯誤

運行text_test.py出現以下錯誤資訊

Configuring CNN model...
Loading test data...
Building prefix dict from the default dictionary ...
Loading model from cache /tmp/jieba.cache
Loading model cost 0.712 seconds.
Prefix dict has been built succesfully.
2018-10-16 01:15:17.947491: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.1 instructions, but these are available on your machine and could speed up CPU computations.
2018-10-16 01:15:17.947539: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.2 instructions, but these are available on your machine and could speed up CPU computations.
2018-10-16 01:15:17.947550: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX instructions, but these are available on your machine and could speed up CPU computations.
Testing...
terminate called after throwing an instance of 'std::bad_alloc'
what(): std::bad_alloc
已經終止 (core dumped)

运行text_test.py出现错误

============================= test session starts =============================
platform win32 -- Python 3.6.2, pytest-4.4.0, py-1.8.0, pluggy-0.9.0
rootdir: D:\lc_bs\text-cnn-mastercollected 1 item

text_test.py FLoading test data...

text_test.py:38 (test)
def test():
print("Loading test data...")
t1=time.time()

  x_test,y_test=process_file(config.test_filename,word_to_id,cat_to_id,config.seq_length)

E NameError: name 'config' is not defined

text_test.py:42: NameError
[100%]

TCNN

实际上 Text-cnn在训练的时候 卷积层可以使用 多个 不同尺寸的卷积核 最后拼接到一个中间层上 然后输入多个 dense 可以尝试如下操作

    # 第一卷积层
    conv_4_layer = Convolution1D(200, 4, activation='tanh')(embedding)
    # 第一池化层
    max_pool_4_layer = MaxPooling1D(4)(conv_4_layer)
    # 第一扁平层
    flat_4_layer = Flatten()(max_pool_4_layer)

    # 第二卷积层
    conv_5_layer = Convolution1D(200, 5, activation='tanh')(embedding)
    # 第二池化层
    max_pool_5_layer = MaxPooling1D(5)(conv_5_layer)
    # 第二扁平层
    flat_5_layer = Flatten()(max_pool_5_layer)

    # 第三卷积层
    conv_6_layer = Convolution1D(200, 6, activation='tanh')(embedding)
    # 第三池化层
    max_pool_6_layer = MaxPooling1D(6)(conv_6_layer)
    # 第三扁平层
    flat_6_layer = Flatten()(max_pool_6_layer)

    # 组合
    CNNs = concatenate([flat_4_layer, flat_5_layer, flat_6_layer])

请问在生成词向量后,为何再生成词汇表,感觉有单词丢失了

word2vec训练生成的词向量对应为20万,128维度,但是之后你又重新生成一个词汇表是6000个单词的,那通过这6000生成的训练数据,对于原始句子是不是与信息丢失了啊,比如词向量与元数据中,“工商”这个词比较重要,但是词汇表里若没有,则不会出现在词汇表i,那就会影响训练效果啊,可以解释下吗?挺疑惑的

run text_test.py时候出现问题

报这个错的同时出现OOM,
ResourceExhaustedError: OOM when allocating tensor with shape[10000,256,1,596] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bf,
请帮忙看看,

python text_test.py出現錯誤

Configuring CNN model...
Loading test data...
Building prefix dict from the default dictionary ...
Loading model from cache /tmp/jieba.cache
Loading model cost 0.712 seconds.
Prefix dict has been built succesfully.
2018-10-16 01:15:17.947491: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.1 instructions, but these are available on your machine and could speed up CPU computations.
2018-10-16 01:15:17.947539: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.2 instructions, but these are available on your machine and could speed up CPU computations.
2018-10-16 01:15:17.947550: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX instructions, but these are available on your machine and could speed up CPU computations.
Testing...
terminate called after throwing an instance of 'std::bad_alloc'
what(): std::bad_alloc
已經終止 (core dumped)

训练的时候使用的词向量是否没用到Word2Vec向量?

在text_train.py文件中:

x_train, y_train = process_file(config.train_filename, word_to_id, cat_to_id, config.seq_length)
x_val, y_val = process_file(config.val_filename, word_to_id, cat_to_id, config.seq_length)

训练数据使用的是:process_file方法,训练集x使用的是 word_to_id:get from def read_vocab(),是根据生成词表的位置索引。

请问如何使用训练好的Word2Vec向量?

更换训练集,训练集只有三个label,text_predict载入模型报错

更换训练集后修改了text_model中的num_classess参数为3,同时更改了loader中的label。
训练完成后使用text_predict文件载入模型报错,
saver.restore(sess=session, save_path=save_path)这里失败。
检查过save_path没有问题。。。
请问为什么呢。。。我想不出来QAQ

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.