shawn1993 / cnn-text-classification-pytorch Goto Github PK

View Code? Open in Web Editor NEW

1.0K 15.0 279.0 13.06 MB

CNNs for Sentence Classification in PyTorch

License: Apache License 2.0

Python 100.00%

cnn-model pytorch

cnn-text-classification-pytorch's Introduction

Introduction

This is the implementation of Kim's Convolutional Neural Networks for Sentence Classification paper in PyTorch.

Kim's implementation of the model in Theano: https://github.com/yoonkim/CNN_sentence
Denny Britz has an implementation in Tensorflow: https://github.com/dennybritz/cnn-text-classification-tf
Alexander Rakhlin's implementation in Keras; https://github.com/alexander-rakhlin/CNN-for-Sentence-Classification-in-Keras

Requirement

python 3
pytorch > 0.1
torchtext > 0.1
numpy

Result

I just tried two dataset, MR and SST.

Dataset	Class Size	Best Result	Kim's Paper Result
MR	2	77.5%(CNN-rand-static)	76.1%(CNN-rand-nostatic)
SST	5	37.2%(CNN-rand-static)	45.0%(CNN-rand-nostatic)

I haven't adjusted the hyper-parameters for SST seriously.

Usage

./main.py -h

python3 main.py -h

You will get:

CNN text classificer

optional arguments:
  -h, --help            show this help message and exit
  -batch-size N         batch size for training [default: 50]
  -lr LR                initial learning rate [default: 0.01]
  -epochs N             number of epochs for train [default: 10]
  -dropout              the probability for dropout [default: 0.5]
  -max_norm MAX_NORM    l2 constraint of parameters
  -cpu                  disable the gpu
  -device DEVICE        device to use for iterate data
  -embed-dim EMBED_DIM
  -static               fix the embedding
  -kernel-sizes KERNEL_SIZES
                        Comma-separated kernel size to use for convolution
  -kernel-num KERNEL_NUM
                        number of each kind of kernel
  -class-num CLASS_NUM  number of class
  -shuffle              shuffle the data every epoch
  -num-workers NUM_WORKERS
                        how many subprocesses to use for data loading
                        [default: 0]
  -log-interval LOG_INTERVAL
                        how many batches to wait before logging training
                        status
  -test-interval TEST_INTERVAL
                        how many epochs to wait before testing
  -save-interval SAVE_INTERVAL
                        how many epochs to wait before saving
  -predict PREDICT      predict the sentence given
  -snapshot SNAPSHOT    filename of model snapshot [default: None]
  -save-dir SAVE_DIR    where to save the checkpoint

Train

./main.py

You will get:

Batch[100] - loss: 0.655424  acc: 59.3750%
Evaluation - loss: 0.672396  acc: 57.6923%(615/1066)

Test

If you has construct you test set, you make testing like:

/main.py -test -snapshot="./snapshot/2017-02-11_15-50-53/snapshot_steps1500.pt

The snapshot option means where your model load from. If you don't assign it, the model will start from scratch.

Predict

Example1

 ./main.py -predict="Hello my dear , I love you so much ." \
           -snapshot="./snapshot/2017-02-11_15-50-53/snapshot_steps1500.pt"

You will get:

 Loading model from [./snapshot/2017-02-11_15-50-53/snapshot_steps1500.pt]...
 
 [Text]  Hello my dear , I love you so much .
 [Label] positive

Example2

 ./main.py -predict="You just make me so sad and I have to leave you ."\
           -snapshot="./snapshot/2017-02-11_15-50-53/snapshot_steps1500.pt"

You will get:

 Loading model from [./snapshot/2017-02-11_15-50-53/snapshot_steps1500.pt]...
 
 [Text]  You just make me so sad and I have to leave you .
 [Label] negative

Your text must be separated by space, even punctuation.And, your text should longer then the max kernel size.

Reference

Convolutional Neural Networks for Sentence Classification

cnn-text-classification-pytorch's People

Contributors

Stargazers

Watchers

Forkers

smartai clcarwin benjamesbabala chagge jadore801120 ilyeong-ai phonism salman1993 adrianhust yuchen112358 rsarxiv joyce94 yunan4nlp rain-y rikkarikka jazzthesaver mydp2017 hnhbdss tobby2002 garminwu mainak24 ritvikshrivastava chinaliuhao lifengjin sampathweb grseb9s sb1992 zfang whwdreamsky seanrosario wintor12 feidong1991 xiaoximao dennisliu94 jdposada ghldun onetaken demon-jiehao dpappas romanegloo papercoming mohnkhan qiao-zhang henryflee quoniammm wuyijian juyye arunpatala helenailse ymym3412 elliottyan garygaryry francisdacian iclementine freedomkite ghiblifield uyoung-jeong authman zhengfengrao preke merryjanejian perryhau davidlking yongpanhuang neuronxe999999 liangchunfeng33 yiyuezhuo machenfeng idkwim duytinvo wanghm92 chouben uniphix000 nininininini sabirdvd asherchan eternalfeather haonanli sumehta rlepsch shubhampachori12110095 dsp6414 greengrass2015 huarong tianforks bowenhua dorringel jamesmw423 hittle2015 moherx moonlightlong sdiox gyeorelee stephane81 nityadav hatleon hustzxd zhaosm feng4251 hvdthong

cnn-text-classification-pytorch's Issues

Performance on MR dataset

Hi there, I see that the best reported accuracy for this repo for the MR dataset is 77.5% using CNN-rand-static. When I run this, using ./main.py -device=0 -static, I get much lower numbers (~70%). Two questions:

What training settings are you using to get 77.5%?
How are you evaluating on the MR dataset to get 77.5%?

Thanks!

序列长度

请问您这个程序序列长度为什么是没有固定的呢？还是在什么地方固定了，不过我并没有发现，请指教，谢谢

prediction error

training seems to work fine , when I run "python main.py"

but when I ran predict "python main.py -predict="I feel bad" -snapshot="snapshot/2018-01-08_13-10-29/snapshot_steps9000.pt", it is giving me this error

RuntimeError: Given input size: (1, 3, 128). Calculated output size: (1, 0, 1). Output size is too small.

I haven't made any changes to code

why the acc is so high?

Data load error in Python2

Trying to run the model in python2.7 (can't upgrade my system python so I have to make do). I am getting the following error:

Loading data...
Traceback (most recent call last):
  File "./main.py", line 73, in <module>
    train_iter, dev_iter = mr(text_field, label_field, device=-1, repeat=False)
  File "./main.py", line 59, in mr
    train_data, dev_data = mydatasets.MR.splits(text_field, label_field)
  File "/home/cnn-text-classification-pytorch/mydatasets.py", line 105, in splits
    examples = cls(text_field, label_field, path=path, **kwargs).examples
  File "/home/cnn-text-classification-pytorch/mydatasets.py", line 82, in __init__
    data.Example.fromlist([line, 'negative'], fields) for line in f]
  File "/home/venv/local/lib/python2.7/site-packages/torchtext/data/example.py", line 52, in fromlist
    setattr(ex, name, field.preprocess(val))
  File "/home/venv/local/lib/python2.7/site-packages/torchtext/data/field.py", line 166, in preprocess
    x = Pipeline(lambda s: six.text_type(s, encoding='utf-8'))(x)
  File "/home/venv/local/lib/python2.7/site-packages/torchtext/data/pipeline.py", line 37, in __call__
    x = pipe.call(x, *args)
  File "/home/venv/local/lib/python2.7/site-packages/torchtext/data/pipeline.py", line 53, in call
    return self.convert_token(x, *args)
  File "/home/venv/local/lib/python2.7/site-packages/torchtext/data/field.py", line 166, in <lambda>
    x = Pipeline(lambda s: six.text_type(s, encoding='utf-8'))(x)
UnicodeDecodeError: 'utf8' codec can't decode byte 0x97 in position 110: invalid start byte

I've decoded all the data in rt-polarity.neg and rt-polarity.pos to UTF-8 ignoring errors (to remove the non-decodeable charaacters), but no luck. Any help?

Output raw input sentences

Is there a easy way to get the original raw sentences instead of data objects during the test dataset evaluation?

def eval(data_iter, model, args):
    model.eval()
    corrects, avg_loss = 0, 0
    for batch in data_iter:
        feature, target = batch.text, batch.label
        **print (feature.original_sentence)**

feature.data.t_(), target.data.sub_(1)

I know feature.data.t_() transposes the matrix, but what does target.data.sub_(1) do? Is it also necessary if the label starts from 0?

'Field' object has no attribute 'vocab'

Hello, excuse me, there was no problem during the training, but this error occurred during the prediction. I'm actually extracting predict as a function

PATH = './snapshot/best_steps_8600.pt'
args = confog_args()
text_field = data.Field(lower=True)
label_field = data.Field(sequential=False)
args.vocabulary_size = len(text_field.vocab)
args.cuda = args.device != -1 and torch.cuda.is_available()

In addition, the training data should also be loaded when predicting ？？？？？

Looking forward to your reply. Thank you

l2 norm

Is there any code showing the usage of l2 norm?

RuntimeError: set_storage_offset is not allowed on Tensor created from .data or .detach()

问题1：

Traceback (most recent call last):
  File "/cnn-text-classification-pytorch/main.py", line 112, in <module>
    train.train(train_iter, dev_iter, cnn, args)
  File "/cnn-text-classification-pytorch/train.py", line 25, in train
    feature.data.t_(), target.data.sub_(1)  # batch first, index align
RuntimeError: set_storage_offset is not allowed on Tensor created from .data or .detach()

Process finished with exit code 1

问题1解决：将【2处】feature.data.t_(), target.data.sub_(1)替换为：

 feature = feature.data.t()
 target = target.data.sub(1)

问题2：

Traceback (most recent call last):
  File "/cnn-text-classification-pytorch/main.py", line 112, in <module>
    train.train(train_iter, dev_iter, cnn, args)
  File "/cnn-text-classification-pytorch/train.py", line 43, in train
    loss.data[0],
IndexError: invalid index of a 0-dim tensor. Use tensor.item() to convert a 0-dim tensor to a Python number

Process finished with exit code 1

问题2解决：将【2处】loss.data[0]替换为：loss.item()

How to make the embedding changeable in backpropagation

It seems that the word embedding are kept static during training.
How to make the embedding changeable in backpropagation?

Preprocessing issue in mydatasets.py

I was reading the documentation for the Torchtext Field object and I noticed that preprocessing happens after tokenization. This seems to conflict with the intention of the clean_str function, as adding it to the text field's preprocessing will split contractions, etc. on individual tokens (causing tokens with spaces in them) rather than an entire sentence. To fix this, the following statement on line 74:

text_field.preprocessing = data.Pipeline(clean_str)

can be replaced with something like this:

text_field.tokenize = lambda x: clean_str(x).split()

which will apply clean_str before tokenization (str.split() is the default tokenizer used by the Field object).

Issue with running code on SST Dataset

Hi @Shawn1993

I have an issue with running the code on SST dataset. I comment line 73 and uncomment line 74 in main.py

It seems that the code will directly use the torchtext dataset package to download the SST dataset, and then run. However this will raise the error RuntimeError: Given input size: (1, 4, 128). Calculated output size: (64, 0, 1). Output size is too small. in line 44 of model.py x = [F.max_pool1d(i, i.size(2)).squeeze(2) for i in x] # [(N, Co), ...]*len(Ks).

Any hint on this possible reason? Thanks in advance.

For your information, I am running on PyTorch '0.3.1.post2' and torchtext '0.2.1'.

Pre-trained word embeddings

Has the loading of pre-trained word embedding (i.e., the one trained from Google News) been included in the code?

Embedding Static

I thing the way this code implements Static Embedding is wrong. Am I right?
This code, uses x=variable(x) when wants to make the embedding static, while it should be something like: self.embed.weight.requires_grad=False

I got some mistakes

OSError: libtorch_cpu.so: cannot open shared object file: No such file or directory

could you tell me how can I correct this mistake

Issue with the prediction function

I've been playing with the CNN text classification lately, and it seems it trains fine, but when it comes to predictions I get this error:

Traceback (most recent call last): File "main.py", line 89, in <module> label = train.predict(args.predict, cnn, text_field, label_field) File "train.py", line 88, in predict return label_feild.vocab.itos[predicted.data[0][0]+1] TypeError: 'int' object has no attribute '__getitem__'

I changed the return label_feild.vocab.itos[predicted.data[0][0]+1] to return label_feild.vocab.itos[predicted.data[0]+1] and bypassed the error, but most predictions are not accurate now. Can you please let me know what I'm missing here.

why * 1000?

In data.py

    def __next__(self):
        self._fill_buffer(self._batch_size * 1000)

Here, why the _bath_size is multiplied by 1000?

hi, errors when call cuda() on this model

this line, i find that when call cuda on this model, this list of modules will not shift to the GPU, Any idea to solve this bug?

shawn1993 / cnn-text-classification-pytorch Goto Github PK

cnn-text-classification-pytorch's Introduction

Introduction

Requirement

Result

Usage

Train

Test

Predict

Reference

cnn-text-classification-pytorch's People

Contributors

Stargazers

Watchers

Forkers

cnn-text-classification-pytorch's Issues

Recommend Projects

Recommend Topics

Recommend Org