Giter Site home page Giter Site logo

cnn-text-classification-pytorch's Introduction

Introduction

This is the implementation of Kim's Convolutional Neural Networks for Sentence Classification paper in PyTorch.

  1. Kim's implementation of the model in Theano: https://github.com/yoonkim/CNN_sentence
  2. Denny Britz has an implementation in Tensorflow: https://github.com/dennybritz/cnn-text-classification-tf
  3. Alexander Rakhlin's implementation in Keras; https://github.com/alexander-rakhlin/CNN-for-Sentence-Classification-in-Keras

Requirement

  • python 3
  • pytorch > 0.1
  • torchtext > 0.1
  • numpy

Result

I just tried two dataset, MR and SST.

Dataset Class Size Best Result Kim's Paper Result
MR 2 77.5%(CNN-rand-static) 76.1%(CNN-rand-nostatic)
SST 5 37.2%(CNN-rand-static) 45.0%(CNN-rand-nostatic)

I haven't adjusted the hyper-parameters for SST seriously.

Usage

./main.py -h

or

python3 main.py -h

You will get:

CNN text classificer

optional arguments:
  -h, --help            show this help message and exit
  -batch-size N         batch size for training [default: 50]
  -lr LR                initial learning rate [default: 0.01]
  -epochs N             number of epochs for train [default: 10]
  -dropout              the probability for dropout [default: 0.5]
  -max_norm MAX_NORM    l2 constraint of parameters
  -cpu                  disable the gpu
  -device DEVICE        device to use for iterate data
  -embed-dim EMBED_DIM
  -static               fix the embedding
  -kernel-sizes KERNEL_SIZES
                        Comma-separated kernel size to use for convolution
  -kernel-num KERNEL_NUM
                        number of each kind of kernel
  -class-num CLASS_NUM  number of class
  -shuffle              shuffle the data every epoch
  -num-workers NUM_WORKERS
                        how many subprocesses to use for data loading
                        [default: 0]
  -log-interval LOG_INTERVAL
                        how many batches to wait before logging training
                        status
  -test-interval TEST_INTERVAL
                        how many epochs to wait before testing
  -save-interval SAVE_INTERVAL
                        how many epochs to wait before saving
  -predict PREDICT      predict the sentence given
  -snapshot SNAPSHOT    filename of model snapshot [default: None]
  -save-dir SAVE_DIR    where to save the checkpoint

Train

./main.py

You will get:

Batch[100] - loss: 0.655424  acc: 59.3750%
Evaluation - loss: 0.672396  acc: 57.6923%(615/1066) 

Test

If you has construct you test set, you make testing like:

/main.py -test -snapshot="./snapshot/2017-02-11_15-50-53/snapshot_steps1500.pt

The snapshot option means where your model load from. If you don't assign it, the model will start from scratch.

Predict

  • Example1

     ./main.py -predict="Hello my dear , I love you so much ." \
               -snapshot="./snapshot/2017-02-11_15-50-53/snapshot_steps1500.pt" 
    

    You will get:

     Loading model from [./snapshot/2017-02-11_15-50-53/snapshot_steps1500.pt]...
     
     [Text]  Hello my dear , I love you so much .
     [Label] positive
    
  • Example2

     ./main.py -predict="You just make me so sad and I have to leave you ."\
               -snapshot="./snapshot/2017-02-11_15-50-53/snapshot_steps1500.pt" 
    

    You will get:

     Loading model from [./snapshot/2017-02-11_15-50-53/snapshot_steps1500.pt]...
     
     [Text]  You just make me so sad and I have to leave you .
     [Label] negative
    

Your text must be separated by space, even punctuation.And, your text should longer then the max kernel size.

Reference

cnn-text-classification-pytorch's People

Contributors

eriche2016 avatar jadore801120 avatar maxxbw avatar onetaken avatar phonism avatar ritvikshrivastava avatar rohan-b avatar rriva002 avatar seanrosario avatar shawn1993 avatar srviest avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

cnn-text-classification-pytorch's Issues

Performance on MR dataset

Hi there, I see that the best reported accuracy for this repo for the MR dataset is 77.5% using CNN-rand-static. When I run this, using ./main.py -device=0 -static, I get much lower numbers (~70%). Two questions:

  1. What training settings are you using to get 77.5%?
  2. How are you evaluating on the MR dataset to get 77.5%?

Thanks!

序列长度

请问您这个程序序列长度为什么是没有固定的呢?还是在什么地方固定了,不过我并没有发现,请指教,谢谢

prediction error

training seems to work fine , when I run "python main.py"

but when I ran predict "python main.py -predict="I feel bad" -snapshot="snapshot/2018-01-08_13-10-29/snapshot_steps9000.pt", it is giving me this error

RuntimeError: Given input size: (1, 3, 128). Calculated output size: (1, 0, 1). Output size is too small.

I haven't made any changes to code

Data load error in Python2

Trying to run the model in python2.7 (can't upgrade my system python so I have to make do). I am getting the following error:

Loading data...
Traceback (most recent call last):
  File "./main.py", line 73, in <module>
    train_iter, dev_iter = mr(text_field, label_field, device=-1, repeat=False)
  File "./main.py", line 59, in mr
    train_data, dev_data = mydatasets.MR.splits(text_field, label_field)
  File "/home/cnn-text-classification-pytorch/mydatasets.py", line 105, in splits
    examples = cls(text_field, label_field, path=path, **kwargs).examples
  File "/home/cnn-text-classification-pytorch/mydatasets.py", line 82, in __init__
    data.Example.fromlist([line, 'negative'], fields) for line in f]
  File "/home/venv/local/lib/python2.7/site-packages/torchtext/data/example.py", line 52, in fromlist
    setattr(ex, name, field.preprocess(val))
  File "/home/venv/local/lib/python2.7/site-packages/torchtext/data/field.py", line 166, in preprocess
    x = Pipeline(lambda s: six.text_type(s, encoding='utf-8'))(x)
  File "/home/venv/local/lib/python2.7/site-packages/torchtext/data/pipeline.py", line 37, in __call__
    x = pipe.call(x, *args)
  File "/home/venv/local/lib/python2.7/site-packages/torchtext/data/pipeline.py", line 53, in call
    return self.convert_token(x, *args)
  File "/home/venv/local/lib/python2.7/site-packages/torchtext/data/field.py", line 166, in <lambda>
    x = Pipeline(lambda s: six.text_type(s, encoding='utf-8'))(x)
UnicodeDecodeError: 'utf8' codec can't decode byte 0x97 in position 110: invalid start byte

I've decoded all the data in rt-polarity.neg and rt-polarity.pos to UTF-8 ignoring errors (to remove the non-decodeable charaacters), but no luck. Any help?

Output raw input sentences

Is there a easy way to get the original raw sentences instead of data objects during the test dataset evaluation?

def eval(data_iter, model, args):
    model.eval()
    corrects, avg_loss = 0, 0
    for batch in data_iter:
        feature, target = batch.text, batch.label
        **print (feature.original_sentence)**

'Field' object has no attribute 'vocab'

Hello, excuse me, there was no problem during the training, but this error occurred during the prediction. I'm actually extracting predict as a function

PATH = './snapshot/best_steps_8600.pt'
args = confog_args()
text_field = data.Field(lower=True)
label_field = data.Field(sequential=False)
args.vocabulary_size = len(text_field.vocab)
args.cuda = args.device != -1 and torch.cuda.is_available()

In addition, the training data should also be loaded when predicting ?????

Looking forward to your reply. Thank you

l2 norm

Is there any code showing the usage of l2 norm?

RuntimeError: set_storage_offset is not allowed on Tensor created from .data or .detach()

  • 问题1:
Traceback (most recent call last):
  File "/cnn-text-classification-pytorch/main.py", line 112, in <module>
    train.train(train_iter, dev_iter, cnn, args)
  File "/cnn-text-classification-pytorch/train.py", line 25, in train
    feature.data.t_(), target.data.sub_(1)  # batch first, index align
RuntimeError: set_storage_offset is not allowed on Tensor created from .data or .detach()

Process finished with exit code 1
  • 问题1解决:将【2处】feature.data.t_(), target.data.sub_(1)替换为:
 feature = feature.data.t()
 target = target.data.sub(1) 
  • 问题2:
Traceback (most recent call last):
  File "/cnn-text-classification-pytorch/main.py", line 112, in <module>
    train.train(train_iter, dev_iter, cnn, args)
  File "/cnn-text-classification-pytorch/train.py", line 43, in train
    loss.data[0],
IndexError: invalid index of a 0-dim tensor. Use tensor.item() to convert a 0-dim tensor to a Python number

Process finished with exit code 1
  • 问题2解决:将【2处】loss.data[0]替换为:loss.item()

Preprocessing issue in mydatasets.py

I was reading the documentation for the Torchtext Field object and I noticed that preprocessing happens after tokenization. This seems to conflict with the intention of the clean_str function, as adding it to the text field's preprocessing will split contractions, etc. on individual tokens (causing tokens with spaces in them) rather than an entire sentence. To fix this, the following statement on line 74:

text_field.preprocessing = data.Pipeline(clean_str)

can be replaced with something like this:

text_field.tokenize = lambda x: clean_str(x).split()

which will apply clean_str before tokenization (str.split() is the default tokenizer used by the Field object).

Issue with running code on SST Dataset

Hi @Shawn1993

I have an issue with running the code on SST dataset. I comment line 73 and uncomment line 74 in main.py

It seems that the code will directly use the torchtext dataset package to download the SST dataset, and then run. However this will raise the error RuntimeError: Given input size: (1, 4, 128). Calculated output size: (64, 0, 1). Output size is too small. in line 44 of model.py x = [F.max_pool1d(i, i.size(2)).squeeze(2) for i in x] # [(N, Co), ...]*len(Ks).

Any hint on this possible reason? Thanks in advance.

For your information, I am running on PyTorch '0.3.1.post2' and torchtext '0.2.1'.

Pre-trained word embeddings

Has the loading of pre-trained word embedding (i.e., the one trained from Google News) been included in the code?

Embedding Static

I thing the way this code implements Static Embedding is wrong. Am I right?
This code, uses x=variable(x) when wants to make the embedding static, while it should be something like: self.embed.weight.requires_grad=False

I got some mistakes

OSError: libtorch_cpu.so: cannot open shared object file: No such file or directory

could you tell me how can I correct this mistake

Issue with the prediction function

I've been playing with the CNN text classification lately, and it seems it trains fine, but when it comes to predictions I get this error:

Traceback (most recent call last): File "main.py", line 89, in <module> label = train.predict(args.predict, cnn, text_field, label_field) File "train.py", line 88, in predict return label_feild.vocab.itos[predicted.data[0][0]+1] TypeError: 'int' object has no attribute '__getitem__'

I changed the return label_feild.vocab.itos[predicted.data[0][0]+1] to return label_feild.vocab.itos[predicted.data[0]+1] and bypassed the error, but most predictions are not accurate now. Can you please let me know what I'm missing here.

why * 1000?

In data.py

    def __next__(self):
        self._fill_buffer(self._batch_size * 1000)

Here, why the _bath_size is multiplied by 1000?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.