Giter Site home page Giter Site logo

chenyuntc / pytorchtext Goto Github PK

View Code? Open in Web Editor NEW
1.1K 47.0 362.0 1.34 MB

1st Place Solution for Zhihu Machine Learning Challenge . Implementation of various text-classification models.(知乎看山杯第一名解决方案)

Home Page: https://biendata.com/competition/zhihu/

License: MIT License

Python 58.63% Shell 4.24% Jupyter Notebook 37.13%
pytorch nlp textcnn textrnn fasttext textrcnn lstm

pytorchtext's Introduction

中文用户请查看readme-zh.md

This is the solution for Zhihu Machine Learning Challenge 2017. We won the champion out of 963 teams.

1. Setup

  • install PyTorch from pytorch.org (Python 2, CUDA)
  • install other depencies:
    pip2 install -r requirements.txt

You may need tf.contrib.keras.preprocessing.sequence.pad_sequences for data preprocessing.

  • start visdom for visualization:
    python2 -m visdom.server

2. Data Preprocessing

Modify the data path in the related file

2.1 wordvector file -> numpy file

python scripts/data_process/embedding2matrix.py main char_embedding.txt char_embedding.npz 
python scripts/data_process/embedding2matrix.py main word_embedding.txt word_embedding.npz 

2.2 question set -> numpy file

it's memory consuming , make sure you have memory larger than 32G.

python scripts/data_process/question2array.py main question_train_set.txt train.npz
python scripts/data_process/question2array.py main question_eval_set.txt test.npz

2.3 label -> json

python scripts/data_process/label2id.py main question_topic_train_set.txt labels.json

2.4 validation data

python scripts/data_process/get_val.py 

3. Training

modify config.py for model path

Path to the models we used:

  • CNN:models/MultiCNNTextBNDeep.py
  • RNN(LSTM):models/LSTMText.py
  • RCNN: models/RCNN.py
  • inception: models/CNNText_inception.py
  • FastText: models/FastText3.py

3.1 Trian model without data augumentation

# LSTM char
python2 main.py main --max_epoch=5 --plot_every=100 --env='lstm_char' --weight=1 --model='LSTMText'  --batch-size=128  --lr=0.001 --lr2=0 --lr_decay=0.5 --decay_every=10000  --type_='char'   --zhuge=True --linear-hidden-size=2000 --hidden-size=256 --kmax-pooling=3   --num-layers=3  --augument=False

# LSTM word
python2 main.py main --max_epoch=5 --plot_every=100 --env='lstm_word' --weight=1 --model='LSTMText'  --batch-size=128  --lr=0.001 --lr2=0.0000 --lr_decay=0.5 --decay_every=10000  --type_='word'   --zhuge=True --linear-hidden-size=2000 --hidden-size=320 --kmax-pooling=2  --augument=False

#  RCNN char
python2 main.py main --max_epoch=5 --plot_every=100 --env='rcnn_char' --weight=1 --model='RCNN'  --batch-size=128  --lr=0.001 --lr2=0 --lr_decay=0.5 --decay_every=5000  --title-dim=1024 --content-dim=1024  --type_='char' --zhuge=True --kernel-size=3 --kmax-pooling=2 --linear-hidden-size=2000 --debug-file='/tmp/debugrcnn' --hidden-size=256 --num-layers=3 --augument=False

# RCNN word
main.py main --max_epoch=5 --plot_every=100 --env='RCNN-word' --weight=1 --model='RCNN'  --zhuge=True --num-workers=4 --batch-size=128 --model-path=None --lr2=0 --lr=1e-3 --lr-decay=0.8  --decay-every=5000  --title-dim=1024 --content-dim=512  --kernel-size=3 --debug-file='/tmp/debugrc'  --kmax-pooling=1 --type_='word' --augument=False
# CNN word
 python main.py main --max_epoch=5 --plot_every=100 --env='MultiCNNText' --weight=1 --model='MultiCNNTextBNDeep'  --batch-size=64  --lr=0.001 --lr2=0.000 --lr_decay=0.8 --decay_every=10000  --title-dim=250 --content-dim=250    --weight-decay=0 --type_='word' --debug-file='/tmp/debug'  --linear-hidden-size=2000 --zhuge=True  --augument=False

# inception word
python2 main.py main --max_epoch=5 --plot_every=100 --env='inception-word' --weight=1 --model='CNNText_inception'  --zhuge=True --num-workers=4 --batch-size=512 --model-path=None --lr2=0 --lr=1e-3 --lr-decay=0.8  --decay-every=2500 --title-dim=1200 --content-dim=1200 --type_='word' --augument=False                                                   
# inception char
python2 main.py main --max_epoch=5 --plot_every=100 --env='inception-char' --weight=1 --model='CNNText_inception'  --zhuge=True --num-workers=4 --batch-size=512 --model-path=None --lr2=0 --lr=1e-3 --lr-decay=0.8  --decay-every=2500 --title-dim=1200 --content-dim=1200 --type_='char'   --augument=False

# FastText3 word
python2 main.py main --max_epoch=5 --plot_every=100 --env='fasttext3-word' --weight=5 --model='FastText3' --zhuge=True --num-workers=4 --batch-size=512  --lr2=1e-4 --lr=1e-3 --lr-decay=0.8  --decay-every=2500 --linear_hidden_size=2000 --type_='word'  --debug-file=/tmp/debugf --augument=False                           

In most cases, the score could be boosted by finetune. for example:

python2 main.py main --max_epoch=2 --plot_every=100 --env='LSTMText-word-ft' --model='LSTMText'  --zhuge=True --num-workers=4 --batch-size=256 --model-path=None --lr2=5e-5 --lr=5e-5 --decay-every=5000 --type_='word'  --model-path='checkpoints/LSTMText_word_0.409196378421'                       

3.2 train models with data augumentation

Add --augument in the training command.

3.3 scores

model score
CNN_word 0.4103
RNN_word 0.4119
RCNN_word 0.4115
Inceptin_word 0.4109
FastText_word 0.4091
RNN_char 0.4031
RCNN_char 0.4037
Inception_char 0.4024
RCNN_word_aug 0.41344
CNN_word_aug 0.41051
RNN_word_aug 0.41368
Incetpion_word_aug 0.41254
FastText3_word_aug 0.40853
CNN_char_aug 0.38738
RCNN_char_aug 0.39854

with model ensemble, it can get up to 0.433.

4 Test and Submit

4.1 Test

  • model: include LSTMText,RCNN,MultiCNNTextBNDeep,FastText3,CNNText_inception
  • model-path: path to the pretrained model
  • result-path: where to save the model
  • val: test the val set or the test set..
# LSTM
python2 test.1.py main --model='LSTMText'  --batch-size=512  --model-path='checkpoints/LSTMText_word_0.411994005382' --result-path='/data_ssd/zhihu/result/LSTMText0.4119_word_test.pth'  --val=False --zhuge=True

python2 test.1.py main --model='LSTMText'  --batch-size=256 --type_=char --model-path='checkpoints/LSTMText_char_0.403192339135' --result-path='/data_ssd/zhihu/result/LSTMText0.4031_char_test.pth'  --val=False --zhuge=True
 
#RCNN
python2 test.1.py main --model='RCNN'  --batch-size=512  --model-path='checkpoints/RCNN_word_0.411511574999' --result-path='/data_ssd/zhihu/result/RCNN_0.4115_word_test.pth'  --val=False --zhuge=True

python2 test.1.py main --model='RCNN'  --batch-size=512  --model-path='checkpoints/RCNN_char_0.403710422571' --result-path='/data_ssd/zhihu/result/RCNN_0.4037_char_test.pth'  --val=False --zhuge=True

# DeepText

python2 test.1.py main --model='MultiCNNTextBNDeep'  --batch-size=512  --model-path='checkpoints/MultiCNNTextBNDeep_word_0.410330780091' --result-path='/data_ssd/zhihu/result/DeepText0.4103_word_test.pth'  --val=False --zhuge=True
# more to go ...

4.2 ensemble

See notebooks/val_ensemble.ipynb and notebooks/test_ensemble.ipynb for more detail

5 Main files

  • main.py: main(for training)
  • config.py: config file
  • test.1.py: for test
  • data/: for data loader
  • scripts/: for data preprocessing
  • utils/ : including calculate score and wrapper for visualization.
  • models/: models
    • models/BasicModel: Base model for models.
    • models/MultiCNNTextBNDeep: CNN
    • models/LSTMText: RNN
    • models/RCNN: RCNN
    • models/CNNText_inception Inception
    • models/MultiModelALLmodels/MultiModelAll2
    • other model
  • rep.py: code for reproducing.
  • del/: methods fail or not used.
  • notebooks/: notebooks.

Pretrained model

https://pan.baidu.com/s/1mjVtJGs passwd: tayb

pytorchtext's People

Contributors

chenyuntc avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

pytorchtext's Issues

不是每个类看成0/1分类,
是一次分1999个类,用topk取前5,
已经确定要top5,所以不用考虑预测出多少个标签的问题。

where is your word2vec module?

in embedding2matrix.py there is the code
`
import word2vec
import numpy as np

def main(em_file, em_result):
'''
embedding ->numpy
'''
em = word2vec.load(em_file)
vec = (em.vectors)
word2id = em.vocab_hash
# d = dict(vector = vec, word2id = word2id)
# t.save(d,em_result)
np.savez_compressed(em_result,vector=vec,word2id=word2id)

if name == 'main':
import fire
fire.Fire()
`

but I can not find any module named 'word2vec'

thank you

关于标签的问题

请问你这个topic的标签是转换成[0,1,0,0,1,......,0]这样的类型的还是别的类型的,谢谢

Why RuntimeError is expected 3D tensor?

class CNNText(nn.Module): 
    def __init__(self):
        super(CNNText, self).__init__()
        self.encoder_tit = nn.Embedding(3281, 64)
        self.encoder_con = nn.Embedding(496037, 512)
        
        self.title_conv_1 = nn.Sequential(
            nn.Conv1d(in_channels = 1,
                      out_channels = 1,
                      kernel_size = (1, 64)),
            nn.ReLU(),
            nn.MaxPool1d(kernel_size=1),
        )
        
        self.title_conv_2 = nn.Sequential(
            nn.Conv1d(in_channels = 1,
                      out_channels = 1,
                      kernel_size = (2, 64)),
            nn.ReLU(),
            nn.MaxPool1d(kernel_size=1),
        )

        self.content_conv_3 = nn.Sequential(
            nn.Conv1d(in_channels = 1,
                      out_channels = 1,
                      kernel_size = (3, 512)),
            nn.ReLU(),
            nn.MaxPool1d(kernel_size = 50)
        )
        
        self.content_conv_4 = nn.Sequential(
            nn.Conv1d(in_channels = 1,
                      out_channels = 1,
                      kernel_size = (3, 512)),
            nn.ReLU(),
            nn.MaxPool1d(kernel_size = 50)
        )
            
        self.content_conv_5 = nn.Sequential(
            nn.Conv1d(in_channels = 1,
                      out_channels = 1,
                      kernel_size = (3, 512)),
            nn.ReLU(),
            nn.MaxPool1d(kernel_size = 50)
        )
        
        
            
        self.fc = nn.Linear(5, 9)

    def forward(self, title, content):
        title = self.encoder_tit(title)
        print(title.size())
        title_out_1 = self.title_conv_1(title)
        title_out_2 = self.title_conv_2(title)
        
        content = self.encoder_con(content)
        content_out_3 = self.content_conv_3(content)
        content_out_4 = self.content_conv_4(content)
        content_out_5 = self.content_conv_5(content)
            
        conv_out = t.cat((title_out_1,title_out_2,content_out_3,content_out_4,content_out_5),dim=1)
        logits = self.fc(conv_out)
        return F.log_softmax(logits)
cnnt = CNNText()

optimizer = optim.Adam(cnnt.parameters(), lr=.001)
Loss = nn.NLLLoss()

for epoch in range(50):
    loss = 0
    
    t = ''.join(title[epoch])
    c = ''.join(content[epoch])
    T, C = variables_from_pair(t, c)
#     print(T.squeeze(1).unsqueeze(0))
    T = T.squeeze(1).unsqueeze(0)
    C = C.squeeze(1).unsqueeze(0)
    optimizer.zero_grad()
    
    out = cnnt(T, C)
    target = cla[epoch]
    loss += Loss(out, target)
    
    loss.backward()
    optimizer.step()
    
print("Loss is {} at {} epoch".format(loss, epoch))

Error:

torch.Size([1, 3, 64])
---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
<ipython-input-34-328d44896eef> in <module>()
     15     optimizer.zero_grad()
     16 
---> 17     out = cnnt(T, C)
     18     target = cla[epoch]
     19     loss += Loss(out, target)

/home/quoniammm/anaconda3/envs/py3Tfgpu/lib/python3.6/site-packages/torch/nn/modules/module.py in __call__(self, *input, **kwargs)
    222         for hook in self._forward_pre_hooks.values():
    223             hook(self, input)
--> 224         result = self.forward(*input, **kwargs)
    225         for hook in self._forward_hooks.values():
    226             hook_result = hook(self, input, result)

<ipython-input-31-fe95ab78725e> in forward(self, title, content)
     52         title = self.encoder_tit(title)
     53         print(title.size())
---> 54         title_out_1 = self.title_conv_1(title)
     55         title_out_2 = self.title_conv_2(title)
     56 

/home/quoniammm/anaconda3/envs/py3Tfgpu/lib/python3.6/site-packages/torch/nn/modules/module.py in __call__(self, *input, **kwargs)
    222         for hook in self._forward_pre_hooks.values():
    223             hook(self, input)
--> 224         result = self.forward(*input, **kwargs)
    225         for hook in self._forward_hooks.values():
    226             hook_result = hook(self, input, result)

/home/quoniammm/anaconda3/envs/py3Tfgpu/lib/python3.6/site-packages/torch/nn/modules/container.py in forward(self, input)
     65     def forward(self, input):
     66         for module in self._modules.values():
---> 67             input = module(input)
     68         return input
     69 

/home/quoniammm/anaconda3/envs/py3Tfgpu/lib/python3.6/site-packages/torch/nn/modules/module.py in __call__(self, *input, **kwargs)
    222         for hook in self._forward_pre_hooks.values():
    223             hook(self, input)
--> 224         result = self.forward(*input, **kwargs)
    225         for hook in self._forward_hooks.values():
    226             hook_result = hook(self, input, result)

/home/quoniammm/anaconda3/envs/py3Tfgpu/lib/python3.6/site-packages/torch/nn/modules/conv.py in forward(self, input)
    152     def forward(self, input):
    153         return F.conv1d(input, self.weight, self.bias, self.stride,
--> 154                         self.padding, self.dilation, self.groups)
    155 
    156 

/home/quoniammm/anaconda3/envs/py3Tfgpu/lib/python3.6/site-packages/torch/nn/functional.py in conv1d(input, weight, bias, stride, padding, dilation, groups)
     81     f = ConvNd(_single(stride), _single(padding), _single(dilation), False,
     82                _single(0), groups, torch.backends.cudnn.benchmark, torch.backends.cudnn.enabled)
---> 83     return f(input, weight, bias)
     84 
     85 

RuntimeError: expected 3D tensor

The title has been a 3D tensor.Why RuntimeError is expected 3D tensor

hi,when i run main.py,there was a error.Do you know why?

Traceback (most recent call last):
File "main.py", line 159, in
fire.Fire()
File "/usr/anaconda3/lib/python3.6/site-packages/fire/core.py", line 127, in Fire
component_trace = _Fire(component, args, context, name)
File "/usr/anaconda3/lib/python3.6/site-packages/fire/core.py", line 366, in _Fire
component, remaining_args)
File "/usr/anaconda3/lib/python3.6/site-packages/fire/core.py", line 542, in _CallCallable
result = fn(*varargs, **kwargs)
File "main.py", line 102, in main
for ii,((title,content),label) in tqdm.tqdm(enumerate(dataloader)):
File "/usr/anaconda3/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 417, in iter
return DataLoaderIter(self)
File "/usr/anaconda3/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 242, in init
self._put_indices()
File "/usr/anaconda3/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 290, in _put_indices
indices = next(self.sample_iter, None)
File "/usr/anaconda3/lib/python3.6/site-packages/torch/utils/data/sampler.py", line 119, in iter
for idx in self.sampler:
File "/usr/anaconda3/lib/python3.6/site-packages/torch/utils/data/sampler.py", line 50, in iter
return iter(torch.randperm(len(self.data_source)).long())
RuntimeError: invalid argument 1: must be strictly positive at /opt/conda/conda-bld/pytorch_1518243271935/work/torch/lib/TH/generic/THTensorMath.c:2247

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.