chenyuntc / pytorchtext Goto Github PK

1st Place Solution for Zhihu Machine Learning Challenge . Implementation of various text-classification models.(知乎看山杯第一名解决方案)

Home Page: https://biendata.com/competition/zhihu/

License: MIT License

Python 58.63% Shell 4.24% Jupyter Notebook 37.13%

pytorch nlp textcnn textrnn fasttext textrcnn lstm

pytorchtext's Introduction

中文用户请查看readme-zh.md

This is the solution for Zhihu Machine Learning Challenge 2017. We won the champion out of 963 teams.

1. Setup

install PyTorch from pytorch.org (Python 2, CUDA)
install other depencies:
```
pip2 install -r requirements.txt
```

You may need tf.contrib.keras.preprocessing.sequence.pad_sequences for data preprocessing.

start visdom for visualization:
```
python2 -m visdom.server
```

2. Data Preprocessing

Modify the data path in the related file

2.1 wordvector file -> numpy file

python scripts/data_process/embedding2matrix.py main char_embedding.txt char_embedding.npz 
python scripts/data_process/embedding2matrix.py main word_embedding.txt word_embedding.npz

2.2 question set -> numpy file

it's memory consuming , make sure you have memory larger than 32G.

python scripts/data_process/question2array.py main question_train_set.txt train.npz
python scripts/data_process/question2array.py main question_eval_set.txt test.npz

2.3 label -> json

python scripts/data_process/label2id.py main question_topic_train_set.txt labels.json

2.4 validation data

python scripts/data_process/get_val.py

3. Training

modify config.py for model path

Path to the models we used:

CNN:models/MultiCNNTextBNDeep.py
RNN（LSTM）:models/LSTMText.py
RCNN: models/RCNN.py
inception: models/CNNText_inception.py
FastText: models/FastText3.py

3.1 Trian model without data augumentation

# LSTM char
python2 main.py main --max_epoch=5 --plot_every=100 --env='lstm_char' --weight=1 --model='LSTMText'  --batch-size=128  --lr=0.001 --lr2=0 --lr_decay=0.5 --decay_every=10000  --type_='char'   --zhuge=True --linear-hidden-size=2000 --hidden-size=256 --kmax-pooling=3   --num-layers=3  --augument=False

# LSTM word
python2 main.py main --max_epoch=5 --plot_every=100 --env='lstm_word' --weight=1 --model='LSTMText'  --batch-size=128  --lr=0.001 --lr2=0.0000 --lr_decay=0.5 --decay_every=10000  --type_='word'   --zhuge=True --linear-hidden-size=2000 --hidden-size=320 --kmax-pooling=2  --augument=False

#  RCNN char
python2 main.py main --max_epoch=5 --plot_every=100 --env='rcnn_char' --weight=1 --model='RCNN'  --batch-size=128  --lr=0.001 --lr2=0 --lr_decay=0.5 --decay_every=5000  --title-dim=1024 --content-dim=1024  --type_='char' --zhuge=True --kernel-size=3 --kmax-pooling=2 --linear-hidden-size=2000 --debug-file='/tmp/debugrcnn' --hidden-size=256 --num-layers=3 --augument=False

# RCNN word
main.py main --max_epoch=5 --plot_every=100 --env='RCNN-word' --weight=1 --model='RCNN'  --zhuge=True --num-workers=4 --batch-size=128 --model-path=None --lr2=0 --lr=1e-3 --lr-decay=0.8  --decay-every=5000  --title-dim=1024 --content-dim=512  --kernel-size=3 --debug-file='/tmp/debugrc'  --kmax-pooling=1 --type_='word' --augument=False
# CNN word
 python main.py main --max_epoch=5 --plot_every=100 --env='MultiCNNText' --weight=1 --model='MultiCNNTextBNDeep'  --batch-size=64  --lr=0.001 --lr2=0.000 --lr_decay=0.8 --decay_every=10000  --title-dim=250 --content-dim=250    --weight-decay=0 --type_='word' --debug-file='/tmp/debug'  --linear-hidden-size=2000 --zhuge=True  --augument=False

# inception word
python2 main.py main --max_epoch=5 --plot_every=100 --env='inception-word' --weight=1 --model='CNNText_inception'  --zhuge=True --num-workers=4 --batch-size=512 --model-path=None --lr2=0 --lr=1e-3 --lr-decay=0.8  --decay-every=2500 --title-dim=1200 --content-dim=1200 --type_='word' --augument=False                                                   
# inception char
python2 main.py main --max_epoch=5 --plot_every=100 --env='inception-char' --weight=1 --model='CNNText_inception'  --zhuge=True --num-workers=4 --batch-size=512 --model-path=None --lr2=0 --lr=1e-3 --lr-decay=0.8  --decay-every=2500 --title-dim=1200 --content-dim=1200 --type_='char'   --augument=False

# FastText3 word
python2 main.py main --max_epoch=5 --plot_every=100 --env='fasttext3-word' --weight=5 --model='FastText3' --zhuge=True --num-workers=4 --batch-size=512  --lr2=1e-4 --lr=1e-3 --lr-decay=0.8  --decay-every=2500 --linear_hidden_size=2000 --type_='word'  --debug-file=/tmp/debugf --augument=False

In most cases, the score could be boosted by finetune. for example:

python2 main.py main --max_epoch=2 --plot_every=100 --env='LSTMText-word-ft' --model='LSTMText'  --zhuge=True --num-workers=4 --batch-size=256 --model-path=None --lr2=5e-5 --lr=5e-5 --decay-every=5000 --type_='word'  --model-path='checkpoints/LSTMText_word_0.409196378421'

3.2 train models with data augumentation

Add --augument in the training command.

3.3 scores

model	score
CNN_word	0.4103
RNN_word	0.4119
RCNN_word	0.4115
Inceptin_word	0.4109
FastText_word	0.4091
RNN_char	0.4031
RCNN_char	0.4037
Inception_char	0.4024
RCNN_word_aug	0.41344
CNN_word_aug	0.41051
RNN_word_aug	0.41368
Incetpion_word_aug	0.41254
FastText3_word_aug	0.40853
CNN_char_aug	0.38738
RCNN_char_aug	0.39854

with model ensemble, it can get up to 0.433.

4 Test and Submit

4.1 Test

model: include LSTMText,RCNN,MultiCNNTextBNDeep,FastText3,CNNText_inception
model-path: path to the pretrained model
result-path: where to save the model
val: test the val set or the test set..

# LSTM
python2 test.1.py main --model='LSTMText'  --batch-size=512  --model-path='checkpoints/LSTMText_word_0.411994005382' --result-path='/data_ssd/zhihu/result/LSTMText0.4119_word_test.pth'  --val=False --zhuge=True

python2 test.1.py main --model='LSTMText'  --batch-size=256 --type_=char --model-path='checkpoints/LSTMText_char_0.403192339135' --result-path='/data_ssd/zhihu/result/LSTMText0.4031_char_test.pth'  --val=False --zhuge=True
 
#RCNN
python2 test.1.py main --model='RCNN'  --batch-size=512  --model-path='checkpoints/RCNN_word_0.411511574999' --result-path='/data_ssd/zhihu/result/RCNN_0.4115_word_test.pth'  --val=False --zhuge=True

python2 test.1.py main --model='RCNN'  --batch-size=512  --model-path='checkpoints/RCNN_char_0.403710422571' --result-path='/data_ssd/zhihu/result/RCNN_0.4037_char_test.pth'  --val=False --zhuge=True

# DeepText

python2 test.1.py main --model='MultiCNNTextBNDeep'  --batch-size=512  --model-path='checkpoints/MultiCNNTextBNDeep_word_0.410330780091' --result-path='/data_ssd/zhihu/result/DeepText0.4103_word_test.pth'  --val=False --zhuge=True
# more to go ...

4.2 ensemble

See notebooks/val_ensemble.ipynb and notebooks/test_ensemble.ipynb for more detail

5 Main files

main.py: main(for training)
config.py: config file
test.1.py: for test
data/: for data loader
scripts/: for data preprocessing
utils/ : including calculate score and wrapper for visualization.
models/: models
- models/BasicModel: Base model for models.
- models/MultiCNNTextBNDeep: CNN
- models/LSTMText: RNN
- models/RCNN: RCNN
- models/CNNText_inception Inception
- models/MultiModelALL 和models/MultiModelAll2
- other model
rep.py: code for reproducing.
del/: methods fail or not used.
notebooks/: notebooks.

Pretrained model

https://pan.baidu.com/s/1mjVtJGs passwd: tayb

pytorchtext's People

Contributors

Stargazers

Watchers

Forkers

stevenlol zpilgrim melody-xiaomi pathriclee chenyyx qss2012 tangzk tongzhenguo liu-nlper weekcup jtchaoren huasanyelao futureer lc222 yongyehuang wang1ang adrianhust yht201293018 tomzhang ccclyu quxiaofeng fifar allensmile benjamesbabala fatty-ricky jimmy-walker fujianhai watkyns spongebbob levstyle olachan jefferyship x-hacker zhoujiang2013 drxan sirius27 yanwang2014 hongvvu lovehoroscoper hunterhawk meccy deeplearningorg anyuray jimmyyfeng xuanhan863 1013553207 wushicanasl qoboty mars-wei krislc longtailneighbor rxt2012kc hexo xuhanvsxuhan sundllyq babylls leezqcst 0wave cjmmya chenglongchen zbn123 delphine0379 zlszhonglongshen ghiblifield airob gonewithgt jinkelacrops nijianmo andrewganjinrui leeon2vec grseb9s embedxj aforever hadoop73 hhj238 hiterstone zbxzc35 15449119 xidongzhang arbinwu yangyaoyunshu berlinhsin moustaphacheikh eight-corner siyumi jameslin2014 wonyonyon zbx91 angzz ghostviper dengminna zlh6013 nininininini kalengit codezero00 hitboys lzjtt2017 zhongminjin yxiao1994 fendaq

pytorchtext's Issues

fc层可以直接接MultiLabelSoftMarginLoss吗？

哦

不是每个类看成0/1分类，
是一次分1999个类，用topk取前5，
已经确定要top5，所以不用考虑预测出多少个标签的问题。

请问词向量转成numpy数组，你这个word2vec怎么装的？

我怎么没有这个包，pip也不能安装

where is your word2vec module?

in embedding2matrix.py there is the code
`
import word2vec
import numpy as np

def main(em_file, em_result):
'''
embedding ->numpy
'''
em = word2vec.load(em_file)
vec = (em.vectors)
word2id = em.vocab_hash
# d = dict(vector = vec, word2id = word2id)
# t.save(d,em_result)
np.savez_compressed(em_result,vector=vec,word2id=word2id)

if name == 'main':
import fire
fire.Fire()
`

but I can not find any module named 'word2vec'

thank you

请问des3 文件的解压密码是?

等比赛结束看到报导才过来的
想请问一个问题

目前官网只剩下des3 链接了 rar 已经不提供
https://www.dropbox.com/s/auycv8lt6ntd805/ieee_zhihu_cup.des3?dl=0

想请问des3 文件的解压密码是?

use max-epoch5 v.s. early stop

Hi, I am new to DL and i wonder what's the reason behind using small epoch (5) and not using early stop?

Thanks,

关于标签的问题

请问你这个topic的标签是转换成[0,1,0,0,1,......,0]这样的类型的还是别的类型的，谢谢

Why RuntimeError is expected 3D tensor?

class CNNText(nn.Module): 
    def __init__(self):
        super(CNNText, self).__init__()
        self.encoder_tit = nn.Embedding(3281, 64)
        self.encoder_con = nn.Embedding(496037, 512)
        
        self.title_conv_1 = nn.Sequential(
            nn.Conv1d(in_channels = 1,
                      out_channels = 1,
                      kernel_size = (1, 64)),
            nn.ReLU(),
            nn.MaxPool1d(kernel_size=1),
        )
        
        self.title_conv_2 = nn.Sequential(
            nn.Conv1d(in_channels = 1,
                      out_channels = 1,
                      kernel_size = (2, 64)),
            nn.ReLU(),
            nn.MaxPool1d(kernel_size=1),
        )

        self.content_conv_3 = nn.Sequential(
            nn.Conv1d(in_channels = 1,
                      out_channels = 1,
                      kernel_size = (3, 512)),
            nn.ReLU(),
            nn.MaxPool1d(kernel_size = 50)
        )
        
        self.content_conv_4 = nn.Sequential(
            nn.Conv1d(in_channels = 1,
                      out_channels = 1,
                      kernel_size = (3, 512)),
            nn.ReLU(),
            nn.MaxPool1d(kernel_size = 50)
        )
            
        self.content_conv_5 = nn.Sequential(
            nn.Conv1d(in_channels = 1,
                      out_channels = 1,
                      kernel_size = (3, 512)),
            nn.ReLU(),
            nn.MaxPool1d(kernel_size = 50)
        )
        
        
            
        self.fc = nn.Linear(5, 9)

    def forward(self, title, content):
        title = self.encoder_tit(title)
        print(title.size())
        title_out_1 = self.title_conv_1(title)
        title_out_2 = self.title_conv_2(title)
        
        content = self.encoder_con(content)
        content_out_3 = self.content_conv_3(content)
        content_out_4 = self.content_conv_4(content)
        content_out_5 = self.content_conv_5(content)
            
        conv_out = t.cat((title_out_1,title_out_2,content_out_3,content_out_4,content_out_5),dim=1)
        logits = self.fc(conv_out)
        return F.log_softmax(logits)

cnnt = CNNText()

optimizer = optim.Adam(cnnt.parameters(), lr=.001)
Loss = nn.NLLLoss()

for epoch in range(50):
    loss = 0
    
    t = ''.join(title[epoch])
    c = ''.join(content[epoch])
    T, C = variables_from_pair(t, c)
#     print(T.squeeze(1).unsqueeze(0))
    T = T.squeeze(1).unsqueeze(0)
    C = C.squeeze(1).unsqueeze(0)
    optimizer.zero_grad()
    
    out = cnnt(T, C)
    target = cla[epoch]
    loss += Loss(out, target)
    
    loss.backward()
    optimizer.step()
    
print("Loss is {} at {} epoch".format(loss, epoch))

Error:

torch.Size([1, 3, 64])
---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
<ipython-input-34-328d44896eef> in <module>()
     15     optimizer.zero_grad()
     16 
---> 17     out = cnnt(T, C)
     18     target = cla[epoch]
     19     loss += Loss(out, target)

/home/quoniammm/anaconda3/envs/py3Tfgpu/lib/python3.6/site-packages/torch/nn/modules/module.py in __call__(self, *input, **kwargs)
    222         for hook in self._forward_pre_hooks.values():
    223             hook(self, input)
--> 224         result = self.forward(*input, **kwargs)
    225         for hook in self._forward_hooks.values():
    226             hook_result = hook(self, input, result)

<ipython-input-31-fe95ab78725e> in forward(self, title, content)
     52         title = self.encoder_tit(title)
     53         print(title.size())
---> 54         title_out_1 = self.title_conv_1(title)
     55         title_out_2 = self.title_conv_2(title)
     56 

/home/quoniammm/anaconda3/envs/py3Tfgpu/lib/python3.6/site-packages/torch/nn/modules/module.py in __call__(self, *input, **kwargs)
    222         for hook in self._forward_pre_hooks.values():
    223             hook(self, input)
--> 224         result = self.forward(*input, **kwargs)
    225         for hook in self._forward_hooks.values():
    226             hook_result = hook(self, input, result)

/home/quoniammm/anaconda3/envs/py3Tfgpu/lib/python3.6/site-packages/torch/nn/modules/container.py in forward(self, input)
     65     def forward(self, input):
     66         for module in self._modules.values():
---> 67             input = module(input)
     68         return input
     69 

/home/quoniammm/anaconda3/envs/py3Tfgpu/lib/python3.6/site-packages/torch/nn/modules/module.py in __call__(self, *input, **kwargs)
    222         for hook in self._forward_pre_hooks.values():
    223             hook(self, input)
--> 224         result = self.forward(*input, **kwargs)
    225         for hook in self._forward_hooks.values():
    226             hook_result = hook(self, input, result)

/home/quoniammm/anaconda3/envs/py3Tfgpu/lib/python3.6/site-packages/torch/nn/modules/conv.py in forward(self, input)
    152     def forward(self, input):
    153         return F.conv1d(input, self.weight, self.bias, self.stride,
--> 154                         self.padding, self.dilation, self.groups)
    155 
    156 

/home/quoniammm/anaconda3/envs/py3Tfgpu/lib/python3.6/site-packages/torch/nn/functional.py in conv1d(input, weight, bias, stride, padding, dilation, groups)
     81     f = ConvNd(_single(stride), _single(padding), _single(dilation), False,
     82                _single(0), groups, torch.backends.cudnn.benchmark, torch.backends.cudnn.enabled)
---> 83     return f(input, weight, bias)
     84 
     85 

RuntimeError: expected 3D tensor

The title has been a 3D tensor.Why RuntimeError is expected 3D tensor

CNNText missing?

Line https://github.com/chenyuntc/PyTorchText/blob/master/models/__init__.py#L2 imports CNNText but it doesn't seem to be in the directory. Is it missing?

Thanks

How to solve the problem that topK's K is different for every input text?

The output topK's K is fixed now.

Do you think training a classifier to predict the value of K for every input is a good solution?

Thank you very much.

hi,when i run main.py,there was a error.Do you know why?

Traceback (most recent call last):
File "main.py", line 159, in
fire.Fire()
File "/usr/anaconda3/lib/python3.6/site-packages/fire/core.py", line 127, in Fire
component_trace = _Fire(component, args, context, name)
File "/usr/anaconda3/lib/python3.6/site-packages/fire/core.py", line 366, in _Fire
component, remaining_args)
File "/usr/anaconda3/lib/python3.6/site-packages/fire/core.py", line 542, in _CallCallable
result = fn(*varargs, **kwargs)
File "main.py", line 102, in main
for ii,((title,content),label) in tqdm.tqdm(enumerate(dataloader)):
File "/usr/anaconda3/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 417, in iter
return DataLoaderIter(self)
File "/usr/anaconda3/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 242, in init
self._put_indices()
File "/usr/anaconda3/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 290, in _put_indices
indices = next(self.sample_iter, None)
File "/usr/anaconda3/lib/python3.6/site-packages/torch/utils/data/sampler.py", line 119, in iter
for idx in self.sampler:
File "/usr/anaconda3/lib/python3.6/site-packages/torch/utils/data/sampler.py", line 50, in iter
return iter(torch.randperm(len(self.data_source)).long())
RuntimeError: invalid argument 1: must be strictly positive at /opt/conda/conda-bld/pytorch_1518243271935/work/torch/lib/TH/generic/THTensorMath.c:2247