galsang / bidaf-pytorch Goto Github PK

Re-implementation of BiDAF(Bidirectional Attention Flow for Machine Comprehension, Minjoon Seo et al., ICLR 2017) on PyTorch.

Python 100.00%

bidaf-pytorch's Introduction

BiDAF-pytorch

Re-implementation of BiDAF(Bidirectional Attention Flow for Machine Comprehension, Minjoon Seo et al., ICLR 2017) on PyTorch.

Results

Dataset: SQuAD v1.1

Model(Single)	EM(%)(dev)	F1(%)(dev)
Re-implementation	64.8	75.7
Baseline(paper)	67.7	77.3

Development Environment

OS: Ubuntu 16.04 LTS (64bit)
GPU: Nvidia Titan Xp
Language: Python 3.6.2.
Pytorch: 0.4.0

Requirements

Please install the following library requirements specified in the requirements.txt first.

torch==0.4.0
nltk==3.2.4
tensorboardX==0.8
torchtext==0.2.3

Execution

python run.py --help

usage: run.py [-h] [--char-dim CHAR_DIM]
          [--char-channel-width CHAR_CHANNEL_WIDTH]
          [--char-channel-size CHAR_CHANNEL_SIZE]
          [--dev-batch-size DEV_BATCH_SIZE] [--dev-file DEV_FILE]
          [--dropout DROPOUT] [--epoch EPOCH] [--gpu GPU]
          [--hidden-size HIDDEN_SIZE] [--learning-rate LEARNING_RATE]
          [--print-freq PRINT_FREQ] [--train-batch-size TRAIN_BATCH_SIZE]
          [--train-file TRAIN_FILE] [--word-dim WORD_DIM]

optional arguments:
  -h, --help            show this help message and exit
  --char-dim CHAR_DIM
  --char-channel-width CHAR_CHANNEL_WIDTH
  --char-channel-size CHAR_CHANNEL_SIZE
  --dev-batch-size DEV_BATCH_SIZE
  --dev-file DEV_FILE
  --dropout DROPOUT
  --epoch EPOCH
  --gpu GPU
  --hidden-size HIDDEN_SIZE
  --learning-rate LEARNING_RATE
  --print-freq PRINT_FREQ
  --train-batch-size TRAIN_BATCH_SIZE
  --train-file TRAIN_FILE
  --word-dim WORD_DIM

bidaf-pytorch's People

Contributors

Stargazers

Watchers

Forkers

charrnander oojiaoo sparkjiao mhany90 yangli1221 yucoian vectorchanger0 hannandarryl importpandas shbupt jsupeng panjj1990 kungwanyi rishickesh magicwifi cgh0430haha cmjxll kayoyin tom20192019 lichao88 qinzihao00 wangstar0211 guozhilong yinyinyin123 zq47 wangyujie1176 sagnikmjr shenyihenry edgambition greitzmann mottled233 daishu7 myingyi phychaos autoave takupista ammieqi tackoil cn-boop caoyun001 awei-lab kifish zhiqihuang fanruiwen liuhaijia1234 stainswei xhsun1997 fbi1314 rikuei goldgaruda fd54 jiapyliu muhamob cmd2001 askintution zhangxuemiao aiainui davidfan1224 cindycandy xiaoanshi nhsjgczryf yancaoweidaode xfy998 qxl-space andyyyf fcc-roy leoleoasd lizhaofu 2071848 jina-kim7 pengyun1314123 efazletdinov rucpei qinxin1015 xuemduan shimaimani t-jiaming jackyin68 gongchuanyang soumitrapy cszjing yamm01 veeresh09 a411633308

bidaf-pytorch's Issues

GPU memory issues

First of all, fantastic work on this implementation of BiDAF, very compact and readable!

I am however having strange trouble with GPU memory consumption. With train batch size 10, dev batch size 50 and context threshold 400 it uses up to 10 GB memory during training. Full disclosure, I'm using a google translated version of SQuAD into a different language, but with the context threshold set to 400 I don't expect this to make a significant difference. There is also not at all a linear relationship between batch size and memory consumption, for example I can almost keep the train batch size at 20 without running out of memory (my card has 12 GB memory). Any idea what might cause this behaviour? Were you able to train the network at batch sizes 60/100 with 12 GB GPU ram?
EDIT: For reference, I have been able to train BiDAF with the same dataset and the same hardware using the authors TF implementation at batch size 50.

I also noticed a couple of minor issues:

not specifying a dimension for torch.squeeze() in the forward function of the model will cause it to crash with batch size 1. I am far from a pytorch expert so I can't say what best practice is, but to me it seems good to always specify the dimension argument to avoid these types of issues :)
if the maximum word size of all questions in a batch is less than the char channel width the forward function crashes. I solved this by adding the following function as a postprocessing function to the CHAR_NESTING field in the SQuAD data module, which just adds padding to the chars:

def char_postprocessing(batch, vocab):

    padded_batch = []
    pad_length = 2

    for chars in batch:
        pad_behind = [vocab.stoi['<pad>'] for _ in range(pad_length)]
        pad_front = [vocab.stoi['<pad>'] for _ in range(pad_length)]
        chars = pad_behind + chars + pad_front
        padded_batch.append(chars)

    return padded_batch

Some questions about the implementation in package utils

Hi!
I'm a beginner of pytorch and deep learning and now I'm considering implementing the bidaf model by myself and I think your code is so clean! I love it so much!
So I still don't know why we need to re-implement the linear module and the lstm layer by ourself? Are the models implemented in the torch.nn module not better enough? Or it's convenient to modified the hyper-parameter?
Thanks u a lot!

In att_flow_layer of bidaf model

Hi,I have just started to learn QA models and thank u sooo much 4 sharing this.
I found that the attention u write is a little bit different from the origin paper:
on line 141 of model.py
s = self.att_weight_c(c).expand(-1, -1, q_len) +
self.att_weight_q(q).permute(0, 2, 1).expand(-1, c_len, -1) +
cq
However, the paper use [h; u; h ◦u], that is 6d after concatenation, which is different from ur multiplication above.
Does it make a difference?

what's the function of file ema.py?

Hi, thank you for implement of BiDAF in this clear way, I am a beginner of pytorch, so I am confused about what's the function of ema.py, one function I guess is saving the parameters which are trainable during training. And I don't understand the update method, Could you please why you use this in implement. Thank you again.
def update(self, name, x): assert name in self.shadow new_average = (1.0 - self.mu) * x + self.mu * self.shadow[name] self.shadow[name] = new_average.clone()

Why not use PyTorch LSTM and Linear

Hi Taeuk! Thank you for implementing BiDAF in such an elegant way! Why do you use your version of LSTM and linear in utils.nn instead of using the PyTorch LSTM and linear?

Some confusions about Contextual Embedding Layer?

Hi, thanks for your sharing this reproduction code firstly, I am a little confused about the Contextual Embedding Layer. Are there two ways for Contextual embedding? one way is using the same BiLSTM for both passage and question, the other way is using the different BiLSTMs for passage and question. and you employ the first way, using the same BiLSTM, is it right?

Some confusions about the Char Emb Layer in class BiDAF

Hi. Thank you for sharing your code. It make a great help and I learnt a lot from it.

I'm a little confused about the padding_idx = 1 in self.char_emb. In my opinion, this para means that when the EmbeddingLayer receive a 1, it will output a zero-vector.
Is there any special effect for the character corresponding to the index 1?

The other one is about the CharCNN. I glanced the offical code which is implement by tf. I found there was a ReLU after the cnn. I tried to apply the ReLU in my code but the result decrease by 5%. So why you didn't use it. Is this just a negligence or another reason？

Why EM and F1 are lower than official results?

Hi, galsang. Nice work! But I want to know why EM and F1 are lower than official results. Have you analyzed it? Thanks!

Error:RawField object has no attribute 'is_target'

Thanks for your sharing, your code is readable! when i cloned the project to run ,i got this problem" the
RawField object has no attribute 'is_target" .But I m a noob ,can you tell me what may be the problem?
Thank you!

can't optimize a non-leaf Tensor

Thanks for your sharing.
but I encounter an error as follow,and this error arises when putting parameters into the optimizer
Could you help me to resolve this problem. Thanks!!!

ntlk错误

Resource 'tokenizers/punkt/PY3/english.pickle' not found.
Please use the NLTK Downloader to obtain the resource: >>>
nltk.download()
Searched in:
- '/home/sunshen/nltk_data'
- '/usr/share/nltk_data'
- '/usr/local/share/nltk_data'
- '/usr/lib/nltk_data'
- '/usr/local/lib/nltk_data'
- ''
我是远程连接服务器，请问怎么解决这个问题

please give an example to run the code

Is it appropriate to apply the view function in char_emb_layer？

Hi. I saw you applied the view function to change the data's shape from (batch, seq_len, word_len, char_dim) to (batch * seq_len, char_dim, word_len) in BiDAF-pytorch/model/model.py#L83.
But shouldn't this be a transpose operation?

RuntimeError: Expected object of backend CUDA but got backend CPU for argument #3 'index'.There is a small bug during running on gpu

There is a small bug during running on gpu causing by the version of torchtext!

when I run this code on GPU, unfortunately,

RuntimeError: Expected object of backend CUDA but got backend CPU for argument #3 'index'

So, I guess the tensor may be not on GPU, and there is a warning:

The `device` argument should be set by using `torch.device` or passing a string as an argument. This behavior will be deprecated soon and currently defaults to CPU.

I change the parameter device, and then working now.

device = torch.device(f"cuda:{args.gpu}" if torch.cuda.is_available() else "cpu")
self.train_iter, self.dev_iter = \
            data.BucketIterator.splits((self.train, self.dev),
            batch_sizes=[args.train_batch_size, args.dev_batch_size],
            device=device,
            sort_key=lambda x: len(x.c_word))

maybe you could change it (●'◡'●)

TypeError: sort(): argument 'input' (position 1) must be Tensor, not int

Cuda out of memory will fix when pytorch update to 1.0

Thanks for your code, very clear!
When i first clone the code and run in my gtx 1060 6G GPU, it exploded soon. I try to delect some intermediate variable in the Bidaf model, but have no effect. Finally, update pytorch to 1.0 version and run again, Nvidia-smi show only about 4G GPU memory be used, even add up the train_batch_size to 128.
So cool !

why change char_dim and word_len dimesion and then use conv2d

around line 82~88 in model.py
# (batch * seq_len, 1, char_dim, word_len)
x = x.view(-1, self.args.char_dim, x.size(2)).unsqueeze(1)
# (batch * seq_len, char_channel_size, 1, conv_len) -> (batch * seq_len, char_channel_size, conv_len)
x = self.char_conv(x).squeeze()
why need change the dims first ? Why not directly use conv1D ?

what are batch.c_word[1] and batch.c_word[0] ?

Thanks for your code !!!
i wanna know what do they mean?

Predictor?

Thank you for your nice implementation. The training went well.
I have been trying to build a predictor using your model, but I have been encountering a series of problems unresolvable for me.

Can you make a predictor out of your current code, to achieve something like
answer = predictor(context, question)

Thanks in advance

RAM overflow while building split ...

I use Google Colab to run model but I got RAM overflow. Could you help me to resolve this issues. Thanks!