abisee / pointer-generator Goto Github PK

View Code? Open in Web Editor NEW

2.2K 2.2K 811.0 74 KB

Code for the ACL 2017 paper "Get To The Point: Summarization with Pointer-Generator Networks"

License: Other

Python 100.00%

pointer-generator's People

Contributors

Stargazers

Watchers

Forkers

ayushoriginal caomw ml-lab binbinbian scylla benjamesbabala stevenlol yiqingyang2012 qiang2100 tonydeep yuxuanhuang fishermanff liuchen11 ychernushenko tpnguyen blanche itys tianjianjiang boyterry supertramp01 redeipirati robitx pedrobalage makcbe udibr pltrdy vyraun weili-nlp ioana-blue 0x10cxr1 liyi193328 abhishekraok primerai pingoogle chenghuige hemina leezqcst patent-python peratham johndpope micolin northanapon samuel2015 rap9430 zxsted leenashekhar falcondai doreenruirui 1623424971 jalused huangpeng1126 dodocho becxer rishavroy1264bitmesra sanjeeku jianweicui urvashik exe1023 wjbianjason xwild logicxin tiberiuichim soumye lonehacker chetankhatri ghrua nanfeng1101 yaserkl egez yitongcu houzhenzhen waveli123 snci smilingray devenlu bluestinger1989 gregorybesson yucoian tanvirfuad zsgchinese rayzkaunda liufang816 yuancz dingxiaofei2017 bolin0215 csgwon avineshpvs wham-bam qiugen zhangxt lucasgreene anitha19r yaolu mingyuanxie zhfzhmsra hitxujian vishwajeet93 neuqmiao yanickschraner tarun-t

pointer-generator's Issues

Why not tf.scatter_nd_add but tf.scatter_nd in function _calc_final_dist?

I have some problem in attn_dists_projected part in function _calc_final_dist (model.py line 174).
https://github.com/abisee/pointer-generator/blob/master/model.py#L174

I thought that we should add all attention probabilities to the word entries.
For example, if the ith, jth and kth encoder words are the same word w, then we should add a_i, a_j and a_k to the same entry (the index of w in the vocabulary).
But tf.scatter_nd only update the value not add the value. I think this should use tf.scatter_nd_add. Please correct me if I misunderstanding something. Thanks!

Building Graph: std::bad_alloc

I'm trying to run your code using another dataset tokenized/bin as in your CNN/DM repo.
I'm running it using 32Go RAM & Nvidia 1080.

It looks memory related so I tried to set small values for some parameters and still got the issue.

Command

python run_summarization.py --mode train --data_path ./data/train.bin --vocab_path ./data/vocab --log_root ./ --exp_name cnndm_summ --batch_size 4 --vocab_size 25000 --max_enc_steps 200 --max_dec_steps 100

Output

[...]
INFO:tensorflow:Adding attention_decoder timestep 80 of 100
INFO:tensorflow:Adding attention_decoder timestep 81 of 100
INFO:tensorflow:Adding attention_decoder timestep 82 of 100
INFO:tensorflow:Adding attention_decoder timestep 83 of 100
INFO:tensorflow:Adding attention_decoder timestep 84 of 100
INFO:tensorflow:Adding attention_decoder timestep 85 of 100
INFO:tensorflow:Adding attention_decoder timestep 86 of 100
INFO:tensorflow:Adding attention_decoder timestep 87 of 100
INFO:tensorflow:Adding attention_decoder timestep 88 of 100
INFO:tensorflow:Adding attention_decoder timestep 89 of 100
INFO:tensorflow:Adding attention_decoder timestep 90 of 100
terminate called after throwing an instance of 'std::bad_alloc'
  what():  std::bad_alloc
max_size of vocab was specified as 25000; we now have 25000 words. Stopping reading.
Finished constructing vocabulary of 25000 total words. Last word added: pencils
creating model...
Writing word embedding metadata file to ./cnndm_summ/train/vocab_metadata.tsv...

If you have any clue/advice

Thx for releasing the code, paper & blog post btw

Efficiently generate summaries from new data?

How do folks typically generate summaries given i) new text files (only content, no headlines/abstracts/urls) and ii) the pretrained model?

It seems that make_datafiles.py from the cnn-dailymail dir needs to be modified (e.g. removing much of the hardcoding). After these modifications, make_datafiles.py may be used to tokenize and chunk the new data. From there, we "decode" using the pretrained model to generate new summaries.

Is the above generally correct or is there a more efficient method?

Multi-GPU support?

As this is based on TextSum code, is there an option for multi-GPU training?

Fail to decode any meaningful output.

I have trained the model for about 80,000 iterations and the loss has decreased to about 0.000004~, which is really low. But when I run the model in decode mode, the output is just what what what what... or why why why why.... I am using the quora data set here for training. There are about 150,000 pairs duplicate questions. I wonder do you have similar experience in your training and how do you fix it? Thanks!

After training how to summarize our own Text

Sry, missunderstood something on the code, I found everything i needed. Thank you for this awesome model !

Training error: "TensorArray has size zero, but element shape <unknown> is not fully defined."

Hello,

After running the code for about half an hour in train mode, the following error stopped the training process:

TensorArray has size zero, but element shape is not fully defined. Currently only static shapes are supported when packing zero-size TensorArrays.

The error appeared when the code was run with the following parameters:
CUDA_VISIBLE_DEVICES=5 python run_summarization.py --mode=train --data_path=/path-to-train/train.bin --vocab_path=/path-to-vocab/vocab --log_root=/path-to-log/log --exp_name=myexperiment2 --max_enc_steps=100 --max_dec_steps=50 --lr=0.01
The error also occurred after some hours when the learning rate was not specified.

The full traceback is provided below:

`Caused by op u'gradients/seq2seq/encoder/bidirectional_rnn/fw/fw/TensorArrayUnstack/TensorArrayScatter/TensorArrayScatterV3_grad/TensorArrayGatherV3', defined at:
File "run_summarization.py", line 264, in
tf.app.run()
File "/opt/neuralnetworks/lib/python2.7/site-packages/tensorflow/python/platform/app.py", line 48, in run
_sys.exit(main(_sys.argv[:1] + flags_passthrough))
File "run_summarization.py", line 250, in main
setup_training(model, batcher)
File "run_summarization.py", line 111, in setup_training
model.build_graph() # build the graph
File "/local/s1742159/neuralnetworks2017/pointer-generator/model.py", line 298, in build_graph
self._add_train_op()
File "/local/s1742159/neuralnetworks2017/pointer-generator/model.py", line 274, in _add_train_op
gradients = tf.gradients(loss_to_minimize, tvars, aggregation_method=tf.AggregationMethod.EXPERIMENTAL_TREE)
File "/opt/neuralnetworks/lib/python2.7/site-packages/tensorflow/python/ops/gradients_impl.py", line 560, in gradients
grad_scope, op, func_call, lambda: grad_fn(op, *out_grads))
File "/opt/neuralnetworks/lib/python2.7/site-packages/tensorflow/python/ops/gradients_impl.py", line 368, in _MaybeCompile
return grad_fn() # Exit early
File "/opt/neuralnetworks/lib/python2.7/site-packages/tensorflow/python/ops/gradients_impl.py", line 560, in
grad_scope, op, func_call, lambda: grad_fn(op, *out_grads))
File "/opt/neuralnetworks/lib/python2.7/site-packages/tensorflow/python/ops/tensor_array_grad.py", line 186, in _TensorArrayScatterGrad
grad = g.gather(indices)
File "/opt/neuralnetworks/lib/python2.7/site-packages/tensorflow/python/ops/tensor_array_ops.py", line 328, in gather
element_shape=element_shape)
File "/opt/neuralnetworks/lib/python2.7/site-packages/tensorflow/python/ops/gen_data_flow_ops.py", line 2244, in _tensor_array_gather_v3
element_shape=element_shape, name=name)
File "/opt/neuralnetworks/lib/python2.7/site-packages/tensorflow/python/framework/op_def_library.py", line 768, in apply_op
op_def=op_def)
File "/opt/neuralnetworks/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 2336, in create_op
original_op=self._default_original_op, op_def=op_def)
File "/opt/neuralnetworks/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 1228, in init
self._traceback = _extract_stack()

...which was originally created as op u'seq2seq/encoder/bidirectional_rnn/fw/fw/TensorArrayUnstack/TensorArrayScatter/TensorArrayScatterV3', defined at:
File "run_summarization.py", line 264, in
tf.app.run()
[elided 2 identical lines from previous traceback]
File "run_summarization.py", line 111, in setup_training
model.build_graph() # build the graph
File "/local/s1742159/neuralnetworks2017/pointer-generator/model.py", line 295, in build_graph
self._add_seq2seq()
File "/local/s1742159/neuralnetworks2017/pointer-generator/model.py", line 199, in _add_seq2seq
enc_outputs, fw_st, bw_st = self._add_encoder(emb_enc_inputs, self._enc_lens)
File "/local/s1742159/neuralnetworks2017/pointer-generator/model.py", line 74, in add_encoder
(encoder_outputs, (fw_st, bw_st)) = tf.nn.bidirectional_dynamic_rnn(cell_fw, cell_bw, encoder_inputs, dtype=tf.float32, sequence_length=seq_len, swap_memory=True)
File "/opt/neuralnetworks/lib/python2.7/site-packages/tensorflow/python/ops/rnn.py", line 350, in bidirectional_dynamic_rnn
time_major=time_major, scope=fw_scope)
File "/opt/neuralnetworks/lib/python2.7/site-packages/tensorflow/python/ops/rnn.py", line 553, in dynamic_rnn
dtype=dtype)
File "/opt/neuralnetworks/lib/python2.7/site-packages/tensorflow/python/ops/rnn.py", line 671, in dynamic_rnn_loop
for ta, input in zip(input_ta, flat_input))
File "/opt/neuralnetworks/lib/python2.7/site-packages/tensorflow/python/ops/rnn.py", line 671, in
for ta, input in zip(input_ta, flat_input))
File "/opt/neuralnetworks/lib/python2.7/site-packages/tensorflow/python/ops/tensor_array_ops.py", line 381, in unstack
indices=math_ops.range(0, num_elements), value=value, name=name)
File "/opt/neuralnetworks/lib/python2.7/site-packages/tensorflow/python/ops/tensor_array_ops.py", line 409, in scatter
name=name)
File "/opt/neuralnetworks/lib/python2.7/site-packages/tensorflow/python/ops/gen_data_flow_ops.py", line 2510, in _tensor_array_scatter_v3
name=name)
File "/opt/neuralnetworks/lib/python2.7/site-packages/tensorflow/python/framework/op_def_library.py", line 768, in apply_op
op_def=op_def)

UnimplementedError (see above for traceback): TensorArray has size zero, but element shape is not fully defined. Currently only static shapes are supported when packing zero-size TensorArrays.
[[Node: gradients/seq2seq/encoder/bidirectional_rnn/fw/fw/TensorArrayUnstack/TensorArrayScatter/TensorArrayScatterV3_grad/TensorArrayGatherV3 = TensorArrayGatherV3[_class=["loc:@seq2seq/encoder/bidirectional_rnn/fw/fw/TensorArray_1"], dtype=DT_FLOAT, element_shape=, _device="/job:localhost/replica:0/task:0/gpu:0"](gradients/seq2seq/encoder/bidirectional_rnn/fw/fw/TensorArrayUnstack/TensorArrayScatter/TensorArrayScatterV3_grad/TensorArrayGrad/TensorArrayGradV3, seq2seq/encoder/bidirectional_rnn/fw/fw/TensorArrayUnstack/range, gradients/seq2seq/encoder/bidirectional_rnn/fw/fw/TensorArrayUnstack/TensorArrayScatter/TensorArrayScatterV3_grad/TensorArrayGrad/gradient_flow)]]
[[Node: seq2seq/output_projection/Softmax_24/_3175 = _Recvclient_terminated=false, recv_device="/job:localhost/replica:0/task:0/cpu:0", send_device="/job:localhost/replica:0/task:0/gpu:0", send_device_incarnation=1, tensor_name="edge_5326_seq2seq/output_projection/Softmax_24", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/cpu:0"]]`

Tensorboard vis error: vocab_metadata.tsv is not a file

It looks like I'm able to run training and eval concurrently and I'm seeing a generated vocab_metadata.tsv, but when I go to "Embeddings" in Tensorboard I get:

log_root/testrun/train/vocab_metadata.tsv is not a file

which is strange since that file exists. Will the visualization work only when I run decode?

how to project attention distribution to vocab distribution when both are tensors and not list ?

Hi,
In my case, attention distribution, the indices at the encoder side and the vocab distribution are all tensors.
Their shapes are:
attn_dist -> [batch size, num_decoder_steps, num_encoder_steps] contains the attn_dist over all encoder time steps for each decoder time step.
indices -> [batch_size, num_encoder_steps] contains the word indices at each encoder time step.
vocab_dist -> [batch_size, num_decoder_steps, vocab_size].

I need to project attn_dist to vocab_dist by using the indices to get the final_dist following the same logic you have used here. The major difference is that I can't unpack the tensors to a list as num_encoder_steps and num_decoder_steps are unknown in my case (inferred from batch).
I am looking for the correct use of scatter_nd for my case. Can you please help me.

is final_dist a probability distribution ?

I see that you are adding the encoder attention scores to the corresponding word's existing score in the vocabulary distribution at each decoder time step. But how then, after the addition, the final distribution is a probability one. You are directly using this dist. for loss computation by applying log over it without any softmax.

Get NAN loss after 35k steps

Anyone got NAN ?

Training Step

Hi,

is there a maximum training step setting? For example, I want the maximum training step to be 10k. It will stop after that.

Thanks.

The loss decreases too fast using LCSTS dataset

Hi~
Using the CNN/DailyMail dataset this system works well.
Then I just change dataset, create .bin files with make_datafiles.py and remain the other parameters, but the training log shows that the loss decrease sharply. So I try to set the learning rate 0.001, and train again. The loss decrease from 12 to 11 in 547 steps. At step 1000 the loss is 8.17, at step 1200 it becomes 1.16, then decrease to 0.0005 at 4000 step.
There is no doubt that it will not has correct answer.
I have checked the batch.enc_batch_extend_vocab, it prints like below

[[   334   1382   6242    103     24   3296    241    296     62 150000
      64   1114    221     58     16     15     93      5    154   1015
    2707     59    101    177    235      4   5387 150000      4    221
    1204     58     16   1212      5  84537     14     10    277 150001
    5830    186     59      4 150002   5830  16287    184    396]
 [  4485     15    173      9    136    972   5124  41986   5382    939
    6361   6728  26273    316   7177   3164   2650    674    309    230
    6326     11    348     12    390    114   9415      5  11653    774
    2840     21    309    230      4    364   1299     87  41986   5382
     939   2074    358    258    110   4051   2761   7177  25514]
 [  1566     39    621      6    121   8799    938   1606    292   1606
     939   3582   8799     21   8983     26   1057  20809  40671  20809
      68   1606   7928     26    939   3582     21    192    236  20809
      68   1606     26    729   1650      4    939   3582   8799   2238
    5583      4     81   5009    505    568     48      4   1263]
 [   456     16  29598      4   3803   5274     20      5   1710     99
    3088      4   6316    176  19774     62  10862     64      5     70
     566    392     96   2879   7705    501   3507     16   3088   5694
       5   9442  10862   8560      8   5967   2879    384  15864   1050
       4   6416   9533   2300   9533   6416  59298    356  47838]
 [ 14514    452    320     12   5783   1257   3639      6    357  35592
   32166     23      4  19297  29786   3848   1078   1135     36    301
     685    161   1544      4   1257    211     20    793     31    479
       4  40736    517     12  10463   1164    127   2891    388   1715
    1164   8408   5210    762   4600   1569   4444 150000    301]
 [  1019   1063    434    681     66      4   2255    487   1401    248
    4487  59742   1063    441   3686   7584    226  59742   1063    441
       4   1130     20      7  10823    710  77964   1063     17   4198
       4   2935   3047   3472    175    177    676    140  11373    139
      68     23   1019   1063   1019    236    434   8777   3803]
 [ 12389   4032  17556   1450     19  38773   2259   2646    482    260
      19     98   8161   4383      6     28   1209  31204    186   1194
    8161  42849  38502  78383      5  19652     32     51   1319   8282
     303    642    278    439     69  21486  12801   2181      6    245
      27    511     50   8161     20  10496    818   6951   1698]
 [  1925    350     34    113     35  33334   2750  17948      6 150000
       4  12936     42     86  17948      4    381  11519  23592   3947
   11599      4   7147   2348     65   8162    699   2056   2310   1925
    3402    508  10814   4392   1652   2675    611   1246    113  38183
      10     70  27439      4    235  34064   6406    429   6338]
 [   176     16   1623  11907   3379    159    295   6207  10247   7443
     738 150000    398     25     56   2890   1944    154     37   7497
     489    423    214   5483   4443    324   5111    302    295   3339
    1508   4018 150001   5368   1918    388    122    295   7443      9
   12074  26006    340     39   1155   5368   1135    505  23609]
 [  8136     56   1623   3089     10    123   5348    359    238   5151
      18     27   8136 150000    202     10    219   2369   1653     15
      66     55     18   8136    591      8   2704    597   1296     18
      27   8136   1783      8   2383    902    971    208   8719    503
      18   8136   2424      8   3268   1033   4440  11870     18]
 [     5   6855   5112     14      4     55  11585     23    977    198
    3255   1288  15077    516    474    666    884      4    386 150000
     977   1519    607    330   4050     57  15077    516  22281    531
     789   2474  12443   7899   5668     21   7070    889    659  10469
     203    977    303   3492  15077    516      4    757  99748]
 [   348     12   1734   1541    719     56   5737   1494    723   3970
    4090  58354   6589  23431 150000   2523     31  38817    338      9
    3923 150001  11233     10  37245   2009    137  22433  33885   1237
    1141   6148    137   1827    278  13465  17452    338     19    919
    5737    880  21829   2203     99    201   4833  90369  11937]
 [   589    266   1447    787    959    192    109   1818    102    583
     101   1176    589   1447    313   3346    360    890   2971    192
     548     62  19867    474     64      4    589   2022  43774   2022
      35     67  15800   2333   7477 143086  25228  87109   1889  92349
   87639   5926   1142  23269     21    139   3194     67  10848]
 [  8189     12   1453    244     10    223   7012    998   1113    316
      37   7497     75    159      4    135    146   4712   5114   1013
    1627   3021     88  34247     47   8426   1696    414   4079  27565
    2953  23567    169    159     29   1602     26  10247   1141   1173
   51509     21    973     49   2724    312    127   4045  74211]
 [   181    174     19 139373     35   3444  10828  12818   1756   8959
       5  12315     13   1325   1151   4796   7656   1693    170    808
    1151   5013     39   1353    665  77126    978    895   4796    258
     113   2029  25221   5232    343  11900    587   4725    333  20412
    1101     98  73789      4   4796    604     81   1153     28]
 [    75    136 150000    120    469    434      5    438   4879    924
   11788   4714   3530      4    121    120  11836    438      4  42018
       9  13785      4   1948     89      9   2861    561   8419    438
    1948   3971     42     10  65627   6520      6  12615   3246    987
    3246      5    391   1163   3006   2743   1196    136   5460]]

where vocab size is 150k.
So maybe the inputs have no problem?
I guess only changing the learning rate will not help to train a correct model.
Any suggestion will be appreciated.

Out of vocabulary word vectors

I'm interested in implementing something similar in a different framework. And I was wondering how out of vocabulary word vectors are handled. From the code I can see for each example there's a vocab size of 50k + number of unknown words in the article. I'm not super familiar with tensorflow and I was wondering what the out of vocabulary word vectors looked like. Is the vocab size really 50k + maximum number of unknown words in an article. Or are the word vectors for the unknowns set to zero / randomised for every example?

The question about coverage loss

Recently, I implement your project to process another dataset, especially the coverage loss, but I can't reproduce the ability to avoid the repetition, I'm sure I have been in accordance with your instruction in your paper. so I want to know is there any note I need to know.
As we can see in the coverage loss, "sum(min(a,c))". If the length of decode is long enough, c may be vector filled with the value bigger than 1, then the "sum(min(a,c))" may be 1 forever. so is there some improvement.

Clarification on paper and implementation

Hi,

Thanks for releasing the code!

In computing the p_gen in this line, your input to the linear projection includes the decoder cell state and decoder prev_output (In eq. 8 of the paper)

Any reason why both the cell state and prev_output are used ? Wouldn't that be redundant since the output subsumes the cell state (via the output gate) ? Have you tried using just the output?

length of enc_batch and dec_batch for axis=1

Hi,
Sorry for bothering you. In enc_batch[hps.batch_size, None], the length of axis=1 is the max number of article words of each example. In dec_batch[hps.batch_size, None], the length of axis=1 is the max_decoder_steps. Can I replace it for the max number of abstract words of each example. And is there any difference between them.
Thanks a lot.

Why it outputs ". . . . . . . . . . . . ." when I run decode mode?

Thanks for your impressive work.
I have trained the model and run this command:
python run_summarization.py --mode=decode --data_path=/path/to/val.bin --vocab_path=/path/to/vocab --log_root=/path/to/a/log/directory --exp_name=myexperiment

but the screen prints like this:
INFO:tensorflow:ARTICLE: -lrb- cnn -rrb- a grand jury in clark county , nevada , has indicted a 19-year-old man accused of fatally shooting his neighbor in front of her house last month . erich nowsch jr. faces charges of murder with a deadly weapon , attempted murder and firing a gun from within a car . police say nowsch shot tammy meyers , 44 , in front of her home after the car he was riding in followed her home february 12 . nowsch 's attorney , conrad claus , has said his client will argue self-defense . the meyers family told police that tammy meyers was giving her daughter a driving lesson when there was a confrontation with the driver of another car . tammy meyers drove home and sent her inside to get her brother , brandon , who allegedly brought a 9mm handgun . tammy meyers and her son then went back out , police said . they encountered the other car again , and there was gunfire , police said . investigators found casings from six .45 - caliber rounds at that scene . nowsch 's lawyer said after his client 's first court appearance that brandon meyers pointed a gun before anyone started shooting . he said the family 's story about a road-rage incident and what reportedly followed do n't add up . after zipping away from the first shooting , tammy meyers drove home and the other car , a silver audi , went there also . police said nowsch shot at both tammy and brandon meyers . tammy meyers was hit in the head and died two days later at a hospital . brandon meyers , who police said returned fire at the home , was not injured . the driver of the silver audi has yet to be found by authorities . that suspect was n't named in thursday 's indictment . nowsch was arrested five days after the killing in his family 's house , just one block away from the meyers ' home . he is due in court tuesday for a preliminary hearing .
INFO:tensorflow:REFERENCE SUMMARY: erich nowsch will face three charges , including first-degree murder . he is accused of killing tammy meyers in front of her home . the two lived !!withing!! walking distance of each other .
INFO:tensorflow:GENERATED SUMMARY: . arrest . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

The GENERATED SUMMARY is ". arrest . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .", and the same as other test articles.
How can i fix it?
I have fix the bug in beam_search.py following your latest commit, but it doesn't solve this problem.

FLAGS.pointer_gen in data.py

When data.py encounters a word id that is not in vocabulary, the program terminates as the FLAGS is not defined.
assert FLAGS.pointer_gen, "Error: model produced a word ID that isn't in the vocabulary"

Adding below lines solves the problem

import tensorflow as tf

FLAGS = tf.app.flags.FLAGS

`run_summarization.py` isn't computing on GPU (but allocate memory)

Strange fact. I've been trying to train summarization with default parameters and default dataset.

It turns out that the GPU is detected (2017-06-01 17:57:05.548650: I tensorflow/core/common_runtime/gpu/gpu_device.cc:977] Creating TensorFlow device (/gpu:0) -> (device: 0, name: GeForce GTX 1080, pci bus id: 0000:01:00.0)), some GPU memory is alocated for the process, but nothing is running on it (load: 0%). It sometimes peak at around 27% for very short time.

The GPU is a single 1080 with 8G RAM (4617MiB allocated by the process)

Any idea?

Not overfitting by training and testing on the same dataset

I tried to train and test the system on our small-scale devset, but it does not overfit by showing very high performance.
In addition, its performance on the devset seems better when it is trained on our large-scale training set.
Does anyone observe similar results? It's really strange.

License?

Hi,
Is it possible to add an open source license? Something like MIT License?
Thanks

Failure to replicate results

Has anyone been able to successfully replicate the model from the paper? I've been training for about two weeks (over 240k iterations) using the published parameters, along with training an additional ~3k iterations with coverage. Here is what my training loss looks like:

Unclear what caused the increase starting around iteration 180k, but even then the output was not looking great.

Here are some REF (gold) and DEC (system) summaries. As you can see, they are qualitatively bad. Unfortunately, at the moment, I can't figure out how to get pyrouge to run so I can't quantify the performance relative to the published results.

000000_reference.txt

REF: marseille prosecutor says so far no videos were used in the crash investigation '' despite media reports . journalists at bild and paris match are very confident '' the video clip is real , an editor says .
andreas lubitz had informed his lufthansa training school of an episode of severe depression , airline says .

DEC: robin 's comments are aware of any video footage , german paris match .
he 's accused into the crash of germanwings flight 9525 flight .
prosecutor : `` it is a very disturbing scene ''

000001_reference.txt

REF: membership gives the icc jurisdiction over alleged crimes committed in palestinian territories since last june .
israel and the united states opposed the move , which could open the door to war crimes investigations against israelis .

DEC: palestinians signed icc 's founding rome statute of alleged crimes in palestinian territories .
israel says `` in the occupied palestinian territory to immediately end and injustice , she says .
it 's founding rome .

000002_reference.txt

REF: amnesty 's annual death penalty report catalogs encouraging signs , but setbacks in numbers of those sentenced to death .
organization claims that governments around the world are using the threat of terrorism to advance executions .
the number of executions worldwide has gone down by almost 22 % compared with 2013 , but death sentences up by 28 % .

DEC: it 's death sentences and executions 2014 '' is some we are closer to abolition , to advance executions '' number of deterrence , `` a number of countries are abolitionist '' amnesty says he would not be used for the death penalty .

000003_reference.txt

REF: amnesty international releases its annual review of the death penalty worldwide ; much of it makes for grim reading .
salil shetty : countries that use executions to deal with problems are on the wrong side of history .

DEC: soldiers who a china agreed to tackle a surge in death sentences to death .
jordan ended china 's public mass sentencing is part a china 's northwestern xinjiang region .
a sharp spike in december 2006 , 2014 , 2014 .

000004_reference.txt

REF: museum : anne frank died earlier than previously believed .
researchers re-examined archives and testimonies of survivors .
anne and older sister margot frank are believed to have died in february 1945 .

DEC: bergen-belsen concentration camp is believed death on march 31 , anne frank says .
four the jewish diarist concentration camp , margot , margot , violent , died at the age of 15 .
`` i am no more than a skeleton camp , '' witnesses say .

If anyone has had success reproducing the published model I would love to hear how you did it. I'm stumped.

Coverage mechanism vs input feeding

Hello,
My question is from the paper than the code. So can I conclude that the coverage mechanism is an alternative to input feeding approach in Effective Approaches to Attention-based Neural Machine Translation? Since we are using both of them to have some kind of memory of past alignments and avoid repetition? Please correct me if this is wrong. Thanks

Errors while preparing the dataset

HI,

I am getting too many errors while preparing the dataset; I will post them on the other github page, but can you tell me an example of the input to your model?

Can share the source code of Nallapati et al., 2016?

Hi, Abigail See!
I have read the paper "Get To The Point: Summarization with Pointer-Generator Networks" and "Abstractive Text Summarization using Sequence-to-sequence RNNs and Beyond". I found that the experiment result with CNN/Dailymail data sets in your paper is not same as in Nallapati's.
So I wonder if you have the source code from Nallapati. Maybe you implement Nallapati's model according to his paper.
Can you share the code of Nallapati et al., 2016?
Thanks a lot!

Decoding Error

while decoding return statement in run_beam_search in beam_search.py will give Index Error when hyps array is empty

[Potential] error in computing attention and final word probability

Hi,

In the attention_decoder.py, you compute attention scores over the encoder states from L92 - L103.

I can't seem to find where you've masked the encoder features of size - (batch_size,attn_length, attention_vec_size). Without masking, the computed attention score will be incorrect. It will assign values to encoder states that correspond to PAD tokens upon taking softmax.

Also, the word probability over the extended vocabulary is computed using tf.scatter_nd. This does not work in case of duplicate indices as the output will be non-deterministic and not summed up.. So, if the source text had more than occurence of an OOV word then it will become a problem.

Question about length of document and summary

Thanks sharing the awesome implementation.

When I try to increase the length of document and summary from 400,100 to 800,200. It has an error 'Resource exhausted: OOM when allocating tensor'. The memory of gpu is 12GB. How can I deal with it?

Thanks a lot.

Trained model availability?

Could you please make your trained model available?

Question on the number of decoding sentences

Hi @abisee, I've implemented your model and saw your test results. It seems that the number of decoding sentences is quite fixed. In my result, the number is roughly 3. The weird thing is that the program even replicates the second generated sentence to obtain the third sentence.
Is there any constraint on the number of decoding sentences in the code? Or you just let the program to determine where to stop the whole summary, which means it actually leans to place comma and period properly.

Failed to load checkpoint on eval mode

I am trying to run mode=train and concurrently run another instance with mode=eval.
However, on the eval part I am getting the following errors:

2017-07-20 12:19:30.099613: W tensorflow/core/framework/op_kernel.cc:1158] Invalid argument: Assign requires shapes of both tensors to match.
lhs shape= [50000,128] rhs shape= [4,128]
         [[Node: save/Assign_13 = Assign[T=DT_FLOAT, _class=["loc:@seq2seq/embedding/embedding"], use_locking=true, validate_shape=true,
          _device="/job:localhost/replica:0/task:0/gpu:0"](seq2seq/embedding/embedding, save/RestoreV2_13/_49)]]
INFO:tensorflow:Failed to load checkpoint from ./log/testexp/train. Sleeping for 10 secs...

File "util.py", line 41, in load_ckpt

Any idea what could be the reason?

'NoneType' object has no attribute 'enc_batch'

when I run the train model, the problem happened after several training steps,about 300 steps,i have no idea about this problem..

this is the problem:
feed_dict = self._make_feed_dict(batch)
File "/users2/ddning/parapharse_generation/pointer-generator/pointer-generator-master/model.py", line 49, in _make_feed_dict
feed_dict[self._enc_batch] = batch.enc_batch
AttributeError: 'NoneType' object has no attribute 'enc_batch'

Note about chunking data

See here.

Question about generated summary

Thanks sharing the awesome implementation.

In batcher.py the padding function seems to pad decoder input to max_len.
Is your model able to generate variable lenghth summary eventually?

def pad_decoder_inp_targ(self, max_len, pad_id):
"""Pad decoder input and target sequences with pad_id up to max_len."""
while len(self.dec_input) < max_len:
self.dec_input.append(pad_id)
while len(self.target) < max_len:
self.target.append(pad_id)

It seems that text_format is undefined?

In this line it seems that the text_format function is not defined. Could anyone please tell me what it is?

Problem of the word embedding of OOV words of decoder inputs

Hi, I have some difficulties on understanding your code, in particular, the word embedding of the decoder input. As for the emb_enc_inputs, it uses the source input with all the OOV words replaced by UNK tokens and that makes sense for me. But the emb_dec_inputs uses the target input with all the OOV words has a temporary id. Why do you use that for embedding_lookup? Please correct me if I had made any mistake. Thanks in advance!

Model producing largely extractive summaries

After much work I was finally able to get the model to train successfully (220k iterations, 400/100 max enc/dec steps, 3k iterations with coverage bringing the coverage loss to 0.2).

I'm noticing when running the model (with beam size = 4) on my own data (financial news) the decoded summaries are almost entirely extractive. What could be causing this? I've left the vocab size at the default size of 50k. Is it possible that this is a result of having too small of a vocab?

Example (bold added to source to show extracted regions):

Decoded

air line pilots association said friday that 79 % of voting aviators approved a deal running through january 2019 that the union said provided industry-leading pay and benefits .

investors are closely watching the outcome of contract talks involving other united labor groups and at other carriers , concerned about a repeat of previous industry cycles when market conditions deteriorated .

pilots at delta air lines inc and southwest airlines co. both rejected proposed deals last year .
flight attendants have been unable to reach agreement on a joint contract even though they have been bargaining since 2012 .

Source

Pilots at United Continental Holdings Inc. overwhelmingly approved a two-year contract extension, continuing the momentum of the airline's efforts to restore labor peace and complete the integration of staff following its creation in a 2010 merger.
The Air Line Pilots Association said Friday that 79% of voting aviators approved a deal running through January 2019 that the union said provided industry-leading pay and benefits.
Investors are closely watching the outcome of contract talks involving other United labor groups and at other carriers, concerned about a repeat of previous industry cycles when record profits resulted in more generous deals that hobbled carriers' finances when market conditions deteriorated. Pilots at Delta Air Lines Inc and Southwest Airlines Co. both rejected proposed deals last year.
The United contract provides for higher pay, restores benefits for previously furloughed pilots, and enhances scheduling rules for long-haul flights, according to the pilot union. Both sides declined to provide contract details, though a person familiar with the situation said it included a 13% pay rise this year followed by a 3% increase in 2017 and 2% in 2018.
United executives said this week that the pilot contract and a new deal being considered by its technicians would raise its unit costs excluding fuel by 2.5 percentage points this year compared with 2015. The airline forecast its unit costs this year excluding fuel and the two labor deals would rise by between 0.5% and 1.5%.
The airline has suffered from rocky labor relations since its 2010 merger with Continental Airlines and has been trying to smooth that friction under new management led by Chief Executive Oscar Munoz that took over in September.
United has fresh deals or tentative agreements with the majority of its unionized staff, with the results on a new contract for its mechanics due to be revealed next week, but the toughest challenge remains securing a pact with flight attendants.
Flight attendants have been unable to reach agreement on a joint contract even though they have been bargaining since 2012. United presented fresh proposals covering pay and work practices last week in talks brokered by federal mediators. Flight attendants this week held a world-wide protest in pursuit of a joint contract.
The airline in November reached a deal to start negotiations with the International Association of Machinists union more than a year before the contract covering 30,000 ramp workers, customer-service agents and reservation staff opens for renewal at the end of 2016. It also reached a new proposed joint labor agreement for its 9,000 mechanics, with the International Brotherhood of Teamsters union due to issue the results on Jan. 25.
Pilots at Delta, its only union-represented labor group, resumed contract talks last month after rejecting a deal endorsed by union leaders.
Southwest pilots turned down a proposed deal in November, with flight attendants having rejected a tentative new pact earlier in the year.

Can we minimize the negative log-likelihood of target words without pointer_gen mode?

Without pointer_gen mode, the loss is calculated by tf.contrib.seq2seq.sequence_loss. However, with pointer_gen mode, the loss minimizes the negative log-likelihood of target words. The loss makes sense for me and but I just wonder whether we cannot also minimize the negative log-likelihood of target words without pointer_gen mode?

Questions for p_gens

Thanks for this awesome implementation!

I have a question regarding the dimension of p_gens. Why is it a list (with length equals the number of decoder steps) of scalars, but not a list of [batch_size]?

Final tensorflow loss for results reported in paper

Hi @abisee ,

I'm training the network and the tensorflow loss is going down, but really really slow. Frankly, I have always worked with CNN's and they're much faster to train. I just want to get a sense for what is a good value for training loss at which point I can say that the network has learnt something meaningful. I read the issues and you mentioned your model had 230k iterations to reach the results of the paper, and that you had a learning rate schedule where you changed the size of hyperparameters to speed up initial iterations. I'm going for more of a brute force approach right now as I'm trying to get a proof of concept ready for my project, and not final results.

I'm currently at tensorflow loss of about 2.2 after 2-3 days of training. It started originally at over 6. What is a good value to stop, or what was your model's value for the results in paper?

Thanks :)

Interpreting the results

I trained the model with the default parameters, and evaluated at the step 27847. (I didn't do the coverage training, just launched the training with default parameters)

These are the results I got:

ROUGE-1:
rouge_1_f_score: 0.6100 with confidence interval (0.6087, 0.6113)
rouge_1_recall: 0.5746 with confidence interval (0.5726, 0.5767)
rouge_1_precision: 0.6785 with confidence interval (0.6767, 0.6803)

ROUGE-2:
rouge_2_f_score: 0.4800 with confidence interval (0.4788, 0.4813)
rouge_2_recall: 0.4533 with confidence interval (0.4515, 0.4552)
rouge_2_precision: 0.5325 with confidence interval (0.5311, 0.5339)

ROUGE-l:
rouge_l_f_score: 0.5986 with confidence interval (0.5973, 0.5999)
rouge_l_recall: 0.5638 with confidence interval (0.5618, 0.5659)
rouge_l_precision: 0.6658 with confidence interval (0.6640, 0.6676)

How to interpret these with respect to the results in the paper ? Where can be the problem, given that the paper reports a rouge-1 of 0.3953 and the above one is 0.61 ?

When are the attention probabilities for repeated indices summed?

From the paper

When a word appears multiple times in the source text, we sum probability mass from all corresponding parts of the attention distribution

However I wasn't able to find the part of the code that sums up the probability mass when an index appears multiple times in the encoder. As used in this line tf.scatter_nd doesn't apply a summation when there are repeated indices. And I can't see a point earlier in the code where this operation is performed. I only ask such a detailed question because I'm trying to implement the copy mechanism and this is a particularly tricky part to do without messing up the autograd mechanics in pytorch.

Tried running it on random internet news articles. Results look more extractive than abstractive?

Hi Abigail .. I was trying to run the code using the already existing training model that was uploaded, as I do not have a powerful enough machine to train. I believe the vocab size is set to 50000 in the code. After running it through multiple news articles on the internet, I found the results to be extractive. Didn't really encounter any situation where a new word was generated for summarizing. Am I missing something in the settings? Could you please let me know where the gap in my understanding lies?

It seems that the checkpoint is problematic?

Here is the error message,

Caused by op u'save/RestoreV2_19', defined at:
  File "run_summarization.py", line 451, in <module>
    tf.app.run()
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/platform/app.py", line 44, in run
    _sys.exit(main(_sys.argv[:1] + flags_passthrough))
  File "run_summarization.py", line 430, in main
    setup_training(model, batcher)
  File "run_summarization.py", line 216, in setup_training
    saver = tf.train.Saver(max_to_keep=1)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/saver.py", line 1040, in __init__
    self.build()
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/saver.py", line 1070, in build
    restore_sequentially=self._restore_sequentially)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/saver.py", line 675, in build
    restore_sequentially, reshape)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/saver.py", line 402, in _AddRestoreOps
    tensors = self.restore_op(filename_tensor, saveable, preferred_shard)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/saver.py", line 242, in restore_op
    [spec.tensor.dtype])[0])
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/gen_io_ops.py", line 668, in restore_v2
    dtypes=dtypes, name=name)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/op_def_library.py", line 763, in apply_op
    op_def=op_def)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 2327, in create_op
    original_op=self._default_original_op, op_def=op_def)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 1226, in __init__
    self._traceback = _extract_stack()

NotFoundError (see above for traceback): Key seq2seq/decoder/attention_decoder/coverage/w_c not found in checkpoint
         [[Node: save/RestoreV2_19 = RestoreV2[dtypes=[DT_FLOAT], _device="/job:localhost/replica:0/task:0/cpu:0"](_recv_save/Const_0, save/RestoreV2_19/tensor_names, save/RestoreV2_19/shape_and_slices)]]

Can you share the Model file for testing on my side?

Dear repository maker,
I would like to test the results on my side. Will it be possible for you to share the trained model of tensorflow on which you have tested the results? I have checked the output and found the results interesting. Kindly, let me know whether you can share the trained model, so that if I need to train the model I will start from the last stop, i.e. from your model. Do let me know.

Getting different result when running the pretrained model

This is the result included in the pretrained model:

ROUGE-1:
rouge_1_f_score: 0.3882 with confidence interval (0.3860, 0.3904)
rouge_1_recall: 0.4146 with confidence interval (0.4120, 0.4174)
rouge_1_precision: 0.3884 with confidence interval (0.3857, 0.3911)

ROUGE-2:
rouge_2_f_score: 0.1681 with confidence interval (0.1658, 0.1703)
rouge_2_recall: 0.1792 with confidence interval (0.1768, 0.1816)
rouge_2_precision: 0.1688 with confidence interval (0.1663, 0.1713)

ROUGE-l:
rouge_l_f_score: 0.3571 with confidence interval (0.3548, 0.3592)
rouge_l_recall: 0.3811 with confidence interval (0.3787, 0.3837)
rouge_l_precision: 0.3576 with confidence interval (0.3549, 0.3602)

I have downloaded the updated code and rerun the decoding using the pretrained model on my machine, and I got:

ROUGE-1:
rouge_1_f_score: 0.3551 with confidence interval (0.3529, 0.3574)
rouge_1_recall: 0.3857 with confidence interval (0.3829, 0.3885)
rouge_1_precision: 0.3512 with confidence interval (0.3486, 0.3539)

ROUGE-2:
rouge_2_f_score: 0.1501 with confidence interval (0.1480, 0.1522)
rouge_2_recall: 0.1631 with confidence interval (0.1608, 0.1655)
rouge_2_precision: 0.1486 with confidence interval (0.1464, 0.1508)

ROUGE-l:
rouge_l_f_score: 0.3241 with confidence interval (0.3219, 0.3263)
rouge_l_recall: 0.3518 with confidence interval (0.3491, 0.3545)
rouge_l_precision: 0.3208 with confidence interval (0.3182, 0.3234)

Is it because it is run on a different machine? Thanks!

Nan

Hello!
when training. the loss:nan and coverage_loss:nan
what is wrong?

Training speed

Hello, thank you for your work.
With the default settings on a 1080 and TF 1.0, i'm getting about 13 secs per size 16 batch, which would mean 1 epoch takes about 3 days, which is clearly off. Do you any ideas what may be causing the slowdown?

calculation of p_final(w)

Hi,
In the paper, p_final(w) = p_g * p_gen(w) + (1 - p_g) * p_copy(w).
And p_copy(w) is the summation of all the attention weight a_i when w_i = w.
But I did not find the code block to calculate this summation of a_i, could some one indicate it to me?
Thanks~

abisee / pointer-generator Goto Github PK

pointer-generator's People

Contributors

Stargazers

Watchers

Forkers

pointer-generator's Issues

Command

Output

000000_reference.txt

000001_reference.txt

000002_reference.txt

000003_reference.txt

000004_reference.txt

Recommend Projects

Recommend Topics

Recommend Org