abisee / pointer-generator Goto Github PK
View Code? Open in Web Editor NEWCode for the ACL 2017 paper "Get To The Point: Summarization with Pointer-Generator Networks"
License: Other
Code for the ACL 2017 paper "Get To The Point: Summarization with Pointer-Generator Networks"
License: Other
I have some problem in attn_dists_projected part in function _calc_final_dist (model.py line 174).
https://github.com/abisee/pointer-generator/blob/master/model.py#L174
I thought that we should add all attention probabilities to the word entries.
For example, if the ith, jth and kth encoder words are the same word w, then we should add a_i, a_j and a_k to the same entry (the index of w in the vocabulary).
But tf.scatter_nd only update the value not add the value. I think this should use tf.scatter_nd_add. Please correct me if I misunderstanding something. Thanks!
I'm trying to run your code using another dataset tokenized/bin as in your CNN/DM repo.
I'm running it using 32Go RAM & Nvidia 1080.
It looks memory related so I tried to set small values for some parameters and still got the issue.
python run_summarization.py --mode train --data_path ./data/train.bin --vocab_path ./data/vocab --log_root ./ --exp_name cnndm_summ --batch_size 4 --vocab_size 25000 --max_enc_steps 200 --max_dec_steps 100
[...]
INFO:tensorflow:Adding attention_decoder timestep 80 of 100
INFO:tensorflow:Adding attention_decoder timestep 81 of 100
INFO:tensorflow:Adding attention_decoder timestep 82 of 100
INFO:tensorflow:Adding attention_decoder timestep 83 of 100
INFO:tensorflow:Adding attention_decoder timestep 84 of 100
INFO:tensorflow:Adding attention_decoder timestep 85 of 100
INFO:tensorflow:Adding attention_decoder timestep 86 of 100
INFO:tensorflow:Adding attention_decoder timestep 87 of 100
INFO:tensorflow:Adding attention_decoder timestep 88 of 100
INFO:tensorflow:Adding attention_decoder timestep 89 of 100
INFO:tensorflow:Adding attention_decoder timestep 90 of 100
terminate called after throwing an instance of 'std::bad_alloc'
what(): std::bad_alloc
max_size of vocab was specified as 25000; we now have 25000 words. Stopping reading.
Finished constructing vocabulary of 25000 total words. Last word added: pencils
creating model...
Writing word embedding metadata file to ./cnndm_summ/train/vocab_metadata.tsv...
If you have any clue/advice
Thx for releasing the code, paper & blog post btw
How do folks typically generate summaries given i) new text files (only content, no headlines/abstracts/urls) and ii) the pretrained model?
It seems that make_datafiles.py from the cnn-dailymail dir needs to be modified (e.g. removing much of the hardcoding). After these modifications, make_datafiles.py may be used to tokenize and chunk the new data. From there, we "decode" using the pretrained model to generate new summaries.
Is the above generally correct or is there a more efficient method?
As this is based on TextSum code, is there an option for multi-GPU training?
I have trained the model for about 80,000 iterations and the loss has decreased to about 0.000004~, which is really low. But when I run the model in decode mode, the output is just what what what what...
or why why why why...
. I am using the quora data set here for training. There are about 150,000 pairs duplicate questions. I wonder do you have similar experience in your training and how do you fix it? Thanks!
Sry, missunderstood something on the code, I found everything i needed. Thank you for this awesome model !
Hello,
After running the code for about half an hour in train mode, the following error stopped the training process:
TensorArray has size zero, but element shape is not fully defined. Currently only static shapes are supported when packing zero-size TensorArrays.
The error appeared when the code was run with the following parameters:
CUDA_VISIBLE_DEVICES=5 python run_summarization.py --mode=train --data_path=/path-to-train/train.bin --vocab_path=/path-to-vocab/vocab --log_root=/path-to-log/log --exp_name=myexperiment2 --max_enc_steps=100 --max_dec_steps=50 --lr=0.01
The error also occurred after some hours when the learning rate was not specified.
The full traceback is provided below:
`Caused by op u'gradients/seq2seq/encoder/bidirectional_rnn/fw/fw/TensorArrayUnstack/TensorArrayScatter/TensorArrayScatterV3_grad/TensorArrayGatherV3', defined at:
File "run_summarization.py", line 264, in
tf.app.run()
File "/opt/neuralnetworks/lib/python2.7/site-packages/tensorflow/python/platform/app.py", line 48, in run
_sys.exit(main(_sys.argv[:1] + flags_passthrough))
File "run_summarization.py", line 250, in main
setup_training(model, batcher)
File "run_summarization.py", line 111, in setup_training
model.build_graph() # build the graph
File "/local/s1742159/neuralnetworks2017/pointer-generator/model.py", line 298, in build_graph
self._add_train_op()
File "/local/s1742159/neuralnetworks2017/pointer-generator/model.py", line 274, in _add_train_op
gradients = tf.gradients(loss_to_minimize, tvars, aggregation_method=tf.AggregationMethod.EXPERIMENTAL_TREE)
File "/opt/neuralnetworks/lib/python2.7/site-packages/tensorflow/python/ops/gradients_impl.py", line 560, in gradients
grad_scope, op, func_call, lambda: grad_fn(op, *out_grads))
File "/opt/neuralnetworks/lib/python2.7/site-packages/tensorflow/python/ops/gradients_impl.py", line 368, in _MaybeCompile
return grad_fn() # Exit early
File "/opt/neuralnetworks/lib/python2.7/site-packages/tensorflow/python/ops/gradients_impl.py", line 560, in
grad_scope, op, func_call, lambda: grad_fn(op, *out_grads))
File "/opt/neuralnetworks/lib/python2.7/site-packages/tensorflow/python/ops/tensor_array_grad.py", line 186, in _TensorArrayScatterGrad
grad = g.gather(indices)
File "/opt/neuralnetworks/lib/python2.7/site-packages/tensorflow/python/ops/tensor_array_ops.py", line 328, in gather
element_shape=element_shape)
File "/opt/neuralnetworks/lib/python2.7/site-packages/tensorflow/python/ops/gen_data_flow_ops.py", line 2244, in _tensor_array_gather_v3
element_shape=element_shape, name=name)
File "/opt/neuralnetworks/lib/python2.7/site-packages/tensorflow/python/framework/op_def_library.py", line 768, in apply_op
op_def=op_def)
File "/opt/neuralnetworks/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 2336, in create_op
original_op=self._default_original_op, op_def=op_def)
File "/opt/neuralnetworks/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 1228, in init
self._traceback = _extract_stack()
...which was originally created as op u'seq2seq/encoder/bidirectional_rnn/fw/fw/TensorArrayUnstack/TensorArrayScatter/TensorArrayScatterV3', defined at:
File "run_summarization.py", line 264, in
tf.app.run()
[elided 2 identical lines from previous traceback]
File "run_summarization.py", line 111, in setup_training
model.build_graph() # build the graph
File "/local/s1742159/neuralnetworks2017/pointer-generator/model.py", line 295, in build_graph
self._add_seq2seq()
File "/local/s1742159/neuralnetworks2017/pointer-generator/model.py", line 199, in _add_seq2seq
enc_outputs, fw_st, bw_st = self._add_encoder(emb_enc_inputs, self._enc_lens)
File "/local/s1742159/neuralnetworks2017/pointer-generator/model.py", line 74, in add_encoder
(encoder_outputs, (fw_st, bw_st)) = tf.nn.bidirectional_dynamic_rnn(cell_fw, cell_bw, encoder_inputs, dtype=tf.float32, sequence_length=seq_len, swap_memory=True)
File "/opt/neuralnetworks/lib/python2.7/site-packages/tensorflow/python/ops/rnn.py", line 350, in bidirectional_dynamic_rnn
time_major=time_major, scope=fw_scope)
File "/opt/neuralnetworks/lib/python2.7/site-packages/tensorflow/python/ops/rnn.py", line 553, in dynamic_rnn
dtype=dtype)
File "/opt/neuralnetworks/lib/python2.7/site-packages/tensorflow/python/ops/rnn.py", line 671, in dynamic_rnn_loop
for ta, input in zip(input_ta, flat_input))
File "/opt/neuralnetworks/lib/python2.7/site-packages/tensorflow/python/ops/rnn.py", line 671, in
for ta, input in zip(input_ta, flat_input))
File "/opt/neuralnetworks/lib/python2.7/site-packages/tensorflow/python/ops/tensor_array_ops.py", line 381, in unstack
indices=math_ops.range(0, num_elements), value=value, name=name)
File "/opt/neuralnetworks/lib/python2.7/site-packages/tensorflow/python/ops/tensor_array_ops.py", line 409, in scatter
name=name)
File "/opt/neuralnetworks/lib/python2.7/site-packages/tensorflow/python/ops/gen_data_flow_ops.py", line 2510, in _tensor_array_scatter_v3
name=name)
File "/opt/neuralnetworks/lib/python2.7/site-packages/tensorflow/python/framework/op_def_library.py", line 768, in apply_op
op_def=op_def)
UnimplementedError (see above for traceback): TensorArray has size zero, but element shape is not fully defined. Currently only static shapes are supported when packing zero-size TensorArrays.
[[Node: gradients/seq2seq/encoder/bidirectional_rnn/fw/fw/TensorArrayUnstack/TensorArrayScatter/TensorArrayScatterV3_grad/TensorArrayGatherV3 = TensorArrayGatherV3[_class=["loc:@seq2seq/encoder/bidirectional_rnn/fw/fw/TensorArray_1"], dtype=DT_FLOAT, element_shape=, _device="/job:localhost/replica:0/task:0/gpu:0"](gradients/seq2seq/encoder/bidirectional_rnn/fw/fw/TensorArrayUnstack/TensorArrayScatter/TensorArrayScatterV3_grad/TensorArrayGrad/TensorArrayGradV3, seq2seq/encoder/bidirectional_rnn/fw/fw/TensorArrayUnstack/range, gradients/seq2seq/encoder/bidirectional_rnn/fw/fw/TensorArrayUnstack/TensorArrayScatter/TensorArrayScatterV3_grad/TensorArrayGrad/gradient_flow)]]
[[Node: seq2seq/output_projection/Softmax_24/_3175 = _Recvclient_terminated=false, recv_device="/job:localhost/replica:0/task:0/cpu:0", send_device="/job:localhost/replica:0/task:0/gpu:0", send_device_incarnation=1, tensor_name="edge_5326_seq2seq/output_projection/Softmax_24", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/cpu:0"]]`
It looks like I'm able to run training and eval concurrently and I'm seeing a generated vocab_metadata.tsv, but when I go to "Embeddings" in Tensorboard I get:
log_root/testrun/train/vocab_metadata.tsv is not a file
which is strange since that file exists. Will the visualization work only when I run decode?
Hi,
In my case, attention distribution, the indices at the encoder side and the vocab distribution are all tensors.
Their shapes are:
attn_dist
-> [batch size, num_decoder_steps, num_encoder_steps]
contains the attn_dist over all encoder time steps for each decoder time step.
indices
-> [batch_size, num_encoder_steps]
contains the word indices at each encoder time step.
vocab_dist
-> [batch_size, num_decoder_steps, vocab_size]
.
I need to project attn_dist
to vocab_dist
by using the indices
to get the final_dist
following the same logic you have used here. The major difference is that I can't unpack the tensors to a list as num_encoder_steps
and num_decoder_steps
are unknown in my case (inferred from batch).
I am looking for the correct use of scatter_nd for my case. Can you please help me.
I see that you are adding the encoder attention scores to the corresponding word's existing score in the vocabulary distribution at each decoder time step. But how then, after the addition, the final distribution is a probability one. You are directly using this dist. for loss computation by applying log over it without any softmax.
Hi,
is there a maximum training step setting? For example, I want the maximum training step to be 10k. It will stop after that.
Thanks.
Hi~
Using the CNN/DailyMail dataset this system works well.
Then I just change dataset, create .bin files with make_datafiles.py and remain the other parameters, but the training log shows that the loss decrease sharply. So I try to set the learning rate 0.001, and train again. The loss decrease from 12 to 11 in 547 steps. At step 1000 the loss is 8.17, at step 1200 it becomes 1.16, then decrease to 0.0005 at 4000 step.
There is no doubt that it will not has correct answer.
I have checked the batch.enc_batch_extend_vocab, it prints like below
[[ 334 1382 6242 103 24 3296 241 296 62 150000
64 1114 221 58 16 15 93 5 154 1015
2707 59 101 177 235 4 5387 150000 4 221
1204 58 16 1212 5 84537 14 10 277 150001
5830 186 59 4 150002 5830 16287 184 396]
[ 4485 15 173 9 136 972 5124 41986 5382 939
6361 6728 26273 316 7177 3164 2650 674 309 230
6326 11 348 12 390 114 9415 5 11653 774
2840 21 309 230 4 364 1299 87 41986 5382
939 2074 358 258 110 4051 2761 7177 25514]
[ 1566 39 621 6 121 8799 938 1606 292 1606
939 3582 8799 21 8983 26 1057 20809 40671 20809
68 1606 7928 26 939 3582 21 192 236 20809
68 1606 26 729 1650 4 939 3582 8799 2238
5583 4 81 5009 505 568 48 4 1263]
[ 456 16 29598 4 3803 5274 20 5 1710 99
3088 4 6316 176 19774 62 10862 64 5 70
566 392 96 2879 7705 501 3507 16 3088 5694
5 9442 10862 8560 8 5967 2879 384 15864 1050
4 6416 9533 2300 9533 6416 59298 356 47838]
[ 14514 452 320 12 5783 1257 3639 6 357 35592
32166 23 4 19297 29786 3848 1078 1135 36 301
685 161 1544 4 1257 211 20 793 31 479
4 40736 517 12 10463 1164 127 2891 388 1715
1164 8408 5210 762 4600 1569 4444 150000 301]
[ 1019 1063 434 681 66 4 2255 487 1401 248
4487 59742 1063 441 3686 7584 226 59742 1063 441
4 1130 20 7 10823 710 77964 1063 17 4198
4 2935 3047 3472 175 177 676 140 11373 139
68 23 1019 1063 1019 236 434 8777 3803]
[ 12389 4032 17556 1450 19 38773 2259 2646 482 260
19 98 8161 4383 6 28 1209 31204 186 1194
8161 42849 38502 78383 5 19652 32 51 1319 8282
303 642 278 439 69 21486 12801 2181 6 245
27 511 50 8161 20 10496 818 6951 1698]
[ 1925 350 34 113 35 33334 2750 17948 6 150000
4 12936 42 86 17948 4 381 11519 23592 3947
11599 4 7147 2348 65 8162 699 2056 2310 1925
3402 508 10814 4392 1652 2675 611 1246 113 38183
10 70 27439 4 235 34064 6406 429 6338]
[ 176 16 1623 11907 3379 159 295 6207 10247 7443
738 150000 398 25 56 2890 1944 154 37 7497
489 423 214 5483 4443 324 5111 302 295 3339
1508 4018 150001 5368 1918 388 122 295 7443 9
12074 26006 340 39 1155 5368 1135 505 23609]
[ 8136 56 1623 3089 10 123 5348 359 238 5151
18 27 8136 150000 202 10 219 2369 1653 15
66 55 18 8136 591 8 2704 597 1296 18
27 8136 1783 8 2383 902 971 208 8719 503
18 8136 2424 8 3268 1033 4440 11870 18]
[ 5 6855 5112 14 4 55 11585 23 977 198
3255 1288 15077 516 474 666 884 4 386 150000
977 1519 607 330 4050 57 15077 516 22281 531
789 2474 12443 7899 5668 21 7070 889 659 10469
203 977 303 3492 15077 516 4 757 99748]
[ 348 12 1734 1541 719 56 5737 1494 723 3970
4090 58354 6589 23431 150000 2523 31 38817 338 9
3923 150001 11233 10 37245 2009 137 22433 33885 1237
1141 6148 137 1827 278 13465 17452 338 19 919
5737 880 21829 2203 99 201 4833 90369 11937]
[ 589 266 1447 787 959 192 109 1818 102 583
101 1176 589 1447 313 3346 360 890 2971 192
548 62 19867 474 64 4 589 2022 43774 2022
35 67 15800 2333 7477 143086 25228 87109 1889 92349
87639 5926 1142 23269 21 139 3194 67 10848]
[ 8189 12 1453 244 10 223 7012 998 1113 316
37 7497 75 159 4 135 146 4712 5114 1013
1627 3021 88 34247 47 8426 1696 414 4079 27565
2953 23567 169 159 29 1602 26 10247 1141 1173
51509 21 973 49 2724 312 127 4045 74211]
[ 181 174 19 139373 35 3444 10828 12818 1756 8959
5 12315 13 1325 1151 4796 7656 1693 170 808
1151 5013 39 1353 665 77126 978 895 4796 258
113 2029 25221 5232 343 11900 587 4725 333 20412
1101 98 73789 4 4796 604 81 1153 28]
[ 75 136 150000 120 469 434 5 438 4879 924
11788 4714 3530 4 121 120 11836 438 4 42018
9 13785 4 1948 89 9 2861 561 8419 438
1948 3971 42 10 65627 6520 6 12615 3246 987
3246 5 391 1163 3006 2743 1196 136 5460]]
where vocab size is 150k.
So maybe the inputs have no problem?
I guess only changing the learning rate will not help to train a correct model.
Any suggestion will be appreciated.
I'm interested in implementing something similar in a different framework. And I was wondering how out of vocabulary word vectors are handled. From the code I can see for each example there's a vocab size of 50k + number of unknown words in the article. I'm not super familiar with tensorflow and I was wondering what the out of vocabulary word vectors looked like. Is the vocab size really 50k + maximum number of unknown words in an article. Or are the word vectors for the unknowns set to zero / randomised for every example?
Recently, I implement your project to process another dataset, especially the coverage loss, but I can't reproduce the ability to avoid the repetition, I'm sure I have been in accordance with your instruction in your paper. so I want to know is there any note I need to know.
As we can see in the coverage loss, "sum(min(a,c))". If the length of decode is long enough, c may be vector filled with the value bigger than 1, then the "sum(min(a,c))" may be 1 forever. so is there some improvement.
Hi,
Thanks for releasing the code!
In computing the p_gen in this line, your input to the linear projection includes the decoder cell state and decoder prev_output (In eq. 8 of the paper)
Any reason why both the cell state and prev_output are used ? Wouldn't that be redundant since the output subsumes the cell state (via the output gate) ? Have you tried using just the output?
Hi,
Sorry for bothering you. In enc_batch[hps.batch_size, None], the length of axis=1 is the max number of article words of each example. In dec_batch[hps.batch_size, None], the length of axis=1 is the max_decoder_steps. Can I replace it for the max number of abstract words of each example. And is there any difference between them.
Thanks a lot.
Thanks for your impressive work.
I have trained the model and run this command:
python run_summarization.py --mode=decode --data_path=/path/to/val.bin --vocab_path=/path/to/vocab --log_root=/path/to/a/log/directory --exp_name=myexperiment
but the screen prints like this:
INFO:tensorflow:ARTICLE: -lrb- cnn -rrb- a grand jury in clark county , nevada , has indicted a 19-year-old man accused of fatally shooting his neighbor in front of her house last month . erich nowsch jr. faces charges of murder with a deadly weapon , attempted murder and firing a gun from within a car . police say nowsch shot tammy meyers , 44 , in front of her home after the car he was riding in followed her home february 12 . nowsch 's attorney , conrad claus , has said his client will argue self-defense . the meyers family told police that tammy meyers was giving her daughter a driving lesson when there was a confrontation with the driver of another car . tammy meyers drove home and sent her inside to get her brother , brandon , who allegedly brought a 9mm handgun . tammy meyers and her son then went back out , police said . they encountered the other car again , and there was gunfire , police said . investigators found casings from six .45 - caliber rounds at that scene . nowsch 's lawyer said after his client 's first court appearance that brandon meyers pointed a gun before anyone started shooting . he said the family 's story about a road-rage incident and what reportedly followed do n't add up . after zipping away from the first shooting , tammy meyers drove home and the other car , a silver audi , went there also . police said nowsch shot at both tammy and brandon meyers . tammy meyers was hit in the head and died two days later at a hospital . brandon meyers , who police said returned fire at the home , was not injured . the driver of the silver audi has yet to be found by authorities . that suspect was n't named in thursday 's indictment . nowsch was arrested five days after the killing in his family 's house , just one block away from the meyers ' home . he is due in court tuesday for a preliminary hearing .
INFO:tensorflow:REFERENCE SUMMARY: erich nowsch will face three charges , including first-degree murder . he is accused of killing tammy meyers in front of her home . the two lived !!withing!! walking distance of each other .
INFO:tensorflow:GENERATED SUMMARY: . arrest . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
The GENERATED SUMMARY is ". arrest . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .", and the same as other test articles.
How can i fix it?
I have fix the bug in beam_search.py following your latest commit, but it doesn't solve this problem.
When data.py encounters a word id that is not in vocabulary, the program terminates as the FLAGS is not defined.
assert FLAGS.pointer_gen, "Error: model produced a word ID that isn't in the vocabulary"
Adding below lines solves the problem
import tensorflow as tf
FLAGS = tf.app.flags.FLAGS
Strange fact. I've been trying to train summarization with default parameters and default dataset.
It turns out that the GPU is detected (2017-06-01 17:57:05.548650: I tensorflow/core/common_runtime/gpu/gpu_device.cc:977] Creating TensorFlow device (/gpu:0) -> (device: 0, name: GeForce GTX 1080, pci bus id: 0000:01:00.0)
), some GPU memory is alocated for the process, but nothing is running on it (load: 0%). It sometimes peak at around 27% for very short time.
The GPU is a single 1080 with 8G RAM (4617MiB allocated by the process)
Any idea?
I tried to train and test the system on our small-scale devset, but it does not overfit by showing very high performance.
In addition, its performance on the devset seems better when it is trained on our large-scale training set.
Does anyone observe similar results? It's really strange.
Hi,
Is it possible to add an open source license? Something like MIT License?
Thanks
Has anyone been able to successfully replicate the model from the paper? I've been training for about two weeks (over 240k iterations) using the published parameters, along with training an additional ~3k iterations with coverage. Here is what my training loss looks like:
Unclear what caused the increase starting around iteration 180k, but even then the output was not looking great.
Here are some REF (gold) and DEC (system) summaries. As you can see, they are qualitatively bad. Unfortunately, at the moment, I can't figure out how to get pyrouge to run so I can't quantify the performance relative to the published results.
000000_reference.txt
REF: marseille prosecutor says
so far no videos were used in the crash investigation '' despite media reports . journalists at bild and paris match are
very confident '' the video clip is real , an editor says .
andreas lubitz had informed his lufthansa training school of an episode of severe depression , airline says .DEC: robin 's comments are aware of any video footage , german paris match .
he 's accused into the crash of germanwings flight 9525 flight .
prosecutor : `` it is a very disturbing scene ''000001_reference.txt
REF: membership gives the icc jurisdiction over alleged crimes committed in palestinian territories since last june .
israel and the united states opposed the move , which could open the door to war crimes investigations against israelis .DEC: palestinians signed icc 's founding rome statute of alleged crimes in palestinian territories .
israel says `` in the occupied palestinian territory to immediately end and injustice , she says .
it 's founding rome .000002_reference.txt
REF: amnesty 's annual death penalty report catalogs encouraging signs , but setbacks in numbers of those sentenced to death .
organization claims that governments around the world are using the threat of terrorism to advance executions .
the number of executions worldwide has gone down by almost 22 % compared with 2013 , but death sentences up by 28 % .DEC:
it 's death sentences and executions 2014 ''
is some we are closer to abolition , to advance executions '' number of deterrence , `` a number of countries are abolitionist '' amnesty says he would not be used for the death penalty .000003_reference.txt
REF: amnesty international releases its annual review of the death penalty worldwide ; much of it makes for grim reading .
salil shetty : countries that use executions to deal with problems are on the wrong side of history .DEC: soldiers who a china agreed to tackle a surge in death sentences to death .
jordan ended china 's public mass sentencing is part a china 's northwestern xinjiang region .
a sharp spike in december 2006 , 2014 , 2014 .000004_reference.txt
REF: museum : anne frank died earlier than previously believed .
researchers re-examined archives and testimonies of survivors .
anne and older sister margot frank are believed to have died in february 1945 .DEC: bergen-belsen concentration camp is believed death on march 31 , anne frank says .
four the jewish diarist concentration camp , margot , margot , violent , died at the age of 15 .
`` i am no more than a skeleton camp , '' witnesses say .
If anyone has had success reproducing the published model I would love to hear how you did it. I'm stumped.
Hello,
My question is from the paper than the code. So can I conclude that the coverage mechanism is an alternative to input feeding approach in Effective Approaches to Attention-based Neural Machine Translation? Since we are using both of them to have some kind of memory of past alignments and avoid repetition? Please correct me if this is wrong. Thanks
HI,
I am getting too many errors while preparing the dataset; I will post them on the other github page, but can you tell me an example of the input to your model?
Hi, Abigail See!
I have read the paper "Get To The Point: Summarization with Pointer-Generator Networks" and "Abstractive Text Summarization using Sequence-to-sequence RNNs and Beyond". I found that the experiment result with CNN/Dailymail data sets in your paper is not same as in Nallapati's.
So I wonder if you have the source code from Nallapati. Maybe you implement Nallapati's model according to his paper.
Can you share the code of Nallapati et al., 2016?
Thanks a lot!
while decoding return statement in run_beam_search in beam_search.py will give Index Error when hyps array is empty
Hi,
In the attention_decoder.py, you compute attention scores over the encoder states from L92 - L103.
I can't seem to find where you've masked the encoder features of size - (batch_size,attn_length, attention_vec_size). Without masking, the computed attention score will be incorrect. It will assign values to encoder states that correspond to PAD tokens upon taking softmax.
Also, the word probability over the extended vocabulary is computed using tf.scatter_nd
. This does not work in case of duplicate indices as the output will be non-deterministic and not summed up.. So, if the source text had more than occurence of an OOV word then it will become a problem.
Thanks sharing the awesome implementation.
When I try to increase the length of document and summary from 400,100 to 800,200. It has an error 'Resource exhausted: OOM when allocating tensor'. The memory of gpu is 12GB. How can I deal with it?
Thanks a lot.
Could you please make your trained model available?
Hi @abisee, I've implemented your model and saw your test results. It seems that the number of decoding sentences is quite fixed. In my result, the number is roughly 3. The weird thing is that the program even replicates the second generated sentence to obtain the third sentence.
Is there any constraint on the number of decoding sentences in the code? Or you just let the program to determine where to stop the whole summary, which means it actually leans to place comma and period properly.
I am trying to run mode=train
and concurrently run another instance with mode=eval
.
However, on the eval
part I am getting the following errors:
2017-07-20 12:19:30.099613: W tensorflow/core/framework/op_kernel.cc:1158] Invalid argument: Assign requires shapes of both tensors to match.
lhs shape= [50000,128] rhs shape= [4,128]
[[Node: save/Assign_13 = Assign[T=DT_FLOAT, _class=["loc:@seq2seq/embedding/embedding"], use_locking=true, validate_shape=true,
_device="/job:localhost/replica:0/task:0/gpu:0"](seq2seq/embedding/embedding, save/RestoreV2_13/_49)]]
INFO:tensorflow:Failed to load checkpoint from ./log/testexp/train. Sleeping for 10 secs...
File "util.py", line 41, in load_ckpt
Any idea what could be the reason?
when I run the train model, the problem happened after several training steps,about 300 steps,i have no idea about this problem..
this is the problem:
feed_dict = self._make_feed_dict(batch)
File "/users2/ddning/parapharse_generation/pointer-generator/pointer-generator-master/model.py", line 49, in _make_feed_dict
feed_dict[self._enc_batch] = batch.enc_batch
AttributeError: 'NoneType' object has no attribute 'enc_batch'
See here.
Thanks sharing the awesome implementation.
In batcher.py the padding function seems to pad decoder input to max_len.
Is your model able to generate variable lenghth summary eventually?
def pad_decoder_inp_targ(self, max_len, pad_id):
"""Pad decoder input and target sequences with pad_id up to max_len."""
while len(self.dec_input) < max_len:
self.dec_input.append(pad_id)
while len(self.target) < max_len:
self.target.append(pad_id)
In this line it seems that the text_format function is not defined. Could anyone please tell me what it is?
Hi, I have some difficulties on understanding your code, in particular, the word embedding of the decoder input. As for the emb_enc_inputs
, it uses the source input with all the OOV words replaced by UNK tokens and that makes sense for me. But the emb_dec_inputs
uses the target input with all the OOV words has a temporary id. Why do you use that for embedding_lookup
? Please correct me if I had made any mistake. Thanks in advance!
After much work I was finally able to get the model to train successfully (220k iterations, 400/100 max enc/dec steps, 3k iterations with coverage bringing the coverage loss to 0.2).
I'm noticing when running the model (with beam size = 4) on my own data (financial news) the decoded summaries are almost entirely extractive. What could be causing this? I've left the vocab size at the default size of 50k. Is it possible that this is a result of having too small of a vocab?
Example (bold added to source to show extracted regions):
Decoded
air line pilots association said friday that 79 % of voting aviators approved a deal running through january 2019 that the union said provided industry-leading pay and benefits .
investors are closely watching the outcome of contract talks involving other united labor groups and at other carriers , concerned about a repeat of previous industry cycles when market conditions deteriorated .
pilots at delta air lines inc and southwest airlines co. both rejected proposed deals last year .
flight attendants have been unable to reach agreement on a joint contract even though they have been bargaining since 2012 .
Source
Pilots at United Continental Holdings Inc. overwhelmingly approved a two-year contract extension, continuing the momentum of the airline's efforts to restore labor peace and complete the integration of staff following its creation in a 2010 merger.
The Air Line Pilots Association said Friday that 79% of voting aviators approved a deal running through January 2019 that the union said provided industry-leading pay and benefits.
Investors are closely watching the outcome of contract talks involving other United labor groups and at other carriers, concerned about a repeat of previous industry cycles when record profits resulted in more generous deals that hobbled carriers' finances when market conditions deteriorated. Pilots at Delta Air Lines Inc and Southwest Airlines Co. both rejected proposed deals last year.
The United contract provides for higher pay, restores benefits for previously furloughed pilots, and enhances scheduling rules for long-haul flights, according to the pilot union. Both sides declined to provide contract details, though a person familiar with the situation said it included a 13% pay rise this year followed by a 3% increase in 2017 and 2% in 2018.
United executives said this week that the pilot contract and a new deal being considered by its technicians would raise its unit costs excluding fuel by 2.5 percentage points this year compared with 2015. The airline forecast its unit costs this year excluding fuel and the two labor deals would rise by between 0.5% and 1.5%.
The airline has suffered from rocky labor relations since its 2010 merger with Continental Airlines and has been trying to smooth that friction under new management led by Chief Executive Oscar Munoz that took over in September.
United has fresh deals or tentative agreements with the majority of its unionized staff, with the results on a new contract for its mechanics due to be revealed next week, but the toughest challenge remains securing a pact with flight attendants.
Flight attendants have been unable to reach agreement on a joint contract even though they have been bargaining since 2012. United presented fresh proposals covering pay and work practices last week in talks brokered by federal mediators. Flight attendants this week held a world-wide protest in pursuit of a joint contract.
The airline in November reached a deal to start negotiations with the International Association of Machinists union more than a year before the contract covering 30,000 ramp workers, customer-service agents and reservation staff opens for renewal at the end of 2016. It also reached a new proposed joint labor agreement for its 9,000 mechanics, with the International Brotherhood of Teamsters union due to issue the results on Jan. 25.
Pilots at Delta, its only union-represented labor group, resumed contract talks last month after rejecting a deal endorsed by union leaders.
Southwest pilots turned down a proposed deal in November, with flight attendants having rejected a tentative new pact earlier in the year.
Without pointer_gen mode, the loss is calculated by tf.contrib.seq2seq.sequence_loss. However, with pointer_gen mode, the loss minimizes the negative log-likelihood of target words. The loss makes sense for me and but I just wonder whether we cannot also minimize the negative log-likelihood of target words without pointer_gen mode?
Thanks for this awesome implementation!
I have a question regarding the dimension of p_gens
. Why is it a list (with length equals the number of decoder steps) of scalars, but not a list of [batch_size]
?
Hi @abisee ,
I'm training the network and the tensorflow loss is going down, but really really slow. Frankly, I have always worked with CNN's and they're much faster to train. I just want to get a sense for what is a good value for training loss at which point I can say that the network has learnt something meaningful. I read the issues and you mentioned your model had 230k iterations to reach the results of the paper, and that you had a learning rate schedule where you changed the size of hyperparameters to speed up initial iterations. I'm going for more of a brute force approach right now as I'm trying to get a proof of concept ready for my project, and not final results.
I'm currently at tensorflow loss of about 2.2 after 2-3 days of training. It started originally at over 6. What is a good value to stop, or what was your model's value for the results in paper?
Thanks :)
I trained the model with the default parameters, and evaluated at the step 27847. (I didn't do the coverage training, just launched the training with default parameters)
These are the results I got:
ROUGE-1:
rouge_1_f_score: 0.6100 with confidence interval (0.6087, 0.6113)
rouge_1_recall: 0.5746 with confidence interval (0.5726, 0.5767)
rouge_1_precision: 0.6785 with confidence interval (0.6767, 0.6803)
ROUGE-2:
rouge_2_f_score: 0.4800 with confidence interval (0.4788, 0.4813)
rouge_2_recall: 0.4533 with confidence interval (0.4515, 0.4552)
rouge_2_precision: 0.5325 with confidence interval (0.5311, 0.5339)
ROUGE-l:
rouge_l_f_score: 0.5986 with confidence interval (0.5973, 0.5999)
rouge_l_recall: 0.5638 with confidence interval (0.5618, 0.5659)
rouge_l_precision: 0.6658 with confidence interval (0.6640, 0.6676)
How to interpret these with respect to the results in the paper ? Where can be the problem, given that the paper reports a rouge-1 of 0.3953 and the above one is 0.61 ?
From the paper
When a word appears multiple times in the source text, we sum probability mass from all corresponding parts of the attention distribution
However I wasn't able to find the part of the code that sums up the probability mass when an index appears multiple times in the encoder. As used in this line tf.scatter_nd doesn't apply a summation when there are repeated indices. And I can't see a point earlier in the code where this operation is performed. I only ask such a detailed question because I'm trying to implement the copy mechanism and this is a particularly tricky part to do without messing up the autograd mechanics in pytorch.
Hi Abigail .. I was trying to run the code using the already existing training model that was uploaded, as I do not have a powerful enough machine to train. I believe the vocab size is set to 50000 in the code. After running it through multiple news articles on the internet, I found the results to be extractive. Didn't really encounter any situation where a new word was generated for summarizing. Am I missing something in the settings? Could you please let me know where the gap in my understanding lies?
Here is the error message,
Caused by op u'save/RestoreV2_19', defined at:
File "run_summarization.py", line 451, in <module>
tf.app.run()
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/platform/app.py", line 44, in run
_sys.exit(main(_sys.argv[:1] + flags_passthrough))
File "run_summarization.py", line 430, in main
setup_training(model, batcher)
File "run_summarization.py", line 216, in setup_training
saver = tf.train.Saver(max_to_keep=1)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/saver.py", line 1040, in __init__
self.build()
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/saver.py", line 1070, in build
restore_sequentially=self._restore_sequentially)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/saver.py", line 675, in build
restore_sequentially, reshape)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/saver.py", line 402, in _AddRestoreOps
tensors = self.restore_op(filename_tensor, saveable, preferred_shard)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/saver.py", line 242, in restore_op
[spec.tensor.dtype])[0])
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/gen_io_ops.py", line 668, in restore_v2
dtypes=dtypes, name=name)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/op_def_library.py", line 763, in apply_op
op_def=op_def)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 2327, in create_op
original_op=self._default_original_op, op_def=op_def)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 1226, in __init__
self._traceback = _extract_stack()
NotFoundError (see above for traceback): Key seq2seq/decoder/attention_decoder/coverage/w_c not found in checkpoint
[[Node: save/RestoreV2_19 = RestoreV2[dtypes=[DT_FLOAT], _device="/job:localhost/replica:0/task:0/cpu:0"](_recv_save/Const_0, save/RestoreV2_19/tensor_names, save/RestoreV2_19/shape_and_slices)]]
Dear repository maker,
I would like to test the results on my side. Will it be possible for you to share the trained model of tensorflow on which you have tested the results? I have checked the output and found the results interesting. Kindly, let me know whether you can share the trained model, so that if I need to train the model I will start from the last stop, i.e. from your model. Do let me know.
This is the result included in the pretrained model:
ROUGE-1:
rouge_1_f_score: 0.3882 with confidence interval (0.3860, 0.3904)
rouge_1_recall: 0.4146 with confidence interval (0.4120, 0.4174)
rouge_1_precision: 0.3884 with confidence interval (0.3857, 0.3911)
ROUGE-2:
rouge_2_f_score: 0.1681 with confidence interval (0.1658, 0.1703)
rouge_2_recall: 0.1792 with confidence interval (0.1768, 0.1816)
rouge_2_precision: 0.1688 with confidence interval (0.1663, 0.1713)
ROUGE-l:
rouge_l_f_score: 0.3571 with confidence interval (0.3548, 0.3592)
rouge_l_recall: 0.3811 with confidence interval (0.3787, 0.3837)
rouge_l_precision: 0.3576 with confidence interval (0.3549, 0.3602)
I have downloaded the updated code and rerun the decoding using the pretrained model on my machine, and I got:
ROUGE-1:
rouge_1_f_score: 0.3551 with confidence interval (0.3529, 0.3574)
rouge_1_recall: 0.3857 with confidence interval (0.3829, 0.3885)
rouge_1_precision: 0.3512 with confidence interval (0.3486, 0.3539)
ROUGE-2:
rouge_2_f_score: 0.1501 with confidence interval (0.1480, 0.1522)
rouge_2_recall: 0.1631 with confidence interval (0.1608, 0.1655)
rouge_2_precision: 0.1486 with confidence interval (0.1464, 0.1508)
ROUGE-l:
rouge_l_f_score: 0.3241 with confidence interval (0.3219, 0.3263)
rouge_l_recall: 0.3518 with confidence interval (0.3491, 0.3545)
rouge_l_precision: 0.3208 with confidence interval (0.3182, 0.3234)
Is it because it is run on a different machine? Thanks!
Hello!
when training. the loss:nan and coverage_loss:nan
what is wrong?
Hello, thank you for your work.
With the default settings on a 1080 and TF 1.0, i'm getting about 13 secs per size 16 batch, which would mean 1 epoch takes about 3 days, which is clearly off. Do you any ideas what may be causing the slowdown?
Hi,
In the paper, p_final(w) = p_g * p_gen(w) + (1 - p_g) * p_copy(w).
And p_copy(w) is the summation of all the attention weight a_i when w_i = w.
But I did not find the code block to calculate this summation of a_i, could some one indicate it to me?
Thanks~
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.