dqwang122 / hetersumgraph Goto Github PK

View Code? Open in Web Editor NEW

242.0 3.0 51.0 34 KB

Code for ACL2020 paper "Heterogeneous Graph Neural Networks for Extractive Document Summarization"

Shell 1.02% Python 98.98%

hetersumgraph's Introduction

HeterSumGraph

Code for ACL 2020 paper: Heterogeneous Graph Neural Networks for Extractive Document Summarization

fastNLP version will come soon.

Some code are borrowed from PG and Transformer. Thanks for their work.

Thanks for issue #28 to point out the flaw of the implementation of GAT layers. The previous version ignores the hidden states of destination nodes when the source and destination nodes have different node types. Since this change will affect the released checkpoints, we update the code in dev branch.

Dependency

python 3.5+
PyTorch 1.0+
DGL 0.4
rouge 1.0.0
- A full Python Implementation of the ROUGE Metric which is used in validation phase
pyrouge 0.1.3
others
- nltk
- numpy
- sklearn

Data

We have preprocessed CNN/DailyMail, NYT50 and Multi-News datasets for TF-IDF features used in the graph creation, which you can find here.

For CNN/DailyMail and Multi-News, we also provide the json-format datasets in this link. However, due to the license, NYT(The New York Times Annotated Corpus) can only be available from LDC. And we follow the preprocessing code of Durrett et al. (2016) to get the NYT50 datasets.

The example looks like this:

{
  "text":["deborah fuller has been banned from keeping animals ... 30mph",...,"a dog breeder and exhibitor... her dogs confiscated"],
  "summary":["warning : ... at a speed of around 30mph",... ,"she was banned from ... and given a curfew "],
  "label":[1,3,6]
}

and each line in the file is an example. For the text key, the value can be list of string (single-document) or list of list of string (multi-document). The example in training set can ignore the summary key since we only use label during the training phase. All strings need be lowercase and tokenized by Stanford Tokenizer, and nltk.sent_tokenize is used to get sentences.

After getting the standard json format, you can prepare the dataset for the graph by PrepareDataset.sh in the project directory. The processed files will be put under the cache directory.

The default file names for training, validation and test are: train.label.jsonl, val.label.jsonl and test.label.jsonl. If you would like to use other names, please change the corresponding names in PrepareDataset.sh, Line 321-322 in train.py and Line 188 in evaluation.py. (Default names is recommended)

Train

For training, you can run commands like this:

python train.py --cuda --gpu 0 --data_dir <data/dir/of/your/json-format/dataset> --cache_dir <cache/directory/of/graph/features> --embedding_path <glove_path> --model [HSG|HDSG] --save_root <model path> --log_root <log path> --lr_descent --grad_clip -m 3

We also provide our checkpoints on CNN/DailyMail, NYT50 and Multi-News in this link. Besides, the outputs can be found here(NYT50 has been removed due to its license).

Test

For evaluation, the command may like this:

python evaluation.py --cuda --gpu 0 --data_dir <data/dir/of/your/json-format/dataset> --cache_dir <cache/directory/of/graph/features> --embedding_path <glove_path>  --model [HSG|HDSG] --save_root <model path> --log_root <log path> -m 3 --test_model multi --use_pyrouge

Some options:

use_pyrouge: whether to use pyrouge for evaluation. Default is False (which means rouge).
- Please change Line17-18 in tools/utils.py to your own ROUGE path and temp file path.
limit: whether to limit the output to the length of gold summaries. This option is only set for evaluation on NYT50 (which uses ROUGE-recall instead of ROUGE-f). Default is False.
blocking: whether to use Trigram blocking. Default is False.
save_label: only save label and do not calculate ROUGE. Default is False.

To load our checkpoint for evaluation, you should put it under the save_root/eval/ and make the name for test_model to start with eval. For example, if your save_root is "checkpoints", then the checkpoint "cnndm.ckpt" should be put under "checkpoints/eval" and the test_model is evalcnndm.ckpt.

ROUGE Installation

In order to get correct ROUGE scores, we recommend using the following commands to install the ROUGE environment:

sudo apt-get install libxml-perl libxml-dom-perl
pip install git+git://github.com/bheinzerling/pyrouge
export PYROUGE_HOME_DIR=the/path/to/RELEASE-1.5.5
pyrouge_set_rouge_path $PYROUGE_HOME_DIR
chmod +x $PYROUGE_HOME_DIR/ROUGE-1.5.5.pl

You can refer to https://github.com/andersjo/pyrouge/tree/master/tools/ROUGE-1.5.5 for RELEASE-1.5.5 and remember to build Wordnet 2.0 instead of 1.6 in RELEASE-1.5.5/data:

cd $PYROUGE_HOME_DIR/data/WordNet-2.0-Exceptions/
./buildExeptionDB.pl . exc WordNet-2.0.exc.db
cd ../
ln -s WordNet-2.0-Exceptions/WordNet-2.0.exc.db WordNet-2.0.exc.db

hetersumgraph's People

Contributors

Stargazers

Watchers

Forkers

zhushaoquan huangtc2022 hahasdnu1029 ammieqi xrosliang littlestar-angel bj2016 emrys-hong wangjiaqiys leslie071564 jyxhyan guowenying111 kldcr yellow-binary-tree mqrshiyan canyuchen tommy-xu fortuneseeker teinhonglo embeddedsamurai kthom1 dungdx34 muguruzawang renqincai zenhjunpro bit-nlper hejinhub pj0616 xianhuaxizi trongthanht3 tiffen pkujcy hongzhangmu kleeeeea jiaruipeng1994 urvashikhanna huyen2510 jaindeepali010 qqfox icloudsong yphuang26 garveyz hsiaoyun0 huyennguyenhelen techthiyanes arnold-em eulring dujunsheng hardtoname1002 nichousha6 jainitbitw

hetersumgraph's Issues

quesion about label

it seems that in original cnndm dataset, there is no filed named label. How to get the corresponding label for each document thx :)

Can't find the summary field in train.label.jsonl

Hi,
I have downloaded your CNN/DailyMail and Multi-News json-format datasets but in the train.label.jsonl doesn't have [summary] field.

bug：总是提示KeyError: 'sh'

@brxx122
您好：
下载了您提供的cnn的数据集，采用运行命令
python train.py --cuda --gpu 0 --data_dir ./data/middledata_2/ --cache_dir ./cache/cnn --embedding_path ./embedding_dir/glove.42B.300d.txt --model HSG --save_root ./data/model_path --log_root ./log --lr_descent --grad_clip -m 3，其他的都没有改过，但是提示sh的keyerror错误，找了很久没有发现错误是什么原因导致的，是否可以帮忙解答下。报错详细信息如下：
result = self.forward(*input, **kwargs)
File "/data/cxx/program/extractivemethod/heterogeneousgraph/module/GATLayer.py", line 119, in forward
h = g.ndata.pop('sh')
File "/home/cxx/anaconda3/envs/cxxnlp/lib/python3.6/_collections_abc.py", line 795, in pop
value = self[key]
File "/home/cxx/anaconda3/envs/cxxnlp/lib/python3.6/site-packages/dgl/view.py", line 66, in getitem
return self._graph._get_n_repr(self._ntid, self._nodes)[key]
File "/home/cxx/anaconda3/envs/cxxnlp/lib/python3.6/site-packages/dgl/frame.py", line 393, in getitem
return self._columns[name].data
KeyError: 'sh'

Cuda version is not compatible with GPU while running the code in T4 GPU runtime of colab.

CODE:
!python HeterSumGraph/evaluation.py --cuda --gpu 1 --data_dir /content/dataset/cnndm --cache_dir /content/graphfile/cache/CNNDM --embedding_path /content/glove.6B.300d.txt --model HSG --save_root /content/models/ --log_root /content/logs -m 3 --test_model multi --use_pyrouge
ERROR:
File "/content/HeterSumGraph/evaluation.py", line 239, in
main()
File "/content/HeterSumGraph/evaluation.py", line 227, in main
model.to(torch.device("cuda:0"))
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1160, in to
return self._apply(convert)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 810, in apply
module._apply(fn)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 833, in _apply
param_applied = fn(param)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1158, in convert
return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None, non_blocking)
File "/usr/local/lib/python3.10/dist-packages/torch/cuda/_init.py", line 298, in _lazy_init
torch._C._cuda_init()
RuntimeError: No CUDA GPUs are available

Document node Updation

Where exactly are we updating the document node features as proposed in the paper?
I can only see the word and sentence nodes getting updated. Can you please point out the file and line number for the same?

您好，请问有vocab文件吗

我尝试了几个但是依旧报错，应该是格式不对，您能发我一份vocab文件吗，谢谢您！

bug问题

很抱歉打扰你，我遇到了一个bug，google之后没有得到解决
RuntimeError: Input, output and indices must be on the current device
Traceback (most recent call last):
File "train.py", line 381, in
main()
File "train.py", line 377, in main
setup_training(model, train_loader, valid_loader, valid_dataset, hps)
File "train.py", line 71, in setup_training
run_training(model, train_loader, valid_loader, valset, hps, train_dir)
File "train.py", line 114, in run_training
outputs = model.forward(G) # [n_snodes, 2]
File "/home/a303/graphsum/HeterSumGraph-master/HiGraph.py", line 94, in forward
word_feature = self.set_wnfeature(graph) # [wnode, embed_size]
File "/home/a303/graphsum/HeterSumGraph-master/HiGraph.py", line 148, in set_wnfeature
w_embed = self._embed(wid) # [n_wnodes, D]
File "/home/a303/anaconda3/envs/gspy3.6/lib/python3.6/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
result = self.forward(*input, **kwargs)
File "/home/a303/anaconda3/envs/gspy3.6/lib/python3.6/site-packages/torch/nn/modules/sparse.py", line 126, in forward
self.norm_type, self.scale_grad_by_freq, self.sparse)
File "/home/a303/anaconda3/envs/gspy3.6/lib/python3.6/site-packages/torch/nn/functional.py", line 1852, in embedding
return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
RuntimeError: Input, output and indices must be on the current device

The training process takes much too long and I have already splited the dataset into small parts.

I've notice the issues before about small memory and slow training speed.
Because my server can't load the whole dataset at the same time, I have already splited the dataset into small parts (use readJson in get_example rather than in init part, which really works). Is there any way that I can still use dgl.data.utils.save_graphs to save the whole graph to save time for training while I can't load the whole datasets at the same time?

There is no explanation about argument about embedding_path

In train.py line 266, there is an argument named embedding_path with default path that is not in the git directory.
I think the code will work when I get the GLOVE embedding file to some directory and modify the argument there, but there is no explanation about that.

Question about R1, R2, RL score

@dqwang122 thank for greate repo!
I test with multi-news datasets, i get score from evaluate.py, but when i run code, the score very difference with your paper score public.

	R1	R2	RL
my test	35.6630	12.2370	31.3000
paper	46.05	16.35	42.08

my script is:

python evaluation.py --cuda --gpu 0  --model HDSG --save_root ./checkpoints --log_root ./log --use_pyrouge --test_model evalmultinews.ckpt -m 3

Maybe I wrong in some step!!
Many thanks for your response.

No 'summary' field in train.lable.jsonl

Hi,

I encounter the same problem (#2 ) that there is no summary field in the training file. The code of this project is really clean and nice. Although one can solve this by using the provided features file, I want to conduct another experiment based on this code repo.

So could you please provide the complete training file that will also help the guys who want to prepare the graph feature by themselves? Thank you so much!

中文数据集

您好，请问如果训练中文数据集的话，参数设置和module模块GAT部分需要修改吗？

模型和数据应该都放在了cuda上，还是出现错误RuntimeError: Expected object of device type cuda but got device type cpu for argument #3 'index' in call to _th_index_select

File "train.py", line 418, in
main()
File "train.py", line 414, in main
setup_training(model, train_loader, valid_loader, valid_dataset, hps)
File "train.py", line 71, in setup_training
run_training(model, train_loader, valid_loader, valset, hps, train_dir)
File "train.py", line 125, in run_training
outputs = model.forward(G) # [n_snodes, 2]
File "/home/zggao/document-summarization/HeterSumGraph-master/HiGraph.py", line 94, in forward
word_feature = self.set_wnfeature(graph) # [wnode, embed_size]
File "/home/zggao/document-summarization/HeterSumGraph-master/HiGraph.py", line 148, in set_wnfeature
w_embed = self._embed(wid) # [n_wnodes, D]
File "/home/zggao/anaconda3/envs/testtransformers/lib/python3.6/site-packages/torch/nn/modules/module.py", line 722, in _call_impl
result = self.forward(*input, **kwargs)
File "/home/zggao/anaconda3/envs/testtransformers/lib/python3.6/site-packages/torch/nn/modules/sparse.py", line 126, in forward
self.norm_type, self.scale_grad_by_freq, self.sparse)
File "/home/zggao/anaconda3/envs/testtransformers/lib/python3.6/site-packages/torch/nn/functional.py", line 1814, in embedding
return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
RuntimeError: Expected object of device type cuda but got device type cpu for argument #3 'index' in call to _th_index_select

有没有遇到一样问题的啊，代码也没改动啊
if args.cuda:
# model.to(torch.device("cuda:0"))
model.to(torch.device("cuda"))#修改
logger.info("[INFO] Use cuda")

if hps.cuda:
G.to(torch.device("cuda"))

我们的做法是参考了 _SummaRuNNer: A Recurrent Neural Network based Sequence Model for Extractive Summarization of Documents_ 这篇论文的贪婪式抽取。下面是一个简单的参考:

Hello,

您好，请问用这个方法抽取的label一定是正确的吗？我在自己数据集上尝试这个，发现得到的label所在document里的句子不一定与abstract吻合，然后一个句子长度的abstract可能len（label)> 10，请问这个是正常的吗，不会影响后续使用hetersumgraph吗？

我们的做法是参考了 SummaRuNNer: A Recurrent Neural Network based Sequence Model for Extractive Summarization of Documents 这篇论文的贪婪式抽取。下面是一个简单的参考:

You can refer to the greedy algorithm in SummaRuNNer: A Recurrent Neural Network based Sequence Model for Extractive Summarization of Documents. Here is a simple way to do it:

def calLabel(article, abstract):
    hyps_list = article
    refer = abstract
    scores = []
    for hyps in hyps_list:
        mean_score = rouge_eval(hyps, refer)
        scores.append(mean_score)

    selected = [int(np.argmax(scores))]
    selected_sent_cnt = 1

    best_rouge = np.max(scores)
    while selected_sent_cnt < len(hyps_list):
        cur_max_rouge = 0.0
        cur_max_idx = -1
        for i in range(len(hyps_list)):
            if i not in selected:
                temp = copy.deepcopy(selected)
                temp.append(i)
                hyps = "\n".join([hyps_list[idx] for idx in np.sort(temp)])
                cur_rouge = rouge_eval(hyps, refer)
                if cur_rouge > cur_max_rouge:
                    cur_max_rouge = cur_rouge
                    cur_max_idx = i
        if cur_max_rouge != 0.0 and cur_max_rouge >= best_rouge:
            selected.append(cur_max_idx)
            selected_sent_cnt += 1
            best_rouge = cur_max_rouge
        else:
            break
    # print(selected, best_rouge)
    return selected

Originally posted by @brxx122 in #7 (comment)

Rouge

Questions about fields in the data example given

Hi, Since constructing my own data set, I want to confirm the number here indicates which sentence of the text is used as the summary, right?

你好，想问一下你是如何实现ORACLE算法从abstractive summary得到extractive的label的？

我想尝试你的这个方法在其他数据集上，需要手动生成extractive的label，能分享一下ORACLE算法吗？万分感谢！

create my dataset ?

Hi
Thanks for your contribution
I'm looking for a multi-text summary with a dataset prepared by myself
Can you provide more details on how to create the input dataset?
thank you

About Rouge-L score

Hi Danqing,

Thank you for sharing the clean and nice code.

I would like to know why Rouge-L are much higher than the results from other papers? Is it becasue the golden summries are different?

为什么Rouge1、Rouge2、RougeL 始终与论文里提高的精度相差太远

2022-06-20 02:20:08,272 INFO : [INFO] Validset match_true 11189, pred 16864, true 50291, total 121375, match 76598
2022-06-20 02:20:08,273 INFO : [INFO] The size of totalset is 5622, sent_number is 121375, accu is 0.631085, precision is 0.663484, recall is 0.222485, F is 0.333229
2022-06-20 02:20:08,274 INFO : [INFO] Found new best model with 13.238739 running_avg_loss. The original loss is 13.245838, Saving to model-H/eval/bestmodel_2
2022-06-20 02:20:08,445 INFO : [INFO] Found new best model with 0.333229 F. The original F is 0.331025, Saving to model-H/eval/bestFmodel
我一共训练了30个周期，但是这是最好的数据了，我用的是原模型没有改动。我

Actual Number of Training Epochs?

Dear author, can I ask you a detail on the implementation? Did the training stop early when you got the expected performance of the model? If yes, and what is the actual number of training epoch?

生成的VOCAL_FILE文件读取编码不对UnicodeDecodeError: 'utf-8' codec can't decode byte 0xa1 in position 236: invalid start byte

Traceback (most recent call last):
File "E:\project\HeterSumGraph-master\train.py", line 438, in
main()
File "E:\project\HeterSumGraph-master\train.py", line 385, in main
vocab = Vocab(VOCAL_FILE, args.vocab_size)
File "E:\project\HeterSumGraph-master\module\vocabulary.py", line 53, in init
for line in vocab_f: # 遍历文件的每一行
File "C:\Users\24672\anaconda3\envs\Pytorch\lib\codecs.py", line 322, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xa1 in position 236: invalid start byte

需要将G.to_device("cuda")改为G=G.to_device("cuda")

运行过程中，好像需要将G.to_device("cuda")改为G=G.to_device("cuda")，否则会报错RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu! (when checking argument for argument index in method wrapper__index_select)。不知道是不是我自己的原因。

GAT UPDATE

Just wanted to know does the changes in GAT layer in the dev branch actually improve the performance of the model (ROUGE scores) ?

AssertionError: doc_feature_element

When I use the code to train HDSG model on my own multi-document dataset, the following problem occurs

Traceback (most recent call last):
File "train.py", line 384, in
main()
File "train.py", line 380, in main
setup_training(model, train_loader, valid_loader, valid_dataset, hps)
File "train.py", line 71, in setup_training
run_training(model, train_loader, valid_loader, valset, hps, train_dir)
File "train.py", line 116, in run_training
outputs = model.forward(G) # [n_snodes, 2]
File "/dat01/jttang/wpc/survey_generation/HeterSumGraph/HiGraph.py", line 222, in forward
doc_feature, snid2dnid = self.set_dnfeature(graph)
File "/dat01/jttang/wpc/survey_generation/HeterSumGraph/HiGraph.py", line 299, in set_dnfeature
assert not torch.any(torch.isnan(doc_feature)), "doc_feature_element"
AssertionError: doc_feature_element

It seems that the problem is because doc_feature is nan and snodes of dnode is null. I check my dataset and didn't find empty document. So I am confused about the reason. Please help. @brxx122 Thanks.

Why add edge between sentences in ExampleSet, but no edge between documents in MultiExampleSet?

As shown in dataloader.py, why add edge between sentences in ExampleSet, but no edge between documents in MultiExampleSet?
BTW, how many GPUs are needed and how low it takes to train the model?

Cannot get NYT dataset

I try the links you provide that "NYT(The New York Times Annotated Corpus) can only be available from LDC. And we follow the preprocessing code of Durrett et al. (2016) to get the NYT50 datasets". But they all cannot be used due to the license issue. Could you provide the data (original data and processed code) to us through email? [email protected]

Thanks a lot.

invalid pointer

I run the code in linux,Problem comes:' Error in `python': munmap_chunk(): invalid pointer: 0x00007f730d8f76d8 ***',how can i solve this problem? Thank you!

Question about evaluation

Thank for great repo @dqwang122. when I run evaluate.py following readme.

python evaluation.py --cuda --gpu 0 --data_dir ./datasets/cnndm --cache_dir ./cache/dnndm --embedding_path glove.840B.300d.txt --model HSG --save_root ./save --log_root ./log -m 3

I get err

2022-06-26 21:46:31,959 INFO    : Pytorch 1.8.0+cu111
2022-06-26 21:46:31,959 INFO    : [INFO] Create Vocab, vocab path is ./cache/dnndm/vocab
Traceback (most recent call last):
  File "train.py", line 388, in <module>
    main()
  File "train.py", line 342, in main
    vocab = Vocab(VOCAL_FILE, args.vocab_size)
  File "/home/tupk/tupk/TextSum/HeterSumGraph/module/vocabulary.py", line 50, in __init__
    with open(vocab_file, 'r', encoding='utf8') as vocab_f: #New : add the utf8 encoding to prevent error
FileNotFoundError: [Errno 2] No such file or directory: './cache/dnndm/vocab'

My question is: "Does the vocab file must be create or download from some package? if it must create, each new dataset must generate file vocab, how to create it?". Many thanks for your help?

train阶段提示ValueError: Expect graph 0 and 96 to have the same edge attributes when edge_attrs=ALL, got {'tffrac', 'dtype'} and {'dtype'}.

你好，我现在运行出现一个问题，不知道是不是我的数据的问题，请教一下！！！
Traceback (most recent call last):
File "train1.py", line 401, in
main()
File "train1.py", line 397, in main
setup_training(model, train_loader, valid_loader, valid_dataset, hps)
File "train1.py", line 71, in setup_training
run_training(model, train_loader, valid_loader, valset, hps, train_dir)
File "train1.py", line 104, in run_training
for i, (G, index) in enumerate(train_loader):
File "/home/ligang/anaconda3/envs/myenv/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 568, in next
return self._process_next_batch(batch)
File "/home/ligang/anaconda3/envs/myenv/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 608, in _process_next_batch
raise batch.exc_type(batch.exc_msg)
ValueError: Traceback (most recent call last):
File "/home/ligang/anaconda3/envs/myenv/lib/python3.7/site-packages/torch/utils/data/_utils/worker.py", line 99, in _worker_loop
samples = collate_fn([dataset[i] for i in batch_indices])
File "/home/ligang/HeterSUMGraph/module/dataloader.py", line 492, in graph_collate_fn
batched_graph = dgl.batch([graphs[idx] for idx in sorted_index])
File "/home/ligang/anaconda3/envs/myenv/lib/python3.7/site-packages/dgl/batched_graph.py", line 355, in batch
return BatchedDGLGraph(graph_list, node_attrs, edge_attrs)
File "/home/ligang/anaconda3/envs/myenv/lib/python3.7/site-packages/dgl/batched_graph.py", line 182, in init
edge_attrs = _init_attrs(edge_attrs, 'edge')
File "/home/ligang/anaconda3/envs/myenv/lib/python3.7/site-packages/dgl/batched_graph.py", line 171, in _init_attrs
.format(ref_g_index, i, mode, attrs, g_attrs))
ValueError: Expect graph 0 and 96 to have the same edge attributes when edge_attrs=ALL, got {'tffrac', 'dtype'} and {'dtype'}.

Question on the training, poor reproduced result

Hi,

Thank you for providing this source code. I follow your instruction. However, the reproduced result it poor.

--cuda --gpu 0 --data_dir dataset/multinews/ --cache_dir cache/multinews/ --embedding_path glove/glove.42B.300d.txt --model HDSG --save_root output --log_root output/logfile --lr_descent --grad_clip -m 3

The training loss deceased very slow, from 14.99 to 12.67 after 14 Epoch.
The best result on the training:
Rouge1:
p:0.568634, r:0.246242, f:0.331011
Rouge2:
p:0.199369, r:0.084300, f:0.114196
Rougel:
p:0.450906, r:0.220580, f:0.288866
Can you provide any command on how to train the network?

problem with pretrain model

I have a problem with pretrain model :
Using backend: pytorch
2021-05-17 04:28:43,255 INFO : Pytorch 1.8.1+cu101
2021-05-17 04:28:43,256 INFO : [INFO] Create Vocab, vocab path is /content/drive/MyDrive/HeterSumGraph/cache/MultiNews/vocab
2021-05-17 04:28:43,310 INFO : [INFO] max_size of vocab was specified as 50000; we now have 50000 words. Stopping reading.
2021-05-17 04:28:43,310 INFO : [INFO] Finished constructing vocabulary of 50000 total words. Last word added: medicated
2021-05-17 04:28:43,459 INFO : [INFO] Loading external word embedding...
2021-05-17 04:29:32,127 INFO : [INFO] External Word Embedding iov count: 48908, oov count: 1092
2021-05-17 04:29:32,288 INFO : Namespace(atten_dropout_prob=0.1, batch_size=32, bidirectional=True, blocking=False, cache_dir='/content/drive/MyDrive/HeterSumGraph/cache/MultiNews', cuda=True, data_dir='/content/drive/MyDrive/HeterSumGraph/cache/multinews', doc_max_timesteps=50, embed_train=False, embedding_path='/content/drive/MyDrive/HeterSumGraph/glove.42B.300d.txt', feat_embed_size=50, ffn_dropout_prob=0.1, ffn_inner_hidden_size=512, gcn_hidden_size=64, gpu='0', hidden_size=128, limited=False, log_root='/content/drive/MyDrive/HeterSumGraph/log', lstm_hidden_state=64, lstm_layers=2, m=3, model='HSG', n_feature_size=64, n_head=16, n_iter=1, n_layers=1, recurrent_dropout_prob=0.1, save_label=False, save_root='/content/drive/MyDrive/HeterSumGraph/model', sent_max_len=100, test_model='evalmultinews.ckpt', use_orthnormal_init=True, use_pyrouge=True, vocab_size=50000, word_emb_dim=300, word_embedding=True)
2021-05-17 04:29:32,411 INFO : [MODEL] HeterSumGraph
2021-05-17 04:29:32,411 INFO : [INFO] Start reading ExampleSet
2021-05-17 04:29:32,591 INFO : [INFO] Finish reading ExampleSet. Total time is 0.179303, Total size is 5622
2021-05-17 04:29:32,591 INFO : [INFO] Loading filter word File /content/drive/MyDrive/HeterSumGraph/cache/MultiNews/filter_word.txt
2021-05-17 04:29:32,692 INFO : [INFO] Loading word2sent TFIDF file from /content/drive/MyDrive/HeterSumGraph/cache/MultiNews/test.w2s.tfidf.jsonl!
/usr/local/lib/python3.7/dist-packages/torch/utils/data/dataloader.py:477: UserWarning: This DataLoader will create 32 worker processes in total. Our suggested max number of worker in current system is 4, which is smaller than what this DataLoader is going to create. Please be aware that excessive worker creation might get DataLoader running slow or even freeze, lower the worker number to avoid potential slowness/freeze if necessary.
cpuset_checked))
2021-05-17 04:29:36,417 INFO : [INFO] Use cuda
2021-05-17 04:29:36,418 INFO : [INFO] Decoding...
2021-05-17 04:29:36,419 INFO : [INFO] Restoring evalmultinews.ckpt for testing...The path is /content/drive/MyDrive/HeterSumGraph/model/eval/multinews.ckpt
Traceback (most recent call last):
File "/content/drive/MyDrive/HeterSumGraph/evaluation.py", line 239, in
main()
File "/content/drive/MyDrive/HeterSumGraph/evaluation.py", line 236, in main
run_test(model, dataset, loader, hps.test_model, hps)
File "/content/drive/MyDrive/HeterSumGraph/evaluation.py", line 77, in run_test
model = load_test_model(model, model_name, eval_dir, hps.save_root)
File "/content/drive/MyDrive/HeterSumGraph/evaluation.py", line 57, in load_test_model
model.load_state_dict(torch.load(bestmodel_load_path))
File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py", line 1224, in load_state_dict
self.class.name, "\n\t".join(error_msgs)))
RuntimeError: Error(s) in loading state_dict for HSumGraph:
Missing key(s) in state_dict: "word2sent.layer.heads.8.fc.weight", "word2sent.layer.heads.8.feat_fc.weight", "word2sent.layer.heads.8.attn_fc.weight", "word2sent.layer.heads.9.fc.weight", "word2sent.layer.heads.9.feat_fc.weight", "word2sent.layer.heads.9.attn_fc.weight", "word2sent.layer.heads.10.fc.weight", "word2sent.layer.heads.10.feat_fc.weight", "word2sent.layer.heads.10.attn_fc.weight", "word2sent.layer.heads.11.fc.weight", "word2sent.layer.heads.11.feat_fc.weight", "word2sent.layer.heads.11.attn_fc.weight", "word2sent.layer.heads.12.fc.weight", "word2sent.layer.heads.12.feat_fc.weight", "word2sent.layer.heads.12.attn_fc.weight", "word2sent.layer.heads.13.fc.weight", "word2sent.layer.heads.13.feat_fc.weight", "word2sent.layer.heads.13.attn_fc.weight", "word2sent.layer.heads.14.fc.weight", "word2sent.layer.heads.14.feat_fc.weight", "word2sent.layer.heads.14.attn_fc.weight", "word2sent.layer.heads.15.fc.weight", "word2sent.layer.heads.15.feat_fc.weight", "word2sent.layer.heads.15.attn_fc.weight".
Unexpected key(s) in state_dict: "dn_feature_proj.weight".
size mismatch for cnn_proj.weight: copying a param with shape torch.Size([128, 300]) from checkpoint, the shape in current model is torch.Size([64, 300]).
size mismatch for cnn_proj.bias: copying a param with shape torch.Size([128]) from checkpoint, the shape in current model is torch.Size([64]).
size mismatch for lstm.weight_ih_l0: copying a param with shape torch.Size([512, 300]) from checkpoint, the shape in current model is torch.Size([256, 300]).
size mismatch for lstm.weight_hh_l0: copying a param with shape torch.Size([512, 128]) from checkpoint, the shape in current model is torch.Size([256, 64]).
size mismatch for lstm.bias_ih_l0: copying a param with shape torch.Size([512]) from checkpoint, the shape in current model is torch.Size([256]).
size mismatch for lstm.bias_hh_l0: copying a param with shape torch.Size([512]) from checkpoint, the shape in current model is torch.Size([256]).
size mismatch for lstm.weight_ih_l0_reverse: copying a param with shape torch.Size([512, 300]) from checkpoint, the shape in current model is torch.Size([256, 300]).
size mismatch for lstm.weight_hh_l0_reverse: copying a param with shape torch.Size([512, 128]) from checkpoint, the shape in current model is torch.Size([256, 64]).
size mismatch for lstm.bias_ih_l0_reverse: copying a param with shape torch.Size([512]) from checkpoint, the shape in current model is torch.Size([256]).
size mismatch for lstm.bias_hh_l0_reverse: copying a param with shape torch.Size([512]) from checkpoint, the shape in current model is torch.Size([256]).
size mismatch for lstm.weight_ih_l1: copying a param with shape torch.Size([512, 256]) from checkpoint, the shape in current model is torch.Size([256, 128]).
size mismatch for lstm.weight_hh_l1: copying a param with shape torch.Size([512, 128]) from checkpoint, the shape in current model is torch.Size([256, 64]).
size mismatch for lstm.bias_ih_l1: copying a param with shape torch.Size([512]) from checkpoint, the shape in current model is torch.Size([256]).
size mismatch for lstm.bias_hh_l1: copying a param with shape torch.Size([512]) from checkpoint, the shape in current model is torch.Size([256]).
size mismatch for lstm.weight_ih_l1_reverse: copying a param with shape torch.Size([512, 256]) from checkpoint, the shape in current model is torch.Size([256, 128]).
size mismatch for lstm.weight_hh_l1_reverse: copying a param with shape torch.Size([512, 128]) from checkpoint, the shape in current model is torch.Size([256, 64]).
size mismatch for lstm.bias_ih_l1_reverse: copying a param with shape torch.Size([512]) from checkpoint, the shape in current model is torch.Size([256]).
size mismatch for lstm.bias_hh_l1_reverse: copying a param with shape torch.Size([512]) from checkpoint, the shape in current model is torch.Size([256]).
size mismatch for lstm_proj.weight: copying a param with shape torch.Size([128, 256]) from checkpoint, the shape in current model is torch.Size([64, 128]).
size mismatch for lstm_proj.bias: copying a param with shape torch.Size([128]) from checkpoint, the shape in current model is torch.Size([64]).
size mismatch for n_feature_proj.weight: copying a param with shape torch.Size([64, 256]) from checkpoint, the shape in current model is torch.Size([128, 128]).
size mismatch for word2sent.ffn.w_1.weight: copying a param with shape torch.Size([512, 64, 1]) from checkpoint, the shape in current model is torch.Size([512, 128, 1]).
size mismatch for word2sent.ffn.w_2.weight: copying a param with shape torch.Size([64, 512, 1]) from checkpoint, the shape in current model is torch.Size([128, 512, 1]).
size mismatch for word2sent.ffn.w_2.bias: copying a param with shape torch.Size([64]) from checkpoint, the shape in current model is torch.Size([128]).
size mismatch for word2sent.ffn.layer_norm.weight: copying a param with shape torch.Size([64]) from checkpoint, the shape in current model is torch.Size([128]).
size mismatch for word2sent.ffn.layer_norm.bias: copying a param with shape torch.Size([64]) from checkpoint, the shape in current model is torch.Size([128]).
size mismatch for sent2word.layer.heads.0.fc.weight: copying a param with shape torch.Size([50, 64]) from checkpoint, the shape in current model is torch.Size([50, 128]).
size mismatch for sent2word.layer.heads.1.fc.weight: copying a param with shape torch.Size([50, 64]) from checkpoint, the shape in current model is torch.Size([50, 128]).
size mismatch for sent2word.layer.heads.2.fc.weight: copying a param with shape torch.Size([50, 64]) from checkpoint, the shape in current model is torch.Size([50, 128]).
size mismatch for sent2word.layer.heads.3.fc.weight: copying a param with shape torch.Size([50, 64]) from checkpoint, the shape in current model is torch.Size([50, 128]).
size mismatch for sent2word.layer.heads.4.fc.weight: copying a param with shape torch.Size([50, 64]) from checkpoint, the shape in current model is torch.Size([50, 128]).
size mismatch for sent2word.layer.heads.5.fc.weight: copying a param with shape torch.Size([50, 64]) from checkpoint, the shape in current model is torch.Size([50, 128]).

can anyone help me with this problem ?

It is so appreciated you share the code! i wonder if you happen the issuse that the code runs also slowly on gpu and the gpu rate is unstable. i doubt the reason is using dgl

It is so appreciated you share the code! i wonder if you happen the issuse that the code runs slowly also on gpu and the gpu rate is unstable,sometimes even lower 10. we run your code this issue happen, i doubt the reason is using dgl.i dont kown if we're using dgl correctly, can you share some experience?

Why is there a big gap between the evaluation result of ROUGE and the paper in the single document summary

          My ROUGE installation should be fine as I have no problem with the CNN/DailyMail dataset at all, but the ROUGE score on the Multi-News dataset is: Rouge1 =40.4, RougE2 =15.7, Rougel =35.5

------------------ 原始邮件 ------------------
发件人: "Danqing @.>;
发送时间: 2022年8月17日(星期三) 下午4:10
收件人: @.>;
抄送: @.>; @.>;
主题: Re: [dqwang122/HeterSumGraph] Question about R1, R2, RL score (Issue #32)

Yes, I get a ROUGE score on the published output and a 6% difference on the multipurpose news dataset from the data listed by the author

What does "multipurpose news dataset" refer to? Is it the multi-news?
What is the exact "a ROUGE score"? Is it R1 40.4? If you cannot get the reported scores (R1 46.05) from the released outputs, you had better check the installation of ROUGE. You can follow the instruction here(https://github.com/dqwang122/HeterSumGraph#rouge-installation).
Besides, you should also recheck the data format and preprocessing.

—
Reply to this email directly, view it on GitHub, or unsubscribe.
You are receiving this because you were mentioned.Message ID: @.***>

Originally posted by @suwu-suwu in #32 (comment)

关于实现细节存在的问题

您好，感谢您公开您的代码。代码实现似乎和我对GAT的理解不太相同，所以请教一下您。
在用GAT计算word和sentence之间的边注意力权重的时候，做法是将边的头结点和尾结点的embedding拼接起来再做变换，但是在您的实现中当结点是word时，传入的向量为【0，0，0，0，0，0，0，0】，不知道这里是您有意设计还是实现方面遗漏了？
感谢您的回复！

随机种子

请问代码为什么没有设置随机种子，不设置随机种子得到的结果会相差比较大吗

label

您好，请问一下您的标签是手打上去的吗？label键的数字就是你想在text中抽取的代表摘要的摘要候选句吗？

你好，内存问题

你好，由于需要读的文件太多，我这16g内存有点顶不住啊，请问可以修改一些什么可以降低一点内存消耗呢？（不是显存）

Train tester和evaluation文件报错问题

Traceback (most recent call last):
File "C:/Users/MacBook/Desktop/HeterSumGraph-master/HeterSumGraph-master3/HeterSumGraph-master/train.py", line 381, in
main()
File "C:/Users/MacBook/Desktop/HeterSumGraph-master/HeterSumGraph-master3/HeterSumGraph-master/train.py", line 339, in main
vectors = embed_loader.load_my_vecs(args.word_emb_dim)
File "C:\Users\MacBook\Desktop\HeterSumGraph-master\HeterSumGraph-master3\HeterSumGraph-master\module\embedding.py", line 37, in load_my_vecs
with open(self._path, encoding="utf-8") as f:
FileNotFoundError: [Errno 2] No such file or directory: '/remote-home/dqwang/Glove/glove.42B.300d.txt'

Process finished with exit code 1
请问这个怎么解决这个glove文件在哪里找并且加在哪里

Convergence Analysis

Thanks for your sharing the code!
The paper gives an analysis of iteration times from an experimental point of view. I wonder if we could deduce the number of iterations through rigorous proof, could you please give me some instructions? Thank you in advance!

在cnndm上训练需要多久呢（虽然有点傻但还是问了因为GPU资源不多（

Check failed: e == cudaSuccess || e == cudaErrorCudartUnloading: CUDA: invalid device ordinal

Whether I try train.py or evaluation.py with supplied checkpoints, I get the same error message: Check failed: e == cudaSuccess || e == cudaErrorCudartUnloading: CUDA: invalid device ordinal

$ python train.py --cuda --gpu 0 --data_dir ./datasets/multinews --cache_dir ./cache/MultiNews --embedding_path /opt/mr/embeddings/glove.840B.300d.txt --model HDSG --save_root ./save --log_root ./log --lr_descent --grad_clip -m 3

Using backend: pytorch
2021-03-07 17:54:52,953 INFO    : Pytorch 1.8.0+cu111
2021-03-07 17:54:52,953 INFO    : [INFO] Create Vocab, vocab path is ./cache/MultiNews/vocab
2021-03-07 17:54:52,986 INFO    : [INFO] max_size of vocab was specified as 50000; we now have 50000 words. Stopping reading.
2021-03-07 17:54:52,986 INFO    : [INFO] Finished constructing vocabulary of 50000 total words. Last word added: medicated
2021-03-07 17:54:53,077 INFO    : [INFO] Loading external word embedding...
^@2021-03-07 17:55:29,241 INFO    : [INFO] External Word Embedding iov count: 46079, oov count: 3921
2021-03-07 17:55:29,357 INFO    : Namespace(atten_dropout_prob=0.1, batch_size=32, bidirectional=True, cache_dir='./cache/MultiNews', cuda=True, data_dir='./datasets/multinews', doc_max_timesteps=50, embed_train=False, embedding_path='/opt/mr/embeddings/glove.840B.300d.txt', feat_embed_size=50, ffn_dropout_prob=0.1, ffn_inner_hidden_size=512, gpu='0', grad_clip=True, hidden_size=64, log_root='./log', lr=0.0005, lr_descent=True, lstm_hidden_state=128, lstm_layers=2, m=3, max_grad_norm=1.0, model='HDSG', n_epochs=20, n_feature_size=128, n_head=8, n_iter=1, n_layers=1, recurrent_dropout_prob=0.1, restore_model='None', save_root='./save', sent_max_len=100, use_orthnormal_init=True, vocab_size=50000, word_emb_dim=300, word_embedding=True)
2021-03-07 17:55:29,463 INFO    : [MODEL] HeterDocSumGraph 
2021-03-07 17:55:29,463 INFO    : [INFO] Start reading MultiExampleSet
2021-03-07 17:55:30,740 INFO    : [INFO] Finish reading MultiExampleSet. Total time is 1.277061, Total size is 44972
2021-03-07 17:55:30,740 INFO    : [INFO] Loading filter word File ./cache/MultiNews/filter_word.txt
2021-03-07 17:55:30,808 INFO    : [INFO] Loading word2sent TFIDF file from ./cache/MultiNews/train.w2s.tfidf.jsonl!
2021-03-07 17:55:37,931 INFO    : [INFO] Loading word2doc TFIDF file from ./cache/MultiNews/train.w2d.tfidf.jsonl!
2021-03-07 17:55:42,741 INFO    : [INFO] Start reading MultiExampleSet
2021-03-07 17:55:42,838 INFO    : [INFO] Finish reading MultiExampleSet. Total time is 0.097269, Total size is 5622
2021-03-07 17:55:42,839 INFO    : [INFO] Loading filter word File ./cache/MultiNews/filter_word.txt
2021-03-07 17:55:42,909 INFO    : [INFO] Loading word2sent TFIDF file from ./cache/MultiNews/val.w2s.tfidf.jsonl!
2021-03-07 17:55:43,825 INFO    : [INFO] Loading word2doc TFIDF file from ./cache/MultiNews/val.w2d.tfidf.jsonl!
2021-03-07 17:55:46,275 INFO    : [INFO] Use cuda
2021-03-07 17:55:46,275 INFO    : [INFO] Create new model for training...
2021-03-07 17:55:46,275 INFO    : [INFO] Starting run_training
Traceback (most recent call last):
  File "train.py", line 381, in <module>
    main()
  File "train.py", line 377, in main
    setup_training(model, train_loader, valid_loader, valid_dataset, hps)
  File "train.py", line 71, in setup_training
    run_training(model, train_loader, valid_loader, valset, hps, train_dir)
  File "train.py", line 114, in run_training
    outputs = model.forward(G)  # [n_snodes, 2]
  File "/home/matt/HeterSumGraph/HiGraph.py", line 201, in forward
    doc_feature, snid2dnid = self.set_dnfeature(graph)
  File "/home/matt/HeterSumGraph/HiGraph.py", line 237, in set_dnfeature
    snodes = [nid for nid in graph.predecessors(dnode) if graph.nodes[nid].data["dtype"]==1]
  File "/home/matt/anaconda3/envs/nlp/lib/python3.7/site-packages/dgl/heterograph.py", line 2647, in predecessors
    return self._graph.predecessors(self.get_etype_id(etype), v)
  File "/home/matt/anaconda3/envs/nlp/lib/python3.7/site-packages/dgl/heterograph_index.py", line 370, in predecessors
    self, int(etype), int(v)))
  File "dgl/_ffi/_cython/./function.pxi", line 287, in dgl._ffi._cy3.core.FunctionBase.__call__
  File "dgl/_ffi/_cython/./function.pxi", line 222, in dgl._ffi._cy3.core.FuncCall
  File "dgl/_ffi/_cython/./function.pxi", line 211, in dgl._ffi._cy3.core.FuncCall3
  File "dgl/_ffi/_cython/./base.pxi", line 155, in dgl._ffi._cy3.core.CALL
dgl._ffi.base.DGLError: [17:55:54] /opt/dgl/src/array/cuda/utils.cu:19: Check failed: e == cudaSuccess || e == cudaErrorCudartUnloading: CUDA: invalid device ordinal
Stack trace:
  [bt] (0) /home/matt/anaconda3/envs/nlp/lib/python3.7/site-packages/dgl/libdgl.so(dmlc::LogMessageFatal::~LogMessageFatal()+0x4f) [0x7f59f26abc8f]
  [bt] (1) /home/matt/anaconda3/envs/nlp/lib/python3.7/site-packages/dgl/libdgl.so(dgl::cuda::AllTrue(signed char*, long, DLContext const&)+0x10f) [0x7f59f32f81ef]
  [bt] (2) /home/matt/anaconda3/envs/nlp/lib/python3.7/site-packages/dgl/libdgl.so(std::pair<bool, bool> dgl::aten::impl::COOIsSorted<(DLDeviceType)2, long>(dgl::aten::COOMatrix)+0x9d) [0x7f59f2efaeed]
  [bt] (3) /home/matt/anaconda3/envs/nlp/lib/python3.7/site-packages/dgl/libdgl.so(dgl::aten::COOIsSorted(dgl::aten::COOMatrix)+0x1e3) [0x7f59f2690893]
  [bt] (4) /home/matt/anaconda3/envs/nlp/lib/python3.7/site-packages/dgl/libdgl.so(dgl::aten::CSRMatrix dgl::aten::impl::COOToCSR<(DLDeviceType)2, long>(dgl::aten::COOMatrix)+0x4c8) [0x7f59f2ef9378]
  [bt] (5) /home/matt/anaconda3/envs/nlp/lib/python3.7/site-packages/dgl/libdgl.so(dgl::aten::COOToCSR(dgl::aten::COOMatrix)+0x3f3) [0x7f59f268f553]
  [bt] (6) /home/matt/anaconda3/envs/nlp/lib/python3.7/site-packages/dgl/libdgl.so(dgl::UnitGraph::GetInCSR(bool) const+0x300) [0x7f59f2e9d2e0]
  [bt] (7) /home/matt/anaconda3/envs/nlp/lib/python3.7/site-packages/dgl/libdgl.so(dgl::UnitGraph::GetFormat(dgl::SparseFormat) const+0x4d) [0x7f59f2e9e25d]
  [bt] (8) /home/matt/anaconda3/envs/nlp/lib/python3.7/site-packages/dgl/libdgl.so(dgl::UnitGraph::Predecessors(unsigned long, unsigned long) const+0x34) [0x7f59f2e9e784]

Here's my nvidia-smi output:

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.32.03    Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  GeForce RTX 3090    Off  | 00000000:21:00.0  On |                  N/A |
| 30%   35C    P8    32W / 350W |    589MiB / 24265MiB |     24%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  GeForce RTX 3090    Off  | 00000000:4A:00.0 Off |                  N/A |
|  0%   38C    P8    26W / 350W |      6MiB / 24268MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A      1960      G   /usr/lib/xorg/Xorg                448MiB |
|    0   N/A  N/A      2917      G   cinnamon                           44MiB |
|    0   N/A  N/A      4333      G   ...AAAAAAAA== --shared-files       76MiB |
|    0   N/A  N/A     12990      G   ...oken=16001321251127579134       15MiB |
|    0   N/A  N/A     13177      G   /usr/bin/nvidia-settings            0MiB |
|    0   N/A  N/A     33039      G   /usr/bin/nvidia-settings            0MiB |
|    1   N/A  N/A      1960      G   /usr/lib/xorg/Xorg                  4MiB |
+-----------------------------------------------------------------------------+

关于数据预处理的问题

您好！
首先我下载了datasets.tar.gz并解压。
然后运行 ./PrepareDataset.sh CNNDM ./datasets/cnndm single这条命令，但是报错：

Traceback (most recent call last):
  File "script/createVoc.py", line 72, in <module>
    summary = " ".join(e["summary"])
KeyError: 'summary'

然后我看了一下 datasets/cnndm/train.label.jsonl 发现是因为里面是没有 summary这个key，请问这个问题怎么解决呢？

内存问题

训练的时候内存一直在增长，直到爆内存？