Giter Site home page Giter Site logo

mtgnn-sum's Introduction

MTGNN-SUM

This repository contains the implementation for our paper: Multi Graph Neural Network for Extractive Long Document Summarization

Installation

The code is written in Python 3.6+. Its dependencies are summarized in the file requirements.txt. You can install these dependencies like this:

pip install -r requirements.txt

Datasets

Download Pubmed and Arxiv datasets from here

Preprocess data

For pubmed dataset:

python preprocess_data.py --input_path dataset/pubmed-dataset --output_path dataset/pubmed --task train
python preprocess_data.py --input_path dataset/pubmed-dataset --output_path dataset/pubmed --task val
python preprocess_data.py --input_path dataset/pubmed-dataset --output_path dataset/pubmed --task test

For arxiv dataset:

python preprocess_data.py --input_path dataset/arxiv-dataset --output_path dataset/arxiv --task train
python preprocess_data.py --input_path dataset/arxiv-dataset --output_path dataset/arxiv --task val
python preprocess_data.py --input_path dataset/arxiv-dataset --output_path dataset/arxiv --task test

After getting the standard JSON format, you process the dataset by running a script: sh PrepareDataset.sh in the project directory. The processed files will be put under the cache directory.

Get contextualized embeddings

For pubmed dataset:

python feature_extraction.py --bert_model bert-base-uncased --data ./dataset/pubmed/train.label.jsonl --output ./bert_features_pubmed/bert_features_train --batch_size 100
python feature_extraction.py --bert_model bert-base-uncased --data ./dataset/pubmed/val.label.jsonl --output ./bert_features_pubmed/bert_features_val --batch_size 100
python feature_extraction.py --bert_model bert-base-uncased --data ./dataset/pubmed/test.label.jsonl --output ./bert_features_pubmed/bert_features_test --batch_size 100

For arxiv dataset:

python feature_extraction.py --bert_model bert-base-uncased --data ./dataset/arxiv/train.label.jsonl --output ./bert_features_arxiv/bert_features_train --batch_size 100
python feature_extraction.py --bert_model bert-base-uncased --data ./dataset/arxiv/val.label.jsonl --output ./bert_features_arxiv/bert_features_val --batch_size 100
python feature_extraction.py --bert_model bert-base-uncased --data ./dataset/arxiv/test.label.jsonl --output ./bert_features_arxiv/bert_features_test --batch_size 100

Training

Run command like this

python train.py --cuda --gpu 0 --data_dir <data/dir/of/your/json-format/dataset> --cache_dir <cache/directory/of/graph/features> --embedding_path <glove_path> --model [HSG|MTHSG] --save_root <model path> --log_root <log path> --bert_path <bert feature path> --lr_descent --grad_clip -m 3

For example:

python train.py --cuda --gpu 0 --data_dir dataset/arxiv --cache_dir cache/arxiv --embedding_path glove.42B.300d.txt --model MTHSG --save_root models_arxiv --log_root log_arxiv/ --bert_path bert_features_arxiv --lr_descent --grad_clip -m 3

Evaluation

For evaluation, the command may like this:

python evaluation.py --cuda --gpu 0 --data_dir <data/dir/of/your/json-format/dataset> --cache_dir <cache/directory/of/graph/features> --embedding_path <glove_path>  --model [HSG|HDSG] --save_root <model path> --log_root <log path> --bert_path <bert feature path> -m 5 --test_model multi --use_pyrouge

For example:

python evaluation.py --cuda --gpu 0 --data_dir dataset/arxiv --cache_dir cache/arxiv --embedding_path glove.42B.300d.txt  --model MTHSG --save_root models_arxiv --log_root log_arxiv/ --bert_path bert_features_arxiv -m 5 --test_model multi --use_pyrouge

Note: To use ROUGE evaluation, you need to download the 'ROUGE-1.5.5' package and then use pyrouge.

Error Handling: If you encounter the error message Cannot open exception db file for reading: /path/to/ROUGE-1.5.5/data/WordNet-2.0.exc.db when using pyrouge, the problem can be solved from here.

Some code are borrowed from HSG. Thanks for their work.

Citation

@inproceedings{doan-etal-2022-multi,
    title = "Multi Graph Neural Network for Extractive Long Document Summarization",
    author = "Doan, Xuan-Dung  and Nguyen, Le-Minh  and Bui, Khac-Hoai Nam",
    booktitle = "Proceedings of the 29th International Conference on Computational Linguistics",
    year = "2022"
}

mtgnn-sum's People

Contributors

dungdx34 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

mtgnn-sum's Issues

num_workers problems

sorry to bother u again,
in the train.py ,the param of dataloader num_workers set >0 always have errors
like:
RuntimeError: DataLoader worker (pid 62777) is killed by signal: Bus error. It is possible that dataloader's workers are out of shared memory. Please try to raise your shared memory limit.
raise RuntimeError('DataLoader worker (pid(s) {}) exited unexpectedly'.format(pids_str)) from e
RuntimeError: DataLoader worker (pid(s) 62777) exited unexpectedly
how could i do to deal this problem when i set > 0
do u meet this problem ?
thanks
your big fan of study

the bug

sorry sir bother you again,
there have some problems:
Traceback (most recent call last):
File "train.py", line 379, in
main()
File "train.py", line 340, in main
pretrained_weight = embed_loader.add_unknown_words_by_avg(vectors, args.word_emb_dim)
File "/home/cht/cht/MTGNN/MTGNN-SUM-main/module/embedding.py", line 85, in add_unknown_words_by_avg
avg = col[m] / int(len(word_vecs_numpy))
ZeroDivisionError: float division by zero

training time always be killed

training time always be killed
2022-11-05 18:25:35,346 INFO : | end of iter 0 | time: 21.63s | train loss 0.7058 |
2022-11-05 18:49:22,869 INFO : | end of iter 100 | time: 13.68s | train loss 11.6181 |
2022-11-05 19:12:35,681 INFO : | end of iter 200 | time: 13.09s | train loss 10.9483 |
2022-11-05 19:27:48,766 INFO : | end of iter 300 | time: 7.51s | train loss 10.7406 |
2022-11-05 19:39:26,067 INFO : | end of iter 400 | time: 6.64s | train loss 10.5892 |
2022-11-05 19:51:05,295 INFO : | end of iter 500 | time: 6.97s | train loss 10.4192 |
2022-11-05 20:02:30,155 INFO : | end of iter 600 | time: 7.10s | train loss 10.2825 |
2022-11-05 20:14:13,780 INFO : | end of iter 700 | time: 8.56s | train loss 10.2735 |
Killed
when up to 700 always be killed why?

and pubmed dataset nearly 120000 files ,a 100 files used 12mins ,total use almost 30days
why your v100 so fast, whats wrong ?
please tell very important
tks!!!!
best

train time

sorry to bother you,
when i train this model
the train time only 10s,but the dataloader time nearly 30mins
how should i do to accelerate the loader efficiency for short time of dataloader

glove.42B.300d.txt

hello sir, this txt not found,
where should i get it
tks
best,
arnold

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.