sosuperic / meansum Goto Github PK

View Code? Open in Web Editor NEW

112.0 10.0 51.0 364 KB

License: Other

Python 99.59% Shell 0.41%

meansum's Introduction

MeanSum: A Model for Unsupervised Neural Multi-Document Abstractive Summarization

Corresponding paper, accepted to ICML 2019: https://arxiv.org/abs/1810.05739.

Requirements

Main requirements:

python 3
torch 0.4.0

Rest of python packages in requirements.txt. Tested in Docker, image = pytorch/pytorch:0.4_cuda9_cudnn7.

General setup

Execute inside scripts/:

Create directories that aren't part of the Git repo (checkpoints/, outputs/):

bash setup_dirs.sh

Install python packages:

bash install_python_pkgs.sh

The default parameters for Tensorboard(x?) cause texts from writer.add_text() to not show up. Update by:

python update_tensorboard.py

Downloading data and pretrained models

Data

Download Yelp data: https://www.yelp.com/dataset and place files in datasets/yelp_dataset/
Run script to pre-process script and create train, val, test splits:
```
bash scripts/preprocess_data.sh
```
Download subword tokenizer built on Yelp and place in datasets/yelp_dataset/processed/: link

Pre-trained models

Download summarization model and place in stable_checkpoints/sum/mlstm/yelp/batch_size_16-notes_cycloss_honly-sum_lr_0.0005-tau_2.0/: link
Download language model and place in stable_checkpoints/lm/mlstm/yelp/batch_size_512-lm_lr_0.001-notes_data260_fixed/: link
Download classification model and place in stable_checkpoints/clf/cnn/yelp/batch_size_256-notes_data260_fixed/: link

Reference summaries

Download from: link. Each row contains "Input.business_id", "Input.original_review_<num>_id", "Input.original_review__<num>_", "Answer.summary", etc. The "Answer.summary" is the reference summary written by the Mechanical Turk worker.

Running

Testing with pretrained mode. This will output and save the automated metrics. Results will be in outputs/eval/yelp/n_docs_8/unsup_<run_name>

NOTE: Unlike some conventions, 'gpus' option here represents the GPU ID (the one which is visible) and NOT the number of GPUs. Hence, for a machine with a single GPU, you will give gpus=0

python train_sum.py --mode=test --gpus=0 --batch_size=16 --notes=<run_name>

Training summarization model (using pre-trained language model and default hyperparams). The automated metrics results will be in checkpoints/sum/mlstm/yelp/<hparams>_<additional_notes>.:

python train_sum.py --batch_size=16 --gpus=0,1,2,3 --notes=<additional_notes>

meansum's People

Contributors

Stargazers

Watchers

Forkers

peterjliu mrcuongnv doantumy udayakumar97 burakakrishna munaachyuta aadesh-magare shikaize stepgazaille arevuzdisaitek renlang97 bastiendechamps legendtianjin 1204271075 thatianafernandes geminifox2019 henryzxu mingzi151 fajri91 hadyelsahar stjordanis ammieqi gridl abhirut weiwei718 lr-silva dknine-zz sinead-cook pandumagdum nightdessert chreman hamidah-alanazi astorytyme marcelgm umeannever diego999 faiazrahman gkutiel booydar xuehuiping lbyiuou0329 pr162 hyunbool bam-bit marcuscoleman pratham1002 mcharnelli amberhuang01 dragomirradev anhnda obaidtambo

meansum's Issues

Issues about training a model on Chinese corpus

Hi, very much thanks for sharing your implementation.
I tried to apply this model to the Chinese corpus without labels. I reused the default parameter settings but only decreased the batch size to be 8 because the "out of memory" problem occurred when the batch size is larger. But the generation turned out to be poor after training for 20 epochs. By the way, I pretrained the language model with the "pretrained_lm" code.
Is there anything I need to pay special attention to for training? such as the number of documents? the length of each document? the size of vocabulary?
Looking forward to your reply.

About the Discriminator model

hello , i have a question. I can't find the model path in project_settings.py and the class Discriminator that appears in train_sum.py also not found

About the file ''subwordenc_32000_maxrevs260_fixed.pkl''

hello , i want to reduce the size of the vocabulary ， can you provied the python file that produced the "subwordenc_32000_maxrevs260_fixed.pkl'"

About the nll Loss

hi , when i train the No pre-trained language model ,why the nll loss is Nan sometimes ?

Is the extractive baseline correct?

There is a bug in the implementation of the extractive baseline you are using, mentioned here: gaetangate/text-summarizer#5 Did you fix that bug before evaluating this model?

Where can I get business.json?

I got an error which says
FileNotFoundError: [Errno 2] No such file or directory: 'datasets/yelp_dataset/business.json'
Where can I get business.json?

How to train for amazon data?

datasets/amazon_dataset/processed/subwordenc_32000_secondpass.pkl' not found

Docker image

How to run docker image??

Word Overlap and Sentiment classifier

Hey!
I'm testing your model in another dataset. The WO and Sent Acc were used as metric in your paper. Did you provide them in this repo ?

Script for evaluation against reference summaries?

Is there a script that allows to evaluate against YELP reference summaries?

About the language model

Hello , i have a question. Is the language model provided for download generated by pretrain_lm.py ?

How to evaluate the model?

The model has been trained. Which file is used to measure performance ? Is the file run_evaluations.py ?

Running MeanSum without CUDA

Hi, I am trying to run MeanSum on a cluster where cuda is not available, so when loading the language model, it fails when trying to deserialize and gives `AttributeError: module 'torch._C' has no attribute '_cuda_getDevice', which I guess is expected since cuda is not enabled. Is it possible to run MeanSum with just CPUs?

Possible bug when setting k > 1 & Gumbel_hard = True

Thanks for the interesting work and code

I was trying to get my head around the code and I couldn't understand something:

When training the mlstm model If we try the following set of parameters:

gumbel_hard = true
sampling method = "greedy" or "sample"
k > 1

In the line mlstm #L291
The logits_to_prob function will return a strict one hot vector according to the Torch gumbel softmax implementation F.gumbel_softmax

Afterwards, this prob vector is sent to prob_to_vocab_id method which is supposed to apply either torch.top_k (beam search) or torch.multinomial (top k sampling).

Implementation wise this shouldn't show any errors in beam search because of the torch.topk function ability to handle draws, however, the top k you get aren't the actual top k probabilities e.g.

But if you try to sample multinomial from 1 hot vector where K > 1 you get a runtime error:

Am I missing something here?

Use word embeddings

Hello, I want to train this model on my custom data which is not as large as the yelp one. So, I wanted to use fasttext to build the word encoder instead of initialising them with zero. How can I use that in this code?

Cant Find the reconstruction part

Hi@sosuperic , According to the paper, when we back propagates the model, there should be an autoencoder reconstruction loss and average summary similarity loss, but I couldn't find the code part of autoenc_loss in train_sum.py(only see the backward part and the file address of the autoencoder is none in project_setting.py), please tell me where are they, thank you.

Another problem is, i think, cycle_loss should be the average similarity loss, but I find the parameter

self.cycle_loss='enc' # When 'rec', reconstruct original texts. When 'enc', compare rev_enc and sum_enc embs

in project_settings.py. now, I am confused about the meaning of cycle loss? Dose it mean we could only choose one of those two losses to backward or what is the actual meaning of the parameter. Looking forward to your reply! Thank you again.

About the file: text_cnn.py

When i change the batch_size = 16 , emb_size = 32 , and hidden_size = 64 , the text_cnn.py throughs a error : RuntimeError: Expected 4-dimensional input for 4-dimensional weight [128, 1, 3, 32], but got 5-dimensional input of size [16, 1, 150, 31688, 32] instead

How can i solve it ?