Giter Site home page Giter Site logo

meansum's Introduction

MeanSum: A Model for Unsupervised Neural Multi-Document Abstractive Summarization

Corresponding paper, accepted to ICML 2019: https://arxiv.org/abs/1810.05739.

Requirements

Main requirements:

  • python 3
  • torch 0.4.0

Rest of python packages in requirements.txt. Tested in Docker, image = pytorch/pytorch:0.4_cuda9_cudnn7.

General setup

Execute inside scripts/:

Create directories that aren't part of the Git repo (checkpoints/, outputs/):
bash setup_dirs.sh
Install python packages:
bash install_python_pkgs.sh
The default parameters for Tensorboard(x?) cause texts from writer.add_text() to not show up. Update by:
python update_tensorboard.py

Downloading data and pretrained models

Data

  1. Download Yelp data: https://www.yelp.com/dataset and place files in datasets/yelp_dataset/
  2. Run script to pre-process script and create train, val, test splits:
    bash scripts/preprocess_data.sh
    
  3. Download subword tokenizer built on Yelp and place in datasets/yelp_dataset/processed/: link

Pre-trained models

  1. Download summarization model and place in stable_checkpoints/sum/mlstm/yelp/batch_size_16-notes_cycloss_honly-sum_lr_0.0005-tau_2.0/: link
  2. Download language model and place in stable_checkpoints/lm/mlstm/yelp/batch_size_512-lm_lr_0.001-notes_data260_fixed/: link
  3. Download classification model and place in stable_checkpoints/clf/cnn/yelp/batch_size_256-notes_data260_fixed/: link

Reference summaries

Download from: link. Each row contains "Input.business_id", "Input.original_review_<num>_id", "Input.original_review__<num>_", "Answer.summary", etc. The "Answer.summary" is the reference summary written by the Mechanical Turk worker.

Running

Testing with pretrained mode. This will output and save the automated metrics. Results will be in outputs/eval/yelp/n_docs_8/unsup_<run_name>

NOTE: Unlike some conventions, 'gpus' option here represents the GPU ID (the one which is visible) and NOT the number of GPUs. Hence, for a machine with a single GPU, you will give gpus=0

python train_sum.py --mode=test --gpus=0 --batch_size=16 --notes=<run_name>

Training summarization model (using pre-trained language model and default hyperparams). The automated metrics results will be in checkpoints/sum/mlstm/yelp/<hparams>_<additional_notes>.:

python train_sum.py --batch_size=16 --gpus=0,1,2,3 --notes=<additional_notes> 

meansum's People

Contributors

ayushoriginal avatar sosuperic avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

meansum's Issues

Issues about training a model on Chinese corpus

Hi, very much thanks for sharing your implementation.
I tried to apply this model to the Chinese corpus without labels. I reused the default parameter settings but only decreased the batch size to be 8 because the "out of memory" problem occurred when the batch size is larger. But the generation turned out to be poor after training for 20 epochs. By the way, I pretrained the language model with the "pretrained_lm" code.
Is there anything I need to pay special attention to for training? such as the number of documents? the length of each document? the size of vocabulary?
Looking forward to your reply.

About the Discriminator model

hello , i have a question. I can't find the model path in project_settings.py and the class Discriminator that appears in train_sum.py also not found

About the nll Loss

hi , when i train the No pre-trained language model ,why the nll loss is Nan sometimes ?

Where can I get business.json?

I got an error which says
FileNotFoundError: [Errno 2] No such file or directory: 'datasets/yelp_dataset/business.json'
Where can I get business.json?

About the language model

Hello , i have a question. Is the language model provided for download generated by pretrain_lm.py ?

How to evaluate the model?

The model has been trained. Which file is used to measure performance ? Is the file run_evaluations.py ?

Running MeanSum without CUDA

Hi, I am trying to run MeanSum on a cluster where cuda is not available, so when loading the language model, it fails when trying to deserialize and gives `AttributeError: module 'torch._C' has no attribute '_cuda_getDevice', which I guess is expected since cuda is not enabled. Is it possible to run MeanSum with just CPUs?

Possible bug when setting k > 1 & Gumbel_hard = True

Thanks for the interesting work and code

I was trying to get my head around the code and I couldn't understand something:

When training the mlstm model If we try the following set of parameters:

  • gumbel_hard = true
  • sampling method = "greedy" or "sample"
  • k > 1

In the line mlstm #L291
The logits_to_prob function will return a strict one hot vector according to the Torch gumbel softmax implementation F.gumbel_softmax

Afterwards, this prob vector is sent to prob_to_vocab_id method which is supposed to apply either torch.top_k (beam search) or torch.multinomial (top k sampling).

Implementation wise this shouldn't show any errors in beam search because of the torch.topk function ability to handle draws, however, the top k you get aren't the actual top k probabilities e.g.

image

But if you try to sample multinomial from 1 hot vector where K > 1 you get a runtime error:
image

Am I missing something here?

Use word embeddings

Hello, I want to train this model on my custom data which is not as large as the yelp one. So, I wanted to use fasttext to build the word encoder instead of initialising them with zero. How can I use that in this code?

Cant Find the reconstruction part

Hi@sosuperic , According to the paper, when we back propagates the model, there should be an autoencoder reconstruction loss and average summary similarity loss, but I couldn't find the code part of autoenc_loss in train_sum.py(only see the backward part and the file address of the autoencoder is none in project_setting.py), please tell me where are they, thank you.

Another problem is, i think, cycle_loss should be the average similarity loss, but I find the parameter

self.cycle_loss='enc' # When 'rec', reconstruct original texts. When 'enc', compare rev_enc and sum_enc embs

in project_settings.py. now, I am confused about the meaning of cycle loss? Dose it mean we could only choose one of those two losses to backward or what is the actual meaning of the parameter. Looking forward to your reply! Thank you again.

About the file: text_cnn.py

When i change the batch_size = 16 , emb_size = 32 , and hidden_size = 64 , the text_cnn.py throughs a error : RuntimeError: Expected 4-dimensional input for 4-dimensional weight [128, 1, 3, 32], but got 5-dimensional input of size [16, 1, 150, 31688, 32] instead

How can i solve it ?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.