Giter Site home page Giter Site logo

dialoglm's Introduction

DialogLM

Code for AAAI 2022 paper: DialogLM: Pre-trained Model for Long Dialogue Understanding and Summarization.

Pre-trained Models

We release two versions of pre-trained models.

  • DialogLM is based on UniLMv2. According to whether sparse attention is introduced, it can be divided into two different versions to process dialogs of different lengths.
  • DialogLED builds on Longformer-Encoder-Decoder (LED) architecture and uses window-based denoising as the pre-training task on a large amount of long dialogue data for further training. You can use its base version and large version directly through HuggingFace.

Datasets

Please download the five datasets we used in our paper here (AMI, ICSI, QMSum, ForeverDreaming, TVMegaSite).

Finetuning for Downstream Tasks

Please go to specific folders to apply them to downstream tasks related to long dialogues.

Contributing

This project welcomes contributions and suggestions. Most contributions require you to agree to a Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us the rights to use your contribution. For details, visit https://cla.opensource.microsoft.com.

When you submit a pull request, a CLA bot will automatically determine whether you need to provide a CLA and decorate the PR appropriately (e.g., status check, comment). Simply follow the instructions provided by the bot. You will only need to do this once across all repos using our CLA.

This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact [email protected] with any additional questions or comments.

Trademarks

This project may contain trademarks or logos for projects, products, or services. Authorized use of Microsoft trademarks or logos is subject to and must follow Microsoft's Trademark & Brand Guidelines. Use of Microsoft trademarks or logos in modified versions of this project must not cause confusion or imply Microsoft sponsorship. Any use of third-party trademarks or logos are subject to those third-party's policies.

dialoglm's People

Contributors

microsoftopensource avatar nlpyang avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

dialoglm's Issues

Pre-trained and fine-tuned models not generating dialogue summaries.

Hello, I'm trying to generate summaries on dialogues using the pre-trained model (MingZhong/DialogLED-large-5120) as well as a fine-tuned one (on AMI dataset). It does not produce any meaningful summaries, but instead it just replicates a part of the dialogue. I'm using an adaptation of AllenAI (allenai/led-large-16384-arxiv) and the Huggingface LEDTokenizer and LEDForConditionalGeneration scripts:

LONG_DIALOGUE = """ Here goes my long dialogue.""""
import torch
from transformers import LEDTokenizer, LEDForConditionalGeneration

tokenizer = LEDTokenizer.from_pretrained("AMI_DialogLED_large/")
input_ids = tokenizer(LONG_DIALOGUE, return_tensors="pt").input_ids.to("cuda")
global_attention_mask = torch.zeros_like(input_ids)
global_attention_mask[:, 0] = 1

model = LEDForConditionalGeneration.from_pretrained("AMI_DialogLED_large/", return_dict_in_generate=True).to("cuda")
sequences = model.generate(input_ids, global_attention_mask=global_attention_mask).sequences

summary = tokenizer.batch_decode(sequences)
print(summary)

With this code the output is a gibberish dialogue, not an actual summary as shown in your model outputs. Is there another way to generate summaries, or should the model be used in a different way?

Thanks,

David

hyperparameters to replicate experiments

Hi! Thank you for sharing the dialogLM and dialogLED implementation. I wonder if it's possible to release the hyperparameters used for ForeverDreaming and TVMegaSite. I see the DialogLM_UniLM folder does use TVMegaSite as an example and provided the hyperparameters. Does fine-tuning on ForeverDreaming use slightly different parameters? In particular, knowing the num_training_steps and batchsize will be super useful for us to replicate the results presented in the paper.

error while running the script

ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 2) local_rank: 0 (pid: 1199) of binary: /usr/bin/python3
Traceback (most recent call last):
File "/usr/lib/python3.7/runpy.py", line 193, in _run_module_as_main
"main", mod_spec)
File "/usr/lib/python3.7/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/usr/local/lib/python3.7/dist-packages/torch/distributed/launch.py", line 193, in
main()
File "/usr/local/lib/python3.7/dist-packages/torch/distributed/launch.py", line 189, in main
launch(args)
File "/usr/local/lib/python3.7/dist-packages/torch/distributed/launch.py", line 174, in launch
run(args)
File "/usr/local/lib/python3.7/dist-packages/torch/distributed/run.py", line 755, in run
)(*cmd_args)
File "/usr/local/lib/python3.7/dist-packages/torch/distributed/launcher/api.py", line 131, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/usr/local/lib/python3.7/dist-packages/torch/distributed/launcher/api.py", line 247, in launch_agent
failures=result.failures,
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

how DialogLM is able process 5,120 tokens without using hybrid attention approach?

As mentioned in the research paper -

DIALOGLM is obtained by further pre-training UNILMbase with the window-based denoising method. Its maximum input length is 5,120 and the tokens exceeding this length is truncated in the experiments.
DIALOGLM-sparse additionally introduces the hybrid
attention approach in the pre-training process of DIALOGLM, so its maximum length is increased to 8,192 tokens.

I am not understanding how DialogLM is able process 5,120 tokens without using hybrid attention approach? Since it is using UniLM V2 as backbone the max tokens it can process should be 512 right?

The outputs on the ICSI and AMI

Hi, thanks for your contribution on the meeting transcripts summarization.

I've noticed that there are 12 out of 20 AMI outputs, and 3 out of 6 ICSI outputs are incomplete. Is this phenomenon caused by some bugs?

URLs for unilm vocab.txt are deprecated

The vocab.txt links in tokenization_unilm.py are deprecated.
Need to update the latest checkpoints links according to [link].(https://unilm.blob.core.windows.net/ckpt/unilm1.2-base-uncased-vocab.txt)

PRETRAINED_VOCAB_FILES_MAP = {
    'vocab_file':
    {
        'unilm-large-cased': "https://conversationhub.blob.core.windows.net/beit-share-public/ckpt/unilm-large-cased-vocab.txt",
        'unilm-base-cased': "https://conversationhub.blob.core.windows.net/beit-share-public/ckpt/unilm-base-cased-vocab.txt",
        'unilm1-large-cased': "https://conversationhub.blob.core.windows.net/beit-share-public/ckpt/unilm1-large-cased-vocab.txt",
        'unilm1-base-cased': "https://conversationhub.blob.core.windows.net/beit-share-public/ckpt/unilm1-base-cased-vocab.txt",
        'unilm1.2-base-uncased': "https://conversationhub.blob.core.windows.net/beit-share-public/ckpt/unilm1.2-base-uncased-vocab.txt"
    }
}

Thus, links in configuration_minilm.py, configuration_unilm.py, tokenization_minilm.py, tokenization_unilm.py need to be updated as well.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.