- 2024 May-18 : Our paper is available at here.
- 2024 Feb-20 : This work has been accepted by COLING 2024.
- 2023 Oct-13 : Updated the demo interface.
- 2023 Aug-02 : Released the demo video. [YouTube]
Figure 1: The overview of TIGER. Given the dialogue context, response modal predictor
- We propose mulTImodal GEnerator for dialogue Response (TIGER), a unified generative model framework designed for multimodal dialogue response generation. Notably, this framework is capable of handling conversations involving any combination of modalities.
- We implement a system for multimodal dialogue response generation, incorporating both text and images, based on TIGER.
- Extensive experiments show that TIGER achieves new state-of-the-art results on both automatic and human evaluations, which validate the effectiveness of our system in providing a superior multimodal conversational experience.
โ We implemented a multimodal dialogue system based on TIGER
, as depicted in figure above.
Our system offers various modifiable components:
- For the textual dialogue response generator, users can can choose decoding strategies and adjust related parameters.
- For the Text-to-Image translator, users can freely modify prompt templates and negative prompt to suit different requirements. Default prompt templates and negative prompts are provided, enhancing the realism of generated images.
โNote: It's worth mentioning that our research focuses on open-domain multimodal dialogue response generation. However, the system may not possess perfect instruction-following capabilities. Users can treat it as a companion or listener, but using it as a QA system or AI painting generator is not recommended.
![]() |
![]() |
![]() |
![]() |
Restricted by the limited number of pages, we only give a clear and easy-to-understand introduction of our method in the paper. More implementation details, experimental results and discussions can be found in supplement.
โญ A GPU with 24GB memory (18GB at runtime) is enough for the demo.
cd TIGER/
conda env create -f environment.yml
conda activate tiger
โจ Please download our model weights from here (Google Drive). For Text-to-Image Translator's weights, we have already uploaded it to Hugging Face, so you don't need to download it locally now. More details can be sourced from friedrichor/stable-diffusion-2-1-realistic.
The final weights would be in a single folder in a structure similar to the following:
TIGER
โโโ demo
โ โโโ ...
โโโ model_weights
โ โโโ tiger_response_modal_predictor.pth
โ โโโ tiger_textual_dialogue_response_generator.pth
โ โโโ tiger_text2image_translator
โ โโโ feature_extractor
โ โ โโโ preprocessor_config.json
โ โโโ scheduler
โ โ โโโ scheduler_config.json
โ โโโ text_encoder
โ โ โโโ config.json
โ โ โโโ pytorch_model.bin
โ โโโ tokenizer
โ โโโ merges.txt
โ โโโ special_tokens_map.json
โ โโโ tokenizer_config.json
โ โ โโโ vocab.json
โ โโโ unet
โ โ โโโ config.json
โ โ โโโ diffusion_pytorch_model.bin
โ โโโ vae
โ โ โโโ config.json
โ โ โโโ diffusion_pytorch_model.bin
โ โโโ model_index.json
โโโ tiger
โ โโโ ...
โโโ utils
โ โโโ ...
โโโ demo.py
...
python demo.py --config demo/demo_config.yaml
If you find our work useful in your research, please consider citing us:
@inproceedings{kong-etal-2024-tiger-unified,
title = "{TIGER}: A Unified Generative Model Framework for Multimodal Dialogue Response Generation",
author = "Kong, Fanheng and
Wang, Peidong and
Feng, Shi and
Wang, Daling and
Zhang, Yifei",
booktitle = "Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)",
month = may,
year = "2024",
address = "Torino, Italy",
publisher = "ELRA and ICCL",
url = "https://aclanthology.org/2024.lrec-main.1403",
pages = "16135--16141",
}