Connect, Collapse, Corrupt: Learning Cross-Modal Tasks with Uni-Modal Data

This repo provides the PyTorch source code of our paper: Connect, Collapse, Corrupt: Learning Cross-Modal Tasks with Uni-Modal Data (ICLR 2024). Check out project page here!

🔮 Abstract

Building cross-modal applications is challenging due to limited paired multi-modal data. Recent works have shown that leveraging a pre-trained multi-modal contrastive representation space enables cross-modal tasks to be learned from uni-modal data. This is based on the assumption that contrastive optimization makes embeddings from different modalities interchangeable. However, this assumption is under-explored due to the poorly understood geometry of the multi-modal contrastive space, where a modality gap exists. In our study, we provide a theoretical explanation of this space's geometry and introduce a three-step method, $C^3$ (Connect, Collapse, Corrupt), to bridge the modality gap, enhancing the interchangeability of embeddings. Our $C^3$ method significantly improves cross-modal learning from uni-modal data, achieving state-of-the-art results on zero-shot image / audio / video captioning and text-to-image generation.

💡 Approach

Figure: Overview of the motivation behind our approach, $C^3$. Our work provides a theoretical explanation of the unique geometry that arises from multi-modal contrastive learning, where a modality gap and alignment noise exist in the learned representation space. Building upon this observation, we present a straightforward technique, $C^3$, which enhances the interchangeability of embeddings between modalities, enabling the creation of cross-modal applications using only uni-modal data.

🚀 Getting Started

Reproduce embedding geometry analysis results here.
Reproduce image captioning results here.
Reproduce image generation results here.

🎯 Citation

If you use this repo in your research, please cite it as follows:

@inproceedings{C3,
  title={Connect, Collapse, Corrupt: Learning Cross-Modal Tasks with Uni-Modal Data},
  author={Zhang, Yuhui and Sui, Elaine and Yeung-Levy, Serena},
  booktitle={International Conference on Learning Representations (ICLR)},
  year={2024}
}

yuhui-zh15 / c3 Goto Github PK

c3's Introduction

Connect, Collapse, Corrupt: Learning Cross-Modal Tasks with Uni-Modal Data

🔮 Abstract

💡 Approach

🚀 Getting Started

🎯 Citation

c3's People

Contributors

Stargazers

Watchers

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent