A curated list of attacks and defenses papers on multimodal large language models (MLLMs).
Papers are sorted by their released dates in descending order.
Search keywords like attack/defense type (e.g., Jailbreaking
), or adversarial knowledge (e.g., Black-box
) over the webpage to quickly locate related papers.
Attack papers sorted by year: |2024 |2023 |
Defense papers sorted by year: | 2024 | 2023 |
Survey papers sorted by year: | 2023 |
Year | Title | Attack Type | Adversarial Knowledge | Venue | Paper Link | Code Link |
---|---|---|---|---|---|---|
2024 | Inducing High Energy-Latency of Large Vision-Language Models with Verbose Images | High Energy-latency(Availability) | White-box/Black-box | ICLR 2024 | Link | Code |
2024 | Vulnerabilities Unveiled: Adversarially Attacking a Multimodal Vision Language Model for Pathology Imaging | Misprediction | White-box | Arxiv | Link |
Year | Title | Attack Type | Adversarial Knowledge | Venue | Paper Link | Code Link |
---|---|---|---|---|---|---|
2023 | Privacy-Aware Document Visual Question Answering | Membership Inference | White-box | Arxiv | Link | |
2023 | OT-Attack: Enhancing Adversarial Transferability of Vision-Language Models via Optimal Transport Optimization | Link | ||||
2023 | On the Robustness of Large Multimodal Models Against Image Adversarial Attacks | Misprediction | White-box | Arxiv | Link | |
2023 | QuantAttack: Exploiting Dynamic Quantization to Attack Vision Transformers | Attacking Availability | Arxiv | Link | ||
2023 | Query-Relevant Images Jailbreak Large Multi-Modal Models | Jailbreaking | Black-box | Arxiv | Link | Code |
2023 | MMA-Diffusion: MultiModal Attack on Diffusion Models | Jailbreaking | White-box/Black-box | Arxiv | Link | |
2023 | How Many Unicorns Are in This Image? A Safety Evaluation Benchmark for Vision LLMs | Jailbreaking | White-box/Black-box | Arxiv | Link | Code |
2023 | BadCLIP: Dual-Embedding Guided Backdoor Attack on Multimodal Contrastive Learning | Backdoor | Arxiv | Link | ||
2023 | Jailbreaking GPT-4V via Self-Adversarial Attacks with System Prompts | Jailbreaking | Black-box | Arxiv | Link | |
2023 | Magmaw: Modality-Agnostic Adversarial Attacks on Machine Learning-Based Wireless Communication Systems | Arxiv | Link | |||
2023 | Composite Backdoor Attacks Against Large Language Models | Backdoor | Arxiv | Link | ||
2023 | VLATTACK: Multimodal Adversarial Attacks on Vision-Language Tasks via Pre-trained Models | Misprediction | Black-box | Arxiv | Link | Coming soon |
2023 | Can Language Models be Instructed to Protect Personal Information? | Jailbreaking | Black-box | Arxiv | Link | Code |
2023 | Fool Your (Vision and) Language Model With Embarrassingly Simple Permutations | Misprediction | Black-box | Arxiv | Link | Code |
2023 | Black-box Attacks on Image Activity Prediction and its Natural Language Explanations | Misprediction | Black-box | Arxiv | Link | |
2023 | Practical Membership Inference Attacks Against Large-Scale Multi-Modal Models: A Pilot Study | Membership Inference | Black-box | ICCV 2023 | Link | |
2023 | How Robust is Google’s Bard to Adversarial Image Attacks? | Jailbreaking | Black-box | Arxiv | Link | Code |
2023 | Exploring Transferability of Multimodal Adversarial Samples for Vision-Language Pre-training Models with Contrastive Learning | Misprediction | Black-box | Arxiv | Link | |
2023 | Adversarial Illusions in Multi-Modal Embeddings | Misprediction | Grey-box | Arxiv | Link | Code |
2023 | On the Adversarial Robustness of Multi-Modal Foundation Models | Misgeneration | White-box | ICCV 2023 | Link | |
2023 | Jailbreak in pieces: Compositional Adversarial Attacks on Multi-Modal Language Models | Jailbreaking | Grey-box | Arxiv | Link | |
2023 | Are aligned neural networks adversarially aligned? | Jailbreaking | Black-box | Arxiv | Link | |
2023 | Visual Adversarial Examples Jailbreak Aligned Large Language Models | Jailbreaking | White-box/Black-box | Arxiv | Link | Code |
2023 | On Evaluating Adversarial Robustness of Large Vision-Language Models | Misgeneration | Black-box | Arxiv | Link | Code |
2023 | Text-to-Image Diffusion Models can be Easily Backdoored through Multimodal Data Poisoning | Backdoor | Arxiv | Link | Coming soon |
Year | Title | Mitigating | Defense Strategy | Venue | Paper Link | Code Link |
---|---|---|---|---|---|---|
2024 | InferAligner: Inference-Time Alignment for Harmlessness through Cross-Model Guidance | Jailbreaking | In-processing (Alignment Technique) | Arxiv | Link | Code |
2024 | The Stronger the Diffusion Model, the Easier the Backdoor: Data Poisoning to Induce Copyright Breaches Without Adjusting Finetuning Pipeline | Copyright Breaches | Pre-processing (Backdoor) | Arxiv | Link | Code |
2024 | MLLM-Protector: Ensuring MLLM’s Safety without Hurting Performance | Jailbreaking | Post-processing (Detector) | Arxiv | Link | Code |
Year | Title | Mitigating | Defense Strategy | Venue | Paper Link | Code Link |
---|---|---|---|---|---|---|
2023 | A Mutation-Based Method for Multi-Modal Jailbreaking Attack Detection | Jailbreaking | Pre-/Post-processing (Adding Noise/Detector) | Arxiv | Link | |
2023 | Effective Backdoor Mitigation Depends on the Pre-training Objective | Backdoor | Finetuning on Clean Data | Arxiv | Link | |
2023 | Adversarial Prompt Tuning for Vision-Language Models | Misprediction | Adversarial Training | Arxiv | Link | Code |
2023 | Watermarking Vision-Language Pre-trained Models for Multi-modal Embedding as a Service | Model Stealing | Graph Embedding Models | Arxiv | Link | Demo |
2023 | Robust Contrastive Language-Image Pre-training against Data Poisoning and Backdoor Attacks | Backdoor | Adversarial Training | Arxiv | Link | Code |
2023 | Bidirectional Contrastive Split Learning for Visual Question Answering | Backdoor | Robust Training | AAAI 24 | Link |
Year | Title | Venue | Paper Link |
---|---|---|---|
2023 | A Mutation-Based Method for Multi-Modal Jailbreaking Attack Detection | Arxiv | Link |