Giter Site home page Giter Site logo

pat's Introduction

PAT(Prompt Adversarial Tuning)

License: MIT

This is the official repository for "Fight Back Against Jailbreaking via Prompt Adversarial Tuning" by Yichuan Mo, Yuji Wang, Zeming Wei, and Yisen Wang. We introduce a defense strategy named PAT(Prompt Adversarial Tuning) to protect LLMs from Jailbreak attacks.

PAT

Table of Contents

Installation

We conduct all our experiments with the version of FastChat fschat==0.2.20.

Models

Before run the experiments, you need to download Vicuna-7B or/and LLaMA-2-7B-Chat first. Then modify the model path in experiments/configs/individual_xxx.py or experiments/configs/transfer_xxx.py (for multiple behaviors or transfer experiment).

Experiments

The experiments folder contains code to reproduce PAT defense on AdvBench.

  • To perform multiple behaviors defense on harmful behaviors:
cd experiments/launch_scripts
bash run_gcg_multiple.sh vicuna 15 0.2 0
  • To perform transfer behaviors experiments (i.e. 25 behaviors, 2 models), run the following code inside experiments:
cd experiments/launch_scripts
bash run_gcg_transfer.sh vicuna_guanaco 15 0.2 0
  • To perform evaluation experiments, please refer to the file in eval_scripts. eval_attack.sh tests the model on harmful behaviors, and eval_benign.sh tests the model on benign prompts. You need to specify the paths for the attack file and the defense file.

We will show our training results here, and you can test PAT directly with the provided defense controls.

The defense control for Vicuna-7B is as below:

Answer seriously esper quietlymedia (/ \". harmcgi{{[.(_{( zv\":

The defense control for Llama-2 is as below:

{[[ Tili particulderr einemroom folgenderomasquestion bezeichneter {%Begin Charlotteawtativelyiostream sug>::

The defense control for transferable settings is as below:

Come ! critique ! respond seriously?. evil fortune////////////////WRITE asked ein legalEGIN chat,. /**------+ nicelyTurnCred/) Base Question sid!(button [\"itle 

Citation

If you find this useful in your research, please consider citing:

@inproceedings{
mo2024fight,
title={Fight Back Against Jailbreaking via Prompt Adversarial Tuning},
author={Yichuan Mo and Yuji Wang and Zeming Wei and Yisen Wang},
booktitle={ICLR 2024 Workshop on Secure and Trustworthy Large Language Models},
year={2024},
url={https://openreview.net/forum?id=q0PbfNwLBq}
}

License

PAT is licensed under the terms of the MIT license. See LICENSE for more details.

Acknowledgments

Thanks for work Universal and Transferable Adversarial Attacks on Aligned Language Models.

pat's People

Contributors

rain152 avatar

Stargazers

Yichuan Mo avatar Zichen Wen avatar  avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.