CipherChat 🔐

A novel framework CipherChat to systematically examine the generalizability of safety alignment to non-natural languages – ciphers (Demo).

LOVE💗 and Peace🌊

RESEARCH USE ONLY✅ NO MISUSE❌

🛠️ Usage

✨An example run:

python3 main.py \
 --model_name gpt-4-0613 \
--data_path data/data_en_zh.dict \
--encode_method caesar \
--instruction_type Crimes_And_Illegal_Activities \
--demonstration_toxicity toxic \
--language en

🔧 Argument Specification

--model_name: The name of the model to evaluate.
--data_path: Select the data to run.
--encode_method: Select the cipher to use.
--instruction_type: Select the domain of data.
--demonstration_toxicity: Select the toxic or safe demonstrations.
--demonstration_toxicity: Select the language of the data.

💡Framework

Our approach presumes that since human feedback and safety alignments are presented in natural language, using a human-unreadable cipher can potentially bypass the safety alignments effectively. Intuitively, we first teach the LLM to comprehend the cipher clearly by designating the LLM as a cipher expert, and elucidating the rules of enciphering and deciphering, supplemented with several demonstrations. We then convert the input into a cipher, which is less likely to be covered by the safety alignment of LLMs, before feeding it to the LLMs. We finally employ a rule-based decrypter to convert the model output from a cipher format into the natural language form.

📃Our Results

The query-responses pairs in our experiments are all stored in the form of a list in the "experimental_results" folder, and torch.load() can be used to load data.

🌰Case Study

🫠Ablation Study

🦙Other Models

👉 Paper and Citation

For more details, please refer to our paper here.

Citation

If you find our paper&tool interesting and useful, please feel free to give us a star and cite us through:

@misc{yuan2023cipherchat,
      title={GPT-4 Is Too Smart To Be Safe: Stealthy Chat with LLMs via Cipher}, 
      author={Youliang Yuan and Wenxiang Jiao and Wenxuan Wang and Jen-tse Huang and Pinjia He and Shuming Shi and Zhaopeng Tu},
      year={2023},
      eprint={2308.06463},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

singl3 / cipherchat Goto Github PK

cipherchat's Introduction

CipherChat 🔐

LOVE💗 and Peace🌊

RESEARCH USE ONLY✅ NO MISUSE❌

🛠️ Usage

🔧 Argument Specification

💡Framework

📃Our Results

🌰Case Study

🫠Ablation Study

🦙Other Models

👉 Paper and Citation

Citation

cipherchat's People

Contributors

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent