This is the code repository accompanying the paper "Self-Destructing Models: Increasing the Costs of Harmful Dual Uses in Foundation Models" that appeared at AIES 2023.
For the data, we used the bias in bios dataset which can be generated here: https://github.com/microsoft/biosbias. Place the BIOS.pkl file in data/cache/
You can run training by overriding the hydra config file. For example, to run the self-destructing model with the default parameters, you can run:
ID=$(python -c "import random; chars='qwertyuiopasdfghjklzxcvbnmQWERTYUIOPASDFGHJKLZXCVBNM1234567890'; print(''.join([random.choice(chars) for _ in range(7)]))")
python -m train experiment=bios l_bad_adapted=1.0 l_linear_mi=1.0 l_bad_adapted_grad=0.0 max_adapt_steps=16 batch_hash=$ID
To run eval, you pass in the checkpoint from the self-destructing model and run something along these lines:
python train.py -m hydra/launcher=RUN experiment=bios eval_only=True eval_network_type=loaded adversary.n_examples=20,50,100,200 seed=0,1,2,3,4,5 +eval_loaded_model_dir=./H63mlVF/regression__1.0__0.0__1.0__4__2022-05-19_11-26-58__15485531/ eval_only_bad=True batch_hash=$ID
Plots used in the paper can be generated by running (modifying the config in the code for the different plots in the paper).
python aggregate_results.py
If you use this code, please cite the following paper:
@inproceedings{hendersonmitchell2023selfdestructing,
title={Self-Destructing Models: Increasing the Costs of Harmful Dual Uses in Foundation Models},
author={Henderson\*, Peter and Mitchell\*, Eric and Manning, Christopher D. and Jurafsky, Dan and Finn, Chelsea},
booktitle={Proceedings of the 2023 AAAI/ACM Conference on AI, Ethics, and Society},
pages={forthcoming},
year={2023}
}