Giter Site home page Giter Site logo

keven980716 / weak-to-strong-deception Goto Github PK

View Code? Open in Web Editor NEW
7.0 1.0 0.0 7.61 MB

Code&Data for the paper "Super(ficial)-alignment: Strong Models May Deceive Weak Models in Weak-to-Strong Generalization"

License: MIT License

Shell 5.87% Python 94.13%

weak-to-strong-deception's Introduction

Weak-to-Strong Deception

This repository contains the code and data for the paper "Super(ficial)-alignment: Strong Models May Deceive Weak Models in Weak-to-Strong Generalization" [pdf].

The concepts studied in our paper.

Introduction

When LLMs become superhuman models ultimately, it remains crucial and urgent to study whether supermodels trained under humans' weak supervision can demonstrate full potential and most importantly, still align well with human values. The Superalignment team has made an initial exploration and discovered a promising weak-to-strong generalization phenomenon. However, we are concerned about a potential safety issue called the weak-to-strong deception: the strong model behaves well-aligned in areas known to the weak supervisor but produces mis-aligned behaviors in cases beyond the understanding of the weak supervisor.

There could be many situations causing the weak-to-strong deception issue, while we take a preliminary study in a specific but realistic case: multi-objective alignment scenario, where there may be some alignment goals conflicting with each other. In such a case, it is likely that the strong student may deceive the weak supervisor in one alignment dimension to gain high reward in another alignment dimension.

We conduct experiments on both the reward modeling task and the preference optimization scenario (with DPO and SimPO). The code for our weak-to-strong deception experiments is in weak-to-strong directory.

Acknowledgement

Our code is mainly based on the original weak-to-strong repo provided by the Superalignment team. We greatly appreaciate their open-sourcing! Also, when conducting experiments with DPO and SimPO, we implement the code mainly based on the official DPO repo, an unofficial DPO repo, and the official SimPO repo. Thanks for their open-sourcing!

Citation

If you find this repo helpful, please kindly cite our work as

@article{yang2024super,
  title={Super (ficial)-alignment: Strong Models May Deceive Weak Models in Weak-to-Strong Generalization},
  author={Yang, Wenkai and Shen, Shiqi and Shen, Guangyao and Gong, Zhi and Lin, Yankai},
  journal={arXiv preprint arXiv:2406.11431},
  year={2024}
}

weak-to-strong-deception's People

Contributors

keven980716 avatar

Stargazers

Timsty avatar Qinyuan Cheng avatar Yongqi Li avatar Theq Yang avatar Wei_Yao avatar  avatar Wang Yizhou avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.