Giter Site home page Giter Site logo

ai-safety-cheatsheet's Introduction

AI Safety Cheatsheet

This is intended to be a compilation of the big AI safety ideas, problems and approaches to solutions. In order to keep it readable, we provide links to the content instead of the content itself. Contributions are welcome!

Alignment Problems

Forecasting

Optimization issues

  • Inner alignment. [1] [2]
    • Out of distribution alignment / goal misgeneralization [1] [2] [3]
  • Outer alignment and Mesa-optimizers [1] [2] [3]

Human-like behaviors

  • Instrumental convergence [1] [2] [3] [4]
    • Self-preservation
    • Goal-content integrity
    • Cognitive enhancement
    • Resource acquisition
    • Power/Influence acquisition [1]
  • Specification gaming [1] [2]
    • Goodharts law [1] and Goodharts Curse [2]
  • Deception. This is the optimal behavior for a misaligned mesa-optimizer. [1]
  • Nearest unblocked strategy [1]
  • Collaboration with other AIs
  • Sycophant AI [1]

Alien behaviors

  • Orthogonality Thesis [1] [2] [3] [4]
  • Strawberry problem [1] [2]
  • Paperclip maximizing [1] [2]
  • Learning the wrong distribution. [1] [2]
  • High impact [1]
  • Edge instantiation [1]
  • Context disaster [1]

Deployment Problems

  • Alignment tax / Safety Tax [1]
  • Collingridge dilemma [1]
  • Corrigibility [1] [2]
  • Humans are not secure [1]
  • AI-Box [1] [2]
  • We need to get alignment right on the 'first critical try' [1]
  • Shutdown problem [1]

Governance

Overview [1] [2] [3]

  • Robust totalitarianism [1] [2]
  • Extreme first-strike advantages [1] [2]
  • Misuse Risks [1]
  • Value Erosion through Competition [1] [2]
  • Windfall clause [1]
  • Compute governance [1]
  • Risks from malevolent actors [1]

Models for thinking about AGI agents

  • Human-anchors [1]
  • Bio-anchors [1]
  • A super-smart deceptive, manipulative, psychopath with arbitrary and (possibly absurd) goals.
  • As a computer program that simply does what it’s programmed to do. Just because it is super capable does not mean it is wise, moral, smart or cares about what humans want.

Approaches to AI safety

  • Eliciting latent knowledge (Paul Christiano) (Alignment Research Center) [1]
  • Agent foundations (MIRI) [1]
  • Brain-like design [1]
  • Iterated Distillation and Amplification [1]
  • Humans Consulting Humans (Christiano) [1]
  • Learning from Humans [1] [2] [3]
  • Reward modeling (DeepMind) [1]
  • Better-than-Demonstrator Imitation Learning via Automatically-Ranked Demonstration [1]
  • Imitation learning [1]
  • Myopic reinforcement learning [1]
  • Inverse reinforcement learning [1]
  • Cooperative inverse reinforcement learning [1]
  • Debate [1] [2]
  • Capability control method
  • Transparency / Interpretability
    • Understandability principle [1]
    • Effability [2]

Definitions

General Intelligence

  • "General Intelligence or Universal Intelligence is the ability to efficiently achieve goals in a wide range of domains". (This is a commonly held definition) [1] [2]
  • "Intelligence is the ability to make models. General intelligence means that a sufficiently large computational substrate can be fitted to an arbitrary computable function, within the limits of that substrate." (Josha Bach) [1]

Alignment

  • "AI that is trying to do what you want it to do". (Paul Christiano) [1]
  • "AI systems be designed with the sole objective of maximizing the realization of human preferences" (Stuart Russell) [1]
  • "AI should be designed to align with our ‘coherent extrapolated volition’ (CEV)[1]. CEV represents an integrated version of what we would want ‘if we knew more, thought faster, were more the people we wished we were, and had grown up farther together" (Eliezer Yudkowsky) [1]

Meta Resources

About Cheatsheet

Contributions are welcome! Please open a merge request and will do my best to quickly approve it.

ai-safety-cheatsheet's People

Contributors

jakobovski avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.