AI Alignment

Alignment research aims to line up three different descriptions of an AI system:

  1. Intended goals (‘wishes’): “the hypothetical (but hard to articulate) description of an ideal AI system that is fully aligned to the desires of the human operator”;
  2. Specified goals (or ‘outer specification’): The goals we actually specify — typically jointly through an objective function and a dataset;
  3. Emergent goals (or ‘inner specification’): The goals the AI actually advances.
  • ‘Outer misalignment’ is a mismatch between the intended goals (1) and the specified goals (2)
  • ‘inner misalignment’ is a mismatch between the human-specified goals (2) and the AI’s emergent goals (3)

Inner misalignment is often explained by analogy to biological evolution. In the ancestral environment, evolution selected human genes for inclusive genetic fitness, but humans evolved to have other objectives.