How can we align powerful AI systems with human values?

Status: Queued

As AI systems become more capable, ensuring that their objectives remain aligned with human intentions becomes a research problem with no settled solution. The 'alignment problem' covers reward specification, deceptive optimisation, scalable oversight and interpretability.

Active areas include RLHF and its successors, mechanistic interpretability of neural networks, formal verification of learned policies, and constitutional methods. No proposal has yet been shown to scale to systems significantly more capable than current LLMs.

Sources

Wikipedia: AI alignment

Runs

No runs yet — this question is queued.