How can we align powerful AI systems with human values?
Category: Computer Science
Status: Queued
As AI systems become more capable, ensuring that their objectives remain aligned with human intentions becomes a research problem with no settled solution. The 'alignment problem' covers reward specification, deceptive optimisation, scalable oversight and interpretability.
Active areas include RLHF and its successors, mechanistic interpretability of neural networks, formal verification of learned policies, and constitutional methods. No proposal has yet been shown to scale to systems significantly more capable than current LLMs.
Sources
Runs
No runs yet — this question is queued.