AI alignment is the research field and engineering discipline focused on ensuring AI systems pursue their intended goals correctly. It tackles the problem of getting models to do what developers and users actually want, rather than misinterpreting goals, gaming reward functions, or developing unintended behaviors. The work spans current techniques (RLHF, Constitutional AI, evaluation against intended behaviors) and fundamental research into how to align increasingly capable systems whose internal reasoning may be opaque. It's a subset of AI safety focused specifically on the goal-correctness problem.
The alignment problem:
How do you ensure an AI system pursues what you want, not something else? Sounds simple but is technically...