Jan Leike on X: The Evolution of AI Alignment Research Over a Decade

TL;DR · AI Summary
Jan Leike reflects on the transformation of AI alignment research over the past decade—from a niche field with only ~12 researchers and unclear methods to one now driven by RLHF, scalable oversight, and automated techniques like constitutional AI in models such as Claude.
Key Takeaways
- In the early days (~10 years ago), only about 12 people worked on alignment, mos
- RLHF made alignment practically feasible for large language models, enabling rea
- Current alignment research is increasingly automated—e.g., Claude now includes a
Outline
Jump quickly between sections.
The author notes that ten years ago, there was little understanding of how AGI would be built or secured, and the field had very few dedicated researchers.
At that time, only around a dozen researchers were involved in alignment work, mostly part-time, with no consensus on how to tackle the problem.
Reinforcement Learning from Human Feedback (RLHF) turned alignment into an actionable technique, significantly boosting its real-world applicability.
Today’s alignment research is becoming increasingly automated, including behavior evaluation, correction mechanisms, and constitutional design in models like Claude.
Mindmap
See how the topics connect at a glance.
查看大纲文本(无障碍 / 无 JS 友好)
- AI 对齐研究的发展历程
- 早期阶段(<10年前)
- 人数极少(~12人)
- 无明确方法论
- 中期突破(2010s末-2020s初)
- RLHF 技术成熟
- 实验机会增加
- 当前趋势(2020s至今)
- 自动化对齐研究
- 模型内置安全机制(如Claude宪法)
Highlights
Key sentences worth saving and sharing.
When I started to work on the alignment problem more than 10 years ago, we had no idea how AGI was going to be built or how to make it safe.
The field had maybe a dozen people who were working on it as a side gig. Everyone was pretty confused about how to approach the problem.
RLHF on LLMs made it a lot more practical. We've made a ton of progress on evaluating, investigating, steering and fixing behavioral issues.
Jan Leike on X: "When I started to work on the alignment problem more than 10 years ago, we had no idea how AGI was going to be built or how to make it safe. The field had maybe a dozen people who were working on it as a side gig. Everyone was pretty confused about how to approach the problem," / X
Don’t miss what’s happening

When I started to work on the alignment problem more than 10 years ago, we had no idea how AGI was going to be built or how to make it safe. The field had maybe a dozen people who were working on it as a side gig. Everyone was pretty confused about how to approach the problem, and the number of people willing to run experiments with deep learning was tiny. So much has changed since then! The world woke up not just to AGI but also increasingly to the importance of alignment. RLHF on LLMs made it a lot more practical. We've made a ton of progress on evaluating, investigating, steering and fixing behavioral issues. Claude now has a constitution and we made some good progress on scalable oversight. More and more of our alignment research is getting automated.
·
4
1
114
5