#539. 手搓AlphaGo：前DeepMind科学家拆解AI围棋核心原理，以及对LLM强化学习的深远启示

跨国串门儿计划

跨国串门儿计划Podcast2026年5月17日2:09:53

Rebuilding AlphaGo: A Deep Dive into AI Go Core Principles and Implications for LLMs

8.5Score

Listen

Duration 2:09:53Original podcast page

问这期播客

会先在本集摘要、章节、转录和笔记里找答案。

TL;DR · AI Summary

AlphaGo uses MCTS and neural networks to achieve efficient search, showcasing the potential of reinforcement learning.

Key Takeaways

AlphaGo employs MCTS and neural networks for efficient search with clear supervi
MCTS never starts from 0% and avoids exploration challenges in RL.
Go serves as a training ground for AI scientists, enabling low-cost validation o

Outline

Jump quickly between sections.

§Introduction
AlphaGo uses MCTS and neural networks for efficient search, showcasing RL potential.
·Core Mechanism
MCTS uses UCB and PUCT for exploration, with strategy networks pruning effectively.
›Search & Learning
MCTS never starts from 0% and provides clear supervisory targets.
›Information Efficiency
Supervised learning provides more information per sample than RL, with knowledge distillation effective.
›Computational Cost & Research
Low computational cost enables rebuilding AlphaGo, with significant cost premium for pioneers.
›Go as a Research Tool
Go serves as a training ground for AI scientists, validating hypothesis and experimentation.

Mindmap

See how the topics connect at a glance.

查看大纲文本（无障碍 / 无 JS 友好）

AlphaGo 核心原理
- MCTS 机制
  - UCB/PUCT 算法
- 神经网络与搜索
  - 策略网络
- 信息效率对比
  - 监督学习 vs RL
- 算力与研究
  - 低算力重建 AlphaGo

Highlights

Key sentences worth saving and sharing.

MCTS never starts from 0% and provides clear supervisory targets.
— Paragraph 3
⬇︎ 下载 PNG 𝕏 分享到 X
Supervised learning provides more information per sample than RL, with distillation effective.
— Paragraph 4
⬇︎ 下载 PNG 𝕏 分享到 X
Pioneers must pay a high cost premium for first-mover advantage.
— Paragraph 6
⬇︎ 下载 PNG 𝕏 分享到 X

Chapters

开场 & 播客简介
开场 & 播客简介
为何 AlphaGo 令人着迷：用一个神经网络摊销几乎不可解的搜索
为何 AlphaGo 令人着迷：用一个神经网络摊销几乎不可解的搜索
围棋规则速通：从吃子到 Trump-Taylor 计分
围棋规则速通：从吃子到 Trump-Taylor 计分
搜索树与组合爆炸：361的300次方，比宇宙原子数还大
搜索树与组合爆炸：361的300次方，比宇宙原子数还大
UCB 与 PUCT：如何边建树边决定探索哪条路
UCB 与 PUCT：如何边建树边决定探索哪条路
价值函数登场：人类“一眼定输赢”的直觉，AI 也能拥有
价值函数登场：人类“一眼定输赢”的直觉，AI 也能拥有
策略网络：先猜一把哪儿值得搜，大幅剪枝
策略网络：先猜一把哪儿值得搜，大幅剪枝
MCTS 四步流程：选择、扩展、评估、回传
MCTS 四步流程：选择、扩展、评估、回传
架构选择：为什么 ResNet 在小预算下仍优于 Transformer
架构选择：为什么 ResNet 在小预算下仍优于 Transformer
初始化的魔力：先用人类棋谱教会模型什么是好棋
初始化的魔力：先用人类棋谱教会模型什么是好棋
Self-play 闭环：让搜索反哺网络，实现策略迭代
Self-play 闭环：让搜索反哺网络，实现策略迭代
MCTS 作为改进算子：永远给你一个比当前策略更好的答案
MCTS 作为改进算子：永远给你一个比当前策略更好的答案

Transcript

开场 & 播客简介

为何 AlphaGo 令人着迷用一个神经网络摊销几乎不可解的搜索

围棋规则速通从吃子到 Trump-Taylor 计分

搜索树与组合爆炸361的300次方，比宇宙原子数还大

UCB 与 PUCT如何边建树边决定探索哪条路

价值函数登场人类“一眼定输赢”的直觉，AI 也能拥有

策略网络先猜一把哪儿值得搜，大幅剪枝

MCTS 四步流程选择、扩展、评估、回传

架构选择为什么 ResNet 在小预算下仍优于 Transformer

初始化的魔力先用人类棋谱教会模型什么是好棋

Self-play 闭环让搜索反哺网络，实现策略迭代

MCTS 作为改进算子永远给你一个比当前策略更好的答案

知识蒸馏把几千步搜索的成果内化到网络的一次前传里

价值函数训练技巧小棋盘预训练与终局标签的重要性

深度震撼10 层神经网络如何摊销 NP 难问题

对比 LLM RL方差为何爆炸，“吸管里吸信号”的困境

MCTS 能直接用于 LLM 推理吗？广度、深度与动作空间的挑战

算力缩放亲历从千万美元到几千块，AlphaGo 变廉价了

Off-policy 训练与回放缓冲区如何复用旧数据

信息论视角监督学习每样本比特数远超 RL，软标签有多重要

围棋作为 AI 科学家孵化器用外循环验证研究直觉

研究品味与可验证性如何设计正确的 RL 环境

结尾 & 资源推荐

#AI#Reinforcement Learning#Go#Neural Networks#Search Algorithms

Show notes

📝 Podcast Introduction

This episode is a clone of the deep interview from the popular tech podcast *Dwarkesh Patel Podcast* titled "What rebuilding AlphaGo teaches us about self-play, RL, and future of LLMs - Eric Jang". The host, Dwarkesh Patel, engages in a fascinating technical dive with guest Eric Jang, reconstructing the core ideas of AlphaGo from scratch using modern open-source tools and minimal budget.

Eric Jang previously served as the VP of AI at 1X Technologies and was an advanced research scientist at Google DeepMind Robotics. During his leave, he independently reconstructed and improved AlphaGo, writing a detailed technical tutorial that has garnered significant community attention. He is renowned for his unique insights into AlphaGo's core mechanisms and his pioneering thoughts on automated AI research.

⏱️ Timestamps

00:00 Opening & Podcast Introduction Understanding AlphaGo from Scratch

02:05 Why AlphaGo is So Fascinating: A Neural Network That "Solves" Almost Impossible Search

03:43 Go Rules Quick Summary: From Taking Pieces to Trump-Taylor Scoring

08:38 Search Tree and Explosion: 361^300, Larger Than the Number of Atoms in the Universe

Monte Carlo Tree Search (MCTS) Core Principles

11:16 UCB and PUCT: How to Explore While Building the Tree

15:59 Value Function Arrives: The Intuition of Humans "Seeing the Win/Loss" in One Look, AI Can Also Have It

21:02 Strategy Network: Guessing Where to Search, Dramatically Pruning

Neural Networks and Search: A Perfect Match

24:54 Four Steps of MCTS: Selection, Expansion, Evaluation, Backpropagation

27:28 Architecture Choice: Why ResNet Still Outperforms Transformer in Low Budget

34:23 The Magic of Initialization: Teaching the Model What a Good Game Is with Human Chess Patterns

42:21 Self-Play Loop: Letting Search Feed the Network for Strategy Iteration

The Elegance and Cruelty of Reinforcement Learning

47:41 MCTS as an Improvement Operator: Always Giving a Better Answer Than the Current Policy

52:00 Knowledge Distillation: Internalizing Thousands of Search Steps into a Single Forward Pass

57:04 Training Techniques for Value Function: Small Board Pretraining and Final Game Labels

01:03:01 Deep Impact: How 10 Layers of Neural Networks Solve NP Hard Problems

01:11:35 Comparing LLM RL: Why Variance Explodes, and the "Pipe Absorbing Signals" Dilemma

01:22:21 Can MCTS Be Used Directly for LLM Reasoning? Challenges of Breadth, Depth, and Action Space

Computational Efficiency and Automated Research

01:28:41 Scaling Calculations: From Millions to Thousands of Dollars, AlphaGo Became Cheaper

01:38:08 Off-Policy Training and Replay Buffer: How to Reuse Old Data

01:47:04 Information Theory Perspective: Supervised Learning Has Much More Information per Sample than RL

01:55:36 Go as an AI Scientist's Training Ground: Using Outer Cycles to Validate Research Intuitions

02:05:12 Research Taste and Verifiability: How to Design a Correct RL Environment

02:08:03 Closing & Resource Recommendations

🌟 Highlights

💡 10-Layer Network, Solving NP Hard Problems Eric points out that AlphaGo's most profound contribution is not the game itself, but a conceptual breakthrough: just 10 layers of neural networks can, through a single forward pass, approximate a deeply complex search problem with high accuracy. This suggests that macroscopic features can dissolve our traditional understanding of computational complexity, similar to what we see in models like AlphaFold.

“这是一个突破，我觉得今天大多数人都没能完全领会它有多么深远。”

🛠️ MCTS's Elegance: Never Starting from 0% Unlike the naive gradient-based methods used by today's LLMs, AlphaGo's MCTS always provides an improved policy label based on the current state. This means its learning process never gets stuck in a "zero signal" desert, with clear supervisory goals at every step, leading to extraordinary sampling efficiency and stability.

“AlphaGo 之所以优雅，就是你永远不需要从一个 0% 的成功率开始，也不需要解决怎么拿到非零成功率的探索问题。”

🚀 Supervised Learning Information Efficiency Wins Eric and Dwarkesh compare supervised learning and RL from an information theory perspective. In regions with low pass rates, RL can only provide very little learning bits per sample, while supervised learning, through soft labels (the entire probability distribution), can provide much more information. This explains why distillation is so effective—MCTS's access count distribution serves as a soft target, passing on far more "hidden knowledge" than a single action label.

“在一个软标签里，每样本的信息量，以比特计，要大得多。这就是为什么蒸馏这么有效。”

⚖️ Become First, Calculating Power is Always Most Expensive Eric shares his experience of reconstructing AlphaGo using just $10,000 of computing power, contrasting it with DeepMind's usual millions of dollars and custom TPU clusters. He emphasizes, “The computing power needed to become the first to do something is always much greater than the power required to catch up later.” This principle also holds true in the age of LLMs, where pioneers must pay a significant premium for exploring unknown territory.

🧪 Go as an AI Scientist's Training Ground Eric is currently building a Go environment as an "outer loop" to train automated AI research agents. Because Go validates quickly and has clear win/loss outcomes, it offers a low-cost way to test the ability of agents to propose hypotheses, design experiments, and explain results, potentially leading to applications in more complex scientific discovery tasks.

“我搭建这个围棋环境的动机之一，就是觉得围棋承载了大量非常有趣的研究问题，而且验证速度很快。”

🌐 Podcast Information Supplement Translated from the *Dwarkesh Patel Podcast*

This episode uses AI voice cloning technology to translate the voices of the original host and guest into Chinese, which may result in slight differences in pronunciation.

AI translation was used, so there might be some awkwardness in certain parts;

If you'd like to listen to other foreign podcasts in Chinese in the future, please contact us via WeChat: iEvenight.

Rebuilding AlphaGo: A Deep Dive into AI Go Core Principles and Implications for LLMs

Listen

问这期播客

TL;DR · AI Summary

Key Takeaways

Outline

Mindmap

Highlights

Chapters

开场 & 播客简介

为何 AlphaGo 令人着迷：用一个神经网络摊销几乎不可解的搜索

围棋规则速通：从吃子到 Trump-Taylor 计分

搜索树与组合爆炸：361的300次方，比宇宙原子数还大

UCB 与 PUCT：如何边建树边决定探索哪条路

价值函数登场：人类“一眼定输赢”的直觉，AI 也能拥有

策略网络：先猜一把哪儿值得搜，大幅剪枝

MCTS 四步流程：选择、扩展、评估、回传

架构选择：为什么 ResNet 在小预算下仍优于 Transformer

初始化的魔力：先用人类棋谱教会模型什么是好棋

Self-play 闭环：让搜索反哺网络，实现策略迭代

MCTS 作为改进算子：永远给你一个比当前策略更好的答案

Transcript

Show notes