#569. Deep Dive into xAI: Building Grok Imagine in 3 Months, Video Generation, World Models, and Video Agents

Listen
问这期播客
会先在本集摘要、章节、转录和笔记里找答案。
TL;DR · AI Summary
A former Nvidia researcher explains how xAI built Grok Imagine in three months, revealing the training pipeline of video generation models, the definition of world models, and the future trends of Video Agents.
Key Takeaways
- xAI built Grok Imagine 0.9 from scratch within three months, thanks to talent de
- The progress of video models largely comes from advancements in language models,
- World Models are defined as 'real-time, interactive, long-term videos,' and futu
Outline
Jump quickly between sections.
Introduces guest Ethan He's background and an overview of the program's content.
Analyzes the key factors behind xAI building Grok Imagine in three months.
Details the complete training process from data acquisition to diffusion Transformer.
Elaborates on the definition of world models and their applications in interactive systems.
Discusses audio processing challenges and audio-video alignment issues.
Mindmap
See how the topics connect at a glance.
查看大纲文本(无障碍 / 无 JS 友好)
- 深入xAI技术
- Grok Imagine开发
- 三个月速成秘诀
- 视频模型训练
- 数据与架构
- 世界模型
- 实时交互特性
Highlights
Key sentences worth saving and sharing.
The truth about model quality improvements: many breakthroughs come from small bugs in data and training pipelines.
Many advances in video models actually come from language models, not video diffusion itself.
World Models are 'real-time, interactive, long-term videos'.
Chapters
开场 & 播客简介
开场 & 播客简介
嘉宾登场:Ethan He 与 Latent Space 社区的缘起
嘉宾登场:Ethan He 与 Latent Space 社区的缘起
为什么离开 Nvidia:视频模型也有 scaling law,需要更大算力
为什么离开 Nvidia:视频模型也有 scaling law,需要更大算力
xAI 从零起步:三个月做出 Grok Imagine 0.9
xAI 从零起步:三个月做出 Grok Imagine 0.9
快速迭代的秘密:人才、infra、compute 与低沟通成本
快速迭代的秘密:人才、infra、compute 与低沟通成本
模型质量提升的真相:很多突破来自数据和训练 pipeline 里的小 bug
模型质量提升的真相:很多突破来自数据和训练 pipeline 里的小 bug
Coding model 如何改变研究节奏:代码更快,compute 再次成为瓶颈
Coding model 如何改变研究节奏:代码更快,compute 再次成为瓶颈
高压研发文化:算力昂贵,但这是一场马拉松
高压研发文化:算力昂贵,但这是一场马拉松
为什么做视频模型之前,通常要先做图像模型
为什么做视频模型之前,通常要先做图像模型
数据从哪里来:人工详细标注与 VLM 生成 synthetic caption
数据从哪里来:人工详细标注与 VLM 生成 synthetic caption
训练视频模型为什么既需要配对数据,也需要无标签数据
训练视频模型为什么既需要配对数据,也需要无标签数据
VAE / tokenizer:为什么不能直接在像素上训练
VAE / tokenizer:为什么不能直接在像素上训练
Transcript
开场 & 播客简介
嘉宾登场Ethan He 与 Latent Space 社区的缘起
为什么离开 Nvidia视频模型也有 scaling law,需要更大算力
xAI 从零起步三个月做出 Grok Imagine 0.9
快速迭代的秘密人才、infra、compute 与低沟通成本
模型质量提升的真相很多突破来自数据和训练 pipeline 里的小 bug
Coding model 如何改变研究节奏代码更快,compute 再次成为瓶颈
高压研发文化算力昂贵,但这是一场马拉松
为什么做视频模型之前,通常要先做图像模型
数据从哪里来人工详细标注与 VLM 生成 synthetic caption
训练视频模型为什么既需要配对数据,也需要无标签数据
VAE / tokenizer为什么不能直接在像素上训练
Diffusion transformer从噪声一步步去噪生成图像和视频
图像模型如何 bootstrap 视频模型语言与图像连接更密集
视频压缩路线逐帧压缩 vs 时间维度压缩
为什么不用 MP4 token 直接训练latent space 必须对模型友好
实时性的代价时间压缩节省 context,但会引入响应延迟
Flipbook像浏览器一样探索模型想象出的网页
Generative UI从用户意图直接到像素,而不是先写代码再渲染
Diffusion 前端,确定性后端未来界面可能如何被重构
人机交互的带宽人类用语音输出,用视觉输入
NeuroOS用视频模型模拟操作系统和游戏
从过拟合现有界面,到想象全新交互系统
为什么视频模型能生成训练集中不存在的超自然内容
视频模型到底有多贵训练成本接近中等规模 LLM
被低估的成本视频存储、特征存储、IO 和 egress
训练规模数十万亿视觉 token、百亿级 active 参数
推理端加速step distillation 如何把一百步变成几步
Consistency model、GAN 与少步生成的关系
Grok Imagine 0.9大规模音视频联合生成模型
音频为什么难speech 更离散,music 更连续
音视频对齐模型必须理解每一秒声音和画面的关系
时间感为什么 LLM 本身并不真正感知时间
什么是 world model实时、可交互、长时程的视频
交互性键盘、鼠标、语音都可以成为输入模态
实时性游戏需要毫秒级响应,数字人也要接近两百毫秒
长时程世界模型不能只生成几秒,而要持续几分钟甚至几小时
视频延展通往长时程 world model 的第一步
长 context 的挑战五秒视频就可能有五六万 token
为什么用户喜欢视频延展它是通往最终目标的中间产品
长视频里的冗余不是所有历史都需要一直放进 context
Reference video用角色、物体、场景作为生成条件
为什么 reference 是一种“作弊”,也是一种重要机制
FramePack 与动态 context selection:离当前越远,信息越压缩
LLM 与视频模型共享的问题context pruning 目前仍高度依赖 heuristic
Continual learning 的可能突破让模型自己管理上下文
人类注意力的启发不是记住一切,而是动态拉取相关信息
xAI 被低估的地方move fast、build、宏大目标和 first principles
如何倒推三个月目标从数据、训练、人工标注、GPU 周转时间拆解
Elon Musk 的工作方式非常 hands-on,直接给反馈
Grok Voice实时语音体验、打断能力和车载场景
生成式视频安全水印、下架和社交平台治理
SynthID 的局限论文公开后,水印也可能被反向工程
AI 生成内容越来越难识别从看手指,到看逻辑是否成立
核心判断视觉智能很大程度来自语言模型
Prompt rewriter视频模型背后的“大脑”
为什么视频 diffusion model 很“字面”:用户说“一只猫”,它可能只生成一只不会动的猫
GPT Image 类模型为什么要“想几分钟”时间花在推理、重写 prompt 和组织内容上
不同架构路线独立 LLM + diffusion、omni model、离散图像 token
生成—理解—再生成omni model 可能如何迭代优化图像
Prompt rewriter 与 diffusion head 不是一回事,但语言侧都在贡献智能
不需要 joint training,光重写 prompt 就能显著提高画面质量
Video Agent 的愿景像人类创作者一样调用工具、编辑、迭代
Grok Imagine Agent beta从视频生成走向视频创作工作流
为什么“生成一分钟视频”是 Agent 任务,而不是单次视频模型任务
从 Copilot 到 Claude Code视频创作也会经历 Agent 化
速度、thinking budget 与 inference infra
Video Agent 的真正价值不是模型到头了,而是 harness 和工具链解锁新能力
AI 模型更懂 AI 模型未来会有模型专门负责 prompt 和调度生成模型
为什么确定性工具仍然重要字幕、排版、精准编辑不必全靠视频模型
Ethan 的时间判断年底 Video Agent 会成为大热点
Production grade 视频一旦可用于广告和展示,预算会指数级增长
World model 不一定只服务机器人,但机器人会自然成为 AI 可调用的工具
Physical AI 也许不需要先在真实世界解决,可以先被强视频模型解决
为什么离开 xAI想做公司优先级之外的研究,尤其是语言模型方向
视频模型的瓶颈,正在转向语言模型和 Agent
未来一年关注什么模型感知并管理自己的 context
Context awareness模型应该知道自己快到上下文上限了
Context addition / removal / compaction:今天由 harness 做,未来可能被模型吸收
Self-modifying harness模型像程序一样,在 test time 给自己编程
职业路径从 ResNet 时代的视觉研究,到 FAIR、Cosmos、MoE、xAI
为什么跨方向并没有想象中困难训练大模型的原则高度相通
收尾xAI 背后还有很多未被讲清楚的层次
Show notes
#569. Delving into xAI: Creating Grok Imagine in Three Months, Video Generation and the Battle of World Models, and Video Agents
📝 Overview of This Podcast Episode
In this episode, we cloned: Latent Space: Inside xAI: Building Grok Imagine in 3 Months, Videogen vs World Models, and Video Agents—Ethan He
Original content update time: June 1, 2026
This episode is a high-density technical interview about video generation, world models, and Video Agents. The guest, Ethan He, was involved in the Cosmos world model at Nvidia and later joined xAI to participate in Grok Imagine, audio-video joint generation, reference video, video extension, and world model related work from scratch. In the program, he reviewed how xAI quickly created version 0.9 of Grok Imagine in just three months with no infrastructure, data, or models; and detailed the complete training pipeline of video models from data, captions, VAEs, diffusion transformers to distillation.
More importantly, Ethan put forward several insightful viewpoints: many advancements in video models actually come from language models rather than video diffusion itself; in his view, a world model is "real-time, interactive, long-duration video"; future Video Agents will call upon video models, image editors, FFmpeg, and various deterministic tools like human creators to iteratively generate truly usable video content for advertising, creation, and production environments. This episode is suitable not only for those who want to understand the technical roadmap of video generation but also for listeners who wish to get an early understanding of the future trends of AI interaction interfaces, generative media, and Agents.
👨💻 Guest of This Episode
Ethan He, formerly participated in the Cosmos world model and Megatron-LM MoE among other works at Nvidia, then joined xAI to engage in research and development related to Grok Imagine, video generation, audio-video joint generation, reference video, video extension, and world models. His research experience spans computer vision, self-supervised learning, large-scale MoE, video diffusion, world models, and LLM Agents.
⏱️ Timestamps
00:00 Opening & podcast overview
From Cosmos to xAI: creating Grok Imagine in three months
02:42 Guest appearance: the origin of Ethan He's connection with the Latent Space community
04:14 Why leaving Nvidia: video models also have scaling laws and require more computational power
05:43 Starting from scratch at xAI: creating Grok Imagine 0.9 within three months
06:15 Secret of rapid iteration: talent, infra, compute, and low communication costs
08:23 Truth behind model quality improvement: many breakthroughs come from data and small bugs in the training pipeline
08:37 How coding models change the research pace: code faster, compute becomes the bottleneck again
09:54 High-pressure R&D culture: expensive computation, but it's a marathon
How video models are trained
11:46 Why image models are usually done before video models
12:50 Where does the data come from: manual detailed annotations and VLM-generated synthetic captions
14:12 Why paired data as well as unlabeled data are needed for training video models
15:07 VAE/tokenizer: why direct training on pixels is not feasible
17:08 Diffusion transformer: denoising step by step to generate images and videos from noise
17:27 How image models bootstrap video models: denser connections between language and images
18:24 Video compression routes: frame-by-frame compression versus temporal dimension compression
18:55 Why not directly train with MP4 tokens: the latent space must be friendly to the model
20:00 The cost of real-time: time compression saves context but introduces response delays
Early forms of generative UI and world models
20:51 Flipbook: exploring web pages imagined by the model like using a browser
22:31 Generative UI: directly from user intent to pixels, without writing code first and then rendering
24:09 Diffusion frontend, deterministic backend: how future interfaces might be restructured
25:15 Human-computer interaction bandwidth: humans output through speech and input through vision
26:15 NeuroOS: simulating operating systems and games using video models
27:52 From overfitting existing interfaces to imagining entirely new interaction systems
28:47 Why video models can generate supernatural content not present in the training set
Costs, acceleration, and audio-video joint generation of video models
31:05 How costly are video models: training costs approach those of medium-sized LLMs
31:52 Underestimated costs: video storage, feature storage, IO, and egress
33:29 Training scale: tens of trillions of visual tokens, billions of active parameters
34:16 Acceleration at inference end: how step distillation reduces one hundred steps to just a few
36:36 Relationship between consistency models, GANs, and few-step generation
37:48 Grok Imagine 0.9: large-scale audio-video joint generation model
38:00 Why audio is difficult: speech is more discrete, music is more continuous
40:25 Audio-video alignment: the model must understand the relationship between every second of sound and visuals
41:20 Sense of time: why LLMs themselves do not truly perceive time
Ethan's Definition of World Models
43:47 What is a world model: real-time, interactive, long-duration video
44:03 Interactivity: keyboards, mice, voice can all be input modalities
45:00 Real-time: games need millisecond-level responses, digital humans need to be close to two hundred milliseconds
46:00 Long duration: world models should generate not just a few seconds but continue for minutes or even hours
47:00 Video extension: the first step towards long-duration world models
48:00 Challenges of long contexts: five-second videos may already have fifty to sixty thousand tokens
49:03 Why users like video extensions: they are intermediate products leading to the ultimate goal
Reference Videos and Dynamic Context Management
51:24 Redundancy in long videos: not all history needs to be continuously fed into the context
52:01 Reference video: using characters, objects, scenes as generation conditions
52:46 Why references are a kind of "cheating", yet an important mechanism
54:34 FramePack and dynamic context selection: the further away from the current point, the more compressed the information
55:52 Shared issues between LLMs and video models: context pruning currently heavily relies on heuristics
56:14 Possible breakthroughs in continual learning: enabling models to manage their own contexts
57:00 Human Attention Inspiration: Not Remembering Everything, But Dynamically Pulling Relevant Information
xAI Culture and Generative Video Security
58:35 Underestimated Aspects of xAI: Move Fast, Build, Grand Goals, and First Principles
59:30 How to Backtrack Three-Month Goals: Breakdown from Data, Training, Manual Annotation, GPU Turnaround Time
60:12 Elon Musk's Working Style: Very Hands-On, Direct Feedback
61:09 Grok Voice: Real-time Voice Experience, Interruption Capability, In-vehicle Scenarios
61:56 Generative Video Security: Watermarks, Takedowns, and Social Platform Governance
62:19 Limitations of SynthID: After the Paper is Public, Watermarks May Also Be Reverse Engineered
63:04 AI-generated Content Becoming Harder to Identify: From Looking at Fingers to Checking Whether Logic Holds Up
Visual Intelligence Why It Comes from Language
64:31 Core Judgment: A Large Part of Visual Intelligence Comes from Language Models
65:00 Prompt Rewriter: The "Brain" Behind Video Models
65:40 Why Video Diffusion Model Is Very "Literal": If a User Says "A Cat", It Might Only Generate a Stationary Cat
66:10 Why GPT Image-like Models Need to "Think for a Few Minutes": Time Spent on Inference, Rewriting Prompts, and Organizing Content
67:07 Different Architecture Routes: Independent LLM + Diffusion, Omni Model, Discrete Image Tokens
68:21 Generation—Understanding—Regeneration: How an Omni Model Might Iteratively Optimize Images
69:54 Prompt Rewriter and Diffusion Head Are Not the Same Thing, but the Language Side Contributes to Intelligence
70:33 No Need for Joint Training, Just Rewriting Prompts Can Significantly Improve Image Quality
Video Agent: The Next Wave of Generative Media
71:54 Vision of Video Agent: Like Human Creators in Invoking Tools, Editing, Iterating
72:13 Grok Imagine Agent Beta: From Video Generation to Video Creation Workflow
72:29 Why "Generating a One-Minute Video" Is an Agent Task, Not a Single Video Model Task
73:30 From Copilot to Claude Code: Video Creation Will Also Go Through Agentization
74:17 Speed, Thinking Budget, and Inference Infra
75:12 True Value of Video Agent: Not That Models Have Reached Their Limits, but Harnesses and Toolchains Unlock New Capabilities
76:21 AI Models Understand AI Models Better: There Will Be Models Specifically Responsible for Prompting and Scheduling Generating Models
77:28 Why Deterministic Tools Are Still Important: Subtitles, Typesetting, Precise Editing Don't Have to Rely Entirely on Video Models
78:02 Ethan's Timing Judgment: By the End of the Year, Video Agents Will Become a Big Trend
78:20 Production Grade Videos: Once Usable for Ads and Displays, Budgets Will Grow Exponentially
Robots, LLMs, and the Next Stage of Research
78:36 World Models Don't Necessarily Serve Only Robots, But Robots Naturally Become AI-invocable Tools
79:12 Physical AI Perhaps Doesn't Need to Solve Problems in the Real World First; It Can Be Solved by Strong Video Models First
80:10 Why Leaving xAI: To Conduct Research Outside Company Priorities, Especially in the Direction of Language Models
81:06 Bottlenecks of Video Models Are Shifting Toward Language Models and Agents
81:31 What to Focus on in the Coming Year: Models Sensing and Managing Their Own Context
82:00 Context Awareness: Models Should Know They're Approaching Their Context Limits
82:30 Context Addition/Removal/Compaction: Currently Done by Harnesses, Potentially Absorbed by Models in the Future
83:59 Self-modifying Harness: Models Program Themselves Like Programs at Test Time
85:22 Career Path: From Visual Research in the ResNet Era to FAIR, Cosmos, MoE, xAI
86:44 Why Cross-directional Work Isn't as Difficult as Imagined: Principles of Training Large Models Are Highly Interconnected
87:33 Conclusion: There Are Many Layers Behind xAI That Haven't Been Fully Explained Yet
🌟 Highlights
💡 Making Grok Imagine in Three Months: Speed Comes from Iteration Ability
Ethan reviewed his state when joining xAI: no infra, no data, no models, only a few engineers and a very clear goal. The team eventually released Grok Imagine 0.9 within three months. He believes that the key to training models is not some magical algorithm, but end-to-end iteration speed: how many rounds of experiments you can do each day, how many bugs you can find, and how many data and training pipeline issues you can fix.
"When I look at training models, what's most important is actually how many rounds of iterations you can do each day."
🧠 Progress in Video Models, Much Comes from Language Models
The most counterintuitive view in this episode is: visual intelligence largely comes from language. Ethan explained that video diffusion models themselves are often very literal; they need a stronger language model for prompt rewriting, expanding users' simple instructions into extremely detailed visual descriptions. Many improvements in images and videos are not because diffusion models suddenly became smarter, but because language models are better at thinking, writing prompts, and invoking tools.
"I have a pretty big judgment: a large part of visual intelligence actually comes from language, especially these video models."
🌍 What World Models Are: Real-time, Interactive, Long-term Videos
Ethan does not attempt to argue about the sole standard definition of world models, but gives his own definition from the perspective of video generation: world models are real-time, interactive, long-term videos. They must respond to keyboard, mouse, voice inputs; achieve low latency; and be able to continuously generate content for several minutes or even hours while maintaining consistency in characters, voices, objects, and events.
"In my view, world models are real-time, interactive, long-term videos."
🧩 Core Challenges of Long Videos: Not Longer Context, But Context Management
Video generation faces huge context pressure. Ethan mentioned that a five-second video in Cosmos might have five to sixty thousand tokens, making it easy for long videos to explode. Therefore, the key in the future is not just to expand context length forcefully, but to let models learn to dynamically choose historical information: when to fully remember the previous second, when to compress distant history, and when to pull back references of certain characters.
"Models should selectively know where to fetch references."
🎬 Video Agents Will Be the Next Wave of Generative Media
Ethan believes that Video Agents will not simply "generate a few clips and拼them together", but will use video models, image editing tools, video editors, FFmpeg, subtitle tools, and deterministic tools like human creators, repeatedly generating, checking, modifying, combining, and ultimately producing production-grade videos. He predicts that by the end of the year, Video Agents will become a major trend, and once generated videos meet the standards for ads and displays, corporate budgets will quickly enter the field.
"AI models understand AI models better."
🔊 Challenges of Audio-visual Joint Generation: Temporal Alignment
Grok Imagine 0.9 was called by Ethan the first large-scale deployed audio-visual joint generation model. Its challenges are not just generating sound, but ensuring precise temporal alignment between sound, music, dialogue, and visuals. The alignment between text and images can be relatively loose, but audio and video must correspond at every time step, which makes data annotation, captioning, and model design more complex.
"The model must know there's a time-based alignment relationship between video and audio."
🖥️ Generative UI: Future Interfaces May Be Directly Generated by Models
Ethan envisions a future where if inference costs are low enough, user interfaces don't necessarily have to be written in code and rendered by browsers, but can be directly generated from pixels by generative models based on user intent. You could present emails like TikTok or generate Instagram stories without a like button. LLMs and coding models handle backend logic, while diffusion models become the frontend visual layer.
"Generative UI goes directly from user intent to pixels."
🧠 The Next Step for LLMs: Perceiving and Managing Their Own Context
After leaving xAI, Ethan pays closer attention to the direction of language models. He believes that models need to know their context status in the future: when they're approaching limits, when they should compress, when they should delete tool invocation results, and when they should re-add certain information to the context. Today, this work is mainly done by heuristics of Agent harnesses, but in the future, it may be absorbed by the models themselves.
"Many things in heuristic engineering will also be absorbed by the models themselves in the end."
🌐 Podcast Information Supplement
This podcast uses the original voice line for podcast audio production, so some parts might sound a bit odd.
AI is used for translation, so there might be some places that are not smooth;
If you want to listen to other foreign language podcasts in Chinese later, feel free to contact WeChat: iEvenight