T
traeai
Sign in
返回首页
跨国串门儿计划Podcast1:29:05

#569. Deep Dive into xAI: Building Grok Imagine in 3 Months, Video Generation, World Models, and Video Agents

8.8Score
#569. Deep Dive into xAI: Building Grok Imagine in 3 Months, Video Generation, World Models, and Video Agents

Listen

Duration 1:29:05Original podcast page

问这期播客

会先在本集摘要、章节、转录和笔记里找答案。

TL;DR · AI Summary

A former Nvidia researcher explains how xAI built Grok Imagine in three months, revealing the training pipeline of video generation models, the definition of world models, and the future trends of Video Agents.

Key Takeaways

  • xAI built Grok Imagine 0.9 from scratch within three months, thanks to talent de
  • The progress of video models largely comes from advancements in language models,
  • World Models are defined as 'real-time, interactive, long-term videos,' and futu

Outline

Jump quickly between sections.

  1. Introduces guest Ethan He's background and an overview of the program's content.

  2. ·Secrets of xAI's Rapid Development

    Analyzes the key factors behind xAI building Grok Imagine in three months.

  3. Details the complete training process from data acquisition to diffusion Transformer.

  4. Elaborates on the definition of world models and their applications in interactive systems.

  5. Discusses audio processing challenges and audio-video alignment issues.

Mindmap

See how the topics connect at a glance.

查看大纲文本(无障碍 / 无 JS 友好)
  • 深入xAI技术
    • Grok Imagine开发
      • 三个月速成秘诀
    • 视频模型训练
      • 数据与架构
    • 世界模型
      • 实时交互特性

Highlights

Key sentences worth saving and sharing.

Chapters

  1. 开场 & 播客简介

    开场 & 播客简介

  2. 嘉宾登场:Ethan He 与 Latent Space 社区的缘起

    嘉宾登场:Ethan He 与 Latent Space 社区的缘起

  3. 为什么离开 Nvidia:视频模型也有 scaling law,需要更大算力

    为什么离开 Nvidia:视频模型也有 scaling law,需要更大算力

  4. xAI 从零起步:三个月做出 Grok Imagine 0.9

    xAI 从零起步:三个月做出 Grok Imagine 0.9

  5. 快速迭代的秘密:人才、infra、compute 与低沟通成本

    快速迭代的秘密:人才、infra、compute 与低沟通成本

  6. 模型质量提升的真相:很多突破来自数据和训练 pipeline 里的小 bug

    模型质量提升的真相:很多突破来自数据和训练 pipeline 里的小 bug

  7. Coding model 如何改变研究节奏:代码更快,compute 再次成为瓶颈

    Coding model 如何改变研究节奏:代码更快,compute 再次成为瓶颈

  8. 高压研发文化:算力昂贵,但这是一场马拉松

    高压研发文化:算力昂贵,但这是一场马拉松

  9. 为什么做视频模型之前,通常要先做图像模型

    为什么做视频模型之前,通常要先做图像模型

  10. 数据从哪里来:人工详细标注与 VLM 生成 synthetic caption

    数据从哪里来:人工详细标注与 VLM 生成 synthetic caption

  11. 训练视频模型为什么既需要配对数据,也需要无标签数据

    训练视频模型为什么既需要配对数据,也需要无标签数据

  12. VAE / tokenizer:为什么不能直接在像素上训练

    VAE / tokenizer:为什么不能直接在像素上训练

Transcript

开场 & 播客简介

嘉宾登场Ethan He 与 Latent Space 社区的缘起

为什么离开 Nvidia视频模型也有 scaling law,需要更大算力

xAI 从零起步三个月做出 Grok Imagine 0.9

快速迭代的秘密人才、infra、compute 与低沟通成本

模型质量提升的真相很多突破来自数据和训练 pipeline 里的小 bug

Coding model 如何改变研究节奏代码更快,compute 再次成为瓶颈

高压研发文化算力昂贵,但这是一场马拉松

为什么做视频模型之前,通常要先做图像模型

数据从哪里来人工详细标注与 VLM 生成 synthetic caption

训练视频模型为什么既需要配对数据,也需要无标签数据

VAE / tokenizer为什么不能直接在像素上训练

Diffusion transformer从噪声一步步去噪生成图像和视频

图像模型如何 bootstrap 视频模型语言与图像连接更密集

视频压缩路线逐帧压缩 vs 时间维度压缩

为什么不用 MP4 token 直接训练latent space 必须对模型友好

实时性的代价时间压缩节省 context,但会引入响应延迟

Flipbook像浏览器一样探索模型想象出的网页

Generative UI从用户意图直接到像素,而不是先写代码再渲染

Diffusion 前端,确定性后端未来界面可能如何被重构

人机交互的带宽人类用语音输出,用视觉输入

NeuroOS用视频模型模拟操作系统和游戏

从过拟合现有界面,到想象全新交互系统

为什么视频模型能生成训练集中不存在的超自然内容

视频模型到底有多贵训练成本接近中等规模 LLM

被低估的成本视频存储、特征存储、IO 和 egress

训练规模数十万亿视觉 token、百亿级 active 参数

推理端加速step distillation 如何把一百步变成几步

Consistency model、GAN 与少步生成的关系

Grok Imagine 0.9大规模音视频联合生成模型

音频为什么难speech 更离散,music 更连续

音视频对齐模型必须理解每一秒声音和画面的关系

时间感为什么 LLM 本身并不真正感知时间

什么是 world model实时、可交互、长时程的视频

交互性键盘、鼠标、语音都可以成为输入模态

实时性游戏需要毫秒级响应,数字人也要接近两百毫秒

长时程世界模型不能只生成几秒,而要持续几分钟甚至几小时

视频延展通往长时程 world model 的第一步

长 context 的挑战五秒视频就可能有五六万 token

为什么用户喜欢视频延展它是通往最终目标的中间产品

长视频里的冗余不是所有历史都需要一直放进 context

Reference video用角色、物体、场景作为生成条件

为什么 reference 是一种“作弊”,也是一种重要机制

FramePack 与动态 context selection:离当前越远,信息越压缩

LLM 与视频模型共享的问题context pruning 目前仍高度依赖 heuristic

Continual learning 的可能突破让模型自己管理上下文

人类注意力的启发不是记住一切,而是动态拉取相关信息

xAI 被低估的地方move fast、build、宏大目标和 first principles

如何倒推三个月目标从数据、训练、人工标注、GPU 周转时间拆解

Elon Musk 的工作方式非常 hands-on,直接给反馈

Grok Voice实时语音体验、打断能力和车载场景

生成式视频安全水印、下架和社交平台治理

SynthID 的局限论文公开后,水印也可能被反向工程

AI 生成内容越来越难识别从看手指,到看逻辑是否成立

核心判断视觉智能很大程度来自语言模型

Prompt rewriter视频模型背后的“大脑”

为什么视频 diffusion model 很“字面”:用户说“一只猫”,它可能只生成一只不会动的猫

GPT Image 类模型为什么要“想几分钟”时间花在推理、重写 prompt 和组织内容上

不同架构路线独立 LLM + diffusion、omni model、离散图像 token

生成—理解—再生成omni model 可能如何迭代优化图像

Prompt rewriter 与 diffusion head 不是一回事,但语言侧都在贡献智能

不需要 joint training,光重写 prompt 就能显著提高画面质量

Video Agent 的愿景像人类创作者一样调用工具、编辑、迭代

Grok Imagine Agent beta从视频生成走向视频创作工作流

为什么“生成一分钟视频”是 Agent 任务,而不是单次视频模型任务

从 Copilot 到 Claude Code视频创作也会经历 Agent 化

速度、thinking budget 与 inference infra

Video Agent 的真正价值不是模型到头了,而是 harness 和工具链解锁新能力

AI 模型更懂 AI 模型未来会有模型专门负责 prompt 和调度生成模型

为什么确定性工具仍然重要字幕、排版、精准编辑不必全靠视频模型

Ethan 的时间判断年底 Video Agent 会成为大热点

Production grade 视频一旦可用于广告和展示,预算会指数级增长

World model 不一定只服务机器人,但机器人会自然成为 AI 可调用的工具

Physical AI 也许不需要先在真实世界解决,可以先被强视频模型解决

为什么离开 xAI想做公司优先级之外的研究,尤其是语言模型方向

视频模型的瓶颈,正在转向语言模型和 Agent

未来一年关注什么模型感知并管理自己的 context

Context awareness模型应该知道自己快到上下文上限了

Context addition / removal / compaction:今天由 harness 做,未来可能被模型吸收

Self-modifying harness模型像程序一样,在 test time 给自己编程

职业路径从 ResNet 时代的视觉研究,到 FAIR、Cosmos、MoE、xAI

为什么跨方向并没有想象中困难训练大模型的原则高度相通

收尾xAI 背后还有很多未被讲清楚的层次

#AI#Video Generation#World Models#Deep Learning

Show notes

#569. Delving into xAI: Creating Grok Imagine in Three Months, Video Generation and the Battle of World Models, and Video Agents

📝 Overview of This Podcast Episode

In this episode, we cloned: Latent Space: Inside xAI: Building Grok Imagine in 3 Months, Videogen vs World Models, and Video Agents—Ethan He

Original content update time: June 1, 2026

This episode is a high-density technical interview about video generation, world models, and Video Agents. The guest, Ethan He, was involved in the Cosmos world model at Nvidia and later joined xAI to participate in Grok Imagine, audio-video joint generation, reference video, video extension, and world model related work from scratch. In the program, he reviewed how xAI quickly created version 0.9 of Grok Imagine in just three months with no infrastructure, data, or models; and detailed the complete training pipeline of video models from data, captions, VAEs, diffusion transformers to distillation.

More importantly, Ethan put forward several insightful viewpoints: many advancements in video models actually come from language models rather than video diffusion itself; in his view, a world model is "real-time, interactive, long-duration video"; future Video Agents will call upon video models, image editors, FFmpeg, and various deterministic tools like human creators to iteratively generate truly usable video content for advertising, creation, and production environments. This episode is suitable not only for those who want to understand the technical roadmap of video generation but also for listeners who wish to get an early understanding of the future trends of AI interaction interfaces, generative media, and Agents.

👨‍💻 Guest of This Episode

Ethan He, formerly participated in the Cosmos world model and Megatron-LM MoE among other works at Nvidia, then joined xAI to engage in research and development related to Grok Imagine, video generation, audio-video joint generation, reference video, video extension, and world models. His research experience spans computer vision, self-supervised learning, large-scale MoE, video diffusion, world models, and LLM Agents.

⏱️ Timestamps

00:00 Opening & podcast overview

From Cosmos to xAI: creating Grok Imagine in three months

02:42 Guest appearance: the origin of Ethan He's connection with the Latent Space community

04:14 Why leaving Nvidia: video models also have scaling laws and require more computational power

05:43 Starting from scratch at xAI: creating Grok Imagine 0.9 within three months

06:15 Secret of rapid iteration: talent, infra, compute, and low communication costs

08:23 Truth behind model quality improvement: many breakthroughs come from data and small bugs in the training pipeline

08:37 How coding models change the research pace: code faster, compute becomes the bottleneck again

09:54 High-pressure R&D culture: expensive computation, but it's a marathon

How video models are trained

11:46 Why image models are usually done before video models

12:50 Where does the data come from: manual detailed annotations and VLM-generated synthetic captions

14:12 Why paired data as well as unlabeled data are needed for training video models

15:07 VAE/tokenizer: why direct training on pixels is not feasible

17:08 Diffusion transformer: denoising step by step to generate images and videos from noise

17:27 How image models bootstrap video models: denser connections between language and images

18:24 Video compression routes: frame-by-frame compression versus temporal dimension compression

18:55 Why not directly train with MP4 tokens: the latent space must be friendly to the model

20:00 The cost of real-time: time compression saves context but introduces response delays

Early forms of generative UI and world models

20:51 Flipbook: exploring web pages imagined by the model like using a browser

22:31 Generative UI: directly from user intent to pixels, without writing code first and then rendering

24:09 Diffusion frontend, deterministic backend: how future interfaces might be restructured

25:15 Human-computer interaction bandwidth: humans output through speech and input through vision

26:15 NeuroOS: simulating operating systems and games using video models

27:52 From overfitting existing interfaces to imagining entirely new interaction systems

28:47 Why video models can generate supernatural content not present in the training set

Costs, acceleration, and audio-video joint generation of video models

31:05 How costly are video models: training costs approach those of medium-sized LLMs

31:52 Underestimated costs: video storage, feature storage, IO, and egress

33:29 Training scale: tens of trillions of visual tokens, billions of active parameters

34:16 Acceleration at inference end: how step distillation reduces one hundred steps to just a few

36:36 Relationship between consistency models, GANs, and few-step generation

37:48 Grok Imagine 0.9: large-scale audio-video joint generation model

38:00 Why audio is difficult: speech is more discrete, music is more continuous

40:25 Audio-video alignment: the model must understand the relationship between every second of sound and visuals

41:20 Sense of time: why LLMs themselves do not truly perceive time

Ethan's Definition of World Models

43:47 What is a world model: real-time, interactive, long-duration video

44:03 Interactivity: keyboards, mice, voice can all be input modalities

45:00 Real-time: games need millisecond-level responses, digital humans need to be close to two hundred milliseconds

46:00 Long duration: world models should generate not just a few seconds but continue for minutes or even hours

47:00 Video extension: the first step towards long-duration world models

48:00 Challenges of long contexts: five-second videos may already have fifty to sixty thousand tokens

49:03 Why users like video extensions: they are intermediate products leading to the ultimate goal

Reference Videos and Dynamic Context Management

51:24 Redundancy in long videos: not all history needs to be continuously fed into the context

52:01 Reference video: using characters, objects, scenes as generation conditions

52:46 Why references are a kind of "cheating", yet an important mechanism

54:34 FramePack and dynamic context selection: the further away from the current point, the more compressed the information

55:52 Shared issues between LLMs and video models: context pruning currently heavily relies on heuristics

56:14 Possible breakthroughs in continual learning: enabling models to manage their own contexts

57:00 Human Attention Inspiration: Not Remembering Everything, But Dynamically Pulling Relevant Information

xAI Culture and Generative Video Security

58:35 Underestimated Aspects of xAI: Move Fast, Build, Grand Goals, and First Principles

59:30 How to Backtrack Three-Month Goals: Breakdown from Data, Training, Manual Annotation, GPU Turnaround Time

60:12 Elon Musk's Working Style: Very Hands-On, Direct Feedback

61:09 Grok Voice: Real-time Voice Experience, Interruption Capability, In-vehicle Scenarios

61:56 Generative Video Security: Watermarks, Takedowns, and Social Platform Governance

62:19 Limitations of SynthID: After the Paper is Public, Watermarks May Also Be Reverse Engineered

63:04 AI-generated Content Becoming Harder to Identify: From Looking at Fingers to Checking Whether Logic Holds Up

Visual Intelligence Why It Comes from Language

64:31 Core Judgment: A Large Part of Visual Intelligence Comes from Language Models

65:00 Prompt Rewriter: The "Brain" Behind Video Models

65:40 Why Video Diffusion Model Is Very "Literal": If a User Says "A Cat", It Might Only Generate a Stationary Cat

66:10 Why GPT Image-like Models Need to "Think for a Few Minutes": Time Spent on Inference, Rewriting Prompts, and Organizing Content

67:07 Different Architecture Routes: Independent LLM + Diffusion, Omni Model, Discrete Image Tokens

68:21 Generation—Understanding—Regeneration: How an Omni Model Might Iteratively Optimize Images

69:54 Prompt Rewriter and Diffusion Head Are Not the Same Thing, but the Language Side Contributes to Intelligence

70:33 No Need for Joint Training, Just Rewriting Prompts Can Significantly Improve Image Quality

Video Agent: The Next Wave of Generative Media

71:54 Vision of Video Agent: Like Human Creators in Invoking Tools, Editing, Iterating

72:13 Grok Imagine Agent Beta: From Video Generation to Video Creation Workflow

72:29 Why "Generating a One-Minute Video" Is an Agent Task, Not a Single Video Model Task

73:30 From Copilot to Claude Code: Video Creation Will Also Go Through Agentization

74:17 Speed, Thinking Budget, and Inference Infra

75:12 True Value of Video Agent: Not That Models Have Reached Their Limits, but Harnesses and Toolchains Unlock New Capabilities

76:21 AI Models Understand AI Models Better: There Will Be Models Specifically Responsible for Prompting and Scheduling Generating Models

77:28 Why Deterministic Tools Are Still Important: Subtitles, Typesetting, Precise Editing Don't Have to Rely Entirely on Video Models

78:02 Ethan's Timing Judgment: By the End of the Year, Video Agents Will Become a Big Trend

78:20 Production Grade Videos: Once Usable for Ads and Displays, Budgets Will Grow Exponentially

Robots, LLMs, and the Next Stage of Research

78:36 World Models Don't Necessarily Serve Only Robots, But Robots Naturally Become AI-invocable Tools

79:12 Physical AI Perhaps Doesn't Need to Solve Problems in the Real World First; It Can Be Solved by Strong Video Models First

80:10 Why Leaving xAI: To Conduct Research Outside Company Priorities, Especially in the Direction of Language Models

81:06 Bottlenecks of Video Models Are Shifting Toward Language Models and Agents

81:31 What to Focus on in the Coming Year: Models Sensing and Managing Their Own Context

82:00 Context Awareness: Models Should Know They're Approaching Their Context Limits

82:30 Context Addition/Removal/Compaction: Currently Done by Harnesses, Potentially Absorbed by Models in the Future

83:59 Self-modifying Harness: Models Program Themselves Like Programs at Test Time

85:22 Career Path: From Visual Research in the ResNet Era to FAIR, Cosmos, MoE, xAI

86:44 Why Cross-directional Work Isn't as Difficult as Imagined: Principles of Training Large Models Are Highly Interconnected

87:33 Conclusion: There Are Many Layers Behind xAI That Haven't Been Fully Explained Yet

🌟 Highlights

💡 Making Grok Imagine in Three Months: Speed Comes from Iteration Ability

Ethan reviewed his state when joining xAI: no infra, no data, no models, only a few engineers and a very clear goal. The team eventually released Grok Imagine 0.9 within three months. He believes that the key to training models is not some magical algorithm, but end-to-end iteration speed: how many rounds of experiments you can do each day, how many bugs you can find, and how many data and training pipeline issues you can fix.

"When I look at training models, what's most important is actually how many rounds of iterations you can do each day."

🧠 Progress in Video Models, Much Comes from Language Models

The most counterintuitive view in this episode is: visual intelligence largely comes from language. Ethan explained that video diffusion models themselves are often very literal; they need a stronger language model for prompt rewriting, expanding users' simple instructions into extremely detailed visual descriptions. Many improvements in images and videos are not because diffusion models suddenly became smarter, but because language models are better at thinking, writing prompts, and invoking tools.

"I have a pretty big judgment: a large part of visual intelligence actually comes from language, especially these video models."

🌍 What World Models Are: Real-time, Interactive, Long-term Videos

Ethan does not attempt to argue about the sole standard definition of world models, but gives his own definition from the perspective of video generation: world models are real-time, interactive, long-term videos. They must respond to keyboard, mouse, voice inputs; achieve low latency; and be able to continuously generate content for several minutes or even hours while maintaining consistency in characters, voices, objects, and events.

"In my view, world models are real-time, interactive, long-term videos."

🧩 Core Challenges of Long Videos: Not Longer Context, But Context Management

Video generation faces huge context pressure. Ethan mentioned that a five-second video in Cosmos might have five to sixty thousand tokens, making it easy for long videos to explode. Therefore, the key in the future is not just to expand context length forcefully, but to let models learn to dynamically choose historical information: when to fully remember the previous second, when to compress distant history, and when to pull back references of certain characters.

"Models should selectively know where to fetch references."

🎬 Video Agents Will Be the Next Wave of Generative Media

Ethan believes that Video Agents will not simply "generate a few clips and拼them together", but will use video models, image editing tools, video editors, FFmpeg, subtitle tools, and deterministic tools like human creators, repeatedly generating, checking, modifying, combining, and ultimately producing production-grade videos. He predicts that by the end of the year, Video Agents will become a major trend, and once generated videos meet the standards for ads and displays, corporate budgets will quickly enter the field.

"AI models understand AI models better."

🔊 Challenges of Audio-visual Joint Generation: Temporal Alignment

Grok Imagine 0.9 was called by Ethan the first large-scale deployed audio-visual joint generation model. Its challenges are not just generating sound, but ensuring precise temporal alignment between sound, music, dialogue, and visuals. The alignment between text and images can be relatively loose, but audio and video must correspond at every time step, which makes data annotation, captioning, and model design more complex.

"The model must know there's a time-based alignment relationship between video and audio."

🖥️ Generative UI: Future Interfaces May Be Directly Generated by Models

Ethan envisions a future where if inference costs are low enough, user interfaces don't necessarily have to be written in code and rendered by browsers, but can be directly generated from pixels by generative models based on user intent. You could present emails like TikTok or generate Instagram stories without a like button. LLMs and coding models handle backend logic, while diffusion models become the frontend visual layer.

"Generative UI goes directly from user intent to pixels."

🧠 The Next Step for LLMs: Perceiving and Managing Their Own Context

After leaving xAI, Ethan pays closer attention to the direction of language models. He believes that models need to know their context status in the future: when they're approaching limits, when they should compress, when they should delete tool invocation results, and when they should re-add certain information to the context. Today, this work is mainly done by heuristics of Agent harnesses, but in the future, it may be absorbed by the models themselves.

"Many things in heuristic engineering will also be absorbed by the models themselves in the end."

🌐 Podcast Information Supplement

This podcast uses the original voice line for podcast audio production, so some parts might sound a bit odd.

AI is used for translation, so there might be some places that are not smooth;

If you want to listen to other foreign language podcasts in Chinese later, feel free to contact WeChat: iEvenight

AI may generate inaccurate information. Please verify important content.