T
traeai
Sign in
返回首页
宝玉的分享

机器人的终局:英伟达 Jim Fan 宣告 VLA 时代结束,WAM 登场

8.5Score
机器人的终局:英伟达 Jim Fan 宣告 VLA 时代结束,WAM 登场

TL;DR · AI Summary

英伟达Jim Fan宣布VLA路线过时,提出新范式WAM,代表作为DreamZero,预计2040年实现机器人自主设计制造,置信度95%。

Key Takeaways

  • VLA路线过时,新范式WAM登场
  • DreamZero预计2040年实现机器人自主设计制造
  • 遥操作数据将被传感化人类数据取代

Outline

Jump quickly between sections.

  1. Jim Fan宣布VLA路线过时,提出新范式WAM

  2. Jim公开宣告VLA路线过时,新范式叫世界动作模型(WAM),代表作是DreamZero

  3. 遥操作物理上限低,预测一两年内降到接近0,被传感化人类数据取代。

  4. EgoScale用21,000小时人类第一人称视频预训练,团队发现了灵巧操作的神经缩放定律。

  5. Dream Dojo用44,000小时人类视频训练出一个完全绕过物理引擎的神经仿真器。

  6. 给出2040年完成机器人终局的预测,置信度95%。

Mindmap

See how the topics connect at a glance.

查看大纲文本(无障碍 / 无 JS 友好)
  • 机器人终局

Highlights

Key sentences worth saving and sharing.

  • Jim Fan公开宣告VLA路线过时,新范式叫世界动作模型(WAM),代表作是DreamZero(140亿参数)。

    要点速览

    ⬇︎ 下载 PNG𝕏 分享到 X
  • 遥操作物理上限低,预测一两年内降到接近0,被传感化人类数据取代。

    要点速览

    ⬇︎ 下载 PNG𝕏 分享到 X
  • EgoScale用21,000小时人类第一人称视频预训练,团队发现了灵巧操作的神经缩放定律(R² = 0.998)。

    要点速览

    ⬇︎ 下载 PNG𝕏 分享到 X
#机器人#英伟达#WAM#DreamZero
Open original article

Jim Fan is the head of NVIDIA's GEAR Lab (Group for Robotics and AI Research), which has been promoting the GR00T humanoid robot model based on the VLA (Vision-Language-Action) architecture over the past few years. He recently gave a 20-minute speech at the Sequoia AI Ascent 2026 titled "Robotics' End Game," where he first announced that the VLA approach is outdated—including his own GR00T project from just six months ago.

The new paradigm is called World Action Models (WAM), with DreamZero being its flagship project released by NVIDIA in February. He refers to this as the "Great Parallel": replicating the three steps taken by large language models (LLMs)—pre-training → alignment → reinforcement learning—by replacing language models with video world models and using first-person human videos instead of teleoperation data, ultimately aiming to have robots design and manufacture their next generation by 2040. He is 95% confident about this prediction.

The speech was sourced from Sequoia Capital AI Ascent 2026, published on April 30, 2026. Original video: https://www.youtube.com/watch?v=3Y8aq_ofEVs

Key Takeaways

  • End of the VLA Route: Jim publicly declared the end of the VLA route, with the new paradigm being World Action Models (WAM). The flagship project is DreamZero (14 billion parameters).
  • Farewell to Teleoperation Data: Teleoperation has a low physical ceiling, predicted to drop close to zero within one or two years, replaced by sensorized human data.
  • Neural Scaling Law: EgoScale used 21,000 hours of first-person human video for pre-training, discovering a neural scaling law for dexterous manipulation (R² = 0.998).
  • Neural Simulator: Dream Dojo trained a fully physics-agnostic neural simulator using 44,000 hours of human video.
  • Countdown to the Endgame: Predicts completion of the robotics endgame (physical automation research) by 2040 with 95% confidence.

From DGX-1 Signature to “Great Parallel”

Jim started with a story from 2016. That summer, Huang Rendeng wore his iconic leather jacket and carried a large metal tray into the then-OpenAI office, inscribed with: "To Elon and the OpenAI team, to the future of computation and humanity." This was the world's first DGX-1.

Image 1: Huang Rendeng and Elon Musk viewing the first DGX-1

At the time, Jim was the first intern at OpenAI, quickly lining up to sign it. "Back then, I had no idea what I was signing." Andrej Karpathy was also there. This machine is now housed in the Computer History Museum. Jim added that he feels like a dinosaur.

Image 2: DGX-1 signature board screenshot

_Note: Jim Fan (Fan Linxi) is the Director of Robotics and AI at NVIDIA, an outstanding scientist, leading the GEAR Lab and GR00T humanoid robot project. His mentor during his internship at OpenAI in 2016 was Ilya Sutskever and Andrej Karpathy, and later he completed his Ph.D. under Fei-Fei Li at Stanford._

This story leads to his core framework. He quoted Ilya saying, "If you believe in deep learning, deep learning will believe in you," and then said that LLMs have reached today's level in just three stages and six years: GPT-3 pre-training, InstructGPT supervised fine-tuning, o1-style reinforcement learning, and finally automated research.

Thus, he made a decision: copy the homework, give it a different name, and call it the "Great Parallel". Replace "simulating the next state of a string" with "simulating the next state of the physical world," converge through action fine-tuning to the part needed by the robot, and let reinforcement learning finish the last mile.

Image 3: Jim Fan demonstrates the "Great Parallel": LLM training path corresponds to the robot training path

Join if you can't beat them. ("If you can't beat them, join them.")

What's Wrong with VLA: Parameters Are Piled Up on Language

Over the past three years, the mainstream architecture in robotics has been VLA (Vision-Language-Action, vision-language-action model). NVIDIA's own GR00T and Physical Intelligence’s π0 both fall into this category.

Jim pointed out a structural issue: these models should actually be called LVA, because most parameters are piled up on language. Language is the top citizen, followed by vision, with actions at the bottom.

Image 4: VLA architecture illustration: vision-language model with an action head on top

VLA excels at encoding knowledge and nouns but struggles with physics and verbs. The focus is misplaced.

He cited a classic demo from the original RT-2 paper: instructing a robot to push a soda can next to a photo of Taylor Swift. The model hadn’t seen Taylor Swift before but could generalize. The problem is, it generalized the noun (recognizing Taylor Swift) rather than the verb (how to push, what angle to use, how much force).

Image 5: RT-2 paper's soda can and Taylor Swift generalization example

From AI Trash Videos to DreamZero

VLA isn't the answer; what's the next pre-training paradigm? It turns out to be video models, which internally learn to simulate the next state of the physical world.

Image 6: Jim Fan introduces video world models using "AI video slop"

How do we make these world models useful? By doing action fine-tuning. Converge the superposition of all possible futures to a meaningful trajectory for a real robot.

NVIDIA's answer is DreamZero. This is a new type of policy model that "dreams" a few seconds into the future before executing an action, then acts according to the dream. DreamZero decodes the next frame and the next action simultaneously. Here, vision and action become true "first-class citizens."

Image 7: DreamZero displaying the world model view while performing tasks

Jim candidly admitted that DreamZero is not yet reliable for every task. "It's roughly at the GPT-2 stage, heading in the right direction but not yet stable and reliable." He named this new architecture WAM (World Action Models, world action models).

Let us mourn our dear VLA for a moment. It has completed its historical mission. Rest in peace. Long live World Action Models.

_Note: DreamZero paper (arXiv 2602.15922) was released in February 2026, with 140 billion parameters, based on the Wan2.1 video diffusion model. A key limitation is that the 14B model must undergo 38x system-level optimization plus GB200 hardware to bring closed-loop control down to 7Hz, making deployment extremely challenging._

Q: Is VLA really dead? A: It's dead at the presentation level. However, the latest NVIDIA GR00T N1.7 (April 2026) paper explicitly mentions the "VLA model." The paradigm shift has not yet been completed internally.

Q: Can DreamZero be used in production environments now? A: No. Jim himself says it is "roughly at the GPT-2 stage." The disclosed paper shows that a 14B model running closed-loop control only achieves 7Hz and must use the GB200.

Q: Will teleoperation really become obsolete? A: Jim predicts it will drop close to zero within one to two years. However, wearing devices for household chores is not as essential as driving, and the existing infrastructure for teleoperation in the industry won't vanish overnight.

Q: What does the scaling law for dexterous manipulation mean? A: If R² = 0.998 holds consistently, it means that increasing human video data will predictably enhance the dexterity of robots. This is the most core empirical evidence in the entire presentation.

Q: What does NVIDIA gain from this? A: WAM and neural simulators have extremely high computational demands. Jim’s phrase “buy more, save more” directly reflects the commercial intent behind the natural shift in paradigms, which benefits chip sales.

Finally: Three Suspenses Worth Tracking

Three things are worth tracking:

  1. How will DreamZero cross the "GPT-2 stage"? Whether the maximum parameters can be stabilized in the next 12-18 months will determine the true power of this paradigm.
  2. The moment NVIDIA shifts to the VLA paradigm internally: Observe the substantive evolution in their product updates. If the next generation still uses VLA, the presentation leans more towards conceptual marketing.
  3. The flywheel carrier for first-person video data: Since NVIDIA lacks consumer-grade hardware entry points, we need to watch who (such as Apple or Meta) can truly drive this data on the scale of millions of hours.

AI may generate inaccurate information. Please verify important content.