Just Now, Fei-Fei Li Defines the World Model

TL;DR · AI Summary
Fei-Fei Li defines the three functions of the world model: rendering, simulation, and planning.
Key Takeaways
- The world model has three functions: rendering, simulation, and planning.
- Simulators are the bridge between rendering and planning.
- The boundaries between the three functions are blurring, and they are converging
Outline
Jump quickly between sections.
Fei-Fei Li defines the three functions of the world model: rendering, simulation, and planning.
Rendering, simulation, and planning.
Outputs visual results for humans, with visual fidelity as the core metric.
接受 observations and goals, outputs the next action.
Outputs computationally and interactively consistent states, emphasizing geometry, physics, and dynamics.
The three functions are converging.
Mindmap
See how the topics connect at a glance.
查看大纲文本(无障碍 / 无 JS 友好)
- 世界模型
- 渲染器
- 输出给人看的观察结果
- 规划器
- 输入观察和目标,输出下一步动作
- 模拟器
- 输出可计算、可交互的状态
- 边界正在消融
- 三类能力正在相互融合
Highlights
Key sentences worth saving and sharing.
The world model is one of the most abused terms in the field of artificial intelligence today.
Simulators are the bridge between rendering and planning.
The boundaries between the three functions are blurring, and they are converging.
Just Now, Li Feifei Personally Entered the Field to Define World Models
2026-06-04 08:44:04 Source: QbitAI
Rendering, simulation, and planning—the boundaries between these three functions are blurring.
Fish Yáng Sent from凹Non-Square
QbitAI | Public Account QbitAI
The concept of world models is trending, but it has become a bit chaotic. A single definition alone has led to much debate: video generation models can be considered world models, language models that generate games are also referred to as world models, and some people even include physics engines in this category.…
It's so chaotic that Li Feifei herself couldn't help but take notice. Just now, she personally wrote an article to clarify the functional classification of world models.

Her words were毫不客气: The term "world model" is one of the most important and misused concepts in today's artificial intelligence field.
The ancient Greeks could not reach a consensus on the composition of the world because "the world" has never been a single entity. Artificial intelligence has inherited the same problem, and at this point, this field needs precision more than ever.
At least, let's first clarify three things:
Rendering, Simulation, Planning.
No need for much introduction; let's get straight into taking notes.
Three Functions of World Models
Li Feifei first deconstructed the technical significance of world models.
An agent (human, robot, or system) takes actions that affect the state of the world. By "state," we mean a complete description of everything happening in the world at a specific moment, including every object, position, velocity, and property.
Observation refers to the objective perception of this world by the subject. Action is the response of the subject to this reality.
Subject → Action → State → Observation → Return, which endows "world models" with their technical significance. Now, various things called world models are actually different projections of the same cycle.
In terms of functionality, Li Feifei believes that world models have three major functions: rendering, simulation, and planning.
Among them, simulators receive the least attention but are crucial, serving as the bridge connecting rendering and planning.

Renderers
Renderers output visual results intended for human viewing, with core metrics being visual fidelity.
Google's Genie 3 and Li Feifei's own World Labs' RTFM all fall under the category of renderers.
These models themselves do not have a clear understanding of three-dimensional structures. They generate画面 that viewers see, rather than actual existent画面.
For example, in AI-generated drone footage, buildings may appear perfect when viewed from above, but if you drive through the city below, you'll find them摇摇欲坠.

Li Feifei认为, renderers are currently the most mature technology in the commercial sector. For instance, Nano Banana is a global representative of this trend.
However, their limitations lie in optimizing visual realism rather than physical accuracy. Their output results are very visually appealing but cannot be used in architectural design or robot training scenarios where real-world integration is closer.
Planners
Planners input observations and goals and output the next action.
VLA models and new-generation world action models all belong to the planner category, determining what robots should do in unstructured environments.
Planners are the most attractive and promising area for development. Embodied intelligence is closely related to this, and significant amounts of venture capital are flowing into this segment.
But Li Feifei points out that many impressive robot demonstrations in recent years have been limited to highly restricted laboratory environments, with narrow target ranges and short task cycles, making it difficult to verify the complexity, variability, and duration required for real-world deployment.
Simulators
Simulators output calculable and interactive states, emphasizing geometric, physical, and dynamic consistency.
A simulator requires geometric structures to withstand verification, adheres to physical laws in physics, and its dynamics align with how the world operates.

Simulators serve two user groups:
Professionals such as architects, designers, film makers, and game developers require precision beyond mere visual realism.
Fields like reinforcement learning agents, robot controllers, and autonomous driving rely on simulators as training grounds to interact大规模ly with the world and test scenarios that are dangerous, expensive, or impossible to run in reality.
Li Feifei认为, simulation serves as the bridge connecting rendering and planning.
If language represents an abstract understanding of the world, pixels represent a projection of the world, then geometry, physics, and dynamics are the essence of the world itself.
And the simulator is the structural framework that generates visual appearance (for renderers) and action consequences (for planners).
Simulation models can convert their understanding into pixel images for humans to use and predict the behavior of intelligent entities. Applications such as robot training, autonomous driving testing, architectural visualization, engineering design, and drug development all depend on some form of simulation technology.
Its commercial prospects are extremely broad, with NVIDIA's Omniverse platform aiming at this trillion-dollar potential market.

The issue is that there is too little data available for training simulators: the amount of three-dimensional data with explicit geometric shapes, material properties, and physical annotations is several orders of magnitude less than the internet videos used for rendering训练.
Simulation inherently differs from reality, and generative simulators introduce new risks: AI-generated content may look correct but upon closer inspection, there are many inconsistencies with physical principles.
Massive multi-physics simulations (rigid bodies, deformable objects, fluid interactions, cloth interactions, etc.) cost several orders of magnitude more than single-domain simulations.
World Labs' product Marble aims to break through the bottleneck in the simulation phase: it supports multimodal inputs such as text, images, videos, or spatial sketches, generating exploreable 3D environments, and outputs Gaussian splats and collision meshes suitable for physical engine operations.
But Li Feifei also emphasizes: Marble is just the beginning of this long journey in the field.
Blurring Boundaries
Another key point Li Feifei made in this article is that three types of models are gradually merging.
Rendering a world, simulating a world, and acting within a world all require knowledge that is largely the same.
For example:
If a model truly understands how a cup is placed on a table, including its geometric structure, material properties, and force responses, then it should be able to render this cup from any angle, simulate what happens when the cup is pushed, and plan how a hand should pick it up.
These three capabilities are projections of the same underlying understanding.
Recent research has shown that at least conceptually, a pre-trained video renderer can serve as the backbone network for joint world prediction and action prediction.
This implies a bridge between renderers and planners:
Let the same model imagine both what will happen next and what should be done next.
Marble simultaneously outputs Gaussian splats and collision meshes from a single model, representing the boundary消融 between renderers and simulators.
Each layer is moving from passive output to interactive systems. Renderers are becoming conditional on actions. Simulators are generating more controlled and editable worlds. Planners are transitioning from simple reactions to true thoughtful consideration.
The logical endpoint is a unified world model—a basic model that can generate photo-realistic views and accurate physical structures while planning action sequences.
.
The core challenge remains data.
Renderers have access to vast amounts of internet videos, but simulators and planners lack sufficient 3D assets and robot demonstration data.
Chasing visual beauty might compromise the precision needed for robot or high-fidelity simulation.
How to harmonize these contradictions within a single architecture is the central open question in world model research today.
But Li Feifei optimistically concludes: The direction is clear.
Three independent research streams have each driven and shaped billion-dollar industries. Now they begin to act as one thing.
When their boundaries collapse共同ly, this change will reshape a larger question: the relationship between machine intelligence and the physical world.
This is the arc of space intelligence—language gives machines a way to discuss the world, and world models will enable machines to understand, imagine, reason, and interact with the world.
原文链接:
https://x.com/drfeifei/status/2062247238143996275
三条本来相互独立的研究 lines are now driving and shaping tens of billions of dollars worth of industries.
当它们的边界共同塌缩, this change will reframe a greater question: the relationship between machine intelligence and its physical environment.
版权 reserved, unauthorized copying and using are strictly allowed,