VLA and World Models Are Not the Endgame, There Will Be Unique Physical World Models

TL;DR · AI Summary
Shen Yujun believes that VLA and world models are not the endgame for embodied intelligence; a unique physical-world model will emerge. Robots lack data, so spatial perception must be built from sensor inputs.
Key Takeaways
- Robotics lacks physical-world data, requiring specialized models for the real wo
- Neither VLA nor world models are final; they'll merge into new paradigms.
- Spatial perception is key to robot intelligence, starting from sensor input.
Outline
Jump quickly between sections.
Large models rely on internet data dividends, but robotics still lacks physical data, needing migration to the real world.
Robotics data scarcity requires building spatial perception capabilities from sensor inputs to intelligent decisions.
Both paths are not final; they’ll converge into a new paradigm suitable for the physical world.
Robots will gradually enter daily life, becoming generalized brains enhancing human productivity.
Expect benchmark cases in 1–2 years, mass replication in 2–3 years, and consumer adoption afterward.
Mindmap
See how the topics connect at a glance.
查看大纲文本(无障碍 / 无 JS 友好)
- 物理世界专属模型
- 数据稀缺性
- 机器人数据空白
- 数字世界数据红利
- 技术路线融合
- VLA与世界模型
- 空间感知能力
- 未来应用场景
- 工厂物流
- 家庭服务
Highlights
Key sentences worth saving and sharing.
Robotics data is nearly blank—physical-world data is key to AI 2.0's next phase.
VLA and world models are not the endgame; a unique physical-world model will emerge.
Spatial perception is core to robot intelligence, starting with sensor input optimization.
< img id="wx_img" src="https://www.qbitai.com/wp-content/uploads/imgs/qbitai-logo-1.png" width="400" height="400">
2026-05-25 14:56:42 Source: QuantumBit
"Building an Android System for the Age of Robots"
The explosion of large models is fueled by the data dividends accumulated over decades on the internet.
But as AI transitions from the digital world to the physical world, Shen Yujun, Chief Scientist at Ant Lingbo Technology, discovered that the data for robots is almost entirely blank.
Previously, he proposed the concept of AIGA in a public speech — in the second half of AI 2.0, artificial intelligence should shift from the "entertainment" of the digital world to the "work" of the physical world, moving from Content (content generation) to Action (action generation).
At the GenAI Talk session of the 2026 China AIGC Industry Summit, Shen Yujun had a deep conversation with Li Gen, co-founder and chief editor of QuantumBit, on this topic titled *AI 2.0 Second Half: From AIGC to AIGA*.
It was precisely from the theme of "data" that he put forward a judgment that left those working on VLA and world models "slightly startled":
Neither standalone VLA nor world models will be the endgame of embodied intelligence.
Just like humans can integrate various information and predict future trends, from an intelligent perspective, both must be combined — neither can be missing.
What will they ultimately evolve into? Shen Yujun's current answer is — a model unique to the physical world.

To fully reflect Shen Yujun’s thinking, QuantumBit edited and organized the content of his speech without altering its original meaning, hoping to provide more inspiration.
The 2026 China AIGC Industry Summit, hosted by QuantumBit, brought together nearly 20 industry representatives for discussions. Over a thousand people attended in person, and nearly 4 million watched live online, receiving widespread attention and coverage from mainstream media.
Core Points Summary
- Large models have taken advantage of the data dividends accumulated over decades on the internet, but there remains a significant gap in physical world data for robots. The key to the second half of AI lies in how data transitions from the digital world to the physical world.
- To build a general-purpose robot brain existing in the physical world, one critical element is spatial perception capability. How to transform sensor inputs into better information for models, starting from understanding the world through sensor inputs, is crucial.
- Regarding the technical debate between VLA and world models, first, no matter how the technology evolves, data is indispensable. Secondly, neither path leads to the final solution. When robot data accumulates to a certain extent, both paths will converge, giving rise to models uniquely suited to the physical world.
- Prediction: In 1–2 years, some benchmark examples will emerge where models are truly deployed; in 2–3 years, these examples will be replicated across industries; afterward, robots will attempt to enter the consumer market; eventually, they will become widely adopted in households.
- When everyone can generate data for robots, it will be the ChatGPT moment for embodied intelligence.
Below is the full transcript of the dialogue:
"Large Models Tapped Into Decades of Internet Data Dividends"
Li Gen: In the first half of AIGC, everyone was talking about anxiety, but looking further ahead, once the direction is clear, it's all about execution. Every year we hope to find a guest who has both academic insights and industrial practice — someone who understands both "know" and "know-how." Dr. Shen fits this profile perfectly. Let's start from past to present — AI 2.0 began with ChatGPT, evolving from writing, painting to coding. What do you think of this trajectory?

Shen Yujun: Large models started with the breakthrough of ChatGPT. Initially, people found it fun, later it became practical, especially with the recent surge in coding capabilities. As someone in the robotics field, I see that large models really capitalized on the data dividends accumulated over decades on the internet.
How so? The internet has accumulated massive amounts of text, images, and video materials, which coincided with the development of computing power. When these two converged, it perfectly utilized the decades of data accumulation on the internet.
Looking further, autonomous driving has developed for nearly ten years, gradually accumulating its own data — from early days when cars had few sensors to now, where human driving data can be automatically recorded. In contrast, the robotics industry still faces a major data gap. We don’t have decades of internet accumulation or ten years of autonomous driving experience, and robot data is currently very scarce.
Some say AI has finally entered the second half, transitioning from the digital world to the physical world. We introduced the concept of AIGA (AI Generated Action). But at the core of building models is data. I believe the more important question is: how does data evolve? How can data truly transition from the digital world to the physical world?
Li Gen: So the data in the physical space is a blank slate and a new frontier?
Shen Yujun: Exactly. Over the past year, there have been more and more body manufacturers, and their development has also been quite good. This year, a clear trend is emerging: various data collection methods are starting to appear. This shows that people are gradually realizing that embodied intelligence — embodiment and intelligence — cannot exist without data. However, they haven't yet figured out what kind of data physical intelligence needs, and how to standardize it as much as possible.
Standardization is very important. Going back to the success of large models, a big part of it stems from the internet standardizing data. In fields like coding and dialogue, the internet has already done a great job, and now we're enjoying the benefits of that.
But in the physical world, looking at the diverse data collection methods today, while people realize the importance of data, they haven’t yet found the right path. I believe that in the near future, data will also gradually converge.
"AIGC Isn’t Enough — Models Need to Generate Productivity"
Li Gen: You proposed the AIGA application paradigm. Can you share more details? Why introduce AIGA?
Shen Yujun: Still from the perspective of model deployment. Initially, we started with Chat, then moved to Coding, and models are slowly shifting toward production. In the digital world, programming and content creation are excellent production directions. But we live in the physical world, and truly meaningful services require real-world interaction.
So AIGC alone may not be enough. At the end of the day, can intelligence actually solve specific problems? Everyone talks about Agents, which can help with many workflow issues and tool usage in the digital world. But many things that give us a real sense of touch still need actions. For example, I want a cup of coffee — a somewhat cliché scenario.
Especially in embodied intelligence, we hope the model can produce not just content, but actual productivity.

Li Gen: What kind of imagination and practical scenarios does this productivity offer?
Shen Yujun: This has been a recurring discussion in the industry recently. For instance, robots have entered factories, logistics, and warehouses to move goods and sort items. We’ve also collaborated with body manufacturers to explore these scenarios. Recently, our robot even entered a pharmacy retail store.
If we could see such a day — where robots are highly advanced and intelligent — every aspect of life could be transformed. For example, before coming on stage, staff needed to bring up a chair. They had to wait nearby. If timing changed, they would have to stay there.
If robots could handle such tasks, they could stand there without issue, know when to act, and do it well — freeing up human labor for more valuable work. I believe robots will gradually permeate all aspects of life.
Li Gen: So every place that requires human presence could potentially be replaced or executed by robots?
Shen Yujun: Yes, I think it's more about freeing people for activities that require human skills — such as creativity and culture — rather than doing repetitive physical labor.
"Lingbo’s Position Is to Build a General Brain, Similar to a Mobile OS"
Li Gen: What is the technical positioning and strategy of Ant Lingbo?
Shen Yujun: Our positioning is clear: focus on the intelligent side. How to understand this?
I’ll give an analogy — perhaps not perfect — but similar to a mobile operating system. In our view, whether robots enter enterprises or homes, people’s hardware demands will vary greatly. It’s impossible to have just one unified robot. Just like phones — Huawei, Xiaomi, Apple — each has its preferences, reflecting personal needs. Enterprises are even more diverse: some need strong robots, others agile ones.
But all these robots share one common demand: intelligence. Intelligence isn’t like industrial robots that follow fixed trajectories and perform fixed tasks at fixed times. Life is full of randomness, and intelligence means being able to handle such variability. For example, if the conference time changes, can the robot know when to bring the chair? That’s a simple case.
So our positioning is clear: we aim to create a relatively universal “brain,” allowing all robots to perform tasks better under this brain.

Compared to the digital world, the physical world has two advantages.
First, there are definitely more modalities — hearing, temperature, touch, etc. These modalities are hard to obtain in the digital world, but that doesn’t mean they’re unimportant. Often, intelligence comes from combining more and more modalities. Today, digital world multi-modal models still mostly rely on text, images, videos, and sounds. Can it sense "force"? Not really. Because the physical world offers richer modalities, it might foster stronger intelligence.
Second, the physical world provides real feedback. In the digital world, tasks are often artificially defined — humans set standards, expecting certain outputs from models. But in the physical world, many things are defined by nature: for example, an apple falls when released, without needing anyone to define it — it’s a natural law. With connections to the physical world, intelligence might learn directly from reality, even surpassing manually defined loss or reward functions.
With these two advantages, the potential for physical intelligence is huge. Of course, there are too many variables now, and too much needs verification. Many factors are interlinked, so the industry will initially split into factions, but it will eventually converge.
"VLA and World Models Are Not the Endgame — There Will Be Models Unique to the Physical World"
Li Gen: To summarize, Lingbo complements body manufacturers like Unitree by providing brains/OS; physical AI may give rise to more fundamental intelligence. Now there are different views on routes, such as VLA and world models. What’s your take?
Shen Yujun: Before addressing VLA, let me first state my overall judgment on the technical path of embodied intelligence.
People often discuss how to fuse modalities, mainly through two paths: VLA and world models. But I want to highlight another point — Lingbo aims to build a general brain, one that exists in the physical world. One key but rarely mentioned aspect in our strategy is: spatial perception ability, also known as spatial intelligence.
Robots live in the physical world, and their inputs come from various sensors, not just text or photos. The physical world includes depth, distance, and force sensors. Turning these inputs into effective information is a vital part of embodied intelligence. Currently, people focus more on the core part — how to turn sensor inputs into better information for models — and often overlook the input side.
My view is: no matter how the core model’s technical path evolves, better understanding of the world from the sensor input side is crucial. In building our embodied brain, a key part is mastering spatial perception from the input side.
Returning to the core part people love to discuss. At the beginning of this year, we shared some thoughts — we’ve explored both paths: VLA and VA (now called WAM, World Action Model).
My impression is: First, the core is still data. No matter the paradigm changes, understanding the data is the key capability. People now talk about data volume — tens of thousands, hundreds of thousands, even millions of hours. But only talking about quantity without quality is unscientific. What makes data good? That’s key.
In our past work with VLA, a key task was to navigate the data pipeline — how to process data, whether to send it to the model — this is the core chain. No matter how the technology evolves, data is unavoidable.
Second, we've explored both paths. My judgment is: neither path is the endgame. Why? Because VLA and world models solve different problems. VLA excels in human-robot interaction, extending from multi-modal models into the physical world. World models resemble video generation models applied to the physical world, excelling in future prediction.

I believe humans possess both abilities: integrating various information and predicting the future. Robots need both — they can’t just predict the future without fusing modalities, nor can they just fuse modalities without predicting the future.
In my view, VLA is easier to implement and more efficient in industry, so more people are pursuing it. But if world models can truly predict the future, they will definitely help robots. I believe that when robot data accumulates sufficiently, these two paths will deeply merge.
This fusion won’t be like today — applying digital models to physical applications — but rather could give rise to models unique to the physical world. This model is designed from the start based on more physical world modalities, tailored specifically for robot applications. It might not chat with humans, but it can execute tasks better.
Summarized into three points: First, physical intelligence depends on sensor input spatial perception. Lingbo will start from the input side to help robots better understand the world. Second, regardless of how the technology evolves, data is unavoidable — we must understand the data robots need and even push for standards. Third, the current technical paths discussed are not the endgame — there will definitely be models unique to the physical world in the future.
"When Everyone Can Generate Data for Robots, It Will Be the ChatGPT Moment for Embodied Intelligence"
Li Gen: Thank you, Dr. Shen, for your straightforward answers. So, what is the development timeline and milestones for the embodied brain?
Shen Yujun: In the short term, there are several changes. First, hardware will increasingly converge — not in form, but in supply chains, becoming modular and less coupled. Hardware and sensors will become more standardized. Second, data standards will also converge.
Once these two converge, debates around technical approaches at the model level will intensify. With the first two settled, the variable becomes modeling. After a period of model debates, paradigms may also converge. Once models converge, they will drive hardware upgrades — new hardware designed specifically for embodied intelligence, not the previous generation. Like this cycle: hardware fluctuates, converges, then models iterate. This is something to look forward to.
From an industry implementation perspective, there should also be some expectations. This year to next year, there will be some landmark cases where models are truly deployed for production, no longer just demos but entering commercial applications. In 2–3 years, these cases will be replicated in batches, and more industries will start adopting models. Later on, robots will attempt to enter the consumer market in some way—perhaps not everything can be done yet, but a foothold can be found. Gradually, they'll become as common in households as electric vehicles are today.
Li Gen: When will we see the “ChatGPT moment” for embodied intelligence?
Shen Yujun: The training of large models is a continuous process—from GPT 1.0, 2.0 to 3.0. But why did ChatGPT become a milestone? Because it truly entered every household, accessible and experienceable by everyone. If we draw a parallel to embodied intelligence, when can most people participate in it? That’s what I consider the ChatGPT moment for embodied intelligence.
Participation has two levels. The most intuitive understanding is that embodied intelligence becomes accessible to everyone—which may still be quite far off. But before that, there's another stage: the data phase. Just like how people now drive cars and contribute human driving experiences to autonomous driving systems.
When will we have a data standard such that our daily behaviors can serve as training data for robots? When everyone can generate data for robots, in my view, that’s the ChatGPT moment for embodied intelligence.

Li Gen: How long do you think it will take?
Shen Yujun: There are already many companies working on data, though different schools of thought exist. In the next one or two years, it will likely be a period of adjustment between model-focused companies and data-focused ones. Since data standards must be defined by models, but the demands from models require hardware iterations to catch up. After about a year or two of adjustment, around 2028, we can expect everyone to become a data provider for embodied intelligence.
From that point onward, the pace of embodied intelligence will accelerate.
Li Gen: Will operating systems similar to Android and iOS emerge at the same time?
Shen Yujun: Yes, this distinction already exists now. Lingbo follows a general brain approach, while some companies like Tesla build both their own body and their own brain—the model is specifically designed for their own robot. Robots are like smartphones; people won’t all use the same model because personalization is always a factor. So we firmly believe in the universal brain model.
Li Gen: So does Ant Lingbo hope to become the Android of the robotics era?
Shen Yujun: Yes, that’s part of our vision.
Li Gen: To summarize, the second half of AI 2.0 unfolds with the exploration of the physical world, shifting paradigms from AIGC to AIGA. Data is key, and technical paths are converging. Around 2028, we might see convergence in embodied brains. Ant Lingbo aims to be the Android within this ecosystem. Thank you, Mr. Shen!
Shen Yujun: Thank you!
© All rights reserved. No form of reproduction or usage is permitted without authorization. Violators will be prosecuted.