CVPR 2026，英伟达特斯拉Waymo一块听中国公司讲物理AI

量子位

量子位2026年6月4日

CVPR 2026: NVIDIA, Tesla, and Waymo Gather to Hear Chinese Companies Discuss Physical AI

8.5Score

TL;DR · AI Summary

XPeng unveiled its Physical AI foundation model at CVPR 2026, achieving co-evolution of dense physical prediction and sparse human intent by fusing Gen-2 VLA with World Models. Leveraging X-World tech and self-developed chips, it reduced vehicle inference latency to 80ms, validating Autonomous Driving Scaling Laws.

Key Takeaways

XPeng's Gen-2 VLA co-evolves with World Models using 4T+ tokens to solve sparse
X-Cache reduces redundant computation by 70%, accelerating World Model inference
Turing Chip + VLA stack boosts compute utilization to 82.5%, cutting vehicle lat

Outline

Jump quickly between sections.

§Physical AI Foundation Architecture
XPeng proposes a co-evolutionary architecture of Gen-2 VLA and World Models, combining sparse human intent with dense physical signals for controllability.
·Three Core World Model Capabilities
X-World, X-Foresight, and X-Cache enable controllable generation, causal prediction, and inference acceleration for long-horizon autonomous driving.
·Scaling Law & Engineering Deployment
XPeng's GPU utilization reached 90% and single-task training efficiency surged 4360%, confirming Scaling Laws apply to Physical AI.
§Hardware-Software Co-optimization
Deep co-design of Turing Chip and compilers boosted vehicle compute utilization from 22.8% to 82.5%, compressing latency to 80ms.
·Data Advantage of Vision-Only Route
Cameras provide billions of data points per second versus LiDAR's millions, better supporting massive data-driven Physical AI understanding.

Mindmap

See how the topics connect at a glance.

查看大纲文本（无障碍 / 无 JS 友好）

小鹏物理AI基座模型
- 双模协同架构
  - 第二代VLA (稀疏意图)
  - 世界模型 (密集物理预测)
- 核心技术组件
  - X-World (可控生成)
  - X-Foresight (因果预测)
  - X-Cache (推理加速)
- 工程化验证
  - 图灵芯片软硬协同
  - Scaling Law数据规模

Highlights

Key sentences worth saving and sharing.

World Models receive denser supervision than human actions: every frame and motion serves as a training signal, essentially adopting the 'next token prediction' paradigm.
— Core Mechanism Section
⬇︎ 下载 PNG 𝕏 分享到 X
X-Cache reduces redundant computation by ~70% without quality loss, achieving up to 2.7x inference acceleration for the World Model denoising backbone.
— Tech Stack Section
⬇︎ 下载 PNG 𝕏 分享到 X
The full self-developed stack (Gen-2 VLA + Turing Chip) spikes compute utilization to 82.5% and cuts latency to 80ms, far exceeding the 22.8% of generic solutions.
— Engineering Section
⬇︎ 下载 PNG 𝕏 分享到 X
In the year ending March, XPeng's single-GPU training efficiency rose 1010%, and single-task training efficiency surged 4360%.
— Scaling Law Section
⬇︎ 下载 PNG 𝕏 分享到 X

#Physical AI#World Model#VLA#XPeng#Autonomous Driving

Open original article

< img id="wx_img" src="https://www.qbitai.com/wp-content/uploads/imgs/qbitai-logo-1.png" width="400" height="400">

2026-06-04 19:56:35 Source: QbitAI

First to Achieve a Closed-Loop Flywheel for Physical AI

By Jia Haonan | QbitAI

Physical AI—the hottest concept in the 2026 AI landscape!

Autonomous driving companies are talking about it, automakers are discussing it, large model players are exploring it, and investors are betting on it...

As the concept becomes a consensus, the real dividing line begins to emerge: Who is the first to present a complete technology stack, papers, and code, validated on production vehicles already running on public roads?

At this year's CVPR, during the inaugural "Workshop on Foundation Models for Embodied Intelligence Deployment," clarity finally began to emerge from the chaos.

The session was packed with top-tier players in the field: Tesla, NVIDIA, Waymo, and the only invited Chinese company—XPeng.

Fred Lambert, Editor-in-Chief of the leading US EV media outlet Electrek, noted even before CVPR 2026 began that XPeng’s Liu Xianming and Tesla’s Ashok Elluswamy would be sharing the stage at this premier global conference to present their technical achievements.

Countless participants discuss cutting-edge AI topics at top conferences, but few command the focused attention of industry and academic giants like Tesla, Waymo, and NVIDIA.

XPeng is one of those few.

Sharing the Stage with NVIDIA, Tesla, and Waymo: What Did XPeng Present?

While this was the first "Workshop on Foundation Models for Embodied Intelligence Deployment" at CVPR, it marked the seventh edition of the "Embodied AI Workshop" series.

These forums typically feature invited talks by leading experts from academia and industry, sharing the latest research findings and forward-looking insights. This year's participants included Waymo, Tesla, NVIDIA, and others—representing the global first tier of Physical AI.

△ From left to right, 3rd from left: Liu Xianming, Head of XPeng Group's General Intelligence Center;

5th from left: Ashok Elluswamy, VP of AI Software at Tesla;

6th from left: Dragomir Anguelov, VP at Waymo;

XPeng was represented by Liu Xianming, current Head of XPeng's General Intelligence Center.

This marks XPeng's third invitation to speak at CVPR. However, unlike the previous two occasions, this was the first time XPeng comprehensively showcased its world model technology roadmap.

△ XPeng's Physical World Foundation Model Technology Roadmap

Based on a series of recently published academic papers—including X-World, X-Foresight, and X-Cache—the presentation systematically analyzed XPeng's world model technology.

It first clarified a core initiative: XPeng is developing a world model capable of proactive reasoning, controllable generation, and long-horizon inference. Together with the second-generation VLA (Vision-Language-Action) model, this world model forms the foundation of XPeng's Physical AI base model.

The two evolve synergistically through different training signals.

Human actions contain rich high-level semantics, implicitly encoding perception, reasoning, intent, risk assessment, social interaction, and understanding of the physical world.

However, such supervisory signals are relatively sparse temporally; they typically supervise only the final behavioral outcome, making it difficult to cover every potential physical state transition leading to that behavior.

In contrast, the world model learns directly from the world itself. It predicts not just the next action, but also future states, future observations, or future representations in latent space.

Consequently, the world model receives significantly denser supervisory signals: every frame, every motion, and every interaction can serve as a training signal. Essentially adopting the "next token prediction" paradigm from Large Language Models (LLMs), it gradually learns the dynamics and causal structure of the physical world through dense prediction of the next frame or state on massive amounts of unlabeled video data.

In practical engineering deployment, conventional VLA and world model approaches are often seen as opposing paths. XPeng’s strategy, however, combines sparse human intent with dense physical prediction, enabling the model to learn not only "what a human driver would do" but also to deeply understand "what will happen next in the physical world."

This parallel evolution toward dual objectives ensures system controllability and safety in complex environments, while endowing autonomous driving systems with deeper physical perception and logical reasoning capabilities.

"Should we pursue the VLA path or the world model path?" Liu Xianming answered: XPeng's Physical World Foundation Model is both a second-generation VLA and a world model.

Returning to Physical AI, truly learning knowledge about the objective world requires both real-world physical laws, causal logic, and long-horizon inference, as well as repeated practice in virtual worlds—to validate strategies, handle long-tail scenarios, and achieve closed-loop optimization.

The co-evolution of world models and VLAs essentially represents a generalized data-driven system: extracting model intelligence from larger-scale, high-quality data, encompassing both understanding of human behavior and knowledge of the world.

For AI to truly act in the physical world, it needs to know several things: first, "how to act"; second, to understand "how the world changes after an action"; and simultaneously adjust its action strategy based on those potential changes. These are the respective responsibilities of XPeng's second-generation VLA and world model.

"How to act" was the theme of Liu Xianming's CVPR presentation last year, where he introduced the architecture and training methodology of XPeng's second-generation VLA.

"How the world changes after an action" is precisely this year's topic: how XPeng develops its world model. XPeng's world model can also be understood through several recent key papers from the team.

How to Enable AI to Understand Environment, Spatiotemporal Dynamics, and Causality?

Liu Xianming believes that an excellent world model must possess three core capabilities: proactive reasoning, controllable generation, and long-horizon inference. These embody intelligence and are prerequisites for applying world models in autonomous driving. Several technical reports recently released by XPeng's R&D team correspond directly to these key capabilities.

X-World is a controllable multi-view generative world model built on video diffusion generation technology. Given specific actions, it generates future videos that adhere to physical constraints while maintaining strong controllability and stability throughout continuous generation. It has been deployed in XPeng's R&D processes, including closed-loop simulation testing, online reinforcement learning, and data generation.

X-Foresight is a vision-action causal prediction network based on a predictive world model. Architecturally integrated with the VLA, X-Foresight jointly predicts future multi-view imagery and ego-vehicle actions within a unified token space, providing core support for VLA vehicle control decisions. Its predictive decision-making logic compels the model to "understand the world," mastering the movement patterns of vehicles and pedestrians and the causal chains of scenarios.

X-Cache is a cross-segment block-level cache designed for few-step autoregressive world models. It reduces redundant computation by approximately 70% with negligible loss in image quality, achieving up to ~2.7x inference acceleration for the denoising backbone of the world model.

Liu Xianming also revealed that a paper titled "X-mind" will be published soon, analyzing how the model performs "proactive reasoning" and visually presenting the intermediate reasoning process behind driving decisions. Interpretability is crucial for debugging autonomous driving software performance, building user trust, and enabling rapid model iteration.

Behind these architectural innovations, scaling laws remain applicable to Physical AI, and the dividends of scale are just beginning.

Over the past year, XPeng has continuously iterated across three core dimensions—models, computing power, and data—constantly pushing the performance limits of its foundation models.

Currently, XPeng's second-generation VLA model has reached billions of parameters, trained on over 100 million video clips. Total training tokens per model version have exceeded 4 trillion, placing its data and model scale firmly in the industry's top tier.

Data provided by XPeng shows that in the year ending March, training efficiency per GPU in XPeng's intelligent computing cluster increased by 1,010%, single-task training efficiency surged by 4,360%, and GPU hardware utilization rose from 40% to 90%, matching the standards of leading domestic AI companies.

Beyond cloud computing, XPeng has also maximized the utilization of in-vehicle computing power.

Through deep co-development of chips, compilers, and model software/hardware, XPeng has fully unlocked in-vehicle computing resources, boosting overall in-vehicle model inference speed by 12x.

Three sets of comparative data disclosed by Liu Xianming vividly demonstrate the overwhelming advantage of this proprietary system:

Generic chips + open-source models: Computing utilization is only 22.8%, with inference latency as high as 800ms;

XPeng's proprietary Turing chip + open-source models: Computing utilization increases to 35.1%, with latency reduced to 300ms;

Full proprietary stack—Second-generation VLA model + Turing chip: Performance undergoes a qualitative leap, with computing utilization soaring to 82.5% and inference latency dropping to just 80ms.

The mass-production performance of the second-generation VLA serves as the best proof of the Scaling Law for autonomous driving.

In the first month after the official rollout of the second-generation VLA, assisted driving mileage accounted for over 50% of total mileage for equipped vehicles. Advanced intelligent driving is transitioning from an "optional feature" to a frequent necessity.

Behind every software upgrade lies XPeng's rapid model iteration. XPeng Group previously revealed that from November last year to March this year, the R&D team iterated an average of four model versions per day. In the AI era, "speed" itself is a core competitive advantage.

Everyone Talks About Physical AI: What Makes XPeng Different?

First, on the path from L2 to L4, XPeng was the first to present a complete technical roadmap.

In-vehicle AI has entered a phase of competition based on "model intelligence," rather than merely comparing parameter counts or piling up in-vehicle hardware.

For instance, constrained by physical limits, LiDAR operates at lower frequencies with additional latency overhead, generating only millions of data points per second.

Cameras, conversely, respond faster with significantly higher frequencies, producing billions of rich visual information points per second.

Thus, while LiDAR has a lower processing threshold, it suffers from poor long-range accuracy and susceptibility to false positives. Cameras require massive computing power to process vast amounts of data—but given sufficient compute, pure vision solutions far surpass LiDAR in upper-limit capability.

This trend, of course, was driven by Tesla.

Elon Musk's insistence on pure vision is not fundamentally about a "sensor type debate," but rather "which data type better supports ultra-large-scale data-driven approaches"—this is the essence of the first principles of autonomous driving.

XPeng's second-generation VLA is the prime example of this implementation: Rather than relying solely on more cameras or higher-compute chips to improve capabilities, it leverages a unified Physical World Foundation Model, combined with sufficient in-vehicle and cloud computing power, along with world models and road-test data, deployed on physical-world terminals:

In terms of parameter scale, data types, and underlying architecture, XPeng's second-generation VLA comprehensively surpasses traditional autonomous driving models, demonstrating AI's ability to understand the physical world and solve corner cases more efficiently beyond real-world road-collected data.

On a deeper level, XPeng's full-stack world model technology system transcends autonomous driving.

It is not merely an autonomous driving model, but a unified Physical World Foundation Model.

Because the underlying logic of multimodal large models is universal—it addresses not "how to drive," but "how to understand and predict a dynamically changing physical world."

In a sense, XPeng's world model isn't teaching AI to drive; it's teaching AI to "see and understand" the physical world—driving is just one specific manifestation of that understanding.

From smart cars to humanoid robots, this methodology possesses inherent cross-domain transferability.

While the industry still treats "Physical AI" as a marketing buzzword to attract capital, XPeng has taken the lead in establishing a closed-loop data flywheel for Physical AI.

This represents not only technological leadership but also a definition of the Physical AI discourse:

• Ending the binary opposition between VLA and world model technical routes;

• Exploring effective technical pathways for scaling from L2 to L4;

• Pioneering methods to "extract world knowledge" in autonomous driving, then applying them to broader application scenarios.

XPeng is a regular at CVPR, having taken the stage at this premier global AI conference for three consecutive years—a distinction virtually unique among global automakers.

This seemingly "misplaced" competitive approach explains precisely why it is often categorized alongside tech companies rather than traditional automakers.

The data provides the answer: Technical prowess drives appeal. A survey shows that over 60% of car buyers rank "intelligent driving capability" and "technological leadership" among their top three factors in purchasing decisions.

They are choosing not just a mode of transportation, but a continuously evolving AI system with monthly OTA updates. From highway NGP to city NGP, from rule-driven VLAs to data-driven world models—every technological leap translates directly into enhanced driving experience and purchase confidence for users.

The showcase at CVPR 2026 serves as the latest validation: XPeng's technology brand is not built on marketing rhetoric, but on published papers, continuous OTA updates, and millions of kilometers of intelligent driving data. It now stands at the very forefront of Physical AI world models.

This represents not only a disruptive advantage in the smart vehicle competition but also enables cross-domain transfer to robotics and flying cars.

In 2023, XPeng made its debut at CVPR, presenting XNet—China's first mass-produced BEV perception architecture.

In 2025, XPeng took the stage for the second time. Liu Xianming, Head of World Foundation Models, introduced a 72-billion-parameter foundation model theory, providing the industry's first validation that scaling laws remain effective for autonomous driving VLA models.

In 2026, at the CVPR workshop on "World Models for Embodied Intelligence," XPeng made its third appearance. The company presented its insights on VLA and world models, unveiled a complete technology stack comprising X-World, X-Foresight, and X-Cache, and shared mass-production validation data for its second-generation VLA.

Over four years, XPeng has progressed from engineering practice to theoretical breakthroughs, and finally to mass production. This production validation, in turn, provides real-world feedback data essential for the scalable deployment of world models.

It is this sustained continuity that forms XPeng's truly insurmountable competitive moat.

This consistent accumulation over time has enabled XPeng's remarkable leap from an "EV startup" to a "Physical AI company":

The goal is no longer just to build an AI for a single car, but to create a universal cognitive foundation for the physical world.