NVIDIA 推出 Cosmos 3：统一物理AI多模态模型

NVIDIA Developer

NVIDIA DeveloperVideo2026年6月1日

Introducing NVIDIA Cosmos 3: Unified Multimodal Model for Physical AI

9.2Score

Watchable video resourceOpen original video

TL;DR · AI Summary

NVIDIA launches Cosmos 3, the first unified multimodal model integrating language, video, sound, and action inputs/outputs, built on Mixture of Transformer architecture, open-sourced with weights available on Hugging Face, achieving top scores across physical AI benchmarks including Robo Lab, PiBench, and Vintage.

Key Takeaways

Cosmos 3 is the first omni-model combining language, video, audio, and action mo
The Super version delivers state-of-the-art accuracy in physical AI tasks; Nano
Cosmos 3 ranks #1 in Robo Lab policy evaluation, PiBench, Vintage, and TA benchm

Outline

Jump quickly between sections.

§Release Context & Goal
NVIDIA introduces Cosmos 3 to accelerate the physical AI revolution by providing a unified foundation model for customization and deployment.
·Architectural Innovation
Built on Mixture of Transformer with dual towers — autoregressive left and diffusion right — supporting vision-language-action models.
·Versions & Deployment
Two variants: Super (high accuracy) and Nano (lightweight for edge), weights available via Hugging Face, code on GitHub.
·Performance & Benchmarks
Top-ranked across physical AI benchmarks including Robo Lab, PiBench, Vintage, and TA; first in open-source image-to-video generation.
·Open Ecosystem Support
Provides training scripts and datasets to empower developers to build downstream applications using the open model.

Mindmap

See how the topics connect at a glance.

查看大纲文本（无障碍 / 无 JS 友好）

NVIDIA Cosmos 3：统一物理AI多模态模型
- 核心架构
  - Mixture of Transformer
  - 双塔设计：自回归 + 扩散
- 版本策略
  - Super 模型：高精度物理AI任务
  - Nano 模型：边缘设备部署
- 性能表现
  - Robo Lab 政策评估第一
  - PiBench / Vintage / TA 基准榜首
  - 开源图像到视频生成第一
- 开源生态
  - Hugging Face 开源权重
  - GitHub 示例代码与训练脚本

Highlights

Key sentences worth saving and sharing.

Cosmos 3 is the first omni-model integrating language, video, sound, and action inputs/outputs, leveraging a novel Mixture of Transformer architecture combining autoregressive and diffusion mechanisms
— Paragraphs 0:27–0:46
⬇︎ 下载 PNG 𝕏 分享到 X
Ranked #1 in Robo Lab policy evaluation and multiple physical AI benchmarks including PiBench, Vintage, and TA, demonstrating superior physical reasoning and generation capabilities.
— Paragraphs 1:58–2:05
⬇︎ 下载 PNG 𝕏 分享到 X
NVIDIA offers Super and Nano versions — Super for high-performance AI tasks, Nano for edge devices — lowering deployment barriers for developers.
— Paragraphs 1:28–1:38
⬇︎ 下载 PNG 𝕏 分享到 X

#NVIDIA#Physical AI#Multimodal Model#Mixture of Transformers#Open Source