英伟达清华团队提出Gamma-World：世界模型从「一个人玩」到「多人共处」

量子位

量子位2026年5月30日

NVIDIA & Tsinghua Propose Gamma-World: World Models Evolve from ‘Solo Play’ to ‘Multi-Agent Coexistence’

9.2Score

TL;DR · AI Summary

Gamma-World systematically solves architectural gaps in multi-agent world modeling via simplex agent encoding and sparse hub attention, achieving >40% average FVD reduction, zero-shot generalization from 2 to 4 agents, and 24 FPS real-time rollout.

Key Takeaways

Simplex encoding ensures geometrically equidistant player representations with z
Sparse Hub Attention reduces inter-agent communication complexity from O(N²) to
Three-stage distillation compresses diffusion steps to 4, achieving 24 FPS strea

Outline

Jump quickly between sections.

§Core Challenges in Multi-Agent World Modeling
Existing single-agent video world models inherently lack cross-view and interaction consistency, a structural deficiency unfixable by scaling data or model size.
·Design I: Simplex Rotary Agent Encoding
Players are placed at vertices of a regular simplex for geometrically equidistant identity encoding, requiring no learnable parameters and supporting zero-shot player-number extension.
·Design II: Sparse Hub Attention
A hub-based sparsified topology reduces inter-agent communication complexity from quadratic to linear, dramatically improving scalability.
·Design III: Three-Stage Distillation for Fast Inference
Bidirectional teacher → causal student → self-forcing distillation compresses sampling to 4 steps, enabling 24 FPS streaming rollout.
§Experimental Validation
On five Minecraft tasks, Gamma-World outperforms Solaris with >40% average FVD drop; ablation shows simplex encoding yields the largest single-step gain (FVD ↓27.8), and 2-agent training enables 4-age

Mindmap

See how the topics connect at a glance.

查看大纲文本（无障碍 / 无 JS 友好）

Gamma-World：多智能体世界模型新架构
- 问题根源
  - 单智能体假设缺失跨视角一致性
  - 身份编码破坏置换对称性
  - 全连接注意力O(N²)不可扩展
- 三大核心创新
  - Simplex Rotary Agent Encoding
    - 正单纯形顶点等距编码
    - 零参数、零样本扩展
  - Sparse Hub Attention
    - 轮辐式拓扑（智能体→hub→智能体）
    - 计算复杂度O(N)
  - 三阶段蒸馏
    - 双向教师 → 因果学生 → Self-Forcing
    - 4步采样 + 24 FPS流式推演
- 实验效果
  - FVD平均↓40%（vs Solaris）
  - 双人训练 → 四人零样本泛化
  - 消融：单纯形编码贡献最大单步增益

Highlights

Key sentences worth saving and sharing.

Scaling from 2 to 8 agents increases dense attention FLOPs from 477.8G to 7.6T (~16×), whereas Sparse Hub Attention keeps growth linear.
— Design II
⬇︎ 下载 PNG 𝕏 分享到 X
Replacing slot embeddings with simplex encoding reduces FVD from 256.3 to 228.5—no extra parameters, yet the largest single-step improvement in ablation study.
— Experiment 2
⬇︎ 下载 PNG 𝕏 分享到 X
Gamma-World trained only on 2-agent data directly generates 4-agent coexistence scenes at inference time, proving true zero-shot player-number generalization.
— Experiment 3
⬇︎ 下载 PNG 𝕏 分享到 X
The three-stage distillation achieves 4-step sampling and 24 FPS streaming inference, while preserving initial frames and per-agent actions as conditioning signals to maintain controllability.
— Design III
⬇︎ 下载 PNG 𝕏 分享到 X

#World Model#Multi-Agent#Transformer#NVIDIA#Tsinghua

Open original article

< img id="wx_img" src="https://www.qbitai.com/wp-content/uploads/imgs/qbitai-logo-1.png" width="400" height="400">

2026-05-30 14:33:58 Source: QbitAI

Enabling World Models for Multi-Agent Interactive Simulation

By Yun Zhong, from凹非寺
QbitAI | WeChat Official Account QbitAI

Current video-based world models have matured significantly under the single-agent setting.

However, multi-agent scenarios—where multiple players share a single evolving world—have long lacked systematic architectural solutions.

The issue is not insufficient compute power, but rather that existing positional encoding schemes and attention mechanisms were never designed with multiple agents in mind.

Recently, NVIDIA, in collaboration with Tsinghua University, the University of Toronto, and the Vector Institute, introduced Gamma-World (γ-World), tackling this challenge at the foundational level by rethinking two core components: RoPE extension and attention topology.

Paper Title: Gamma-World: Generative Multi-Agent World Modeling Beyond Two Players

Why Multi-Agent World Modeling Is a Hard Problem

Existing video world models are almost universally built upon the single-agent assumption:

Given an action sequence from one player, predict future observations from that agent’s perspective.

The multi-agent setting fundamentally alters the problem: the model must no longer merely predict *“what this agent will see next,”* but simultaneously answer:

How should Player A’s movement appear in Player B’s field of view? When two players interact with the same object simultaneously, how should its state evolve?

This is not a matter of generating *N independent video sequences*, but rather generating *N coupled perspectives projecting the same evolving world*.

Technically, this requires the model to maintain three types of consistency simultaneously:

Temporal consistency: frames remain coherent over time;
Cross-view consistency: Player A’s appearance in Player B’s view aligns with A’s own trajectory;
Interaction consistency: operations performed by multiple agents on the shared environment produce consistent state changes across all views.

The single-agent framework was designed only to guarantee temporal consistency; the latter two were never considered—this represents a structural deficiency at the architectural level, which cannot be resolved by scaling up data or model size.

Prior to Gamma-World, attempts had been made in this direction.

Solaris achieved promising results on two-player Minecraft, yet it exposed two structural limitations that precisely illustrate why naively extending single-agent frameworks to multi-agent settings is fundamentally unworkable.

First, identity encoding breaks symmetry.

Solaris assigns fixed, learnable slot-based identity vectors to each player, effectively training “Slot 1” and “Slot 2” as two distinct role types.

In real multi-agent worlds, players with identical capabilities are inherently interchangeable. This loss of symmetry causes the model to learn *interaction patterns specific to particular roles*, rather than *general laws governing multiple equal agents sharing a world*. Generalization is thus fundamentally limited, and supporting new numbers of players necessitates full retraining.

Second, fully connected attention hits a scalability ceiling.

Allowing all players’ tokens to interact pairwise leads to computational cost growing quadratically with the number of players—

Expanding from 2 to 8 players increases computation from 477.8G to 7.6T FLOPs—a ~16× increase.

This ceiling is dictated by algorithmic complexity and cannot be overcome via engineering optimizations.

Both issues point to the same conclusion: multi-agent world modeling demands not incremental fixes, but a complete redesign of two core components—how agent identities are represented, and how inter-agent communication is structured.

Core Design I: Simplex Rotary Agent Encoding — Ensuring Agents Are “Equidistant and Equal”

The central conflict this design resolves is:

How can the model distinguish different players without privileging any one player in representation?

Video Transformers use RoPE (Rotary Position Embedding) to encode spatial-temporal relationships—assigning each token a rotation angle, where relative positions are expressed via angular differences.

Standard video RoPE encodes three axes: time, height, and width.

Gamma-World adds a fourth axis—the player axis—dedicating a dimension solely to agent identity, without altering the original spatiotemporal encoding.

Adding an axis is easy; designing the encoding on this new axis is hard.

Direct indexing fails.

Assigning angles sequentially (e.g., Player 1 = 0°, Player 2 = 1°, Player 3 = 2°) results in unequal rotational distances between pairs: distance(1,2)=1, distance(1,3)=2.

Although physically equivalent, the geometric relationship between “Player 1 & 2” and “Player 1 & 3” becomes asymmetric in representation space—permutation symmetry is broken by the encoding itself.

Learnable slot embeddings also fail.

Binding fixed trainable vectors to fixed seats locks the model to the exact number of players seen during training, preventing scalability—this is precisely Solaris’s core limitation.

The Regular Simplex: Naturally Equidistant Agents

Gamma-World’s solution is elegant: place all players at the vertices of a regular simplex.

What does this mean?

Imagine an equilateral triangle—all vertex-to-vertex distances are identical, with no vertex being special.

2 players → endpoints of a line segment
3 players → vertices of an equilateral triangle
4 players → vertices of a regular tetrahedron

No matter which two players are selected, their angular distance in the rotation space remains identical. The model perceives any pair of agents symmetrically—no agent is privileged.

This encoding requires no learnable parameters.

During training, active players are randomly assigned to different vertices in the vertex pool; the model identifies agents solely via geometric coordinates.

At inference, supporting more players simply involves selecting additional vertices from the same pool—no architectural modification or retraining is needed.

This is precisely why Gamma-World achieves “train on 2-player data, run directly on 4-player scenes.”

Core Design II: Sparse Hub Attention — From “Fully Connected” to “Hub-Based Broadcasting”

Inter-agent communication is indispensable for multi-agent world modeling, yet prior approaches incur prohibitive costs—

Enabling pairwise token-level interaction among all players leads to quadratic computational growth: expanding from 2 to 8 players increases FLOPs from 477.8G to 7.6T (~16×).

This algorithmic ceiling cannot be circumvented through engineering optimization.

The root cause lies in a flawed assumption: every token-level detail must be directly transmitted across all agents.

In reality, when Player A places a block, Player B only needs to perceive “a block has appeared in the world”—a compact world-state update—not A’s full visual details.

But do agents truly need to “talk directly”?

Dense attention implicitly assumes that all token-level details must be directly exchanged among all agents—an assumption invalid in most scenarios.

Gamma-World introduces a set of learnable hub tokens, forming a hub-and-spoke topology:

Each agent interacts only with its own history and the hub tokens;
Hub tokens aggregate information from all agents into a compressed shared-world state summary, then broadcast it back to each agent stream;
Direct attention between agents is fully masked; information flows via two hops: Agent → Hub → Agent.

This structure reduces computational complexity from quadratic to linear.

△Sparse Hub Attention (blue) vs Dense Attention (red): FLOP gap approaches 8× as player count increases

Crucially, sparse hub attention is not just computationally efficient—it embodies a more principled inductive bias: it explicitly encodes the prior that *inter-agent information should flow through a shared world-state bottleneck*, rather than expecting the model to implicitly discover this from data.

At inference, sparse communication topology is preserved via independent KV caches, enabling real-time 24 FPS action-response rollout.

Method Overview

(Note: Method overview—left: synchronized multi-agent inputs; center: Tokenization; right: Causal Multi-Agent DiT; bottom: schematics of Simplex Rotary Agent Encoding and Sparse Hub Attention)

The overall architecture takes synchronized multi-agent observations and action sequences as input, tokenizes each agent stream using shared visual and action encoders, and generates future multi-stream rollouts via a causal multi-agent DiT equipped with sparse hub attention.

At inference, streaming generation uses KV caches, with separate caches maintained for each agent stream and the hub.

Core Design III: Three-Stage Distillation — From “Seeing Everything” to “Running Fast”

In diffusion models, generation quality and inference speed are inherently conflicting: bidirectional models achieve highest fidelity but cannot support streaming inference; causal models enable real-time generation but sacrifice quality.

Gamma-World bridges this gap via a three-stage training strategy.

Stage 1: Train a Bidirectional Teacher.

The teacher model accesses the full sequence (including future frames), providing the highest-quality generation distribution—used exclusively for training, not inference.

Stage 2: Train a Causal Student.

The student model observes only current and past frames, adapted for streaming inference with sparse hub attention.

Critically, the student is trained end-to-end as a multi-step diffusion model—not merely as a warm-up for distillation. Before distillation begins, the student already produces reasonable rollouts, providing a stable starting point for Stage 3.

Stage 3: Conditional Self-Forcing Distillation.

Starting from the causal student and targeting the bidirectional teacher, distribution-matching distillation (DMD) compresses multi-step sampling into 4-step sampling.

Distillation occurs under autoregressive self-rollout, aligning training and inference distributions to effectively mitigate error accumulation.

Throughout, the initial frame and per-agent action sequences are retained as conditioning signals, ensuring controllability is preserved after compression—ultimately achieving 24 FPS streaming rollout.

Experimental Results

1. Outperforming State-of-the-Art Across the Board

On five scenario categories in multi-player Minecraft, Gamma-World surpasses both frame-stitching baselines and the current strongest multi-agent world model, Solaris, across all five metrics: memory retention, spatial localization, movement, construction, and cross-view consistency. Key metric FVD (a standard video generation quality measure) drops by over 40% on average.

2. Ablation: Every Design Step Delivers Measurable Gains

Ablation results show that replacing “learnable slot identity” with “simplex encoding” reduces FVD from 256.3 to 228.5—achieving the largest single-step improvement in the entire ablation study, without adding any parameters.

This result’s significance goes beyond “simplex encoding works better”; it demonstrates a deeper principle:

Explicitly encoding permutation symmetry constraints in the architecture yields substantially higher sample efficiency and final performance compared to letting the model implicitly learn such structure from data.

Symmetry is a known prior; embedding it directly into architecture is inherently more efficient than requiring the model to rediscover it—ablation experiments numerically confirm this.

3. Train on Two Players, Run on Four Directly

△Zero-shot generalization to four players: model trained only on two-player data, generates four synchronized views at inference

The model is trained solely on two-player data. At inference, two new vertices are activated from the vertex pool, enabling direct generation of four synchronized views—without modifying any architectural parameters—while maintaining consistent shared-world states across all four views.

This result directly validates the core objective of simplex encoding: generalization to arbitrary numbers of players, without requiring training data for that specific count.

Whether Solaris, Enigma Labs’ Multiverse, or Odyssey’s Agora-1—these works demonstrate that multi-agent world modeling is feasible, yet none possess this kind of scalable generalization capability.

4. Qualitative Demonstrations on Two Representative Tasks

△Two-agent interaction example—two views stay synchronized; Agent 1’s actions are correctly reflected in Agent 2’s view

In the “Place & Mine” task, the two views remain synchronized in real time, with one agent’s actions accurately reflected in the other’s view.

In the “Build Tower” task, blocks collaboratively placed by both agents occupy consistent positions across views, fully preserving the shared world state.

When a player temporarily leaves the other’s field of view, the model still maintains correct spatial localization—indicating it tracks a shared latent world state, rather than stitching independently generated videos together.

5. From Game Agents to Real Robots

△From game agents to real dual-arm robot collaboration: generated future frames preserve coordinated motion

The research team applied Gamma-World to the RealOmin-Open dataset for real dual-arm robotic collaboration tasks, treating left and right robotic arms as independent agents.

Generated future frames preserve coordinated motion and spatial layout between the two arms. The same framework transfers directly from Minecraft multi-player scenes to real-world physical manipulation—requiring no additional adaptation.

This result confirms the intrinsic generality of the multi-agent world modeling framework—not a scene-specific solution.

It naturally invites broader speculation: virtually all valuable real-world scenarios involve multiple agents collaborating or competing within a shared environment—multi-arm coordination in operating rooms, multi-robot scheduling on factory assembly lines, or multi-vehicle interaction in autonomous driving.

The translation of the given Mark down article into English is as follows:

If a uniform multiagent world model framework could cover these scenarios, it would represent much more than just a提升 in simulation capacity. It would provide a new data production and strategy training infrastructure for the entire physical AI domain.

Conclusion

The three core designed elements of Gamma-World—simplex rotation multiagent coding, sparse hub attention, and conditional teacher student distillation— address the three long-known conundrums of multiagent world modeling:

Symmetry representation of identities,
Efficient modeling of interactions,
simultaneously balancing mass and real-time.

Each of these is not a patch but a fundamental re definition of how we understand multiagent systems. The common method学 behind them is to directly encode our understanding of the problem structure into the architecture, rather than relying on models to learn approximate behavior from data.

A true understanding of multiagent systems would be built into the structure itself, rather than just fitting to data after seeing it enough times. The frontiers of multiagent world model generation to fouragent scenarios confirm this judgment directly.

This method学 also points to a larger possibility: when multiagent world models generate data of sufficient quality to accurately represent real physical laws, the way we collect data will fundamental change— from relying on real-world scenarios to being generated byneural simulation.

limitations of human resources, space, and time to collect data, will one day be surmounted by the infinite expandability ofneural simulation.

From the block world to the robot arm, Gamma-World takes a验证mic step towards this vision.

The true world model will learn not just "pictures" but "rules".

Paper: Gamma-World: Generative Multiagent World Modeling Beyond Two Players

Institute: NVIDIA/ Tsinghua University/多伦多 University/ Vector Institute

Project homepage:https://research.nvidia.com/labs/sil projects/gamma-world/\

GitHub:https://github.com/nv-tlabs/Gamma-World/\

Huggingface:https://huggingface.co/papers/2605.28816