GB 200s 改变了大型 MoE 模型如 Qwen 的预填充和解码分离方式

Aravind Srinivas(@AravSrinivas)

Aravind Srinivas(@AravSrinivas)2026年5月12日

GB 200s Change How One Does the Prefill and Decode Disaggregation When Serving Large MoEs Like Qwen

8.5Score

TL;DR · AI Summary

GB 200s improve the prefill and decode disaggregation efficiency for large MoE models like Qwen, significantly enhancing throughput compared to the Hopper platform.

Key Takeaways

GB 200s are better suited for high-throughput inference on large MoE models comp
Perplexity has published research on deploying post-trained Qwen3 235B models on
GB 200s are not just a training platform but also a high-performance inference p

Outline

Jump quickly between sections.

§Introduction
Introduces the impact of GB 200s on large MoE models.
·Prefill and Decode Disaggregation
How GB 200s change the prefill and decode disaggregation process.
·Performance Comparison
Comparison of throughput between GB 200s and Hopper.
·Research Publication
Perplexity's publication on deploying Qwen3 235B models on GB 200s.

Mindmap

See how the topics connect at a glance.

查看大纲文本（无障碍 / 无 JS 友好）

GB 200s 与 MoE 模型
- 预填充和解码分离
- 性能对比
- 研究发布

Highlights

Key sentences worth saving and sharing.

GB 200s change how one does the prefill and decode disaggregation when serving large MoEs like Qwen.
— Paragraph 1
⬇︎ 下载 PNG 𝕏 分享到 X
GB200 is a major step up over Hopper for high-throughput inference on large MoE models, not just a training platform.
— Paragraph 2
⬇︎ 下载 PNG 𝕏 分享到 X
We published new research on how we serve post-trained Qwen3 235B models on NVIDIA GB200 NVL72 Blackwell racks.
— Paragraph 2
⬇︎ 下载 PNG 𝕏 分享到 X

#NVIDIA#MoE#Qwen#Hopper#GB 200

Open original article

Aravind Srinivas on X: "GB 200s change how one does the prefill and decode disaggregation when serving large MoEs like Qwen. We’ve published details of our stack quantifying the throughput benefits compared to serving on Hoppers." / X

Don’t miss what’s happening

Aravind Srinivas ![Image 4](https://x.com/AravSrinivas)

@AravSrinivas

GB 200s change how one does the prefill and decode disaggregation when serving large MoEs like Qwen. We’ve published details of our stack quantifying the throughput benefits compared to serving on Hoppers.

Quote

Perplexity

@perplexity_ai

·

10h

We published new research on how we serve post-trained Qwen3 235B models on NVIDIA GB200 NVL72 Blackwell racks. GB200 is a major step up over Hopper for high-throughput inference on large MoE models, not just a training platform.

2:27 PM · May 12, 2026

·

22.9K Views

11

13

164

53

Read 11 replies