MoE 最近有什么新动态？

traeai 已收录 14 篇与 MoE 相关的内容。最新一篇是「将 600 亿参数大模型装进手机的瓶颈，终于被中国 AI 公司突破了」，由爱范儿发布。

概念

MoE

别名：Mixture of Experts

混合专家模型架构，通过稀疏激活实现大参数量下的高效推理。

已跟踪 14 条高相关材料

TraeAI 观察

如果只读 3 篇

将 600 亿参数大模型装进手机的瓶颈，终于被中国 AI 公司突破了

爱范儿 · 9.2 分

中国AI公司突破三值量化技术，使600亿参数模型可在手机运行，节省6倍显存且性能损失极小。

https://t.co/nw0GoHamCI

宝玉(@dotey) · 9.2 分

DeepSeek通过多项底层技术创新构建低成本高效能模型体系，旨在撬动中国10万亿美元AI硬件生态并实现自身万亿美元市值。

DeepSeek 的 10 万亿美元大战略

宝玉的分享 · 9.2 分

DeepSeek通过多项技术创新大幅降低大模型推理中的KV缓存需求，推动中国AI硬件生态发展，目标打造价值10万亿美元的产业巨兽。

Chinese AI Company Breaks Bottleneck to Run 60 Billion Parameter Model on Mobile

爱范儿5月25日2653 字 (约 11 分钟)

A Chinese AI company has broken the bottleneck of running a 60 billion parameter model on mobile devices using ternary quantization, saving 6x memory with minimal performance loss.

入选理由：三值量化可节省6倍显存，保留97%模型能力，支持在8GB内存手机运行600亿参数模型。

FeaturedArticle#AI Model#Ternary Quantization#Ascend Chip#Edge AI#Model Compression中文

DeepSeek's $10 Trillion Grand Strategy [Translation]

宝玉(@dotey)5月25日5655 字 (约 23 分钟)

DeepSeek builds a low-cost, high-efficiency model system through multiple foundational innovations to drive China's $10 trillion AI hardware ecosystem and achieve its own $1 trillion valuation.

入选理由：DeepSeek V4 Pro在100万上下文中仅需5.48GB HBM显存，远低于竞品的60-89GB。

FeaturedTweet#DeepSeek#AI Model#MoE#KV Cache Optimization#Hardware Ecosystem中文

DeepSeek's 10 Trillion USD Grand Strategy

宝玉的分享5月24日5756 字 (约 24 分钟)

DeepSeek reduces KV cache requirements through innovations, driving China's AI hardware ecosystem toward a $10 trillion industry.

入选理由：DeepSeek V4 Pro仅需5.48GB HBM，相比GLM5的60GB和Qwen3-235B-A22B的89GB显著节省显存

FeaturedArticle#AI Model#Hardware Ecosystem#KV Cache#DeepSeek#China AI中文

[AINews] Thinking Machines' Native Interaction Models - TML-Interaction-Small 276B-A12B - Advances SOTA Realtime Voice and Kills Standard VAD

Latent Space5月12日2369 字 (约 10 分钟)

Thinking Machines released TML-Interaction-Small 276B-A12B, a 276-billion-parameter MoE model with only 12 billion active parameters, achieving sub-200ms end-to-end latency, surpassing GPT-4o and Gemini 3.1-Flash in real-time voice interaction, time-aligned microturns, and visual proactivity, effectively eliminating standard VAD.

入选理由：TML-Interaction-Small为276B参数MoE模型，仅12B激活参数，实现<200ms端到端延迟。

FeaturedArticle#AI#Realtime Voice#MoE#Multimodal#Model Architecture中文

Mellum2 Goes Open Source: A Fast Model for AI Workflows

The JetBrains Blog6月2日606 字 (约 3 分钟)

Mellum2 is an open-source 12B parameter AI model from JetBrains, using MoE architecture to activate only 2.5B parameters per token, reducing inference time by over 50% compared to similar-sized models, specifically designed for software engineering environments with applications in routing, RAG pipelines, and private AI deployment.

入选理由：Mellum2采用MoE架构，12B参数模型每token仅激活2.5B参数，推理速度比同类模型快50%，显著降低生产环境延迟和成本

FeaturedArticle#AI#Model#Mellum2#MoE#Software Engineering中文

Task Cost Only 1/9 of Claude Opus 4.6, Step Refreshes Flash Model Efficiency

爱范儿6月2日4293 字 (约 18 分钟)

Step 3.7 Flash by Yujue Star is a new-generation Flash model for production-grade AI Agents, featuring native multimodal understanding, high throughput with low latency, and enhanced web search. It achieves 97% of Claude Opus 4.6's coding performance at only 1/9 the cost per task, ideal for high-frequency, complex real-world workflows.

入选理由：Step 3.7 Flash 采用稀疏 MoE 架构，激活参数仅 11B，最高生成速度达 400 Tokens/s，支持 40 个 Agent 并行运行。

FeaturedArticle#AI Agent#Multimodal#Flash Model#Yujue Star#Production Deployment中文

Step 3.7 Flash from @StepFun_ai is live on OpenRouter. A multimodal (image/video/text) MoE that act...

OpenRouter(@OpenRouterAI)5月29日166 字 (约 1 分钟)

Step 3.7 Flash 是一个支持多模态的 MoE 模型，仅激活 11B 参数即可处理 196B 参数规模的任务，适用于编码、代理工作流和结构化输出。

入选理由：Step 3.7 Flash 模型通过激活 11B 参数处理 196B 参数规模任务，显著降低计算成本。

FeaturedTweet#MoE#多模态#AI 模型#OpenRouter中文

MoE Travelogue: 8. Forced Sequence-Level Load Balancing

科学空间5月23日4785 字 (约 20 分钟)

This article proposes a new sequence-level load balancing method called Moving Quantile Balancing (MQB), which achieves fine-grained balance in MoE models without relying on auxiliary losses.

入选理由：MQB方法基于Quantile Balancing演化而来，适用于序列级负载均衡。

FeaturedArticle#MoE#Load Balancing#Deep Learning#Routing Mechanism#MQB中文

We published new research on how we serve post-trained Qwen3 235B models on NVIDIA GB200 NVL72 Blackwell racks

Perplexity(@perplexity_ai)5月13日101 字 (约 1 分钟)

Perplexity published new research on deploying the Qwen3 235B model on NVIDIA GB200 NVL72 Blackwell racks, showing that GB200 outperforms Hopper in high-throughput inference for large MoE models.

入选理由：Qwen3 235B 模型在 NVIDIA GB200 上实现了高效的高吞吐量推理。

FeaturedTweet#NVIDIA#GB200#Qwen3#MoE#High Performance Computing中文

The benchmarks show the gap. NVLS all-reduce latency drops from 586.1µs on H200 to 313.3µs on GB200....

Perplexity(@perplexity_ai)5月13日107 字 (约 1 分钟)

NVLS all-reduce latency significantly improves from 586.1µs on H200 to 313.3µs on GB200, with notable performance gains in MoE prefill and decode throughput.

入选理由：NVLS all-reduce latency drops from 586.1µs on H200 to 313.3µs on GB200.

FeaturedTweet#NVLS#H200#GB200#MoE#Performance英文

Today we're shipping Nemotron 3 Ultra.

A 550B MoE frontier-intelligence open model built for long-r...

NVIDIA Ships Nemotron 3 Ultra: 550B MoE Open Model

NVIDIA AI(@NVIDIAAI)6月5日104 字 (约 1 分钟)

NVIDIA released Nemotron 3 Ultra, a 550B MoE open model for long-running agents, offering 5x faster inference and 30% lower costs.

入选理由：Nemotron 3 Ultra采用550B参数MoE架构，是面向前沿智能的开源模型。

FeaturedTweet#NVIDIA#Nemotron#MoE#AI Agent#Open Source英文

7/ 🧩这不是剪枝

ZEDA 更像让 MoE 有了“算力预算意识”。

未来模型不只决定回答什么，还会决定每个 token 值不值得认真思考。

Paper: Post-Trained MoE C...

7/ 🧩This Is Not Pruning

AI Will(@FinanceYF5)5月25日244 字 (约 1 分钟)

ZEDA is a novel MoE technique that uses self-distillation to dynamically skip experts, improving inference efficiency and giving models compute budget awareness.

入选理由：ZEDA 使用自蒸馏方法使 MoE 模型跳过一半专家，提升推理效率。

FeaturedTweet#MoE#Mixture-of-Experts#AI Efficiency#Self-Distillation#ZEDA中文

本周Huggingface暂时第一名的论文：MACE

用MoE构架做音乐驱动舞蹈视频。

哈哈哈，感觉抖音AI跳舞视频估计要更真了。

https://t.co/qmSpyQGC0a

This Week's Top Paper on Hugging Face: MACE

向阳乔木(@vista8)5月11日124 字 (约 1 分钟)

The MACE paper proposes a MoE-based architecture for music-driven dance video generation, improving motion-rhythm synchronization, with potential applications in TikTok-style AI dance videos.

入选理由：MACE使用MoE架构实现音乐到舞蹈动作的高精度对齐，提升生成视频的真实感。

FeaturedTweet#MACE#MoE#Music-to-Dance#Hugging Face#AI Video Generation中文

Nemotron 3 Ultra is coming

NVIDIA Developer6月2日395 字 (约 2 分钟)

NVIDIA announces Nemotron 3 Ultra, an open-source model claiming 5x speed and 30% lower cost, but lacks technical details.

入选理由：Nemotron 3 Ultra采用SSM与MoE混合架构，推理速度比现有开源模型快5倍。

FeaturedVideo#Nemotron#SSM#MoE#Open Source Model#NVIDIA英文

跨材料问答 · MoE

回答基于：MoE 相关 14 条材料