Ok that's so cool

TL;DR · AI 摘要
多令牌预测技术使Gemma 4模型在本地运行速度提升1.5倍,达到138 tokens/s。
核心要点
- Gemma 4使用MTP后,性能从97 tokens/s提升至138 tokens/s。
- 开源项目包括助手模型和代码,便于非技术人员安装使用。
- 研究的重要性在于通过相同硬件获得更高性能。
结构提纲
按章节快速跳转。
思维导图
用一张图看清主题之间的关系。
查看大纲文本(无障碍 / 无 JS 友好)
- 多令牌预测技术
- 性能提升
- 97 tokens/s
- 138 tokens/s
- 开源项目
- 助手模型
- 代码
金句 / Highlights
值得收藏与分享的关键句。
Gemma 4使用MTP后,性能从97 tokens/s提升至138 tokens/s。
开源项目包括助手模型和代码,便于非技术人员安装使用。
研究的重要性在于通过相同硬件获得更高性能。
Multi-token prediction makes Gemma 4 run way faster locally!
Same model, same laptop, 1.5x faster.
Everything is open source from the assistant model to the code.
- 97 tokens/s without MTP
- 138 tokens/s with MTP
That's why research is so important. You're" / X
Paul Couvert on X: "Ok that's so cool Multi-token prediction makes Gemma 4 run way faster locally! Same model, same laptop, 1.5x faster. Everything is open source from the assistant model to the code. - 97 tokens/s without MTP - 138 tokens/s with MTP That's why research is so important. You're" / X
Don’t miss what’s happening

Ok that's so cool Multi-token prediction makes Gemma 4 run way faster locally! Same model, same laptop, 1.5x faster. Everything is open source from the assistant model to the code. - 97 tokens/s without MTP - 138 tokens/s with MTP That's why research is so important. You're getting much more from the exact same machine and running the same powerful model. And making it available to non-technical folks just by installing an app is amazing.
Quote

atomic.chat
@atomic_chat_hq
·
16h
Multi-Token Prediction (MTP) for LLaMA.cpp! Running Gemma4 local model 1.5x faster. We patched LLaMA.cpp. Quantized Gemma 4 assistant models into GGUF format. We ran tests on a MacBook Pro M5Max. Gemma 26B with MTP drafts tokens 40% faster. Benchmarks, source code and models

·
8
2
21
13
Read 8 replies