Gemma 4 Multi-Token Prediction Delivers Up to ~3x Faster Token Generation

TL;DR · AI Summary
Gemma 4 introduces multi-token prediction technology, achieving up to 3x faster token generation, significantly improving large model inference efficiency.
Key Takeaways
- Gemma 4 uses multi-token prediction to achieve up to 3x faster token generation.
- The technique enables parallel processing of multiple tokens, reducing redundant
- It is especially beneficial for real-time applications like chatbots and code ge
Outline
Jump quickly between sections.
Introduce the Gemma 4 model and its role in large language models.
Explain how multi-token prediction enables parallel processing to improve generation efficiency.
Show performance improvements in various tasks to validate effectiveness.
Analyze potential use cases in conversational systems and code generation.
Explore future potential and challenges of multi-token prediction in model optimization.
Mindmap
See how the topics connect at a glance.
查看大纲文本(无障碍 / 无 JS 友好)
- Gemma 4 多令牌预测
- 技术原理
- 并行预测
- 减少重复计算
- 性能提升
- 速度提升达 3x
- 延迟降低
- 应用场景
- 对话系统
- 代码生成
Highlights
Key sentences worth saving and sharing.
Gemma 4 achieves up to 3x faster token generation through multi-token prediction.
The method reduces redundant reasoning via parallel computation, lowering latency.
This optimization improves user experience in real-time applications such as chatbots.
Gemma 4 Multi-Token Prediction Delivers Up to ~3x Faster Token Generation - InfoQ
Your choice regarding cookies on this site
We use cookies to optimise site functionality and give you the best possible experience.
I Accept I Do Not Accept Settings
[BT](https://www.infoq.com/int/bt/ "bt")
InfoQ Software Architects' Newsletter
A monthly overview of things you need to know as an architect or aspiring architect.
Enter your e-mail address
Select your country - [x] I consent to InfoQ.com handling my data as explained in this Privacy Notice.
Close
QCon San Francisco (Nov 16-20): What's next in AI? What's next in software? Learn from the teams already doing it.Register Now
Close
Toggle Navigation
Facilitating the Spread of Knowledge and Innovation in Professional Software Development
English edition
[Write for InfoQ](https://www.infoq.com/write-for-infoq/ "Write for InfoQ")
Search
Unlock the full InfoQ experience
Unlock the full InfoQ experience by logging in! Stay updated with your favorite authors and topics, engage with content, and download exclusive resources.
or
Don't have an InfoQ account?
- Stay updated on topics and peers that matter to youReceive instant alerts on the latest insights and trends.
- Quickly access free resources for continuous learningMinibooks, videos with transcripts, and training materials.
- Save articles and read at anytimeBookmark articles to read whenever youre ready.
NewsArticlesPresentationsPodcastsGuides
Topics
[Development](https://www.infoq.com/development/ "Development")
- [Java](https://www.infoq.com/java/ "Java")
- [Kotlin](https://www.infoq.com/kotlin/ "Kotlin")
- [.Net](https://www.infoq.com/dotnet/ ".Net")
- [C#](https://www.infoq.com/c_sharp/ "C#")
- [Swift](https://www.infoq.com/swift/ "Swift")
- [Go](https://www.infoq.com/golang/ "Go")
- [Rust](https://www.infoq.com/rust/ "Rust")
- [JavaScript](https://www.infoq.com/javascript/ "JavaScript")
Featured in Development
Dany Lepage discusses the architectural journey of porting a hit VR title to seven non-VR platforms. He explains how his team solved the challenges of cross-progression, diverse input paradigms, and maintaining release velocity across Steam, iOS, and PlayStation. Beyond the tech, he shares candid lessons on the "product fit" gap when translating immersive social presence to 2D screens.

All in developmentFollow Topic
[Architecture & Design](https://www.infoq.com/architecture-design/ "Architecture & Design")
- [Architecture](https://www.infoq.com/architecture/ "Architecture")
- [Enterprise Architecture](https://www.infoq.com/enterprise-architecture/ "Enterprise Architecture")
- [Scalability/Performance](https://www.infoq.com/performance-scalability/ "Scalability/Performance")
- [Design](https://www.infoq.com/design/ "Design")
- [Case Studies](https://www.infoq.com/Case_Study/ "Case Studies")
- [Microservices](https://www.infoq.com/microservices/ "Microservices")
- [Service Mesh](https://www.infoq.com/servicemesh/ "Service Mesh")
- [Patterns](https://www.infoq.com/DesignPattern/ "Patterns")
- [Security](https://www.infoq.com/Security/ "Security")
Featured in Architecture & Design
- #### Context is the Key to the Agentic Architecture Revolution: a Conversation with Baruch Sadogursky
Michael Stiefel spoke to Baruch Sadogursky about software architecture in the age of agentic AI. LLM can function, albeit stochastically, as reasoning machines capable of interpreting human ambiguity. With the appropriate rigorous context artifacts to control the LLM’s reasoning, software specifications can become the source of truth, while the code becomes a disposable intermediate language.

All in architecture-designFollow Topic
[AI Infrastructure](https://www.infoq.com/ai-ml-data-eng/ "AI Infrastructure")
- [Big Data](https://www.infoq.com/bigdata/ "Big Data")
- [Machine Learning](https://www.infoq.com/machinelearning/ "Machine Learning")
- [NoSQL](https://www.infoq.com/nosql/ "NoSQL")
- [Database](https://www.infoq.com/database/ "Database")
- [Data Analytics](https://www.infoq.com/data-analytics/ "Data Analytics")
- [Streaming](https://www.infoq.com/streaming/ "Streaming")
Featured in AI, ML & Data Engineering
Ian Thomas shares a case study on embracing AI-native engineering within Meta’s Reality Labs. He explains the "Assess and Grow" framework, a maturity model designed to move teams from manual toil to AI-integrated innovation. He discusses real-world wins - including hitting 90% code coverage in record time - while addressing senior concerns like "code slop," review fatigue, and maintaining quality.

All in ai-ml-data-engFollow Topic
[Culture & Methods](https://www.infoq.com/culture-methods/ "Culture & Methods")
- [Agile](https://www.infoq.com/agile/ "Agile")
- [Diversity](https://www.infoq.com/diversity/ "Diversity")
- [Leadership](https://www.infoq.com/leadership/ "Leadership")
- [Lean/Kanban](https://www.infoq.com/lean/ "Lean/Kanban")
- [Personal Growth](https://www.infoq.com/personal-growth/ "Personal Growth")
- [Scrum](https://www.infoq.com/scrum/ "Scrum")
- [Sociocracy](https://www.infoq.com/sociocracy/ "Sociocracy")
- [Software Craftmanship](https://www.infoq.com/software_craftsmanship/ "Software Craftmanship")
- [Team Collaboration](https://www.infoq.com/team-collaboration/ "Team Collaboration")
- [Testing](https://www.infoq.com/testing/ "Testing")
- [UX](https://www.infoq.com/ux/ "UX")
Featured in Culture & Methods
Stéphane Di Cesare and Cat Morris share how engineers can move from being a "cost center" to a value driver using product discovery. They explain the "Double Diamond" framework and why identifying user problems must precede building solutions. Learn to choose the right metrics, build customer empathy through shadowing, and use business context to maximize the impact of your technical work.

All in culture-methodsFollow Topic
- [Infrastructure](https://www.infoq.com/infrastructure/ "Infrastructure")
- [Continuous Delivery](https://www.infoq.com/continuous_delivery/ "Continuous Delivery")
- [Automation](https://www.infoq.com/automation/ "Automation")
- [Containers](https://www.infoq.com/containers/ "Containers")
- [Cloud](https://www.infoq.com/cloud-computing/ "Cloud")
- [Observability](https://www.infoq.com/observability/ "Observability")
Featured in DevOps
J. Paul Reed discusses the "ironies of automation" - a 40 years-old concept now amplified by AI. He explains how advanced systems often make the human operator more crucial, not less, while simultaneously degrading the skills needed to intervene. Sharing real-world stories of "AI-fueled" incidents, he shares why over-reliance on AI can double recovery times and how to maintain resilience.

All in devopsFollow Topic
[Events](https://events.infoq.com/ "Events")
Helpful links
- [About InfoQ](https://www.infoq.com/about-infoq "About InfoQ")
- [InfoQ Editors](https://www.infoq.com/infoq-editors "InfoQ Editors")
- [Write for InfoQ](https://www.infoq.com/write-for-infoq "Write for InfoQ")
- [About C4Media](https://c4media.com/ "About C4Media")
- [Diversity](https://c4media.com/diversity "Diversity")
Choose your language

[InfoQ Homepage](https://www.infoq.com/ "InfoQ Homepage")[News](https://www.infoq.com/news "News")Gemma 4 Multi-Token Prediction Delivers Up to ~3x Faster Token Generation
[AI, ML & Data Engineering](https://www.infoq.com/ai-ml-data-eng/ "AI, ML & Data Engineering")
Gemma 4 Multi-Token Prediction Delivers Up to ~3x Faster Token Generation
May 25, 2026 2 min read
by
- Sergio De Simone
Follow
#### Write for InfoQ
Feed your curiosity.Help 550k+ global
senior developers
each month stay ahead.Get in touch
Log in to listen to this article
Loading audio
0:00 0:00
Normal 1.25x 1.5x
Like
Gemma 4 can be paired with multi-token prediction (MTP) drafters that use speculative decoding to generate multiple tokens in parallel, allowing the model to verify them in a single pass and achieve up to ~3× faster inference without quality loss.
Multi-token prediction drafters are lightweight auxiliary models that work alongside Gemma 4 to address the LLM memory-bandwidth bottleneck. As Google engineers explain, during inference the processor spends most of its time repeatedly moving billions of parameters from VRAM to compute units for each token. This constant data movement increases latency and leaves compute resources underutilized, particularly on consumer hardware.
This inefficiency is amplified by the fact that LLMs spend the same amount of computation to predict "obvious computations" as to solve a "complex logic puzzle", which is where multi-token prediction drafters can help.
By pairing a heavy target model (e.g., Gemma 4 31B) with a lightweight drafter (the MTP model), we can utilize idle compute to “predict” several future tokens at once with the drafter in less time than it takes for the target model to process just one token. The target model then verifies all of these suggested tokens in parallel.
Using multi-token prediction drafters, Google says, can improve responsiveness and enable faster inference across devices, with personal computers and consumer GPUs running Gemma 26B MoE and 31B dense models, and mobile devices using E2B and E4B variants, all without sacrificing response quality:
Because the primary Gemma 4 model retains the final verification, you get identical frontier-class reasoning and accuracy, just delivered significantly faster.
Google implemented a series of architectural enhancements and hardware-specific optimizations to ensure that MTP drafters deliver maximum efficiency, and provided an in-depth visual explanation of how the drafters work in an x.com thread.
Reddit commenter FarrisAT described Gemma 4 MTP as "pretty impressive stuff", but cautioned that local models still make too many mistakes, suggesting the real benefits will emerge when "those models get closer to the leading edge".
Another user, Gohab2001, noted that MTP itself is a well-known technique with a major drawback for local deployments: having to load two models in memory. They also pointed out that the real advancement in Gemma 4 MTP drafters implementation is the fact they share the target model's shared kV cache, which does effectively help reducing the technique's overhead.
On Hacker News, zozbot234 signals that "MTP is mostly useful when you have one or a few users, which means compute is abundant", as in mobile or edge scenarios, while offering limited benefits large-scale for API providers.
Gemma 4 MTP-enabled variants are available on several platforms, including Hugging Face, Kaggle, Ollama, and others.
About the Author

#### Sergio De Simone
Sergio De Simone is a software engineer. Sergio has been working as a software engineer for over twenty five years across a range of different projects and companies, including such different work environments as Siemens, HP, and small startups. For the last 10+ years, his focus has been on development for mobile platforms and related technologies. He is currently working for BigML, Inc., where he leads iOS and macOS development.
Show more Show less
#### This content is in the AI, ML & Data Engineering topic
Follow Topic
##### Related Topics:
Followers: 4106
Follow Topic
Followers: 5923
Follow Topic
Followers: 102
Follow Topic
Followers: 52
Follow Topic
Followers: 1
Follow Topic
Followers: 141
Follow Topic
Followers: 68
Follow Topic
Followers: 51
Follow Topic
Followers: 69
Follow Topic
* #### Related Editorial
- ##### Apple Researchers Introduce Ferret-UI Lite, an On-Device AI Model for Seeing and Controlling UIs
* #### Related Sponsors
- #### Related Sponsor
Confidently test, evaluate, and red-team your LLM apps with Promptfoo — catch regressions, benchmark models, and ship high-quality AI features faster; start testing your prompts today. [Learn More](https://www.infoq.com/url/f/0ed8a8f2-ad41-400e-b24f-e10459b3993d/).
Related Content
May 21, 2026
May 22, 2026
May 21, 2026
May 17, 2026
May 16, 2026
May 06, 2026
May 15, 2026
- ##### Anthropic Traces Six Weeks of Claude Code Quality Complaints to Three Overlapping Product Changes
May 14, 2026
May 13, 2026
Related Sponsors
- #### Harder, Better, Prompter, Stronger: AI system prompt hardening
System prompts define how LLM applications behave—but they are vulnerable to manipulation. This article explores prompt hardening techniques such as instruction shielding, syntax reinforcement, and layered prompting to defend AI systems against prompt injection and override attacks.
- #### Inside MCP: A Protocol for AI Integration
The Model Context Protocol (MCP) defines a standard way for AI systems to interact with tools, data, and services. This article explains MCP’s architecture—hosts, clients, and servers—and how it enables structured, secure integrations between AI models and external systems.
- Sponsored by

Related Content
May 11, 2026
May 07, 2026
May 05, 2026
May 04, 2026
May 03, 2026
Apr 30, 2026
**The InfoQ** Newsletter
A round-up of last week’s content on InfoQ sent out every Tuesday. Join a community of over 250,000 senior developers. View an example
Enter your e-mail address
Select your country - [x] I consent to InfoQ.com handling my data as explained in this Privacy Notice.
- ##### [Pip 26.1 Ships Dependency Cooldowns and Experimental Lockfile Support to Combat Supply Chain Attacks](https://www.infoq.com/news/2026/05/pip-261-dependency-cooldowns/ "Pip 26.1 Ships Dependency Cooldowns and Experimental Lockfile Support to Combat Supply Chain Attacks")
- ##### [Cloudflare and Stripe Let AI Agents Create Accounts, Buy Domains, and Deploy to Production](https://www.infoq.com/news/2026/05/cloudflare-stripe-agent-commerce/ "Cloudflare and Stripe Let AI Agents Create Accounts, Buy Domains, and Deploy to Production")
- ##### [Google Introduces Cloud Fraud Defense as Successor to reCAPTCHA](https://www.infoq.com/news/2026/05/cloud-fraud-defense-recaptcha/ "Google Introduces Cloud Fraud Defense as Successor to reCAPTCHA")
- ##### [Uber Improves Restaurant Recommendations Using Real-Time Signals and Listwise Ranking](https://www.infoq.com/news/2026/05/uber-eats-ranking-system/ "Uber Improves Restaurant Recommendations Using Real-Time Signals and Listwise Ranking")
- ##### [Designing a Multi-Agent System for Engineering Support at Scale: a Case Study from Grab](https://www.infoq.com/news/2026/05/grab-multi-agent-support-system/ "Designing a Multi-Agent System for Engineering Support at Scale: a Case Study from Grab")
- ##### [OpenAI Outlines WebRTC Architecture for Low-Latency Voice AI at Scale](https://www.infoq.com/news/2026/05/openai-voice-ai-scale/ "OpenAI Outlines WebRTC Architecture for Low-Latency Voice AI at Scale")
- ##### [How Platform Engineering Using Golden Bricks Can Enable Fast and Smooth Delivery](https://www.infoq.com/news/2026/05/platform-golden-bricks/ "How Platform Engineering Using Golden Bricks Can Enable Fast and Smooth Delivery")
- ##### [Product Thinking for Cloud Native Engineers](https://www.infoq.com/presentations/product-cloud-native/ "Product Thinking for Cloud Native Engineers")
- ##### [Accelerating LLM-Driven Developer Productivity at Zoox](https://www.infoq.com/presentations/ai-software-development/ "Accelerating LLM-Driven Developer Productivity at Zoox")
- ##### [Gemma 4 Multi-Token Prediction Delivers Up to ~3x Faster Token Generation](https://www.infoq.com/news/2026/05/gemma4-multi-token-prediction/ "Gemma 4 Multi-Token Prediction Delivers Up to ~3x Faster Token Generation")
- ##### [Google Introduces Middleware Architecture for Genkit Applications](https://www.infoq.com/news/2026/05/google-genkit-middleware/ "Google Introduces Middleware Architecture for Genkit Applications")
- ##### [InfoQ Launches Online AI Engineering Cohort and Certification for Senior Software Practitioners](https://www.infoq.com/news/2026/05/ai-engineering-certification-pro/ "InfoQ Launches Online AI Engineering Cohort and Certification for Senior Software Practitioners")
- ##### [Discord Rebuilds Database Operations Around Automation to Manage ScyllaDB at Massive Scale](https://www.infoq.com/news/2026/05/discord-scylladb-automation/ "Discord Rebuilds Database Operations Around Automation to Manage ScyllaDB at Massive Scale")
- ##### [The Ironies of A^2 I^2](https://www.infoq.com/presentations/automation-incidents-ai/ "The Ironies of A^2 I^2")
- ##### [OpenTofu 1.12: the Feature Terraform Never Shipped](https://www.infoq.com/news/2026/05/opentofu-release-terraform/ "OpenTofu 1.12: the Feature Terraform Never Shipped")
**The InfoQ** Newsletter
A round-up of last week’s content on InfoQ sent out every Tuesday. Join a community of over 250,000 senior developers. View an example
- Get a quick overview of content published on a variety of innovator and early adopter technologies
- Learn what you don’t know that you don’t know
- Stay up to date with the latest information from the topics you are interested in
Enter your e-mail address
Select your country - [x] I consent to InfoQ.com handling my data as explained in this Privacy Notice.
#### Events
- ##### QCon AI Boston
June 1-2, 2026
June 10, 2026
July 25, 2026
- ##### QCon San Francisco
November 16-20, 2026
#### Follow us on
Youtube 232K FollowersLinkedin 26K FollowersInstagram NewRSS 19K ReadersX 57.1k FollowersFacebook 21K LikesBluesky New
#### Stay in the know
The InfoQ PodcastEngineering Culture PodcastThe Software Architects' Newsletter
General Feedback [feedback@infoq.com](mailto:feedback@infoq.com) Advertising [sales@infoq.com](mailto:sales@infoq.com) Editorial [editors@infoq.com](mailto:editors@infoq.com) Marketing [marketing@infoq.com](mailto:marketing@infoq.com)
InfoQ.com and all content copyright © 2006-2026 C4Media Inc.
Privacy Notice, Terms And Conditions, Cookie Policy
Close
[BT](https://www.infoq.com/int/bt/ "bt")