T
traeai
Sign in
返回首页
InfoQ

Gemma 4 Multi-Token Prediction Delivers Up to ~3x Faster Token Generation

7.5Score
Gemma 4 Multi-Token Prediction Delivers Up to ~3x Faster Token Generation

TL;DR · AI Summary

Gemma 4 introduces multi-token prediction technology, achieving up to 3x faster token generation, significantly improving large model inference efficiency.

Key Takeaways

  • Gemma 4 uses multi-token prediction to achieve up to 3x faster token generation.
  • The technique enables parallel processing of multiple tokens, reducing redundant
  • It is especially beneficial for real-time applications like chatbots and code ge

Outline

Jump quickly between sections.

  1. §Gemma 4 Technical Background

    Introduce the Gemma 4 model and its role in large language models.

  2. Explain how multi-token prediction enables parallel processing to improve generation efficiency.

  3. Show performance improvements in various tasks to validate effectiveness.

  4. Analyze potential use cases in conversational systems and code generation.

  5. Explore future potential and challenges of multi-token prediction in model optimization.

Mindmap

See how the topics connect at a glance.

查看大纲文本(无障碍 / 无 JS 友好)
  • Gemma 4 多令牌预测
    • 技术原理
      • 并行预测
      • 减少重复计算
    • 性能提升
      • 速度提升达 3x
      • 延迟降低
    • 应用场景
      • 对话系统
      • 代码生成

Highlights

Key sentences worth saving and sharing.

#AI#LLM#Gemma#Transformer#Token Generation
Open original article

Gemma 4 Multi-Token Prediction Delivers Up to ~3x Faster Token Generation - InfoQ

Your choice regarding cookies on this site

We use cookies to optimise site functionality and give you the best possible experience.

I Accept I Do Not Accept Settings

[BT](https://www.infoq.com/int/bt/ "bt")

InfoQ Software Architects' Newsletter

A monthly overview of things you need to know as an architect or aspiring architect.

View an example

Enter your e-mail address

Select your country - [x] I consent to InfoQ.com handling my data as explained in this Privacy Notice.

We protect your privacy.

Close

QCon San Francisco (Nov 16-20): What's next in AI? What's next in software? Learn from the teams already doing it.Register Now

Close

Toggle Navigation

Facilitating the Spread of Knowledge and Innovation in Professional Software Development

English edition

[Write for InfoQ](https://www.infoq.com/write-for-infoq/ "Write for InfoQ")

Search

RegisterSign in

Unlock the full InfoQ experience

Unlock the full InfoQ experience by logging in! Stay updated with your favorite authors and topics, engage with content, and download exclusive resources.

Log In

or

Don't have an InfoQ account?

Register

  • Stay updated on topics and peers that matter to youReceive instant alerts on the latest insights and trends.
  • Quickly access free resources for continuous learningMinibooks, videos with transcripts, and training materials.
  • Save articles and read at anytimeBookmark articles to read whenever youre ready.

Logo - Back to homepage

NewsArticlesPresentationsPodcastsGuides

Topics

[Development](https://www.infoq.com/development/ "Development")

  • [Java](https://www.infoq.com/java/ "Java")
  • [Kotlin](https://www.infoq.com/kotlin/ "Kotlin")
  • [.Net](https://www.infoq.com/dotnet/ ".Net")
  • [C#](https://www.infoq.com/c_sharp/ "C#")
  • [Swift](https://www.infoq.com/swift/ "Swift")
  • [Go](https://www.infoq.com/golang/ "Go")
  • [Rust](https://www.infoq.com/rust/ "Rust")
  • [JavaScript](https://www.infoq.com/javascript/ "JavaScript")

Featured in Development

Dany Lepage discusses the architectural journey of porting a hit VR title to seven non-VR platforms. He explains how his team solved the challenges of cross-progression, diverse input paradigms, and maintaining release velocity across Steam, iOS, and PlayStation. Beyond the tech, he shares candid lessons on the "product fit" gap when translating immersive social presence to 2D screens.

![Image 2: From VR to Flat Screens: Bridging the Input and Immersion Gap/presentations/game-vr-flat-screens/en/smallimage/thumbnail-1775637585504.jpg)](https://www.infoq.com/presentations/game-vr-flat-screens)

All in developmentFollow Topic

[Architecture & Design](https://www.infoq.com/architecture-design/ "Architecture & Design")

  • [Architecture](https://www.infoq.com/architecture/ "Architecture")
  • [Enterprise Architecture](https://www.infoq.com/enterprise-architecture/ "Enterprise Architecture")
  • [Scalability/Performance](https://www.infoq.com/performance-scalability/ "Scalability/Performance")
  • [Design](https://www.infoq.com/design/ "Design")
  • [Case Studies](https://www.infoq.com/Case_Study/ "Case Studies")
  • [Microservices](https://www.infoq.com/microservices/ "Microservices")
  • [Service Mesh](https://www.infoq.com/servicemesh/ "Service Mesh")
  • [Patterns](https://www.infoq.com/DesignPattern/ "Patterns")
  • [Security](https://www.infoq.com/Security/ "Security")

Featured in Architecture & Design

Michael Stiefel spoke to Baruch Sadogursky about software architecture in the age of agentic AI. LLM can function, albeit stochastically, as reasoning machines capable of interpreting human ambiguity. With the appropriate rigorous context artifacts to control the LLM’s reasoning, software specifications can become the source of truth, while the code becomes a disposable intermediate language.

![Image 3: Context is the Key to the Agentic Architecture Revolution: a Conversation with Baruch Sadogursky/podcasts/context-key-agentic-architecture-revolution/en/smallimage/the-infoq-podcast-logo-thumbnail-1778747429699.jpg)](https://www.infoq.com/podcasts/context-key-agentic-architecture-revolution)

All in architecture-designFollow Topic

[AI Infrastructure](https://www.infoq.com/ai-ml-data-eng/ "AI Infrastructure")

  • [Big Data](https://www.infoq.com/bigdata/ "Big Data")
  • [Machine Learning](https://www.infoq.com/machinelearning/ "Machine Learning")
  • [NoSQL](https://www.infoq.com/nosql/ "NoSQL")
  • [Database](https://www.infoq.com/database/ "Database")
  • [Data Analytics](https://www.infoq.com/data-analytics/ "Data Analytics")
  • [Streaming](https://www.infoq.com/streaming/ "Streaming")

Featured in AI, ML & Data Engineering

Ian Thomas shares a case study on embracing AI-native engineering within Meta’s Reality Labs. He explains the "Assess and Grow" framework, a maturity model designed to move teams from manual toil to AI-integrated innovation. He discusses real-world wins - including hitting 90% code coverage in record time - while addressing senior concerns like "code slop," review fatigue, and maintaining quality.

![Image 4: AI Native Engineering/presentations/ai-native-engineering/en/smallimage/thumbnail-1778664122266.jpeg)](https://www.infoq.com/presentations/ai-native-engineering)

All in ai-ml-data-engFollow Topic

[Culture & Methods](https://www.infoq.com/culture-methods/ "Culture & Methods")

  • [Agile](https://www.infoq.com/agile/ "Agile")
  • [Diversity](https://www.infoq.com/diversity/ "Diversity")
  • [Leadership](https://www.infoq.com/leadership/ "Leadership")
  • [Lean/Kanban](https://www.infoq.com/lean/ "Lean/Kanban")
  • [Personal Growth](https://www.infoq.com/personal-growth/ "Personal Growth")
  • [Scrum](https://www.infoq.com/scrum/ "Scrum")
  • [Sociocracy](https://www.infoq.com/sociocracy/ "Sociocracy")
  • [Software Craftmanship](https://www.infoq.com/software_craftsmanship/ "Software Craftmanship")
  • [Team Collaboration](https://www.infoq.com/team-collaboration/ "Team Collaboration")
  • [Testing](https://www.infoq.com/testing/ "Testing")
  • [UX](https://www.infoq.com/ux/ "UX")

Featured in Culture & Methods

Stéphane Di Cesare and Cat Morris share how engineers can move from being a "cost center" to a value driver using product discovery. They explain the "Double Diamond" framework and why identifying user problems must precede building solutions. Learn to choose the right metrics, build customer empathy through shadowing, and use business context to maximize the impact of your technical work.

![Image 5: Product Thinking for Cloud Native Engineers/presentations/product-cloud-native/en/smallimage/CatMorrisStephaneDiCesare-thumbnail-1778661429675.jpg)](https://www.infoq.com/presentations/product-cloud-native)

All in culture-methodsFollow Topic

DevOps

  • [Infrastructure](https://www.infoq.com/infrastructure/ "Infrastructure")
  • [Continuous Delivery](https://www.infoq.com/continuous_delivery/ "Continuous Delivery")
  • [Automation](https://www.infoq.com/automation/ "Automation")
  • [Containers](https://www.infoq.com/containers/ "Containers")
  • [Cloud](https://www.infoq.com/cloud-computing/ "Cloud")
  • [Observability](https://www.infoq.com/observability/ "Observability")

Featured in DevOps

J. Paul Reed discusses the "ironies of automation" - a 40 years-old concept now amplified by AI. He explains how advanced systems often make the human operator more crucial, not less, while simultaneously degrading the skills needed to intervene. Sharing real-world stories of "AI-fueled" incidents, he shares why over-reliance on AI can double recovery times and how to maintain resilience.

![Image 6: The Ironies of A^2 I^2/presentations/automation-incidents-ai/en/smallimage/thumbnail-1778662652640.jpg)](https://www.infoq.com/presentations/automation-incidents-ai)

All in devopsFollow Topic

[Events](https://events.infoq.com/ "Events")

Helpful links

  • [About InfoQ](https://www.infoq.com/about-infoq "About InfoQ")
  • [InfoQ Editors](https://www.infoq.com/infoq-editors "InfoQ Editors")
  • [Write for InfoQ](https://www.infoq.com/write-for-infoq "Write for InfoQ")
  • [About C4Media](https://c4media.com/ "About C4Media")
  • [Diversity](https://c4media.com/diversity "Diversity")

Choose your language

  • [En](https://www.infoq.com/news/2026/05/gemma4-multi-token-prediction/# "InfoQ English")
  • 中文
  • 日本
  • Fr

![Image 7: InfoQ Architect Certification - image Online InfoQ Architect Certification The more senior you become, the fewer people pressure-test your decisions. This 5-week cohort gives you that check. Register Now.](https://certification.qconferences.com/architecture?utm_source=infoq&utm_medium=referral&utm_campaign=homepageheader_onlinecohortarchitecturejune26)![Image 8: QCon AI Boston - image QCon AI Boston Learn how leading engineering teams run AI in production—reliably, securely, and at scale. Register Now.](https://boston.qcon.ai/?utm_source=infoq&utm_medium=referral&utm_campaign=homepageheader_qaiboston26)![Image 9: QCon AI Boston - image Online InfoQ AI Engineering Certification A practical online cohort for senior engineers making decisions around retrieval, agents, evals, and AI infrastructure. Register Now.](https://certification.qconferences.com/ai-engineering?utm_source=infoq&utm_medium=referral&utm_campaign=homepageheader_onlinecohortaijuly26)![Image 10: QCon San Francisco - image QCon San Francisco Learn what's next in AI and software, from teams already doing it. Register Now.](https://qconsf.com/?utm_source=infoq&utm_medium=referral&utm_campaign=homepageheader_qsf26)

[InfoQ Homepage](https://www.infoq.com/ "InfoQ Homepage")[News](https://www.infoq.com/news "News")Gemma 4 Multi-Token Prediction Delivers Up to ~3x Faster Token Generation

[AI, ML & Data Engineering](https://www.infoq.com/ai-ml-data-eng/ "AI, ML & Data Engineering")

Architecting for Autonomous Reliability: Embedding AI into Your Observability Stack (Webinar Jun 25th)

Gemma 4 Multi-Token Prediction Delivers Up to ~3x Faster Token Generation

May 25, 2026 2 min read

by

Follow

#### Write for InfoQ

Feed your curiosity.Help 550k+ global

senior developers

each month stay ahead.Get in touch

Log in to listen to this article

Loading audio

Audio 2

0:00 0:00

Normal 1.25x 1.5x

Like

Gemma 4 can be paired with multi-token prediction (MTP) drafters that use speculative decoding to generate multiple tokens in parallel, allowing the model to verify them in a single pass and achieve up to ~3× faster inference without quality loss.

Multi-token prediction drafters are lightweight auxiliary models that work alongside Gemma 4 to address the LLM memory-bandwidth bottleneck. As Google engineers explain, during inference the processor spends most of its time repeatedly moving billions of parameters from VRAM to compute units for each token. This constant data movement increases latency and leaves compute resources underutilized, particularly on consumer hardware.

This inefficiency is amplified by the fact that LLMs spend the same amount of computation to predict "obvious computations" as to solve a "complex logic puzzle", which is where multi-token prediction drafters can help.

By pairing a heavy target model (e.g., Gemma 4 31B) with a lightweight drafter (the MTP model), we can utilize idle compute to “predict” several future tokens at once with the drafter in less time than it takes for the target model to process just one token. The target model then verifies all of these suggested tokens in parallel.

Using multi-token prediction drafters, Google says, can improve responsiveness and enable faster inference across devices, with personal computers and consumer GPUs running Gemma 26B MoE and 31B dense models, and mobile devices using E2B and E4B variants, all without sacrificing response quality:

Because the primary Gemma 4 model retains the final verification, you get identical frontier-class reasoning and accuracy, just delivered significantly faster.

Google implemented a series of architectural enhancements and hardware-specific optimizations to ensure that MTP drafters deliver maximum efficiency, and provided an in-depth visual explanation of how the drafters work in an x.com thread.

Reddit commenter FarrisAT described Gemma 4 MTP as "pretty impressive stuff", but cautioned that local models still make too many mistakes, suggesting the real benefits will emerge when "those models get closer to the leading edge".

Another user, Gohab2001, noted that MTP itself is a well-known technique with a major drawback for local deployments: having to load two models in memory. They also pointed out that the real advancement in Gemma 4 MTP drafters implementation is the fact they share the target model's shared kV cache, which does effectively help reducing the technique's overhead.

On Hacker News, zozbot234 signals that "MTP is mostly useful when you have one or a few users, which means compute is abundant", as in mobile or edge scenarios, while offering limited benefits large-scale for API providers.

Gemma 4 MTP-enabled variants are available on several platforms, including Hugging Face, Kaggle, Ollama, and others.

About the Author

Image 12

#### Sergio De Simone

Sergio De Simone is a software engineer. Sergio has been working as a software engineer for over twenty five years across a range of different projects and companies, including such different work environments as Siemens, HP, and small startups. For the last 10+ years, his focus has been on development for mobile platforms and related technologies. He is currently working for BigML, Inc., where he leads iOS and macOS development.

Show more Show less

#### This content is in the AI, ML & Data Engineering topic

Follow Topic

##### Related Topics:

Followers: 4106

Follow Topic

Followers: 5923

Follow Topic

Followers: 102

Follow Topic

Followers: 52

Follow Topic

Followers: 1

Follow Topic

Followers: 141

Follow Topic

Followers: 68

Follow Topic

Followers: 51

Follow Topic

Followers: 69

Follow Topic

* #### Related Editorial

* #### Related Sponsors

  • #### Related Sponsor

![Image 13: Related sponsor icon/filters:no_upscale()/sponsorship/topic/ae9df779-fe62-46d8-a42e-92795ae3c56e/promptfoo-horizontal-logo-1775562471842.png)](https://www.infoq.com/url/f/9e1e2056-ec65-4658-aaaa-50b66b2d0ee1/)Confidently test, evaluate, and red-team your LLM apps with Promptfoo — catch regressions, benchmark models, and ship high-quality AI features faster; start testing your prompts today. [Learn More](https://www.infoq.com/url/f/0ed8a8f2-ad41-400e-b24f-e10459b3993d/).

Related Content

May 21, 2026

May 22, 2026

May 21, 2026

May 17, 2026

May 16, 2026

May 06, 2026

May 15, 2026

May 14, 2026

May 13, 2026

Related Sponsors

System prompts define how LLM applications behave—but they are vulnerable to manipulation. This article explores prompt hardening techniques such as instruction shielding, syntax reinforcement, and layered prompting to defend AI systems against prompt injection and override attacks.

The Model Context Protocol (MCP) defines a standard way for AI systems to interact with tools, data, and services. This article explains MCP’s architecture—hosts, clients, and servers—and how it enables structured, secure integrations between AI models and external systems.

  • Sponsored by

![Image 16: Icon image/filters:no_upscale()/sponsorship/topic/ae9df779-fe62-46d8-a42e-92795ae3c56e/promptfoo-horizontal-logo-1775562471842.png)](https://www.infoq.com/url/f/9e1e2056-ec65-4658-aaaa-50b66b2d0ee1/)

Related Content

May 11, 2026

May 07, 2026

May 05, 2026

May 04, 2026

May 03, 2026

Apr 30, 2026

**The InfoQ** Newsletter

A round-up of last week’s content on InfoQ sent out every Tuesday. Join a community of over 250,000 senior developers. View an example

Enter your e-mail address

Select your country - [x] I consent to InfoQ.com handling my data as explained in this Privacy Notice.

We protect your privacy.

  • ##### [Pip 26.1 Ships Dependency Cooldowns and Experimental Lockfile Support to Combat Supply Chain Attacks](https://www.infoq.com/news/2026/05/pip-261-dependency-cooldowns/ "Pip 26.1 Ships Dependency Cooldowns and Experimental Lockfile Support to Combat Supply Chain Attacks")
  • ##### [Cloudflare and Stripe Let AI Agents Create Accounts, Buy Domains, and Deploy to Production](https://www.infoq.com/news/2026/05/cloudflare-stripe-agent-commerce/ "Cloudflare and Stripe Let AI Agents Create Accounts, Buy Domains, and Deploy to Production")
  • ##### [Google Introduces Cloud Fraud Defense as Successor to reCAPTCHA](https://www.infoq.com/news/2026/05/cloud-fraud-defense-recaptcha/ "Google Introduces Cloud Fraud Defense as Successor to reCAPTCHA")
  • ##### [Uber Improves Restaurant Recommendations Using Real-Time Signals and Listwise Ranking](https://www.infoq.com/news/2026/05/uber-eats-ranking-system/ "Uber Improves Restaurant Recommendations Using Real-Time Signals and Listwise Ranking")
  • ##### [Designing a Multi-Agent System for Engineering Support at Scale: a Case Study from Grab](https://www.infoq.com/news/2026/05/grab-multi-agent-support-system/ "Designing a Multi-Agent System for Engineering Support at Scale: a Case Study from Grab")
  • ##### [OpenAI Outlines WebRTC Architecture for Low-Latency Voice AI at Scale](https://www.infoq.com/news/2026/05/openai-voice-ai-scale/ "OpenAI Outlines WebRTC Architecture for Low-Latency Voice AI at Scale")
  • ##### [How Platform Engineering Using Golden Bricks Can Enable Fast and Smooth Delivery](https://www.infoq.com/news/2026/05/platform-golden-bricks/ "How Platform Engineering Using Golden Bricks Can Enable Fast and Smooth Delivery")
  • ##### [Product Thinking for Cloud Native Engineers](https://www.infoq.com/presentations/product-cloud-native/ "Product Thinking for Cloud Native Engineers")
  • ##### [Accelerating LLM-Driven Developer Productivity at Zoox](https://www.infoq.com/presentations/ai-software-development/ "Accelerating LLM-Driven Developer Productivity at Zoox")
  • ##### [Gemma 4 Multi-Token Prediction Delivers Up to ~3x Faster Token Generation](https://www.infoq.com/news/2026/05/gemma4-multi-token-prediction/ "Gemma 4 Multi-Token Prediction Delivers Up to ~3x Faster Token Generation")
  • ##### [Google Introduces Middleware Architecture for Genkit Applications](https://www.infoq.com/news/2026/05/google-genkit-middleware/ "Google Introduces Middleware Architecture for Genkit Applications")
  • ##### [InfoQ Launches Online AI Engineering Cohort and Certification for Senior Software Practitioners](https://www.infoq.com/news/2026/05/ai-engineering-certification-pro/ "InfoQ Launches Online AI Engineering Cohort and Certification for Senior Software Practitioners")
  • ##### [Discord Rebuilds Database Operations Around Automation to Manage ScyllaDB at Massive Scale](https://www.infoq.com/news/2026/05/discord-scylladb-automation/ "Discord Rebuilds Database Operations Around Automation to Manage ScyllaDB at Massive Scale")
  • ##### [The Ironies of A^2 I^2](https://www.infoq.com/presentations/automation-incidents-ai/ "The Ironies of A^2 I^2")
  • ##### [OpenTofu 1.12: the Feature Terraform Never Shipped](https://www.infoq.com/news/2026/05/opentofu-release-terraform/ "OpenTofu 1.12: the Feature Terraform Never Shipped")

**The InfoQ** Newsletter

A round-up of last week’s content on InfoQ sent out every Tuesday. Join a community of over 250,000 senior developers. View an example

  • Get a quick overview of content published on a variety of innovator and early adopter technologies
  • Learn what you don’t know that you don’t know
  • Stay up to date with the latest information from the topics you are interested in

Enter your e-mail address

Select your country - [x] I consent to InfoQ.com handling my data as explained in this Privacy Notice.

We protect your privacy.

**ONLINE INFOQ CERTIFICATION PROGRAM** A Cohort for Senior Engineers and Architects * **Focused on ARCHITECTURE** with Luca Mezzalira | JUNE 10 * **Focused on AI ENGINEERING** with Hien Luu | JULY 25 Bring a real architecture or AI engineering challenge from your work. Spend 5 weeks pressure-testing your approach with senior peers from other companies and experienced facilitators. Explore the upcoming cohorts. **Register Now.**

#### Events

June 1-2, 2026

June 10, 2026

July 25, 2026

November 16-20, 2026

#### Follow us on

Youtube 232K FollowersLinkedin 26K FollowersInstagram NewRSS 19K ReadersX 57.1k FollowersFacebook 21K LikesBluesky New

#### Stay in the know

The InfoQ Podcast![Image 17: The InfoQ Podcast Logo - Stay in the know](https://www.infoq.com/podcasts/)Engineering Culture Podcast![Image 18: Engineering Culture Podcast Logo - Stay in the knoww](https://www.infoq.com/podcasts/#engineering_culture)The Software Architects' Newsletter![Image 19: The Software Architects' Newsletter Logo - Stay in the know](https://www.infoq.com/software-architects-newsletter/)

General Feedback [feedback@infoq.com](mailto:feedback@infoq.com) Advertising [sales@infoq.com](mailto:sales@infoq.com) Editorial [editors@infoq.com](mailto:editors@infoq.com) Marketing [marketing@infoq.com](mailto:marketing@infoq.com)

InfoQ.com and all content copyright © 2006-2026 C4Media Inc.

Privacy Notice, Terms And Conditions, Cookie Policy

Close

[BT](https://www.infoq.com/int/bt/ "bt")

AI may generate inaccurate information. Please verify important content.