New @GoogleGemma 4 QAT (Quantization-Aware Training) checkpoints are here, so you can run models locally on consumer GPUs and mobile devices with minimal quality loss.

TL;DR · AI Summary
Google releases Gemma 4 QAT checkpoints, enabling local inference on consumer GPUs and mobile devices with Q4_0 GGUF format, keeping memory below 1GB while preserving high inference quality.
Key Takeaways
- Gemma 4 QAT checkpoints use Q4_0 GGUF format, compatible with all model sizes fo
- Custom mobile schema shrinks Gemma 4 to <1GB, employing 2‑bit decoding layers, o
- Training‑time quantization (QAT) preserves inference quality and speeds decoding
Outline
Jump quickly between sections.
Google publicly releases Gemma 4 QAT checkpoints for local inference on consumer GPUs and mobile devices.
Using Q4_0 GGUF format achieves maximum local inference performance across all model sizes.
A custom mixed‑precision schema shrinks Gemma 4 to <1GB, featuring 2‑bit decoding layers, optimized KV caches, and static activations.
Simulating compression during training (QAT) preserves inference quality and accelerates decoding better than post‑training quantization.
Mindmap
See how the topics connect at a glance.
查看大纲文本(无障碍 / 无 JS 友好)
- Gemma 4 QAT 检查点
- GGUF (Q4_0)
- 最高本地性能
- 所有尺寸模型兼容
- 移动 Schema
- <1GB 内存
- 2-bit 解码层
- 优化 KV 缓存
- 静态激活
- QAT vs PTQ
- 保持推理质量
- 加速解码速度
Highlights
Key sentences worth saving and sharing.
Gemma 4 QAT checkpoints use Q4_0 GGUF format, compatible with all model sizes for maximum local inference performance.
Custom mobile schema shrinks Gemma 4 to <1GB, employing 2‑bit decoding layers, optimized KV caches, and static activations to reduce memory footprint.
Training‑time quantization (QAT) preserves inference quality and speeds decoding more effectively than post‑training quantization (PTQ).

New
4 QAT (Quantization‑Aware Training) checkpoints are here, so you can run models locally on consumer GPUs and mobile devices with minimal quality loss. What’s new: GGUF (Q4_0): Checkpoints: Max local performance across all sizes and drafter models
Custom Mobile Schema: We shrunk Gemma 4 down to less than 1 GB for mobile devices by using a custom mixed‑precision schema designed for edge hardware (featuring targeted 2‑bit decoding layers, optimized KV caches, and static activations). By simulating compression during training rather than after (Post‑Training Quantization), we've drastically reduced the memory footprint and accelerated decode speeds while preserving reasoning quality.