New @GoogleGemma 4 QAT (Quantization-Aware Training) checkpoints are here, so you can run models locally on consumer GPUs and mobile devices with minimal quality loss.

Q: Gemma 4 QAT 检查点发布

Google 公开了 Gemma 4 的 QAT 检查点，支持在消费级 GPU 和移动设备上运行。

Q: GGUF (Q4_0) 格式

使用 Q4_0 GGUF 格式可在所有尺寸模型上实现最高本地推理性能。

Q: 自定义移动 Schema

通过自定义混合精度方案，将 Gemma 4 缩小至 1GB 以下，包含 2-bit 解码层、优化 KV 缓存和静态激活。

Q: QAT 与 PTQ 的区别

训练时模拟压缩（QAT）比后训练量化（PTQ）更能保持推理质量并加速解码。

Google AI Developers(@googleaidevs)

Google AI Developers(@googleaidevs)2026年6月5日

New @GoogleGemma 4 QAT (Quantization-Aware Training) checkpoints are here, so you can run models locally on consumer GPUs and mobile devices with minimal quality loss.

7.2内容质量

TL;DR · AI 摘要

Google 发布了 Gemma 4 的 QAT 检查点，支持在消费级 GPU 和移动设备上以 Q4_0 GGUF 格式运行，内存占用低于 1GB，保持高质量推理。

核心要点

Gemma 4 QAT 检查点采用 Q4_0 GGUF 格式，兼容所有尺寸模型，提升本地推理性能。
自定义移动模式将 Gemma 4 缩小{<1GB}，使用 2-bit 解码层、优化 KV 缓存和静态激活，显著降低内存占用。
训练时模拟压缩（QAT）比后训练量化（PTQ）更能保持推理质量，同时加速解码速度。

结构提纲

按章节快速跳转。

§Gemma 4 QAT 检查点发布
Google 公开了 Gemma 4 的 QAT 检查点，支持在消费级 GPU 和移动设备上运行。
·GGUF (Q4_0) 格式
使用 Q4_0 GGUF 格式可在所有尺寸模型上实现最高本地推理性能。
·自定义移动 Schema
通过自定义混合精度方案，将 Gemma 4 缩小至 1GB 以下，包含 2-bit 解码层、优化 KV 缓存和静态激活。
·QAT 与 PTQ 的区别
训练时模拟压缩（QAT）比后训练量化（PTQ）更能保持推理质量并加速解码。

思维导图

用一张图看清主题之间的关系。

查看大纲文本（无障碍 / 无 JS 友好）

Gemma 4 QAT 检查点
- GGUF (Q4_0)
  - 最高本地性能
  - 所有尺寸模型兼容
- 移动 Schema
  - <1GB 内存
  - 2-bit 解码层
  - 优化 KV 缓存
  - 静态激活
- QAT vs PTQ
  - 保持推理质量
  - 加速解码速度

金句 / Highlights

值得收藏与分享的关键句。

Gemma 4 QAT 检查点采用 Q4_0 GGUF 格式，兼容所有尺寸模型，提升本地推理性能。
— 第 1 段
⬇︎ 下载 PNG 𝕏 分享到 X
自定义移动模式将 Gemma 4 缩小{<1GB}，使用 2-bit 解码层、优化 KV 缓存和静态激活，显著降低内存占用。
— 第 2 段
⬇︎ 下载 PNG 𝕏 分享到 X
训练时模拟压缩（QAT）比后训练量化（PTQ）更能保持推理质量，同时加速解码速度。
— 第 3 段
⬇︎ 下载 PNG 𝕏 分享到 X

#Gemma#QAT#GGUF#移动推理#量化

打开原文

What’s new:

🔹 GGUF (Q4_0): Checkpoints: Max local performance across all sizes and drafter models 🔹 Custom" / X

Image 1: Square profile picture

New

4 QAT (Quantization-Aware Training) checkpoints are here, so you can run models locally on consumer GPUs and mobile devices with minimal quality loss. What’s new: Image 2: 🔹 GGUF (Q4_0): Checkpoints: Max local performance across all sizes and drafter models Image 3: 🔹 Custom Mobile Schema: We shrunk Gemma 4 down to less than 1GB for mobile devices by using a custom mixed precision schema designed for edge hardware (featuring targeted 2-bit decoding layers, optimized KV caches, and static activations) By simulating compression during training rather than after (Post-Training Quantization), we've drastically reduced the memory footprint and accelerated decode speeds while preserving reasoning quality.