llama.cpp 最近有什么新动态？

traeai 已收录 7 篇与 llama.cpp 相关的内容。最新一篇是「Gemma 4 12B: The Developer Guide」，由 Google Developers Blog 发布。

产品

llama.cpp

别名：llama-server

一个支持 CPU、GPU 和 Apple 芯片的 C++ 推理引擎，用于本地运行大语言模型。

已跟踪 7 条高相关材料

TraeAI 观察

如果只读 3 篇

Gemma 4 12B: The Developer Guide

Google Developers Blog · 9.2 分

Gemma 4 12B采用无编码器多模态架构，可在16GB显存设备上本地运行并原生支持音频输入。该模型通过移除独立视觉与音频编码器显著降低延迟，配合专用MTP模型提升推理速度，是首个支持macOS桌面端全离线交互的中型多模态模型。

How to Run LLMs Locally (Great For Learning and Privacy)

ByteByteGo · 8.5 分

本地运行大语言模型（LLMs）可通过 llama.cpp、Ollama 和 LM Studio 等工具实现，兼顾隐私与学习。

Reachy Mini goes fully local

Hugging Face Blog · 8.5 分

Reachy Mini 现在可以在本地运行语音后端，无需连接到云端服务器。

Gemma 4 12B: The Developer Guide

Google Developers Blog6月5日1171 字 (约 5 分钟)

Gemma 4 12B features an encoder-free multimodal architecture that runs locally on 16GB VRAM devices with native audio support. By eliminating separate vision and audio encoders, it reduces latency and pairs with a dedicated MTP model for faster inference, marking the first mid-sized multimodal model with a macOS desktop app for fully offline interaction.

入选理由：Gemma 4 12B移除独立编码器，视觉仅用35M参数嵌入层，音频直接线性投影至LLM输入空间

FeaturedArticle#Gemma 4#Multimodal LLM#Encoder-Free Architecture#Local AI#Google英文

How to Run LLMs Locally (Great For Learning and Privacy)

ByteByteGoYesterday1316 字 (约 6 分钟)

本地运行大语言模型（LLMs）可通过 llama.cpp、Ollama 和 LM Studio 等工具实现，兼顾隐私与学习。

入选理由：使用 llama.cpp 可在消费级硬件上运行大型模型，支持 4-bit 量化。

FeaturedVideo#LLM#本地运行#AI#量化#Ollama英文

Reachy Mini goes fully local

Hugging Face Blog5月27日1966 字 (约 8 分钟)

Reachy Mini now runs its voice backend locally, eliminating the need for cloud servers.

入选理由：部署本地语音后端于 Reachy Mini 上。

FeaturedArticle#Reachy Mini#Voice Backend#Local Service中文

This is where we are right now. And i’m not gonna lie it feels pretty magical 🧚‍♀️ Qwen3.6 27B run...

Julien Chaumond(@julien_c)5月2日376 字 (约 2 分钟)

Julien Chaumond 展示 Qwen3.6-27B 模型通过 Llama.cpp 在 MacBook Pro 上本地运行 Pi 编程代理，处理 Hugging Face 代码库任务时性能逼近 Claude Opus，且完全离线。

入选理由：Qwen3.6-27B 已可在消费级 Mac 本地高效运行编程任务

FeaturedTweet#Qwen#Llama.cpp#Pi Agent#Local LLM#Hugging Face中文

llama.cpp with MTP Support Makes Local Models Fast Enough for Daily Use

clem 🤗(@ClementDelangue)5月24日92 字 (约 1 分钟)

With MTP support, llama.cpp improves local model inference speed by 78%, boosting Qwen3.6-27B from 25 to 45 tokens/sec on A10G.

入选理由：MTP 支持使 llama.cpp 推理速度提升 78%

FeaturedTweet#llama.cpp#MTP#Qwen#local model#inference speed英文

How to Run llama.cpp with MTP (Multi-token Prediction)

Julien Chaumond(@julien_c)5月20日255 字 (约 2 分钟)

MTP is a new speculative decoding feature built into llama.cpp that can approximately double token generation speed for most use cases, achieving ~30 tok/sec with the Dense 27B model and ~100 tok/sec with the MoE model.

入选理由：MTP是内置于模型本身的投机解码新特性，可将token生成速度提升约2倍

FeaturedTweet#llama.cpp#MTP#Speculative Decoding#Qwen#LLM Inference Optimization英文

> Ecosystem: Compatible with llama.cpp, MLX, @LMStudio, vLLM, @ollama, @UnslothAI, and SGLang.
&g...

Google AI Developers: Gemma 4 Ecosystem Compatibility and Downloads

Google AI Developers(@googleaidevs)6月4日78 字 (约 1 分钟)

Google announces its model weights are compatible with major open-source ecosystems and can be directly downloaded from Hugging Face and Kaggle, lowering deployment barriers.

入选理由：Gemma 4 权重与 llama.cpp、vLLM、Ollama 等生态兼容，便于本地部署与推理。

FeaturedTweet#Gemma#Open-source Ecosystem#Model Deployment#Hugging Face#Kaggle英文

跨材料问答 · llama.cpp

回答基于：llama.cpp 相关 7 条材料