DeepSeek V4 Flash 可以在 128GB 的 M3 Max 运行,还是 1M 上下文
TL;DR · AI Summary
DeepSeek V4 Flash 模型通过不对称优化和硬件特性绑定,在 128GB 内存的 M3 Max MacBook Pro 上实现了 1M 上下文的稳定运行。
Key Takeaways
- DeepSeek V4 Flash 使用不对称 2-bit 量化,仅对 MoE 专家部分进行量化,保持关键路径全精度。
- KV Cache 被优化至 SSD,利用 Apple Silicon 的统一内存架构和 NVMe SSD,实现长上下文的高效处理。
- ds4-engine 采用纯 Metal 实现,仅支持官方发布的 DeepSeek V4 Flash 模型,性能适合作为 agent 工具使用。
Outline
Jump quickly between sections.
- §背景介绍
Redis 创始人 Antirez 开源了 ds4,展示了如何在有限资源下运行 1M 上下文的 DeepSeek V4 Flash 模型。
模型的 MoE 专家部分使用 2-bit 量化,而关键路径保持全精度,有效降低了内存占用。
KV Cache 被优化至 SSD,利用 Apple Silicon 的统一内存架构和 NVMe SSD,实现长上下文的高效处理。
ds4-engine 采用纯 Metal 实现,仅支持官方发布的 DeepSeek V4 Flash 模型,性能适合作为 agent 工具使用。
- ·性能测试
在 M3 Max 128GB q2 版本下,短 prompt 生成 26.68 t/s,长 prompt 生成 21.47 t/s。
Mindmap
See how the topics connect at a glance.
查看大纲文本(无障碍 / 无 JS 友好)
- DeepSeek V4 Flash 在 128GB M3 Max 上的运行
Highlights
Key sentences worth saving and sharing.
ds4 把 KV Cache 做成「内存活跃状态」配合 「磁盘持久化前缀缓存」的组合,KV Cache 可以移到 SSD ,用 SHA1 哈希 token 前缀做 key,压缩后 KV row 直接 plain read/write 落地。
2-bit 量化有一定损失,目前只有 Metal、无 CUDA,同时 server 是单请求序列化,CPU path 还会触发 macOS kernel bug。
Antirez 提到过 CUDA 端口正在开发中,目前 private branch 上在 DGX Spark(GB10)跑通了 ~12 t/s generation + ~200 t/s prefill。
DeepSeek V4 Flash Can Run on a 128GB M3 Max with 1M Context
Read in 5 minutes Published on 2026-05-11, 2,763 views
Column: AI Thinking Notes
Follow
Recently, the founder of Redis, Antirez, released a project called [_ds4_](https://link.juejin.cn/?target=https%3A%2F%2Fgithub.com%2Fantirez%2Fds4 "https://github.com/antirez/ds4"), which uses pure C code to run the "DeepSeek V4 Flash MoE model" with a 1M context on a 128GB memory M3 Max MacBook Pro, while still supporting stable coding agent loops.
The key point here is that [_ds4_](https://link.juejin.cn/?target=https%3A%2F%2Fgithub.com%2Fantirez%2Fds4 "https://github.com/antirez/ds4") is not just a simple quantization operation, but rather an approach that combines "asymmetric optimization" with "deep hardware binding" to break the limitation of "long contexts requiring massive GPU/memory."
[_ds4_](https://link.juejin.cn/?target=https%3A%2F%2Fgithub.com%2Fantirez%2Fds4 "https://github.com/antirez/ds4") is actually not a general inference engine (unlike llama.cpp or vLLM), but is specifically tailored for the DeepSeek V4 Flash model. The core can be summarized into three technical concepts:
(1) Asymmetric 2-bit Quantization
The core approach here is to quantize "over 90% of the model parameters" on the routed experts of the MoE using 2-bit quantization (up/gate using IQ2_XXS, down using Q2_K), while keeping the critical path (routing, shared experts, projections, etc.) at full precision.
Because the expert part of the MoE model is large but activations are sparse, quantizing them has a much smaller impact on the final output than quantizing dense parts. Antirez himself verified this:
The q2 version performs reliably in the coding agent, with good loop behavior.
Compared to traditional 2-bit quantization, which would drastically reduce quality, this asymmetric approach, which "compresses the big parts but retains the essence," compresses the memory footprint to 128GB levels while keeping perplexity/quality loss within acceptable limits.
Thus, this is a form of "model-aware quantization" rather than generic quantization.
(2) Disk-native KV Cache
ds4 implements the KV Cache as a combination of "memory-active state" and "disk-persistent prefix cache." The KV Cache can be moved to SSD, using SHA1 hash token prefixes as keys, and compressed rows can be directly read/written without mmap to avoid macOS VM pressure.
It supports strategies like cold/continue/evict/shutdown, and includes a tool-call replay map to ensure precise replay of DSML.
Currently, there is still a live KV checkpoint in memory for the current session, but different sessions, restarts, and long prefix reuse can rely on disk KV cache to recover, avoiding the need to refill from token zero each time.
Thanks to Apple Silicon's Unified Memory architecture + ultra-fast NVMe SSD, the bandwidth and latency combination far exceed typical scenarios. The large KV Cache size (tens to hundreds of GB) generated by long contexts (1M tokens) is manageable due to the SSD throughput, only slightly reducing the generation speed:
From 26.68 t/s to 21.47 t/s with 11k+ token prefill.
This represents a complete paradigm shift? Generally, people widely believe that KV Cache must reside entirely in memory to avoid latency explosions, but Antirez's tests using the disk as "extended memory" prove that it is feasible under specific hardware + compression + optimized I/O conditions.
A 1M context doesn't need expanded memory; simply using SSD as swap can maintain a stable 27 tok/s, and Apple Silicon's unified memory + NVMe IO link perform surprisingly well in long contexts.
(3) Pure Metal Native Implementation
The entire engine consists of only a few thousand lines of C + Metal shaders, with no generic framework overhead (not relying on GGML/llama.cpp):
- Metal worker runs serially to avoid race conditions.
- Supports official DeepSeek V4 Flash GGUF (q2/q4 versions), with custom tensor layouts and metadata.
- Experimental MTP (speculative decoding) is supported, but with limited benefits.
In terms of official benchmarks, the performance test for the M3 Max 128GB q2 version:
- Short prompt: prefill 58.52 t/s, generation 26.68 t/s
- 11k+ token long prompt: prefill 250+ t/s, generation 21.47 t/s
27 t/s may not seem fast, but it is sufficient for the agent loop (thinking - calling tools - continuing generation), because the agent scenario is not real-time chat, and it still works well across multiple iterations.
Additionally, 2-bit quantization introduces some loss, currently only Metal support, no CUDA, and the server is serialized per request, triggering a macOS kernel bug on the CPU path.
Despite these limitations, a 128GB M3 Max can run it! Even with the ds4-server compatible with OpenAI/Anthropic, you can directly connect to OpenClaw, Claude Code, etc., using high-end models for planning and review, and local models for simple execution, achieving a hybrid mode.
However, to be honest, 27 t/s is suitable for agents but not for high concurrency or real-time conversations. The actual recommended context size for a 128GB machine is 100k–300k (1M is the theoretical upper limit, with memory reserved for the system and other processes). It does not support Windows or Linux, and the CUDA version is reportedly in development, but this indeed seems like a promising direction.
Antirez mentioned that the CUDA port is being developed, and it has been successfully tested on DGX Spark (GB10) with ~12 t/s generation + ~200 t/s prefill on the private branch.
The overall performance of ds4 can be referenced as follows:
Many people have already successfully tested running it on a 128GB M3 Max, downloading the q2 version allows direct execution. However, during testing, occasional hallucination of end tokens or parser state issues were observed with q2 quantization.
Additional Testing Results
Default DS4 settings have been tested to achieve 14–15 tokens per second (t/s) during the actual encoding of a 62K pre-filled conversation. Memory usage remains stable at around 85GB during the generation process, for a complete context window of 100K. The disk cache is approximately 8GB, and the maximum limitation is about one minute of waiting time for each 10K context segments to resume operations after compression.
According to the comment 「#46 FYI: Works with 96 GB as well」, it turns out that 96GB also works, indicating there is still room for further performance improvements. Optimizations such as Metal 4/M5 prefill, Linux build support, and typo fixes are still ongoing.
If you have an M3 Max with 128GB now, you can try it directly via GitHub with a single command:
make + download_model.sh.
Project Address
[github.com/antirez/ds4](https://link.juejin.cn/?target=https%3A%2F%2Fgithub.com%2Fantirez%2Fds4 "https://github.com/antirez/ds4")
Tags:
FrontendAI ProgrammingArtificial Intelligence
Topics:
This article is included in the following columns
AI Thoughts Diary
Column Directory
Interpretation and Reflection on AI Articles
44 Subscribers
43 Articles
Subscribe
Previous Article
AI Era Open Source Licenses Will Disappear, malus Satirically Demonstrates This
Comments: 17
0/ 1000
Punctuation marks and links do not count towards the effective word count
⌘ + Enter
Send
Login / Register to post comments!
Hot
Latest
When will it run smoothly on my M1 16GB laptop?
1 hour ago
Like
Comment
- Block Author: java菜小鸟
- Report
The token rate per second is slow, and is it not possible to use it concurrently (with multiple agents)?
17 hours ago
Like
Comment
- Block Author: 执器
- Report
@明略科技
Using Metal natively in ds4 is indeed cleaner than the MPS backend path in llama.cpp. Real-world testing shows that the M4 Pro 36GB can stably achieve over 60 tokens per second decoding DeepSeek-V3 Q4 quantized model, without memory issues. In end-user scenarios, when combined with tools like mano-cua, the local model + local control loop can already handle many tasks.
1 day ago
1
Comment
- Block Author: 明略科技
- Report
View all 17 comments
9
17
Favorite
Follow to get updates!
Follow
Follow to get updates!
Followed
Table of Contents
Collapse
Related Recommendations
[Practical Max, New Flutter & Dart Agent Skills in Depth Analysis 1.4k Reads · 22 Likes](https://juejin.cn/post/7637046499474538559 "Practical Max, New Flutter & Dart Agent Skills in Depth Analysis")[AndroidX Introduces a New AppState for Managing Compose State 842 Reads · 11 Likes](https://juejin.cn/post/7638535912314929206 "AndroidX Introduces a New AppState for Managing Compose State")[I Made Two Tools, One 7MB Shell, One That Remembers 631 Reads · 9 Likes](https://juejin.cn/post/7637754131332890659 "I Made Two Tools, One 7MB Shell, One That Remembers")[Open Source 4B Local Model, Use Any App as a Skill! Say Goodbye to Token Anxiety, Private and Secure~ 472 Reads · 3 Likes](https://juejin.cn/post/7637885957680939051 "Open Source 4B Local Model, Use Any App as a Skill! Say Goodbye to Token Anxiety, Private and Secure~")[Free Tokens for 0 Yuan/Month! SenseNova by SenseTime, Don't Miss Out! 98 Reads · 0 Likes](https://juejin.cn/post/7637804704889913385 "Free Tokens for 0 Yuan/Month! SenseNova by SenseTime, Don't Miss Out!")
Featured Content
[Bun v1.3.14 in-depth analysis: Image API, HTTP/3, Global Virtual Storage, and Fifty Changes iDao Technology Cube · 74 Reads · 2 Likes](https://juejin.cn/post/7639025195580194862 "Bun v1.3.14 in-depth analysis: Image API, HTTP/3, Global Virtual Storage, and Fifty Changes")[Boss Forced Me to Go into AI, I Secretly Ran LLaMA in My Browser, Saving 200,000 API Fees kyriewen · 98 Reads · 0 Likes](https://juejin.cn/post/7639265898830970921 "Boss Forced Me to Go into AI, I Secretly Ran LLaMA in My Browser, Saving 200,000 API Fees")[From Frontend to Backend: What is SQL 小小小小宇 · 67 Reads · 0 Likes](https://juejin.cn/post/7639208988976644111 "From Frontend to Backend: What is SQL")[React Observer Hooks: Seven Ways to Listen to DOM Without Boilerplate Code 前端导师顾北 · 37 Reads · 2 Likes](https://juejin.cn/post/7639270931059867694 "React Observer Hooks: Seven Ways to Listen to DOM Without Boilerplate Code")[【To Be Continued】React High-Frequency Interview Questions 卷帘依旧 · 27 Reads · 2 Likes](https://juejin.cn/post/7639181027916267535 "【To Be Continued】React High-Frequency Interview Questions")
Find Your Tech Community
Reply 'join' to join the official WeChat group

Recommended for You
* [DeepSeek V4 Release: 1.6 trillion parameters, 1 million context, breakthrough floor pricing](https://juejin.cn/post/7633624945063378984 "DeepSeek V4 Release: 1.6 trillion parameters, 1 million context, breakthrough floor pricing") After waiting for stars to fall three times, the domestically produced AI star DeepSeek finally released the latest DeepSeek V4. This period, the whole country has been urging, and competitors have been continuously releasing new models, various benchmarks, but DeepSeek remains
- ServBay
- 15 days ago
- 65
- 1
- Comment
[DeepSeek](https://juejin.cn/tag/DeepSeek "DeepSeek")[AIGC](https://juejin.cn/tag/AIGC "AIGC")[AI Programming](https://juejin.cn/tag/AI%E7%BC%96%E7%A8%8B "AI Programming")
* [Redis Creator's Move: ds4, Crafting a DeepSeek V4 Flash Local Inference Engine in C, 6600+ Stars Behind the Hardcore Technical Breakdown](https://juejin.cn/post/7638437596683550726 "Redis Creator's Move: ds4, Crafting a DeepSeek V4 Flash Local Inference Engine in C, 6600+ Stars Behind the Hardcore Technical Breakdown") antirez (Redis Father) wrote a pure C + Metal local inference engine for DeepSeek V4 Flash from scratch. With 2-bit quantization, a 128GB MacBook can run a 284B parameter MoE model, and the KV Cache is directly persisted to SSD
- Wu Qiongqiong
- 2 days ago
- 14
- 1
- Comment
[Artificial Intelligence](https://juejin.cn/tag/%E4%BA%BA%E5%B7%A5%E6%99%BA%E8%83%BD "Artificial Intelligence")
* [DeepSeek V4 Released: 1M Context Becomes Standard, Trillion Parameter MoE, Price Reduced to One-Fifth of Competitors](https://juejin.cn/post/7632208925454680098 "DeepSeek V4 Released: 1M Context Becomes Standard, Trillion Parameter MoE, Price Reduced to One-Fifth of Competitors") On April 24, the preview version of DeepSeek V4 was officially launched and simultaneously open-sourced. After numerous rumors and speculations in both Chinese and English AI circles about the delay of V4, it has finally come to fruition.
- Old Wang's AI Programming
- 19 days ago
- 65
- Like
- Comment
[DeepSeek](https://juejin.cn/tag/DeepSeek "DeepSeek")
* [DeepSeek V4 Arrives: I Stayed Up All Night to Read the Technical Report](https://juejin.cn/post/7632208925455319074 "DeepSeek V4 Arrives: I Stayed Up All Night to Read the Technical Report") Preface: Been waiting for a long time. This morning when I woke up and checked my phone, DeepSeek V4 was released. It wasn't a teaser or a rumor; it was an actual release with open-source code. To be honest, I had grown numb waiting for this day—AI circles are full of "big news next week," and my ears have become calloused.
- HeteroCat
- 19 days ago
- 314
- 1
- Comment
[Artificial Intelligence](https://juejin.cn/tag/%E4%BA%BA%E5%B7%A5%E6%99%BA%E8%83%BD "Artificial Intelligence")
* [DeepSeek-V4 Released: 1.6T MoE + 1M Context Open Source, How Will QA Industry Testing Be Reshaped?](https://juejin.cn/post/7632506858189144064 "DeepSeek-V4 Released: 1.6T MoE + 1M Context Open Source, How Will QA Industry Testing Be Reshaped?") On April 24, DeepSeek officially released the V4 preview version and simultaneously open-sourced it. This is the second time since V3 that DeepSeek has set new benchmarks for open-source large models. As a QA industry veteran with over 10 years of experience, today I want to focus on discussing how this update will impact the industry.
- Spring Breeze Through the Courtyard
- 17 days ago
- 31
- Like
- Comment
[Test](https://juejin.cn/tag/%E6%B5%8B%E8%AF%95 "Test")
* [DeepSeek V4 Released: How to Respond](https://juejin.cn/post/7635869939149717555 "DeepSeek V4 Released: How to Respond") As of April 24, 2026, DeepSeek V4 Preview is no longer just a rumor: official news pages, API update logs, pricing pages, and Hugging Face model cards all feature V4-Pro.
- User652060307843
- 8 days ago
- 44
- Like
- Comment
[Algorithm](https://juejin.cn/tag/%E7%AE%97%E6%B3%95 "Algorithm")
* [Hands-On Test of DeepSeek V4: Not Explosive, but Doing More Important Things](https://juejin.cn/post/7632237134600060980 "Hands-On Test of DeepSeek V4: Not Explosive, but Doing More Important Things") Hello everyone, I'm Cold Yi. After much anticipation, DeepSeek V4 has finally been released. This version comes in two flavors: V4 Pro and V4 Flash, both with 1M context and are open-sourced.
- Woyin AI
- 19 days ago
- 108
- Like
- Comment
[DeepSeek](https://juejin.cn/tag/DeepSeek "DeepSeek")
* [DeepSeek V4 Fully Open-Sourced: Innovation Behind 1.6T Parameters](https://juejin.cn/post/7633987404987170826 "DeepSeek V4 Fully Open-Sourced: Innovation Behind 1.6T Parameters") What Happened On April 24, DeepSeek-AI officially released the V4 series preview version, simultaneously open-sourcing on Hugging Face and the ModelScope community under the MIT license, allowing commercial use. Two versions: V4-Pro (flagship): 1.6
- Quest Lab
- 14 days ago
- 40
- Like
- Comment
[DeepSeek](https://juejin.cn/tag/DeepSeek "DeepSeek")
* [Easy Guide to Integrating DeepSeek V4 with Claude Code](https://juejin.cn/post/7632644475747860515 "Easy Guide to Integrating DeepSeek V4 with Claude Code") On April 24, 2026, DeepSeek v4 was released. This article provides a more reasonable guide and configuration content for setting up DeepSeek with Claude Code.
- Sigmarising
- 17 days ago
- 2.3k
- 1
- 2
[DeepSeek](https://juejin.cn/tag/DeepSeek "DeepSeek")[Claude](https://juejin.cn/tag/Claude "Claude")
* [DeepSeek V4 Released: What Worries NVIDIA the Most Isn't the Model](https://juejin.cn/post/7632228821949136905 "DeepSeek V4 Released: What Worries NVIDIA the Most Isn't the Model") April 24, 2026. No press conference. No pre-launch hype. No countdown to reveal. DeepSeek simply dropped V4—open-sourced, launched its website, and rolled out its app and API updates all at once. Zero frame rate. Then...
- Xiao Tao
- 19 days ago
- 55
- Like
- Comment
[DeepSeek](https://juejin.cn/tag/DeepSeek "DeepSeek")[AI Programming](https://juejin.cn/tag/AI%E7%BC%96%E7%A8%8B "AI Programming")
* [How to White-Label Access DeepSeek V4 with Claude Code](https://juejin.cn/post/7637065398849257498 "How to White-Label Access DeepSeek V4 with Claude Code") Using the free credits from Alibaba Cloud Bai Lian / ModelScope Community, you can quickly integrate Claude Code with DeepSeek V4 series models using the CC Switch desktop tool.
- Xing Hao AI
- 6 days ago
- 725
- 7
- Comment
[LLM](https://juejin.cn/tag/LLM "LLM")[VibeCoding](https://juejin.cn/tag/VibeCoding "VibeCoding")[Claude](https://juejin.cn/tag/Claude "Claude")
* [DeepSeek-TUI: A Terminal-Based Programming Agent Built on DeepSeek V4](https://juejin.cn/post/7635465776091824178 "DeepSeek-TUI: A Terminal-Based Programming Agent Built on DeepSeek V4") DeepSeek-TUI is a terminal-native programming agent built on the DeepSeek V4 model. This article analyzes its architectural features, capability boundaries, and applicable scenarios from a technical perspective. 01. Project Background and Challenges Faced Currently, terminal AI
- GitFun
- 10 days ago
- 55
- Like
- Comment
[Artificial Intelligence](https://juejin.cn/tag/%E4%BA%BA%E5%B7%A5%E6%99%BA%E8%83%BD "Artificial Intelligence")
* [DeepSeek V4 Breakdown: The Technical Cards Behind Millions of Contexts, Domestic Computing Power Finally Crosses the Threshold](https://juejin.cn/post/7632264475764867126 "DeepSeek V4 Breakdown: The Technical Cards Behind Millions of Contexts, Domestic Computing Power Finally Crosses the Threshold") Table of Contents 1. Three Postponements Later: DeepSeek Finally Reveals Its Hand 2. Essential Changes: The Shift from a Compute Race to an Efficiency Race 3. Core Mechanism Breakdown: Three Dimensions of Technological Breakthroughs 4. Case Studies and Comparisons: Where Does V4 Stand 5. Engineering Implementation Insights: What Are Your Available
- Hogwarts Testing Development Club
- 18 days ago
- 29
- Like
- Comment
[Artificial Intelligence](https://juejin.cn/tag/%E4%BA%BA%E5%B7%A5%E6%99%BA%E8%83%BD "Artificial Intelligence")
* [A Musician Who Wrote a Programming Tool Using AI: DeepSeek TUI, and This Article Was Written by It](https://juejin.cn/post/7637488101486002202 "A Musician Who Wrote a Programming Tool Using AI: DeepSeek TUI, and This Article Was Written by It") This article, from topic selection, outline, to every word, was written using DeepSeek TUI. 0. An Even More Surprising Story Before discussing this tool, let's talk about its creator. The creator of DeepSeek TUI is a person named Hunter B
- Star AI
- 4 days ago
- 115
- 1
- Comment
[Agent](https://juejin.cn/tag/Agent "Agent")[GitHub](https://juejin.cn/tag/GitHub "GitHub")[DeepSeek](https://juejin.cn/tag/DeepSeek "DeepSeek")
* [Open Source Project Observation | ds4: Local Agent Inference, More Than Just Running Models](https://juejin.cn/post/7638839672551342118 "Open Source Project Observation | ds4: Local Agent Inference, More Than Just Running Models") Antirez, the author of Redis, has open-sourced a local inference engine for DeepSeek V4 Flash. It does not aim for universality but focuses on compressing model loading, KV caching, tool invocation, and Agent API adaptation into a dedicated solution
- Qiniu Developer
- 1 day ago
- 7
- Like
- Comment
[AI Programming](https://juejin.cn/tag/AI%E7%BC%96%E7%A8%8B "AI Programming")
Collection successful!
Already added to '____', click to change
- WeChat
WeChat QR code sharing
- Sina Weibo

AI Code Assistant is now live
Select your code and experience AI interpreting it for you instantly
Try Now
APP内打开
Download App Download App
WeChat Scan WeChat Official Account
- Sina Weibo
Choose the technology direction you're interested in
Backend
Frontend
Android
iOS
Artificial Intelligence
Development Tools
Code Life
Reading
Skip
Previous Step
At least select one category
Tip
Current operation failed. If you have any questions, you can click to appeal.
Go to Appeal I Understand
Immersive Reading
Confirm Blocking User
After blocking, the other party will not be able to follow you, interact with you, or view your profile.
Cancel Confirm