T
traeai
登录
返回首页
掘金本周最热

DeepSeek V4 Flash 可以在 128GB 的 M3 Max 运行,还是 1M 上下文

8.5Score
DeepSeek V4 Flash 可以在 128GB 的 M3 Max 运行,还是 1M 上下文

TL;DR · AI 摘要

DeepSeek V4 Flash 模型通过不对称优化和硬件特性绑定,在 128GB 内存的 M3 Max MacBook Pro 上实现了 1M 上下文的稳定运行。

核心要点

  • DeepSeek V4 Flash 使用不对称 2-bit 量化,仅对 MoE 专家部分进行量化,保持关键路径全精度。
  • KV Cache 被优化至 SSD,利用 Apple Silicon 的统一内存架构和 NVMe SSD,实现长上下文的高效处理。
  • ds4-engine 采用纯 Metal 实现,仅支持官方发布的 DeepSeek V4 Flash 模型,性能适合作为 agent 工具使用。

Outline

Jump quickly between sections.

  1. Redis 创始人 Antirez 开源了 ds4,展示了如何在有限资源下运行 1M 上下文的 DeepSeek V4 Flash 模型。

  2. 模型的 MoE 专家部分使用 2-bit 量化,而关键路径保持全精度,有效降低了内存占用。

  3. KV Cache 被优化至 SSD,利用 Apple Silicon 的统一内存架构和 NVMe SSD,实现长上下文的高效处理。

  4. ds4-engine 采用纯 Metal 实现,仅支持官方发布的 DeepSeek V4 Flash 模型,性能适合作为 agent 工具使用。

  5. 在 M3 Max 128GB q2 版本下,短 prompt 生成 26.68 t/s,长 prompt 生成 21.47 t/s。

Mindmap

See how the topics connect at a glance.

查看大纲文本(无障碍 / 无 JS 友好)
  • DeepSeek V4 Flash 在 128GB M3 Max 上的运行

Highlights

Key sentences worth saving and sharing.

  • ds4 把 KV Cache 做成「内存活跃状态」配合 「磁盘持久化前缀缓存」的组合,KV Cache 可以移到 SSD ,用 SHA1 哈希 token 前缀做 key,压缩后 KV row 直接 plain read/write 落地。

    第 3 段

    ⬇︎ 下载 PNG𝕏 分享到 X
  • 2-bit 量化有一定损失,目前只有 Metal、无 CUDA,同时 server 是单请求序列化,CPU path 还会触发 macOS kernel bug。

    第 5 段

    ⬇︎ 下载 PNG𝕏 分享到 X
  • Antirez 提到过 CUDA 端口正在开发中,目前 private branch 上在 DGX Spark(GB10)跑通了 ~12 t/s generation + ~200 t/s prefill。

    第 6 段

    ⬇︎ 下载 PNG𝕏 分享到 X
#DeepSeek#MoE#量化#Apple Silicon#CUDA
Open original article

DeepSeek V4 Flash Can Run on a 128GB M3 Max with 1M Context

恋猫de小郭

Read in 5 minutes Published on 2026-05-11, 2,763 views

Column: AI Thinking Notes

Follow

Recently, the founder of Redis, Antirez, released a project called [_ds4_](https://link.juejin.cn/?target=https%3A%2F%2Fgithub.com%2Fantirez%2Fds4 "https://github.com/antirez/ds4"), which uses pure C code to run the "DeepSeek V4 Flash MoE model" with a 1M context on a 128GB memory M3 Max MacBook Pro, while still supporting stable coding agent loops.

Image 10

The key point here is that [_ds4_](https://link.juejin.cn/?target=https%3A%2F%2Fgithub.com%2Fantirez%2Fds4 "https://github.com/antirez/ds4") is not just a simple quantization operation, but rather an approach that combines "asymmetric optimization" with "deep hardware binding" to break the limitation of "long contexts requiring massive GPU/memory."

[_ds4_](https://link.juejin.cn/?target=https%3A%2F%2Fgithub.com%2Fantirez%2Fds4 "https://github.com/antirez/ds4") is actually not a general inference engine (unlike llama.cpp or vLLM), but is specifically tailored for the DeepSeek V4 Flash model. The core can be summarized into three technical concepts:

(1) Asymmetric 2-bit Quantization

The core approach here is to quantize "over 90% of the model parameters" on the routed experts of the MoE using 2-bit quantization (up/gate using IQ2_XXS, down using Q2_K), while keeping the critical path (routing, shared experts, projections, etc.) at full precision.

Because the expert part of the MoE model is large but activations are sparse, quantizing them has a much smaller impact on the final output than quantizing dense parts. Antirez himself verified this:

The q2 version performs reliably in the coding agent, with good loop behavior.

Compared to traditional 2-bit quantization, which would drastically reduce quality, this asymmetric approach, which "compresses the big parts but retains the essence," compresses the memory footprint to 128GB levels while keeping perplexity/quality loss within acceptable limits.

Thus, this is a form of "model-aware quantization" rather than generic quantization.

(2) Disk-native KV Cache

ds4 implements the KV Cache as a combination of "memory-active state" and "disk-persistent prefix cache." The KV Cache can be moved to SSD, using SHA1 hash token prefixes as keys, and compressed rows can be directly read/written without mmap to avoid macOS VM pressure.

It supports strategies like cold/continue/evict/shutdown, and includes a tool-call replay map to ensure precise replay of DSML.

Currently, there is still a live KV checkpoint in memory for the current session, but different sessions, restarts, and long prefix reuse can rely on disk KV cache to recover, avoiding the need to refill from token zero each time.

Thanks to Apple Silicon's Unified Memory architecture + ultra-fast NVMe SSD, the bandwidth and latency combination far exceed typical scenarios. The large KV Cache size (tens to hundreds of GB) generated by long contexts (1M tokens) is manageable due to the SSD throughput, only slightly reducing the generation speed:

From 26.68 t/s to 21.47 t/s with 11k+ token prefill.

This represents a complete paradigm shift? Generally, people widely believe that KV Cache must reside entirely in memory to avoid latency explosions, but Antirez's tests using the disk as "extended memory" prove that it is feasible under specific hardware + compression + optimized I/O conditions.

A 1M context doesn't need expanded memory; simply using SSD as swap can maintain a stable 27 tok/s, and Apple Silicon's unified memory + NVMe IO link perform surprisingly well in long contexts.

(3) Pure Metal Native Implementation

The entire engine consists of only a few thousand lines of C + Metal shaders, with no generic framework overhead (not relying on GGML/llama.cpp):

  • Metal worker runs serially to avoid race conditions.
  • Supports official DeepSeek V4 Flash GGUF (q2/q4 versions), with custom tensor layouts and metadata.
  • Experimental MTP (speculative decoding) is supported, but with limited benefits.

In terms of official benchmarks, the performance test for the M3 Max 128GB q2 version:

  • Short prompt: prefill 58.52 t/s, generation 26.68 t/s
  • 11k+ token long prompt: prefill 250+ t/s, generation 21.47 t/s

27 t/s may not seem fast, but it is sufficient for the agent loop (thinking - calling tools - continuing generation), because the agent scenario is not real-time chat, and it still works well across multiple iterations.

Additionally, 2-bit quantization introduces some loss, currently only Metal support, no CUDA, and the server is serialized per request, triggering a macOS kernel bug on the CPU path.

Despite these limitations, a 128GB M3 Max can run it! Even with the ds4-server compatible with OpenAI/Anthropic, you can directly connect to OpenClaw, Claude Code, etc., using high-end models for planning and review, and local models for simple execution, achieving a hybrid mode.

However, to be honest, 27 t/s is suitable for agents but not for high concurrency or real-time conversations. The actual recommended context size for a 128GB machine is 100k–300k (1M is the theoretical upper limit, with memory reserved for the system and other processes). It does not support Windows or Linux, and the CUDA version is reportedly in development, but this indeed seems like a promising direction.

Antirez mentioned that the CUDA port is being developed, and it has been successfully tested on DGX Spark (GB10) with ~12 t/s generation + ~200 t/s prefill on the private branch.

Image 11
Image 12

The overall performance of ds4 can be referenced as follows:

Image 13

Many people have already successfully tested running it on a 128GB M3 Max, downloading the q2 version allows direct execution. However, during testing, occasional hallucination of end tokens or parser state issues were observed with q2 quantization.

Image 14

Additional Testing Results

Default DS4 settings have been tested to achieve 14–15 tokens per second (t/s) during the actual encoding of a 62K pre-filled conversation. Memory usage remains stable at around 85GB during the generation process, for a complete context window of 100K. The disk cache is approximately 8GB, and the maximum limitation is about one minute of waiting time for each 10K context segments to resume operations after compression.

Image 15: ezgif-761cbe557836730e.gif

According to the comment 「#46 FYI: Works with 96 GB as well」, it turns out that 96GB also works, indicating there is still room for further performance improvements. Optimizations such as Metal 4/M5 prefill, Linux build support, and typo fixes are still ongoing. Image 16

If you have an M3 Max with 128GB now, you can try it directly via GitHub with a single command: make + download_model.sh.

Project Address

[github.com/antirez/ds4](https://link.juejin.cn/?target=https%3A%2F%2Fgithub.com%2Fantirez%2Fds4 "https://github.com/antirez/ds4")

Tags:

FrontendAI ProgrammingArtificial Intelligence

Topics:

Daily Selection Articles

This article is included in the following columns

Image 17: cover

AI Thoughts Diary

Column Directory

Interpretation and Reflection on AI Articles

44 Subscribers

43 Articles

Subscribe

Previous Article

AI Era Open Source Licenses Will Disappear, malus Satirically Demonstrates This

Comments: 17

Image 18: avatar

0/ 1000

Punctuation marks and links do not count towards the effective word count

⌘ + Enter

Send

Login / Register to post comments!

Hot

Latest

java菜小鸟

When will it run smoothly on my M1 16GB laptop?

1 hour ago

Like

Comment

  • Block Author: java菜小鸟
  • Report

执器

The token rate per second is slow, and is it not possible to use it concurrently (with multiple agents)?

17 hours ago

Like

Comment

  • Block Author: 执器
  • Report

明略科技

@明略科技

Using Metal natively in ds4 is indeed cleaner than the MPS backend path in llama.cpp. Real-world testing shows that the M4 Pro 36GB can stably achieve over 60 tokens per second decoding DeepSeek-V3 Q4 quantized model, without memory issues. In end-user scenarios, when combined with tools like mano-cua, the local model + local control loop can already handle many tasks.

1 day ago

1

Comment

  • Block Author: 明略科技
  • Report

View all 17 comments

Image 22 9

Image 23 17

Image 24 Favorite

Follow to get updates!

Image 25: avatar

Follow

572 Posts5.6m Reads37k Fans

Follow to get updates!

Followed

Private Message

Table of Contents

Collapse

Related Recommendations

[Practical Max, New Flutter & Dart Agent Skills in Depth Analysis 1.4k Reads · 22 Likes](https://juejin.cn/post/7637046499474538559 "Practical Max, New Flutter & Dart Agent Skills in Depth Analysis")[AndroidX Introduces a New AppState for Managing Compose State 842 Reads · 11 Likes](https://juejin.cn/post/7638535912314929206 "AndroidX Introduces a New AppState for Managing Compose State")[I Made Two Tools, One 7MB Shell, One That Remembers 631 Reads · 9 Likes](https://juejin.cn/post/7637754131332890659 "I Made Two Tools, One 7MB Shell, One That Remembers")[Open Source 4B Local Model, Use Any App as a Skill! Say Goodbye to Token Anxiety, Private and Secure~ 472 Reads · 3 Likes](https://juejin.cn/post/7637885957680939051 "Open Source 4B Local Model, Use Any App as a Skill! Say Goodbye to Token Anxiety, Private and Secure~")[Free Tokens for 0 Yuan/Month! SenseNova by SenseTime, Don't Miss Out! 98 Reads · 0 Likes](https://juejin.cn/post/7637804704889913385 "Free Tokens for 0 Yuan/Month! SenseNova by SenseTime, Don't Miss Out!")

Featured Content

[Bun v1.3.14 in-depth analysis: Image API, HTTP/3, Global Virtual Storage, and Fifty Changes iDao Technology Cube · 74 Reads · 2 Likes](https://juejin.cn/post/7639025195580194862 "Bun v1.3.14 in-depth analysis: Image API, HTTP/3, Global Virtual Storage, and Fifty Changes")[Boss Forced Me to Go into AI, I Secretly Ran LLaMA in My Browser, Saving 200,000 API Fees kyriewen · 98 Reads · 0 Likes](https://juejin.cn/post/7639265898830970921 "Boss Forced Me to Go into AI, I Secretly Ran LLaMA in My Browser, Saving 200,000 API Fees")[From Frontend to Backend: What is SQL 小小小小宇 · 67 Reads · 0 Likes](https://juejin.cn/post/7639208988976644111 "From Frontend to Backend: What is SQL")[React Observer Hooks: Seven Ways to Listen to DOM Without Boilerplate Code 前端导师顾北 · 37 Reads · 2 Likes](https://juejin.cn/post/7639270931059867694 "React Observer Hooks: Seven Ways to Listen to DOM Without Boilerplate Code")[【To Be Continued】React High-Frequency Interview Questions 卷帘依旧 · 27 Reads · 2 Likes](https://juejin.cn/post/7639181027916267535 "【To Be Continued】React High-Frequency Interview Questions")

Find Your Tech Community

Reply 'join' to join the official WeChat group

Image 28

Recommended for You

* [DeepSeek V4 Release: 1.6 trillion parameters, 1 million context, breakthrough floor pricing](https://juejin.cn/post/7633624945063378984 "DeepSeek V4 Release: 1.6 trillion parameters, 1 million context, breakthrough floor pricing") After waiting for stars to fall three times, the domestically produced AI star DeepSeek finally released the latest DeepSeek V4. This period, the whole country has been urging, and competitors have been continuously releasing new models, various benchmarks, but DeepSeek remains

[DeepSeek](https://juejin.cn/tag/DeepSeek "DeepSeek")[AIGC](https://juejin.cn/tag/AIGC "AIGC")[AI Programming](https://juejin.cn/tag/AI%E7%BC%96%E7%A8%8B "AI Programming")

* [Redis Creator's Move: ds4, Crafting a DeepSeek V4 Flash Local Inference Engine in C, 6600+ Stars Behind the Hardcore Technical Breakdown](https://juejin.cn/post/7638437596683550726 "Redis Creator's Move: ds4, Crafting a DeepSeek V4 Flash Local Inference Engine in C, 6600+ Stars Behind the Hardcore Technical Breakdown") antirez (Redis Father) wrote a pure C + Metal local inference engine for DeepSeek V4 Flash from scratch. With 2-bit quantization, a 128GB MacBook can run a 284B parameter MoE model, and the KV Cache is directly persisted to SSD

[Artificial Intelligence](https://juejin.cn/tag/%E4%BA%BA%E5%B7%A5%E6%99%BA%E8%83%BD "Artificial Intelligence")

* [DeepSeek V4 Released: 1M Context Becomes Standard, Trillion Parameter MoE, Price Reduced to One-Fifth of Competitors](https://juejin.cn/post/7632208925454680098 "DeepSeek V4 Released: 1M Context Becomes Standard, Trillion Parameter MoE, Price Reduced to One-Fifth of Competitors") On April 24, the preview version of DeepSeek V4 was officially launched and simultaneously open-sourced. After numerous rumors and speculations in both Chinese and English AI circles about the delay of V4, it has finally come to fruition.

[DeepSeek](https://juejin.cn/tag/DeepSeek "DeepSeek")

* [DeepSeek V4 Arrives: I Stayed Up All Night to Read the Technical Report](https://juejin.cn/post/7632208925455319074 "DeepSeek V4 Arrives: I Stayed Up All Night to Read the Technical Report") Preface: Been waiting for a long time. This morning when I woke up and checked my phone, DeepSeek V4 was released. It wasn't a teaser or a rumor; it was an actual release with open-source code. To be honest, I had grown numb waiting for this day—AI circles are full of "big news next week," and my ears have become calloused.

[Artificial Intelligence](https://juejin.cn/tag/%E4%BA%BA%E5%B7%A5%E6%99%BA%E8%83%BD "Artificial Intelligence")

Image 29: DeepSeek V4 Arrives: I Stayed Up All Night to Read the Technical Report

* [DeepSeek-V4 Released: 1.6T MoE + 1M Context Open Source, How Will QA Industry Testing Be Reshaped?](https://juejin.cn/post/7632506858189144064 "DeepSeek-V4 Released: 1.6T MoE + 1M Context Open Source, How Will QA Industry Testing Be Reshaped?") On April 24, DeepSeek officially released the V4 preview version and simultaneously open-sourced it. This is the second time since V3 that DeepSeek has set new benchmarks for open-source large models. As a QA industry veteran with over 10 years of experience, today I want to focus on discussing how this update will impact the industry.

[Test](https://juejin.cn/tag/%E6%B5%8B%E8%AF%95 "Test")

* [DeepSeek V4 Released: How to Respond](https://juejin.cn/post/7635869939149717555 "DeepSeek V4 Released: How to Respond") As of April 24, 2026, DeepSeek V4 Preview is no longer just a rumor: official news pages, API update logs, pricing pages, and Hugging Face model cards all feature V4-Pro.

[Algorithm](https://juejin.cn/tag/%E7%AE%97%E6%B3%95 "Algorithm")

Image 30: DeepSeek V4 Released: How to Respond

* [Hands-On Test of DeepSeek V4: Not Explosive, but Doing More Important Things](https://juejin.cn/post/7632237134600060980 "Hands-On Test of DeepSeek V4: Not Explosive, but Doing More Important Things") Hello everyone, I'm Cold Yi. After much anticipation, DeepSeek V4 has finally been released. This version comes in two flavors: V4 Pro and V4 Flash, both with 1M context and are open-sourced.

[DeepSeek](https://juejin.cn/tag/DeepSeek "DeepSeek")

Image 31: Hands-On Test of DeepSeek V4: Not Explosive, but Doing More Important Things

* [DeepSeek V4 Fully Open-Sourced: Innovation Behind 1.6T Parameters](https://juejin.cn/post/7633987404987170826 "DeepSeek V4 Fully Open-Sourced: Innovation Behind 1.6T Parameters") What Happened On April 24, DeepSeek-AI officially released the V4 series preview version, simultaneously open-sourcing on Hugging Face and the ModelScope community under the MIT license, allowing commercial use. Two versions: V4-Pro (flagship): 1.6

[DeepSeek](https://juejin.cn/tag/DeepSeek "DeepSeek")

Image 32: DeepSeek V4 Fully Open-Sourced: Innovation Behind 1.6T Parameters

* [Easy Guide to Integrating DeepSeek V4 with Claude Code](https://juejin.cn/post/7632644475747860515 "Easy Guide to Integrating DeepSeek V4 with Claude Code") On April 24, 2026, DeepSeek v4 was released. This article provides a more reasonable guide and configuration content for setting up DeepSeek with Claude Code.

[DeepSeek](https://juejin.cn/tag/DeepSeek "DeepSeek")[Claude](https://juejin.cn/tag/Claude "Claude")

* [DeepSeek V4 Released: What Worries NVIDIA the Most Isn't the Model](https://juejin.cn/post/7632228821949136905 "DeepSeek V4 Released: What Worries NVIDIA the Most Isn't the Model") April 24, 2026. No press conference. No pre-launch hype. No countdown to reveal. DeepSeek simply dropped V4—open-sourced, launched its website, and rolled out its app and API updates all at once. Zero frame rate. Then...

[DeepSeek](https://juejin.cn/tag/DeepSeek "DeepSeek")[AI Programming](https://juejin.cn/tag/AI%E7%BC%96%E7%A8%8B "AI Programming")

Image 33: DeepSeek V4 Released: What Worries NVIDIA the Most Isn't the Model

* [How to White-Label Access DeepSeek V4 with Claude Code](https://juejin.cn/post/7637065398849257498 "How to White-Label Access DeepSeek V4 with Claude Code") Using the free credits from Alibaba Cloud Bai Lian / ModelScope Community, you can quickly integrate Claude Code with DeepSeek V4 series models using the CC Switch desktop tool.

[LLM](https://juejin.cn/tag/LLM "LLM")[VibeCoding](https://juejin.cn/tag/VibeCoding "VibeCoding")[Claude](https://juejin.cn/tag/Claude "Claude")

Image 34: How to White-Label Access DeepSeek V4 with Claude Code

* [DeepSeek-TUI: A Terminal-Based Programming Agent Built on DeepSeek V4](https://juejin.cn/post/7635465776091824178 "DeepSeek-TUI: A Terminal-Based Programming Agent Built on DeepSeek V4") DeepSeek-TUI is a terminal-native programming agent built on the DeepSeek V4 model. This article analyzes its architectural features, capability boundaries, and applicable scenarios from a technical perspective. 01. Project Background and Challenges Faced Currently, terminal AI

  • GitFun
  • 10 days ago
  • 55
  • Like
  • Comment

[Artificial Intelligence](https://juejin.cn/tag/%E4%BA%BA%E5%B7%A5%E6%99%BA%E8%83%BD "Artificial Intelligence")

Image 35: DeepSeek-TUI: A Terminal-Based Programming Agent Built on DeepSeek V4

* [DeepSeek V4 Breakdown: The Technical Cards Behind Millions of Contexts, Domestic Computing Power Finally Crosses the Threshold](https://juejin.cn/post/7632264475764867126 "DeepSeek V4 Breakdown: The Technical Cards Behind Millions of Contexts, Domestic Computing Power Finally Crosses the Threshold") Table of Contents 1. Three Postponements Later: DeepSeek Finally Reveals Its Hand 2. Essential Changes: The Shift from a Compute Race to an Efficiency Race 3. Core Mechanism Breakdown: Three Dimensions of Technological Breakthroughs 4. Case Studies and Comparisons: Where Does V4 Stand 5. Engineering Implementation Insights: What Are Your Available

[Artificial Intelligence](https://juejin.cn/tag/%E4%BA%BA%E5%B7%A5%E6%99%BA%E8%83%BD "Artificial Intelligence")

* [A Musician Who Wrote a Programming Tool Using AI: DeepSeek TUI, and This Article Was Written by It](https://juejin.cn/post/7637488101486002202 "A Musician Who Wrote a Programming Tool Using AI: DeepSeek TUI, and This Article Was Written by It") This article, from topic selection, outline, to every word, was written using DeepSeek TUI. 0. An Even More Surprising Story Before discussing this tool, let's talk about its creator. The creator of DeepSeek TUI is a person named Hunter B

[Agent](https://juejin.cn/tag/Agent "Agent")[GitHub](https://juejin.cn/tag/GitHub "GitHub")[DeepSeek](https://juejin.cn/tag/DeepSeek "DeepSeek")

Image 36: A Musician Who Wrote a Programming Tool Using AI: DeepSeek TUI, and This Article Was Written by It

* [Open Source Project Observation | ds4: Local Agent Inference, More Than Just Running Models](https://juejin.cn/post/7638839672551342118 "Open Source Project Observation | ds4: Local Agent Inference, More Than Just Running Models") Antirez, the author of Redis, has open-sourced a local inference engine for DeepSeek V4 Flash. It does not aim for universality but focuses on compressing model loading, KV caching, tool invocation, and Agent API adaptation into a dedicated solution

[AI Programming](https://juejin.cn/tag/AI%E7%BC%96%E7%A8%8B "AI Programming")

Image 37: Open Source Project Observation | ds4: Local Agent Inference, More Than Just Running Models

Collection successful!

Already added to '____', click to change

  • WeChatImage 38WeChat QR code sharing
  • Sina Weibo
  • QQ
Image 39: image

AI Code Assistant is now live

Select your code and experience AI interpreting it for you instantly

Try Now

APP内打开

Image 42Choose the technology direction you're interested in

Backend

Frontend

Android

iOS

Artificial Intelligence

Development Tools

Code Life

Reading

Skip

Previous Step

At least select one category

Image 43

Tip

Current operation failed. If you have any questions, you can click to appeal.

Go to Appeal I Understand

Immersive Reading

Confirm Blocking User

After blocking, the other party will not be able to follow you, interact with you, or view your profile.

Cancel Confirm

AI 可能会生成不准确的信息,请核实重要内容