DeepSeek 的 10 万亿美元大战略

宝玉的分享

宝玉的分享2026年5月23日

DeepSeek's 10 Trillion USD Grand Strategy

9.2Score

TL;DR · AI Summary

DeepSeek reduces KV cache requirements through innovations, driving China's AI hardware ecosystem toward a $10 trillion industry.

Key Takeaways

DeepSeek V4 Pro uses only 5.48GB HBM vs 60GB for GLM5 and 89GB for Qwen3-235B-A2
MLA/DSA attention mechanisms reduce KV cache demand, lowering long-horizon agent
This technology drives Chinese chipmakers like YMTC and CXMT to build a $10 tril

Outline

Jump quickly between sections.

§DeepSeek Strategic Overview
DeepSeek pursues a $10 trillion AI industry through technological innovation rather than traditional paths.
·Core Technology Breakthroughs
DeepSeek adopts MoE models, GRPO algorithm, and RLVR reinforcement learning for performance leap.
›KV Cache Optimization
MTP speculative decoding and attention mechanisms reduce KV cache from tens of GB to 5.48GB.
›Hardware Ecosystem Impact
The tech drives Chinese chipmakers like YMTC and CXMT to develop an integrated AI hardware chain.
›Future Industry Vision
DeepSeek aims to build China's AI hardware ecosystem and achieve a $1 trillion valuation.

Mindmap

See how the topics connect at a glance.

查看大纲文本（无障碍 / 无 JS 友好）

DeepSeek 10万亿美元战略
- 技术创新
  - MoE模型
  - GRPO算法
- 硬件生态
  - KV缓存优化
  - 国产芯片厂商

Highlights

Key sentences worth saving and sharing.

DeepSeek V4 Pro requires only 5.48GB HBM at 1M context length, compared to 60GB for GLM5 and 89GB for Qwen3-235B-A22B.
— Paragraph 4
⬇︎ 下载 PNG 𝕏 分享到 X
MLA and DSA attention mechanisms dramatically reduce KV cache needs, lowering long-horizon agent costs.
— Paragraph 5
⬇︎ 下载 PNG 𝕏 分享到 X
This innovation drives Chinese chipmakers like YMTC and CXMT to build a $10 trillion AI hardware ecosystem.
— Paragraph 7
⬇︎ 下载 PNG 𝕏 分享到 X

#AI Model#Hardware Ecosystem#KV Cache#DeepSeek#China AI

Open original article

DeepSeek’s $10 Trillion Grand Strategy

Author: GDP (@bookwormengr)

Have you ever wondered how DeepSeek plans to make money — and not just a little bit, but a lot?

Unlike companies like Zhipu (GLM), Moonshot, and MiniMax, they haven’t launched competitive programming subscription plans. They don’t have multimodal, voice, or video models. As of today, they still don’t even have an evaluation framework (Harness, a benchmarking tool used for testing and evaluating model performance) — although recently there are reports that they’ve started hiring people to work on it. Moreover, DeepSeek has long been committed to open-source, joyfully sharing their “exclusive secrets.” Is this madness? Or just burning money? Are the investors who are preparing to invest $10 billion into them throwing their cash into the water?

No, in my opinion, it's exactly the opposite!!!

Here, I want to talk about what I’ve observed so far regarding their actions and the strategic path they seem to be pursuing. Founder Wenfeng Liang of DeepSeek clearly has his sights set on a much bigger ultimate prize — not only can they themselves reach a $100 billion market cap, but they can also help China nurture a $1 trillion industry giant worth up to $10 trillion!

Revisiting DeepSeek’s "Hero’s Journey"

DeepSeek always goes against the wind. They disdain the idea of competing with minor improvements over others’ fine-tuned models, nor do they rush to sell current applications (such as various coding packages). I posted a viral tweet back on January 27, 2025, describing what I saw, and now the plot is becoming more and more exciting.

While everyone else is obsessed with dense models (a traditional large model architecture where all parameters participate in calculations), DeepSeek took on the challenge and chose the extremely difficult-to-train Mixture-of-Experts (MoE) model.
Starting from first principles, they invented a new GRPO algorithm that replaces the dominant but costly PPO algorithm in reinforcement learning (Reinforcement Learning).
They explored Reinforcement Learning from Verified Rewards (RLVR) and used it as a killer feature to enhance model reasoning capabilities.
With Multi-Token Prediction (MTP), they proposed a brilliant speculative decoding strategy (a technique that accelerates large model generation by predicting subsequent tokens) while making training signals denser.
They perfected the Zero-Bubble pipeline parallelism technology, squeezing every last bit out of limited GPU resources.
They open-sourced the Expert Load Balancer, allowing everyone to easily deploy MoE models. Especially through the Wide Expert Parallel strategy, models can run under large batches, significantly reducing service costs.
They invented a series of modified attention mechanisms such as MLA, DSA, CSA, and HCA, dramatically reducing the need for KV cache (the VRAM space used to store historical conversation memory during large model inference) and keeping computational demands nearly constant even with infinitely extended contexts.
They invented Engram (an imprint module) that achieves the magic of trading memory for compute power.
They developed mHC (modified hyperconnectivity) to solve training stability issues when model size explodes. This list of innovations could go on and on...

In the most classic narrative structure of the hero’s journey, the protagonist doesn’t initially know their ultimate mission. They gradually discover their great destiny through trials and tribulations, overcome countless obstacles, and fulfill it. Along the way, they face mockery and skepticism, encounter malicious opponents, and may even have fatal flaws or weaknesses — but ultimately triumph over themselves and achieve their goal. They confront seemingly insurmountable challenges and cleverly form alliances, skillfully integrate precious resources. That’s why audiences instinctively cheer for heroes. That’s also why DeepSeek, despite winning global admiration and respect from countless fans, has also attracted considerable controversy.

Next, I’ll break down in detail how far DeepSeek has already gone on this path and what their ultimate destiny seems to be: Their vision isn’t selling programming subscriptions, but rather leveraging a $10 trillion Chinese AI hardware ecosystem and thereby naturally securing a $1 trillion valuation. In the process, they might even give a helping hand to emerging players in the Western hardware ecosystem.

Feel free to discuss and correct me.

Let’s Play a Fun KV Cache Math Problem:

Let’s look at a timely tweet released by well-known semiconductor analyst firm @SemiAnalysis_:

Let’s do some fun KV cache math. Don’t worry — if you hate math, we’re just using the recently released KV cache calculator to see how much KV cache DeepSeek V4 Pro saves compared to the latest Zhipu GLM and Alibaba Tongyi Qianwen (Qwen) models.

Using a context length of 1 million (1M) as an example, assuming KV precision is 8-bit and indexer precision is 16-bit, you can also play around with this website:

https://kvcache.ai/tools/kv-cache-calculator/

At a context depth of 1 million:

DeepSeek V4 only requires 5.48 GB of high-bandwidth memory (HBM, high-speed VRAM commonly used in top-tier AI GPUs).

GLM5 needs 60 GB of HBM.

Qwen3-235B-A22B requires as much as 89 GB of VRAM!

Please note, this is under the following conditions:

DeepSeek is a massive model with 1.6 trillion (1.6T) parameters.

GLM5 has around 700 billion (700B) parameters and has already adopted DeepSeek’s MLA and DSA techniques, though it hasn’t yet implemented the latest compressed attention mechanism.

Qwen3-235B-A22B only has 235 billion parameters and uses relatively traditional GQA (grouped query attention mechanism).

DeepSeek made foundational contributions in alleviating VRAM pressure. If this innovation is widely adopted across the industry, it will drastically reduce the cost for long-horizon AI agents (Long-horizon Agents) handling ultra-long tasks, unlocking entirely new application scenarios for the next generation.

The Precision Behind the Madness:

Being able to compress KV cache so significantly without sacrificing model quality gives them the confidence to price long-held cache at rock-bottom prices — even less than 3% of the cache hit price of Anthropic’s Claude Sonnet 4.6, and they even offer several hours of free retention!

For long-horizon tasks, since the cache size is minimal, offloading it to solid-state drives (SSDs) and reloading it when needed becomes highly cost-effective. This greatly reduces reliance on HBM. Note that HBM is currently severely in short supply globally, and from the perspective of China’s AI hardware industry, it’s also one of the core pain points due to its high manufacturing difficulty. What’s even more impressive is that DeepSeek has developed a high-speed technique to reload KV cache from SSDs, with details in their paper: https://arxiv.org/pdf/2602.21548

Who Benefits Most from This "KV Cache Compression Battle"?

Who supplies SSDs in bulk? Don’t forget that YMTC (Yangtze Memory Technologies) is rising as a global leader in 3D NAND flash memory. Flash memory (NAND) allows DeepSeek to directly read cache, avoiding the huge computational waste of recalculating KV each time. In return, DeepSeek is creating an enormous new market for NAND flash and SSDs — benefiting not only YMTC but the entire industry chain.

But the Vision Goes Beyond NAND and SSD:

Low-power memory (LPDDR) also holds tremendous potential, acting as a “backbone” for storing model weights and streaming them continuously into HBM when needed, further easing HBM capacity pressure. You can refer to this blog post: https://www.lmsys.org/blog/2025-09-25-gb200-part-2/ Below is a diagram explaining how this scheme works:

Although DeepSeek hasn’t specifically developed for this scheme, their MoE model architecture with a large number of experts and support for 4-bit weights perfectly fits this solution, making implementation effortless.

This innovation combined with their incredibly efficient lossless compact KV cache technology leads to a dramatic drop in HBM throughput and capacity requirements.

Who makes LPDDR in China? CXMT (Changxin Memory Technologies). Currently, they lag behind international leaders by only half a generation in speed and one generation in density. The gap is very small! This means that in the near future, apart from sufficient NAND flash, China’s domestic ecosystem will also see a flood of LPDDR memory. Can this alleviate the pressure on compute chips? The answer is: absolutely. Keep reading…

Smartly Leveraging Storage to Lighten the Load on GPUs and ASICs

It’s easy to understand: Using NAND flash to store KV cache not only extends cache retention time and reduces HBM pressure but also eliminates the need for repeated computations, effectively freeing up compute units on GPUs and ASICs (Application-Specific Integrated Circuits, i.e., customized AI computing chips). Besides serving as a real-time streamer for model weights, can LPDDR help in other ways? The answer is: yes.

LPDDR can be used to store massive amounts of “Engrams” (Trace Modules). In their paper ([https://arxiv.org/pdf/2601.07372), DeepSeek points out that although the mixture-of-experts (MoE) architecture can expand model capacity through conditional computation, traditional Transformer architectures lack a natural mechanism for knowledge retrieval and can only clumsily simulate retrieval via expensive “computation.” To address this, they introduced the Engram module, upgrading classic N-gram embedding techniques into hash-based, O(1) time complexity instant lookups, creating a novel sparse dimension they call “conditional memory.” This dramatically reduces computational overhead but at the cost of requiring enormous memory space to store the massive embedding table. It's a classic example of trading space (storage) for time (computation), and its brilliance lies in the fact that the cost of reading storage is far cheaper than performing calculations (looking up in LPDDR is much more economical than running an entire forward pass of a large model). When deployed at scale, it’s a deal that’s incredibly worthwhile. This is how they secret sauce for saving compute power by throwing massive amounts of memory at the problem!

This trade-off is absolutely worth it: Due to the lack of extreme ultraviolet lithography (EUV) equipment, it's impossible to achieve the same transistor density on a single chiplet as Western counterparts. As such, Chinese GPUs and ASICs are destined to lag behind Western top-tier graphics cards in raw floating-point performance (FLOPs). At the same time, China is still catching up in advanced packaging technologies. Therefore, if one can leverage the abundant and low-cost NAND and LPDDR memory produced domestically to offset the computing disadvantage, this "play to strengths and avoid weaknesses" strategy is a perfect fit.

Taking stock of DeepSeek’s grand strategy:

Looking at these dazzling innovations and the choices they’ve made (still not doing multimodal or voice models, let alone video generation—what even is that?), DeepSeek’s ambition clearly extends far beyond the mere tens of millions of dollars in short-term gains. They are carefully laying down a $10 trillion chess game aimed at nurturing an independent hardware ecosystem outside of the West.

This not only elevates Chinese storage chip manufacturers to major players on the global AI hardware stage but also fundamentally lowers the resource threshold for training and inference of large language models. Once the cost of running AI models drops, previously underperforming domestic GPU/ASIC chips and network switching chips will all become viable and practical options. Moreover, these open-source innovations will also feed back into the Western open-source community and breathe new life into startups challenging NVIDIA.

All signs point in the same direction. Let’s go through each of their industry-shaking innovations:

Introducing MoE and MLA in DeepSeek V2: MoE reduced compute consumption by 40% to 50% when training an extremely intelligent model; while Multi-head Latent Attention (MLA) cut KV cache usage by 90%, making it highly efficient to offload cache to SSDs. These concepts were first proposed in their May 2024 paper ([https://arxiv.org/pdf/2405.04434). Thanks to these breakthroughs, they later managed to train DeepSeek V3 using just 2048 stripped-down H800 GPUs, matching top closed-source models.**

DSA (Dense Skip Attention): Introduced in the paper ([https://arxiv.org/pdf/2512.02556), designed to reduce computational load in long-context scenarios while alleviating HBM bandwidth pressure. It ensures that computation doesn’t explode with increasing context length. Look at the chart below—the processing time of DeepSeek-v3.2 remains stable even as context grows.

mHC (Modified Hyper-Connections): First appeared in the paper ([https://arxiv.org/pdf/2512.24880). mHC is a major innovation in DeepSeek’s macro-architecture, completely revolutionizing how signals are transmitted between layers in large models. Previously, everyone used standard residual connections inherited from ResNet ($x + F(x)$), whereas mHC expands this residual flow into multiple parallel “information highways,” allowing the model to learn how to mix them autonomously. Most importantly, it enforces doubly stochasticity on these mixing matrices mathematically (by projecting the mixing matrix onto the Birkhoff polytope via Sinkhorn-Knopp), ensuring signal strength never decays across any depth of the network.

This completely solves the catastrophic instability issue plaguing unconstrained hyper-connections (first invented by ByteDance)—previously, at a 27 billion parameter scale, signal amplification would skyrocket to 3000x, crashing the entire training process.

Yet its computational cost is negligible: Since it does not alter the original floating-point operations in attention or feed-forward networks (FFNs), it only changes how outputs are routed between layers, adding only a modest 6.7% extra training time overhead.

However, the performance boost is staggering: Under identical model size and nearly identical compute budget, a 27B model saw a +7.2 score increase on the complex BIG-Bench Hard reasoning test, +3.2 on DROP, +2.8 on GSM8K, and +1.4 on MMLU.

In short, mHC gives the network a richer and more expressive inter-layer routing topology, enabling units to exhibit significantly higher “intelligence” per parameter without spending additional compute.

CSA & HSA: Featured in the DeepSeek V4 Pro technical documentation released in April 2026 ([https://huggingface.co/deepseek-ai/DeepSeek-V4-Pro/blob/main/DeepSeek_V4.pdf). By deeply compressing KV tokens, they further reduced the already-small KV cache requirement by 90%! Meanwhile, they significantly lowered the required floating-point operations, effectively freeing up HBM and GPU/ASIC resources.

The paper ([https://arxiv.org/pdf/2601.07372) was released in Q1 2026, as mentioned earlier, it essentially realizes “trading memory (LPDDR) for compute.” The detailed chart below shows the significant performance leap brought by Engram under identical total parameter budgets.

Maximizing overlap between computation and communication: Techniques like “dual path” appear superficially to be workarounds due to hardware restrictions. But DeepSeek goes further—offering guidance to chip vendors on how to design ASIC architectures to avoid wasting even a sliver of precious silicon. The following screenshot comes directly from the official DeepSeek V4 Pro documentation:

Heavy investment in TileLang: This unmistakably signals that their vision has long transcended the constraints of their own limited compute resources, aiming instead to give the entire Chinese hardware ecosystem a competitive edge against the West. With TileLang (an open-source programming language for writing high-performance compute kernels), engineers write once and deploy seamlessly across any hardware platform with TileLang backend support. I expect other domestic AI labs to quickly join this movement—this will help Chinese hardware vendors circumvent NVIDIA’s formidable “CUDA moat” (NVIDIA’s decades-old proprietary parallel computing ecosystem, its strongest moat). Meanwhile, it also helps liberate Western hardware vendors like AMD. _Note: Many domestic AI hardware platforms also offer CUDA compatibility or CUDA compilation layers. Among them, Moore Threads, MUSA, Wallbreak, and TianShu are the Chinese chip companies with the highest level of CUDA compatibility through translation layers, theoretically not needing TileLang._

Large-scale Reinforcement Learning and Automated Scientific Research:

With the dramatic drop in computational demands and the growing availability of domestic hardware options, DeepSeek can finally take on ambitious training plans previously deemed unfeasible—especially post-training reinforcement learning stages. Reinforcement learning requires generating massive amounts of reasoning trajectories, often producing trillions of tokens, which was formerly extremely costly. Additionally, to train a model supporting 1 million context tokens, you must generate reasoning trajectories of the same length. Only by subjecting the model to such ultra-long trajectories can it truly unlock capabilities for solving complex long-range tasks.

Moreover, diversified hardware choices will provide DeepSeek with surplus compute power to tackle “Research on Silicon Intelligence” (RSI, the autonomous evolution technology where AI acts as scientists, designing and executing algorithmic experiments). This self-evolution approach involves extensive trial-and-error, which is extremely expensive. But to thoroughly explore the unknown space of algorithmic design, RSI is an essential path. On the journey toward Artificial General Intelligence (AGI) and Superintelligence (ASI), DeepSeek must light up the RSI tech tree.

Today’s Test of DeepSeek, Tomorrow’s Textbook of the Industry:

Today, DeepSeek’s string of crazy innovations around MoE, MLA, and DSA have already been embraced by AI labs worldwide and are being widely emulated.

For example, Zhipu AI, which developed the GLM series of models, has already adopted MLA and DSA; Moonshot (Kimi) also openly acknowledged that its latest architecture is based on DeepSeek’s evolution. In return, DeepSeek has also adopted the Muon optimizer in large-scale training, which was first discovered and proven by Kimi’s team to be powerful in ultra-large-scale training.

(_Note:_

* The mixture-of-experts (MoE) architecture was originally proposed in a classic paper by top scholars in 2017 (https://arxiv.org/pdf/1701.06538), while DeepSeek's contribution lies in successfully scaling it to unprecedentedly massive sizes and integrating numerous proprietary innovations.*

* The Muon optimizer (based on Newton-Schulz momentum orthogonalization) was invented by machine learning researcher Keller Jordan at the end of 2024, and Kimi’s team was the first in the world to apply it to ultra-large model training.*)

So, how exactly do you make a killing?

We can look at an interesting case from OpenAI. OpenAI once entered into agreements with AMD and Cerebras (a startup challenging NVIDIA with wafer-level super chips) such that as OpenAI purchases and consumes chips from these two companies to reach specific milestones, OpenAI would receive stock warrants or options from them at extremely low prices. This was a brilliant win-win deal for both AMD and Cerebras — with the deep binding of OpenAI, a beast that devours computing power, their chances of winning in the long run have greatly increased.

According to AMD’s official press release (https://www.amd.com/en/newsroom/press-releases/2025-10-6-amd-and-openai-announce-strategic-partnership-to-d.html): “As part of the agreement, to deeply align strategic interests between both parties, AMD has granted OpenAI warrants for up to 160 million shares of AMD common stock. These equity rights will unlock progressively upon reaching specific milestones. The first phase unlocks when the initial deployment reaches a 1 GW computing center, with subsequent portions unlocking as procurement scales up to 6 GW…”

I boldly predict that DeepSeek is currently signing similar bet-and-benefit-binding agreements with domestic storage, ASIC computing chip, CPU, and network protocol stack vendors. Through deep collaborative optimization, DeepSeek will help these local hardware platforms truly match or even surpass Western hardware when running the world’s most cutting-edge AI workloads.

Currently, all AI-related stocks in the West (including their East Asian allies) have a combined market capitalization exceeding $10 trillion. With this clever business model of “exchanging technology for equity and using ecosystems to share the pie,” DeepSeek not only has the potential to replicate a similarly massive domestic hardware industry in China but also carve out the most profitable slice of the cake for itself, propelling itself into the super-club of $100 billion market cap.

This will allow them to earn far more real money than selling subscription software, while also achieving their stated vision of “making general artificial intelligence accessible to everyone.” Liang Wenfeng, a devoted fan of legendary quant master James Simons, is undoubtedly a top-tier capitalist who would never miss such a grand strategy!

If you step back and connect all of DeepSeek’s unusual actions so far, this is the only underlying logic that perfectly explains everything...

A detailed breakdown of these underlying technological innovations will be published this weekend. Those interested are welcome to follow my Substack column: https://polymath707.substack.com/ ...