T
traeai
Sign in
返回首页
宝玉(@dotey)

DeepSeek's $10 Trillion Grand Strategy [Translation]

9.2Score
DeepSeek's $10 Trillion Grand Strategy [Translation]

TL;DR · AI Summary

DeepSeek builds a low-cost, high-efficiency model system through multiple foundational innovations to drive China's $10 trillion AI hardware ecosystem and achieve its own $1 trillion valuation.

Key Takeaways

  • DeepSeek V4 Pro requires only 5.48GB HBM at 1M context length, significantly les
  • Its KV cache compression reduces long-term caching costs below 3% of Claude Sonn
  • By open-sourcing MLA, DSA attention mechanisms and expert load balancers, MoE mo

Outline

Jump quickly between sections.

  1. DeepSeek focuses not on short-term monetization but on achieving trillion-dollar market cap and trillion-industry impact.

  2. Adopts MoE architecture, GRPO algorithm, RLVR training methods for enhanced performance.

  3. The V4 Pro model needs only 5.48GB HBM under one million context tokens, outperforming GLM5 and Qwen3.

  4. Uses SSDs instead of some HBM storage to reduce reliance on premium memory and boost NAND industry growth.

  5. Opens key technical components to foster ecosystem development and lower deployment/service costs.

Mindmap

See how the topics connect at a glance.

查看大纲文本(无障碍 / 无 JS 友好)
  • DeepSeek万亿美元战略
    • 技术创新支柱
      • MoE混合专家模型
      • GRPO强化学习算法
      • KV缓存压缩技术
    • 生态系统联动
      • SSD/NAND闪存市场
      • 国产HBM供应链
      • 开源社区共建

Highlights

Key sentences worth saving and sharing.

#DeepSeek#AI Model#MoE#KV Cache Optimization#Hardware Ecosystem
Open original article

Article

Image 1: Image

DeepSeek's 10 Trillion Dollar Grand Strategy [Translation]

Author: GDP (

) Title:

GDP

@bookwormengr

Image 2: Article cover image

DeepSeek's 10 trillion USD grand strategy

Have you ever wondered how DeepSeek might make money—and a lot of it?

They haven’t launched competitive coding subscription plans like GLM, MoonShot, or MiniMax. They don’t have multimodal, audio, or video models. To this day, they don’t even have a single harness (though I recently heard they’ve started hiring for that). Moreover, DeepSeek has long been committed to open-source initiatives, enthusiastically sharing their “secret recipes.” Is this madness? Pure money burning? Are those investors preparing to pour $10 billion into them simply throwing money away?

No—in my view, quite the opposite!!!

Here, I want to share my observations about what they’ve done so far and the strategic path they seem to be following. Clearly, DeepSeek founder Liang Wenfeng is aiming at a much bigger ultimate prize—not just achieving a $1 trillion market cap themselves, but also helping China spawn an industry giant worth as much as $10 trillion!

Image 3: Image

DeepSeek always goes against the wind. They disdain fine-tuning models that are only slightly better than others, nor do they rush to sell current applications (like various coding packages). On January 27, 2025, I posted a widely shared tweet describing what I saw then—and now the story is getting more exciting.

  • While everyone else was struggling with dense models (traditional large model structures where all parameters participate in computation), DeepSeek chose the extremely difficult-to-train Mixture-of-Experts (MoE) architecture.
  • Starting from first principles, they invented a brand-new GRPO algorithm, replacing the dominant yet costly PPO algorithm used in reinforcement learning (RL).
  • They developed Reinforcement Learning from Verified Rewards (RLVR) and made it their killer technique for enhancing reasoning capabilities.
  • Through Multi-Token Prediction (MTP), they proposed an ingenious speculative decoding strategy—a method that accelerates generation by predicting subsequent tokens ahead of time—while also making training signals denser.
  • They perfected zero-bubble pipeline parallelism technology, squeezing every last bit out of limited GPU resources.
  • They open-sourced the Expert Load Balancer, enabling anyone to easily deploy MoE models. Particularly through Wide Expert Parallel strategies, these models can operate under large batches, significantly reducing service costs.
  • They invented a series of modified attention mechanisms including MLA, DSA, CSA, and HCA, drastically cutting down the demand for KV cache memory (used during inference to store historical dialogue context), keeping computational needs nearly constant even when handling infinitely extended contexts.
  • They created Engram modules, achieving magical operations trading memory for compute power.
  • They developed mHC (modified hyper-connectivity), solving stability issues arising from massive model scaling. This list of innovations could go on...

In the classic hero’s journey narrative structure, the protagonist initially doesn’t know his ultimate mission. He discovers it gradually through trials and tribulations, eventually overcoming obstacles to fulfill his destiny. Along the way, he faces ridicule but chooses to ignore it; encounters malicious rivals; possesses fatal flaws or weaknesses—but ultimately conquers himself and achieves his goal. He confronts seemingly insurmountable challenges yet cleverly forms alliances and skillfully integrates valuable resources. That’s why audiences instinctively cheer for heroes. It’s also why DeepSeek wins passionate admiration and respect worldwide while simultaneously attracting controversy.

Next, let me break down in detail how far DeepSeek has already gone along this path—and glimpsed its ultimate fate: Their ambition isn’t merely selling programming subscriptions, but rather triggering a $10 trillion Chinese AI hardware ecosystem, naturally positioning themselves for a $1 trillion valuation. In doing so, they might even help some emerging players within Western hardware ecosystems.

Welcome your discussions and corrections.

Image 4: Image

Let’s look at a timely tweet released by a well-known semiconductor analysis firm:

Image 5: Image

Let’s start with some fun KV cache math. Don’t worry—if you hate math, we’re just using a recently published KV cache calculator to see exactly how much DeepSeek V4 Pro saves compared to the latest GLM and Alibaba Qwen models.

Using one million (1M) context length as an example, assuming 8-bit precision for KV and 16-bit indexing precision—you can play around with this yourself on the website:

Image 6: Image

At 1 million context depth:

  1. DeepSeek V4 requires only 5.48 GB of high-bandwidth memory (HBM)—a type of fast VRAM commonly found in top-tier AI graphics cards.
  1. GLM5 needs 60 GB of HBM.
  1. Qwen3-235B-A22B demands up to 89 GB of VRAM!

Note that this assumes:

  1. DeepSeek is a behemoth model with 1.6 trillion (1.6T) parameters.
  1. GLM5 has approximately 700 billion (700B) parameters and has borrowed MLA and DSA techniques from DeepSeek but hasn’t adopted the newest compressed attention mechanism yet.
  1. Qwen3-235B-A22B has only 235 billion parameters and uses relatively traditional Grouped Query Attention (GQA).

DeepSeek has laid foundational contributions toward alleviating VRAM pressure. If such innovation becomes widely adopted across the industry, it will dramatically reduce costs for long-horizon agents tackling ultra-long tasks, unlocking entirely new application scenarios.

Image 7: Image

Being able to compress KV caches without sacrificing model quality gives them confidence to slash long-held cache pricing—to less than 3% of Anthropic’s Claude Sonnet 4.6 cache hit price—and still offer free retention for hours!

For long-horizon tasks, due to minimal cache size, offloading it onto solid-state drives (SSDs) and reloading when needed becomes highly cost-effective. This greatly reduces dependence on HBM—which currently suffers global shortages and poses significant manufacturing difficulties from China’s AI hardware perspective. Even more impressively, DeepSeek developed ultra-fast KV cache reloading tech directly from SSDs, details available in their paper:

Image 8: Image

Who supplies SSDs en masse? Remember YMTC (Yangtze Memory Technologies Co.) rising as a global leader in 3D NAND flash storage. Flash technology allows DeepSeek to read cached data directly, avoiding repeated KV computations—an enormous waste of computing power. Conversely, DeepSeek creates a massive new market for NAND flash and SSDs—not benefiting just YMTC but boosting the entire supply chain.

Image 9: Image

Low-power DDR (LPDDR) holds tremendous potential too—it can serve as a vast backend for storing model weights, streaming them continuously into HBM when needed, further easing HBM capacity pressures. You can refer to this blog post:

Below is a diagram illustrating how this system works:

Although DeepSeek did not specifically develop features tailored for this approach, their MoE architecture—with numerous experts and support for 4-bit weight quantization—perfectly aligns with it, making implementation effortless.

Image 10: Image

This innovation combined with their extraordinary lossless ultra-compact KV caching technology leads to a cliff-like drop in both throughput and capacity requirements for HBM.

Which company produces LPDDR in China? CXMT (Changxin Memory Technologies). Currently lagging behind international leaders by half a generation in speed and one full generation in density—the gap is very small! Soon, besides abundant NAND flash, China’s domestic ecosystem will be flooded with LPDDR memories. Will this alleviate pressure on compute chips? The answer is absolutely yes. Keep reading…

Image 11: Image

The logic is simple: Using NAND flash to store KV caches extends retention times, relieves HBM pressure, and avoids redundant calculations—effectively freeing up GPU and ASIC (Application-Specific Integrated Circuit) compute units. Beyond acting as a real-time conveyor belt for model weights, can LPDDR assist in other ways? Yes again.

LPDDR can store massive amounts of "Engrams"—memory modules. DeepSeek introduced these in their papers (Computation) to expand model capacity. Traditional Transformer architectures lack natural knowledge retrieval mechanisms, relying clumsily on expensive "computation" to simulate "retrieval." To address this, they introduced Engram modules, upgrading classical N-gram embedding techniques into hash-based methods with O(1) time complexity.

The instant lookup of $ creates a brand-new sparse dimension they call "Conditional Memory." This dramatically reduces computational load, at the cost of requiring enormous memory space to store this massive embedding table. It's a classic case of "trading space (storage) for time (computation)," and its brilliance lies in the fact that reading from "storage" is far cheaper than performing computations (looking something up in LPDDR is much cheaper than running an entire forward pass of a large model). In large-scale deployments, this trade-off becomes incredibly cost-effective. This is how they secretly save compute power by heavily investing in memory!!!

Image 12: Image

This trade-off is absolutely worth it: due to the lack of extreme ultraviolet lithography (EUV), Chinese GPUs and ASICs are destined to lag behind Western top-tier graphics cards in raw floating-point computing power (FLOPs). At the same time, China is still catching up in advanced packaging technologies. Therefore, if we can use domestically produced, low-cost NAND flash and LPDDR memory to compensate for the compute shortfall, this strategy of leveraging strengths while mitigating weaknesses is a perfect match.

Looking at these dazzling innovations and strategic choices (not pursuing multimodal models or voice models so far, and video generation? What even is that?), DeepSeek’s ambition clearly goes beyond the mere hundreds of millions of dollars in short-term profits. They are patiently playing a long game worth trillions of dollars, aiming to foster an independent "alternative hardware ecosystem" outside of the West.

This not only elevates Chinese storage chip manufacturers into major players on the global AI hardware stage but also fundamentally lowers the resource threshold for training and inference of large models. Once the cost of running AI models drops significantly, previously underperforming domestic GPU/ASIC chips and network switching chips will all become viable, practical options. Moreover, these open-source innovations will also benefit Western open-source communities and offer a lifeline to Western chip startups attempting to challenge NVIDIA.

All the pieces fit perfectly. Let’s take a closer look at the groundbreaking innovations they’ve introduced:

  1. Introduction of Mixture-of-Experts (MoE) and MLA in DeepSeek V2: MoE reduced compute requirements by 40–50% when training an extremely intelligent model; Multi-head Latent Attention (MLA) slashed KV cache usage by 90%, making offloading caches to SSDs highly efficient. These ideas first appeared in their paper published in May 2024 (link).

Using just 2048 downgraded H800 GPUs, they managed to train DeepSeek V3 to rival top closed-source models.

Image 13: Image
  1. DSA (Dense-Sparse Attention): Introduced in their paper (link), DSA alleviates bandwidth pressure on HBM. It ensures that computation does not explode as context length increases. As shown in the chart below, DeepSeek-v3.2 maintains stable processing times even with extended contexts.
Image 14: Image
  1. mHC (Modified Hyper-Connections): Featured in their December 2025 paper (link), mHC represents a significant innovation in DeepSeek’s macro architecture, completely overturning traditional signal transmission methods between layers. Previously, everyone used standard residual connections inherited from ResNet ($x + F(x)$), whereas mHC expands this into multiple parallel “information highways,” allowing the model to learn how to mix signals autonomously. Crucially, it enforces double stochasticity mathematically (by constraining mixing matrices within the Birkhoff polytope via Sinkhorn-Knopp projection), ensuring that signal strength never decays across any depth of the network.
  • This solves catastrophic instability issues previously seen in unconstrained hyper-connections (originally invented by ByteDance)—where signal amplification could skyrocket to 3000x at 27B parameters, causing training collapse.
  • Its computational overhead is negligible: since it doesn’t change the original floating-point operations in attention or feed-forward layers, only altering routing among layers, it adds only about 6.7% extra training time.
  • Yet the performance gains are stunning: under equal model size and nearly identical compute budgets, the 27B model with mHC sees dramatic improvements—+7.2 points on complex BIG-Bench Hard reasoning tasks, +3.2 on DROP evaluation, +2.8 on GSM8K math tests, and +1.4 on comprehensive MMLU knowledge assessments.

In short, mHC endows the network with richer, more expressive inter-layer information routing topology, enabling each parameter to deliver significantly higher "intelligence quotient" without virtually any additional compute cost.

Image 15: Image
  1. CSA & HSA: Released in April 2026 in DeepSeek V4 Pro technical documentation (link), these techniques deeply compress KV tokens, reducing already minimal KV cache needs by another 90%! They also drastically cut required floating-point operations, effectively freeing HBM and GPU/ASIC bottlenecks.
Image 16: Image
  1. Paper (link): Engram was launched in Q1 2026, achieving—in some sense—"trading memory (LPDDR) for compute." The detailed chart below shows the tremendous performance leap brought by Engram under fully consistent total parameter budgets.
Image 17: Image
  1. Maximizing overlap between computation and communication: Low-level architectural tweaks like "Dual Path" may appear to be workarounds due to hardware constraints, but DeepSeek takes it further—they even begin advising chip vendors on ASIC design, telling them how to avoid wasting even a single silicon resource. The screenshot below comes directly from DeepSeek V4 Pro official documentation:
Image 18: Image
  1. Heavy investment in TileLang: This clearly indicates that their vision extends well beyond addressing internal compute shortages—it aims to empower the entire Chinese hardware ecosystem to compete head-on with the West. With TileLang—an open-source programming language for writing high-performance compute kernels—engineers write kernel code once and seamlessly run it on any platform supporting TileLang backends. I expect other domestic AI labs to quickly join this initiative—collectively helping Chinese hardware vendors circumvent NVIDIA’s seemingly impenetrable “CUDA moat” (a decades-old proprietary parallel computing architecture ecosystem, NVIDIA’s widest competitive moat). Meanwhile, this also liberates Western hardware players such as AMD. Note: Many domestic AI platforms provide CUDA compatibility or translation layers themselves. Among them, Moore Threads, MetaX, Biren, and Tianshu Zhixian achieve the highest level of CUDA compatibility through conversion layers and theoretically don’t require assistance from TileLang.
Image 19: Image

With plummeting compute demands and increasing availability of local hardware options, DeepSeek can finally pursue ambitious training plans previously deemed unaffordable—especially post-training during reinforcement learning phases. Reinforcement learning requires generating massive amounts of thought trajectories, easily producing trillions of tokens—a process historically extremely expensive. Additionally, to train models capable of handling 1 million context lengths, you must generate equally long trajectories. Only after enduring rigorous training on ultra-long sequences can models truly unlock capabilities for solving complex, long-range tasks.

Furthermore, diversified hardware choices will give DeepSeek surplus compute capacity to tackle "Automated Artificial Intelligence Research" (RSI)—the self-evolving technology where AI acts as scientists designing and executing algorithmic experiments autonomously. This mode involves extensive trial-and-error experimentation and incurs astronomical costs. However, exploring the vast unknown space of algorithm design necessitates mastering RSI. On the path toward AGI—and eventually ASI—DeepSeek must first unlock this technological tree branch.

Today, DeepSeek’s series of radical innovations around MoE, MLA, and DSA have already been embraced and replicated by leading AI labs worldwide—including those in China.

For example, Zhipu AI, creators of the GLM series, now utilize MLA and DSA; Moonshot AI (Kimi) openly acknowledges basing their latest architecture on DeepSeek evolution. In return, DeepSeek adopted the Muon optimizer—whose effectiveness at ultra-large scale training was first discovered and proven by Kimi team.

(Note:

  • The MoE architecture originated from a seminal paper by top researchers in 2017 (link).

DeepSeek deserves credit for scaling it unprecedentedly large and integrating numerous proprietary innovations.

  • The Muon optimizer (based on Newton-Schulz momentum orthogonalization) was invented by ML researcher Keller Jordan at the end of 2024, with Kimi being the first globally to apply it to ultra-large-scale model training.)

Let’s examine an interesting classic case from OpenAI. OpenAI struck agreements with both AMD and Cerebras (a wafer-scale superchip startup challenging NVIDIA): upon reaching specific chip procurement milestones, OpenAI would receive warrants or stock options at very favorable prices. For AMD and Cerebras, this is a brilliant win-win deal—with compute-hungry OpenAI deeply tied in, their chances of success over the long haul increase dramatically.

According to AMD’s official press release (link):

“AMD has granted OpenAI warrants to purchase up to 160 million shares of common stock. These equity grants will unlock progressively as certain milestones are met. The first tranche unlocks upon initial deployment reaching 1 gigawatt (GW) data center scale, with subsequent portions unlocking as purchases expand to 6 GW…”

Image 20: Image

I boldly predict that DeepSeek is currently signing similar milestone-based partnership deals with domestic storage, ASIC compute chip, CPU, and networking stack vendors. Through deep joint optimization, DeepSeek will help these local hardware solutions perform—and possibly surpass—the world’s most cutting-edge AI core workloads.

Currently, the combined market cap of all AI-related stocks in the West (including East Asian allies) has already exceeded $10 trillion. Through this ingenious business model of “technology-for-equity swaps and ecosystem-driven value sharing,” DeepSeek won't merely replicate a similarly massive super-hardware industry in China—but also claim the juiciest slice of the pie, propelling itself into the exclusive $1 trillion club.

This approach promises far greater financial returns than selling subscription software alone—and conveniently aligns with their grand vision of “bringing AGI benefits to every individual.” Liang Wenfeng, a devoted fan of legendary quant legend Jim Simons, is undoubtedly one of the smartest capitalists alive—he definitely wouldn’t miss out on such a monumental opportunity!

Once you connect all of DeepSeek’s unusual moves thus far, this underlying logic emerges as the sole explanation that makes everything fall perfectly into place...

Image 21: Image

A detailed breakdown of these underlying technical innovations will be published this weekend. Interested readers are welcome to follow my Substack column:

...

AI may generate inaccurate information. Please verify important content.