小米：请叫我 Token 价格屠夫

爱范儿

爱范儿2026年5月27日

Xiaomi: Please Call Me the Token Price Butcher

8.5Score

TL;DR · AI Summary

Xiaomi announced permanent price cuts for its MiMo-V2.5 series API, with a maximum discount of 99%, sparking industry attention.

Key Takeaways

Xiaomi's MiMo-V2.5 series API is permanently discounted by up to 99%
Xiaomi optimizes its Token Plan pricing structure, maintaining prices but increa
Xiaomi's pricing strategy targets high-frequency, multi-round, and long-context

Outline

Jump quickly between sections.

§Introduction
Xiaomi announces permanent price cuts for its MiMo-V2.5 series API, with a maximum discount of 99%, sparking industry attention.
·Pricing Strategy
Xiaomi's pricing strategy targets high-frequency, multi-round, and long-context real-world work scenarios through cache hit optimization.
·Token Plan Optimization
Xiaomi optimizes its Token Plan pricing structure, maintaining prices but increasing available credits by 5 to 8 times.
·Industry Reaction
The price reduction announcement sparked industry attention, considered the harshest father in the AI industry.
·Competitive Impact
Xiaomi's price reduction impacts smaller model companies, potentially leading to increased price wars and redefinition of product value.

Mindmap

See how the topics connect at a glance.

查看大纲文本（无障碍 / 无 JS 友好）

小米降价策略
- 缓存命中策略
  - 输入缓存命中价格降低
  - 提高缓存命中率
- Token Plan 优化
  - 定价不变
  - 可用量提升
- 行业影响
  - 价格战加剧
  - 市场份额变化

Highlights

Key sentences worth saving and sharing.

Xiaomi's MiMo-V2.5 series API is permanently discounted by up to 99%
— Paragraph 2
⬇︎ 下载 PNG 𝕏 分享到 X
Xiaomi optimizes its Token Plan pricing structure, maintaining prices but increasing available credits by 5 to 8 times
— Paragraph 3
⬇︎ 下载 PNG 𝕏 分享到 X
Xiaomi's pricing strategy targets high-frequency, multi-round, and long-context real-world work scenarios through cache hit optimization
— Paragraph 4
⬇︎ 下载 PNG 𝕏 分享到 X

#Xiaomi#Token#AI#Price War#Cache Hit

Open original article

Betting on a massive token price hike in 2026 saw two embarrassing moments within a week.

On May 22, DeepSeek announced that the permanent discount for its DeepSeek V4 Pro model; today at midnight, Xiaomi's MiMo-V2.5 series followed suit with a price cut, reaching as high as a 99% reduction.

Meanwhile, Xiaomi's Token Plan billing system was also optimized simultaneously, keeping the pricing unchanged but increasing the available amount by 5 to 8 times.

As expected, discussions about the price cuts for Xiaomi's MiMo models were rapidly heating up on overseas Reddit, X platform, and various developer forums.

However, it wasn't surprising that Xiaomi would dare to go against the trend and reduce prices. More importantly, what impact would this wave of price cuts have on the AI industry?

Token Prices Crashing, The Harsh Father of the AI Industry Arrives

Xiaomi's announcement showed that the API for its AI large model MiMo-V2.5 series would be permanently discounted by up to 99%, without distinguishing input length. The new price went into effect globally at 0:00 Beijing time on May 27.

While a 99% reduction doesn't mean every call will be charged at the lowest price, the key variable is whether the input cache hits.

For example, with MiMo-V2.5-Pro, if the cache hits, the input price drops to approximately 0.025 yuan per million tokens. If the cache misses, the price remains at 3 yuan per million tokens, while the output price is 6 yuan per million tokens.

This means that the extremely low price only holds true if requests must frequently hit the cache.

For applications with high repetitive context, frequent agents, multi-round code tasks, and batch inference tasks, this price has strong appeal. However, if your application scenario has a poor cache hit rate, the actual cost won't reach the minimum point.

The Token Plan also follows a similar logic.

Xiaomi emphasized that the pricing remained unchanged, while the Credit limits increased significantly: Lite, Standard, Pro, and Max monthly fees remained at 39 yuan, 99 yuan, 329 yuan, and 659 yuan respectively. The Credit amounts increased from 0.6 billion, 2 billion, 7 billion, and 16 billion to corresponding levels of 41 billion, 110 billion, 380 billion, and 820 billion.

And according to the new conversion rates, hitting the cache with MiMo-V2.5-Pro requires only 2.5 Credits per token, while missing the cache costs 300 Credits per token, and outputting costs 600 Credits per token.

This approach is similar to DeepSeek's strategy.

Let's quickly review the timeline: On April 24, DeepSeek released the preview version of V4; the next day, V4-Pro started a 2.5-fold discount; on April 26, the cache-hit price plummeted to one-tenth of the initial release price; by May 22, the temporary discount became a permanent price cut, reducing V4-Pro to one-quarter of its original price.

After several adjustments, the input cache-hit price for DeepSeek-V4-Pro dropped directly from 0.1 yuan to 0.025 yuan. With Xiaomi's MiMo-V2.5-Pro following swiftly, the baseline input price for domestic models was firmly set at this level.

Both DeepSeek and Xiaomi placed their most impactful prices on cache hits and scenarios. This isn't complex. Large models are moving from chatting to doing work, and agents are where token consumption truly amplifies.

In chat scenarios, users ask questions, and models answer them, making cost estimation relatively easy.

But in agent scenarios, a task may involve long context, multiple rounds of reasoning, code generation, tool calls, web page reading, file analysis, and result verification. Users see only the final output, but the backend might already have undergone multiple requests and extensive context reads.

This is why cache hits are crucial.

Agents, code assistants, and long-context applications share a common characteristic: much content repeats. For example, system prompts, project code, API documentation, tool instructions, historical conversations, dependency files, etc. Recalculating these contents every time is costly; caching allows subsequent uses to be billed based on cache hits, significantly lowering inference costs.

So, lower cache-hit prices are more suitable for high-frequency, multi-round, and long-context real-world work scenarios. Behind DeepSeek and Xiaomi's low prices lies an effort to attract developers and high-frequency applications, encouraging more agents, code assistants, and office automation apps to run on their models.

Previously, Xiaomi through activities like MiMo Orbit and the Trillion-Token Creator Incentive Program allowed more people to experience MiMo and solve real problems. The Trillion-Token Incentive Program launched on April 28 was fully distributed by May 26 at 16:08.

From a platform perspective, low-cost tokens and free quotas bring massive real calls. Real calls generate complex tasks, failed samples, user feedback, agent workflows, code scenarios, and long-context data, which help iterate the model and inference system.

The "tender shrimp" phenomenon in the community can also be understood under this logic. Users maximize their quota consumption while helping the platform create pressure, expose issues, and accumulate call data.

So, this account can't just look at single-inference gross profit. Short-term income is suppressed, but it leads to developer migration, call scale, and real feedback. For companies aiming to compete in the agent ecosystem, this is a very cost-effective platform investment.

Luo Fuli's True Love Law, Behind Engineering Violence

However, having the desire alone isn't enough; the key is being able to afford it. What makes Xiaomi's recent price cut special is that it contradicts her previous public statements.

A month ago, Luo Fuli publicly opposed token price wars. At the time, she believed that low-priced tokens combined with open third-party agent frameworks could easily lead the platform to lose control of costs.

She mentioned that third-party agent frameworks often have loose context management. A single user query might trigger multiple low-value tool calls, each request carrying over 100,000 tokens of long context. If the platform couldn't constrain such waste, the actual API cost might be dozens of times the subscription price.

She also believed that global computing power supply couldn't keep pace with the growth in token demand driven by agents. Before clarifying the programming and agent scene cost structure, blindly engaging in price wars would lead to throttling, downscaling, stability decline, ultimately damaging user experience.

But Xiaomi's recent price cut didn't refute her previous judgment; instead, it changed the premise for price wars to succeed. Luo Fuli previously opposed low prices without a cost structure foundation. Xiaomi now presents a reasoning engineering solution they believe can support low prices.

According to Xiaomi's announcement, its technical team based on SGLang HiCache fully supports SWA (Sliding Window Attention), reducing the data movement between multi-level storage including GPU memory, CPU memory, SSD, etc., to nearly one-seventh of the optimization before, and increasing the number of cacheable tokens to nearly five times the optimization before.

At the same time, Xiaomi optimized the expert parallel scheme and input length bucketing strategy to enhance cluster input throughput. Without this layer of engineering capability, low prices would easily become unsustainable subsidies. With a sufficiently powerful Infra system, low prices could transform into a long-term advantage.

Price wars test engineering capabilities and back-end depth.

Unlike pure AI model companies, Xiaomi's mobile, automotive, IoT, and consumer electronics businesses provide longer investment cycles and greater strategic patience. It views large model services as an entry to the AI ecosystem, avoiding getting stuck in the trap of focusing solely on short-term API revenue.

This is not friendly to small and medium-sized model companies. Without main business funding, lacking robust Infra capabilities, and insufficient call scale to spread costs, they are destined to struggle to follow such price cuts.

DeepSeek's low prices already threaten many domestic models' market positions. As Xiaomi MiMo follows, more sizable vendors will be forced to adjust prices or redefine product value. Smaller model service providers may be pushed into narrower vertical scenarios.

This round of price cuts is also a selection process for efficiency-focused model vendors in the market. Companies with engineering capabilities, scheduling abilities, and ecosystem access can bear the pressure of lower prices. Only those with model capabilities but unable to suppress inference costs will become increasingly passive.

Moreover, as the space for further price cuts narrows, the value of simply lowering prices decreases as it approaches physical costs. In the next phase, model quality, agent adaptation, developer tools, ecosystem binding, service stability, and enterprise delivery capabilities will all face another round of internal competition.

Model capabilities determine the upper limit of AI development, while inference costs determine the scale of AI普及. When truly affordable tokens enter the application layer, we'll finally understand what the next explosive era for AI looks like.