[AINews] 新的AI基础设施十亿美金独角兽：Fireworks、Baseten（OpenRouter即将上线）

Latent Space

Latent Space2026年5月27日

[AINews] New AI Infra decacorns: Fireworks, Baseten (with OpenRouter on the way)

8.5Score

TL;DR · AI Summary

AI infrastructure companies such as Fireworks, Baseten, and OpenRouter are raising large amounts of funding, indicating strong growth momentum in the field.

Key Takeaways

Fireworks raised $1.5 billion, increasing its valuation by 3.75 times.
Baseten raised $1.1 billion, increasing its valuation by 2.2 times.
OpenRouter completed a $113 million C-round, growing fivefold.

Outline

Jump quickly between sections.

§Introduction
Introduce several AI infrastructure companies raising large amounts of funding.
·Fireworks Funding
Fireworks raised $1.5 billion, increasing its valuation by 3.75 times.
·Baseten Funding
Baseten raised $1.1 billion, increasing its valuation by 2.2 times.
·OpenRouter Funding
OpenRouter completed a $113 million C-round, growing fivefold.
§Conclusion
Several AI infrastructure companies are raising large amounts of funding, indicating strong growth momentum in the field.

Mindmap

See how the topics connect at a glance.

查看大纲文本（无障碍 / 无 JS 友好）

AI基础设施融资
- Fireworks
  - 15亿美元融资，估值增长3.75倍
- Baseten
  - 11亿美元融资，估值增长2.2倍
- OpenRouter
  - 1.13亿美元C轮融资，增长5倍

Highlights

Key sentences worth saving and sharing.

Fireworks raised $1.5 billion, increasing its valuation by 3.75 times.
— Paragraph 2
⬇︎ 下载 PNG 𝕏 分享到 X
Baseten raised $1.1 billion, increasing its valuation by 2.2 times.
— Paragraph 3
⬇︎ 下载 PNG 𝕏 分享到 X
OpenRouter completed a $113 million C-round, growing fivefold.
— Paragraph 4
⬇︎ 下载 PNG 𝕏 分享到 X

#AI Infrastructure#Funding#Unicorns#Fireworks#Baseten

Open original article

URL 源: https://www.latent.space/p/ainews-new-ai-infra-decacorns-fireworks

发布时间: 2026-05-27T03:33:53+00:00

Markdown 内容: _参加 2026 年人工智能工程调查，并获得 >$2k 的信用额度和 AIE 工作坊票!_

读者喜欢我们报道没有新闻的时候，但我们的第二大喜好就是能够简单地强化你应该关注的趋势。四月时我们报道了推理拐点，而今天的文章标题让你想起了上周的标题新的人工智能基础设施独角兽：Exa，这正是我们要表达的观点。

随着人工智能融资的速度，我们的常规政策是只覆盖那些超过十亿美元估值的初创公司（> $10B），但在确认之前。今天关于 Fireworks 的 15 亿美元轮融资（“处于谈判中”，在七个月内增长了 3.75 倍，我们在播客中讨论过这里）和 Baseten 的 11 亿美元轮融资（“正在筹集资金”，在三个月内增长了 2.2 倍）的消息有些早，但推理领域投资加速和独角兽到十亿美元公司的进展太诱人了，以至于今天可以作为头条新闻。再加上 OpenRouter 的 1.13 亿美元 C 轮融资（六个月内的交易量增加了五倍）作为点缀：如果你打算进行多模型推理，你就需要一个路由器。

5/23/2026-5/26/2026 的人工智能新闻。我们检查了 12 个子版块，544 条推文，以及没有进一步的 Discord 频道。AINews 的网站让你可以搜索所有过去的期数。提醒一下，AINews 现在是 Latent Space 的一部分。你可以订阅或取消订阅 Substack 上的部分！

代理工具、编码基准和超越“仅仅是一个模型”的转变

代理工具工程正在成为编码代理的主要差异化因素：几篇帖子都集中在同一个论点上：获胜的堆栈现在是 模型 + 代理工具 + 评估循环，而不仅仅是更强的基础模型。一篇长的知乎总结认为 DeepSeek 显式地在构建一个代理团队来关闭模型输出、运行时反馈、验证和修正之间的闭环，并声称通过缓存输入成本优势来支持更紧密的交互/验证循环。同时，谷歌的 Gemini 管理代理指南将代理基础设施框架化为对一个管理代理的单个 API 调用，该代理具有沙盒、持久性和挂载功能，而 LangChain 更新的create_agent 文档和 dair.ai 的“代理”论文摘要正式化了相同的堆栈：上下文治理、可信记忆、动态技能路由。

基准测试越来越接近真实的开发者体验：DeepSWE，作为一个新的代理编程基准，得到了实践者的强烈推荐；@theo 称之为 “第一个真正与使用这些模型编程时的感觉相匹配的代码基准。”它还比公共 SWE 排行榜通常显示的顶部端口提供了更多的分离。相关的基准信号包括：Qwen3.7 Max 在 Code Arena 前端上以第 4 名的成绩亮相，大致与 Claude Opus 4.6 在代理 Web 开发任务上的表现相当，而阿里巴巴则放大了结果。在整个工具链中，Anthropic 发布了一个针对 Claude Code 的安全指导插件，并在内部使用中报告了 30–40% 的安全相关 PR 注释减少，而 OpenAI 在 Databricks 上高亮展示了 Codex 中的 GPT-5.5 以提高文档解析的可靠性。

研究代理、长期推理和“睡眠”用于上下文压缩

数学/科学代理在正确的代理工具下显示出更多的能力余量：推特上最强的推文集群围绕着解决旧的开放问题的模型。一位数学家报告说 Claude Mythos 解决了 Erdős 问题 #90，并附带了后续细节，即模型经常收敛到不同于 OpenAI 早期路线的不同、更干净的证明路径。这一观点也得到了 @_sholtodouglas、@kimmonismus 和随后由 Sébastien Bubeck 的强化：通过合适的代理工具，无论是 Mythos 还是 GPT-5.5 都可以在一次性的操作中重现内部模型所做的工作，暗示了一种大量潜在能力未被基础聊天用户体验所展示的情况。

长时记忆再次成为核心瓶颈：论文《语言模型需要睡眠》引起了广泛关注。机制是一个类似于睡眠的巩固阶段，在此期间，近期上下文被转换为持久的快速权重，然后清除 KV 缓存，将计算转移到离线阶段，同时保持唤醒时间的延迟。dair.ai 的总结强调了系统角度：这是对长期轨迹代理不断增长的 KV 缓存的一种替代方案。这一主题与关于代理中记忆系统的讨论紧密相连，包括Omar 指向 Anthropic 的记忆演讲和 Dream 功能。

开放深度研究代理和科学预测也取得了进展：QUEST，一个用于长时间范围事实查找、引证定位和报告合成的开放系列模型（2B 至 35B），作为通用深度研究代理发布。在科学评估方面，Sakana/斯坦福/牛津/AI2 的 CUSP 基准发现当前模型通常能够识别有前景的研究方向，但在判断突破何时发生方面却面临巨大挑战。

模型、优化器和架构更新

优化器工作仍然活跃，特别是在 Muon 变体和无调度训练方面：AMUSE 提出了 随时 MUon 结合稳定梯度评估，结合 Muon 和无调度风格的梯度评估，实现稳定且无需学习率衰减的随时训练，报告了在 124M / 720M / 1B 规模和 ViT/ImageNet 微调上的收益。相关的实现讨论来自 ClashLuke 的 SFMuon 片段和 kellerjordan 在 Newton-Muon 上的 Modded-NanoGPT 结果。

稀疏注意力设计空间继续多样化：MiniMax 预测 M3 将开源，后续的技术评论建议了一种新的 块稀疏两阶段注意力 路径。@kimmonismus 总结了报道的速度提升：在 1M 令牌 下，9.7× 预填充 和 15.6× 解码，相较于 M2。@eliebakouch 添加表示 M3 看起来正在回归基于 GQA 的稀疏注意力，使用块选择在实际 KV 上，这与 DeepSeek 的压缩注意力变体不同。

视觉/开放模型发布和排名更新：PrismML 发布了 Bonsai Image 4B，包括 1 位和三元 变量，旨在本地运行在笔记本电脑和平板电脑上；后续提到浏览器本地执行可能在约 3GB 的内存占用下实现。在封闭侧，微软的 MAI-Image-2.5 在图像竞技场中首次亮相，排名第 3，打破了由 OpenAI 和 Google 主导的前五名俱乐部，竞技场报告了 1,254 分的成绩。与此同时，Artificial Analysis 测量了 Gemini 3.5 Flash，在输出标记每秒高达 ~280 的情况下，表现更为强大，但成本约为 Gemini 3 Flash 的 ~5 倍。

基础设施、系统和半导体堆栈

华为的“τ 扩放”论文主要被视为工程路线图，而不是一项新定律：一个非常详细的帖子认为华为的“多层电子系统的时间尺度理论” 应该解释为 战略宣言/白皮书。核心提案是将 时间常数 τ 视为设备、芯片和数据中心规模下的统一指标，而非工艺节点。最具体的主张涉及未来麒麟设计中的 逻辑折叠，包括 密度增加 55%、能效提高 41% 和 频率增加 13%，固定节点条件下，以及封装/网络想法如 统一总线 和 Hi-ONE 光学 I/O。同一帖子小心地指出缺少验证构件——晶圆照片、SEM、负载细节、产量曲线，并将最引人注目的数字解读为有希望但 未验证 的。后续反应也强调，华为的路径可能更多依赖于封装和架构，而不是光刻追赶，例如 @josiah_leee 引用 Jensen 的观点，指出 Hopper→Blackwell 的大部分收益来自于非节点优化。

数据中心电力和推理供应约束正变得首要关注点：SemiAnalysis 发表了关于 800VDC 过渡的文章，并得到了 John Carmack 的推荐，强调电动汽车电源电子领域与数据中心设计的交叉，包括高压 SiC 部件。此外，Epoch AI 估计了一个可能的推理计算危机：需求似乎比服务容量增长得更快，特别是对于长时间上下文的工作负载。他们的粗略模型表明，即使在有利假设下，当前全球 Blackwell 供应也能满足今天的市场需求，但随着上下文长度的增加吞吐量急剧下降，需求增长可能已经超过了供应。

生产工具和开发者基础设施

Serving/inference stacks got meaningful performance and observability updates: vLLM merged a Rust frontend as a drop-in alternative to the Python API server, with early numbers showing ~837 req/s vs ~162 req/s on a preprocess-heavy workload in a single process. W&B launched an MCP server to let coding agents inspect experiments and training runs, with a schema-first redesign aimed at avoiding context-window blowups. Unsloth added support for running GPT, Claude, and other APIs inside its local UI, including prompt caching and code execution.

Cloudflare, OpenRouter, and vector/retrieval vendors pushed the “productionization” layer: OpenRouter announced a $113M Series B and said weekly volume had grown from 5T to 25T tokens over six months. Cloudflare relaunched its startups program with up to $350k in credits, while separate posts around Think and agent ergonomics emphasized durable turns, reconnects, stale-state handling, and recovery as key practical differentiators. On retrieval infra, Booking.com discussed scaling to 100M+ embeddings, including filtered vector search, reads-during-writes, concurrency, and human-in-the-loop evals for partner messaging agents.

Top tweets (by engagement)

Codex / agentic coding in practice: The highest-signal product-use tweet was @bunkaich showing Codex help reverse-engineer and patch firmware on a cheap MP3 player, with the workflow spanning chip inspection, OS extraction, binary analysis, and flashing a modified image.

DeepSWE benchmark launch: @serenaa_ge’s DeepSWE announcement became the main reference point for “does this match real coding experience?” discussion.

Claude Code security plugin: @ClaudeDevs’ release stood out because it paired a concrete product launch with an internal metric: 30–40% fewer security-related PR comments.

OpenRouter financing + production token growth: @OpenRouter’s $113M Series B is one of the clearer market signals that routing and multi-model infra are now seen as durable platform layers.

vLLM Rust frontend: @vllm_project’s merge announcement mattered for anyone hitting CPU/API-server bottlenecks in high-throughput serving.

[Waiting for Qwen 3.7 open weight... The new King has arrived...](https://www.reddit.com/r/LocalLLaMA/comments/1tjvz6l/waiting_for_qwen_37_open_weight_the_new_king_has/) (Activity: 1217): The [image](https://i.redd.it/j8qkty82qj2h1.png) is a benchmark/marketing comparison from the [Qwen3.7 blog](https://qwen.ai/blog?id=qwen3.7) positioning Qwen3.7-Max as a leading frontier model across agentic coding, software engineering, MCP/tool-use, reasoning, and knowledge evaluations versus Qwen3.6-Plus, DS-V4-Pro Max, GLM-5.1, Kimi K2.6, and Claude Opus-4.6 Max. The technical significance is that the slide frames Qwen3.7-Max as highly competitive with or ahead of Claude-class models on many benchmarks, though Claude Opus-4.6 Max still appears to lead on some tasks such asClawEvalandCoWorkBench. Commenters note that this is the Max model, not necessarily representative of smaller/open-weight releases, and speculate about a potential3.7-122B-A17BMXFP4model with512kcontext for local hardware such as Strix Halo. The main debate is skepticism around open weights: commenters point out that Qwen has historically not open-weighted the Max series, so the title’s “waiting for open weight” framing may be unrealistic. Others caution not to expect a hypothetical 27B model to match the shown Max-tier benchmark results.

Several commenters distinguish Qwen Max from likely open-weight releases, noting that _“Qwen has never open-weighted the Max series”_ and warning not to expect a smaller 27B variant to match Max-level benchmark performance. The implied technical takeaway is that any public/open-weight Qwen 3.7 release may use a different architecture/scale than the benchmarked flagship model.

One technical wishlist centers on a hypothetical Qwen 3.7122B-A17BMTP MXFP4 model with 512k context, which commenters argue would be well-suited to Strix Halo-class local hardware. Another user references Qwen 3.5397B-A17BNVFP4, claiming it fits on 4x RTX 6000 Pro GPUs with enough memory headroom for roughly 10 concurrent 200k-token sessions, positioning it as a potential “Opus at home” if Qwen 3.7 matches reported benchmarks.

A commenter argues that open-weight frontier releases may be less likely because highly capable local models can undermine provider monetization. They claim Qwen’s strategy has shifted from disruption toward monetized frontier competition, which could affect whether large MoE models like 397B-A17B are released openly.

[Qwen3.6 35Ba3 has changed my workflows and even how I use my computer](https://www.reddit.com/r/LocalLLaMA/comments/1tjwrp7/qwen36_35ba3_has_changed_my_workflows_and_even/) (Activity: 567): The post describes a local-agent workflow using Qwen3.6 35B a3 viapi, where the user converts repeatable procedures into “skills” generated/documented by Codex, then reuses them for VPS DevOps,doclingPDF→EPUB conversion, Playwright testing, code tickets, and OS-level shell tasks. A concrete example: WhatsApp audio → transcription in AnythingLLM →content.md→ locally generated landing page, then aplan.mdticket queue executed by a “manager”piprocess spawning fresh-context sub-agents withpi -p @plan.md "Check the first Ticket with Status UNDONE and do it", marking ticketsDONE, committing via git, and finally deploying via a VPS skill. Commenters focused on operational concerns: what hardware can run this setup, whether the agent is sandboxed/trustworthy with OS access, and how hard pi is to adopt compared with other agentic tools such as Hermes.

A user reports running unsloth/Qwen3.6-35B-A3B-MTP-GGUF via Unsloth Studio on an MS-02 with a 24GB RTX Pro 4000 Blackwell SFF GPU, consistently seeing >100 tokens/s. They compare performance to “unoptimized GGUFs” on a Mac Studio M2, using the MS-02 as a small remote GPU server for the Mac workstation, and note that future MLX support in Unsloth could improve Mac-side performance. Screenshot: preview.redd.it.

[110 tok/s with 12GB VRAM on Qwen3.6 35B A3B and ik_llama.cpp](https://www.reddit.com/r/LocalLLaMA/comments/1tjh7az/110_toks_with_12gb_vram_on_qwen36_35b_a3b_and_ik/) (Activity: 565): The post benchmarks Qwen3.6-35B-A3B MTP using byteshape’sIQ4_XS[](https://huggingface.co/byteshape/Qwen3.6-35B-A3B-MTP-GGUF)4.19 bpw[GGUF](https://huggingface.co/byteshape/Qwen3.6-35B-A3B-MTP-GGUF) on an RTX 4070 Super 12GB + Ryzen 7 9700X, comparing upstreamllama.cppvsik_llama.cppwith--ctx-size 131072,q8_0KV cache, MTP draft max3, andp_min=0.75. Using the samemtp-bench.pyworkload, upstreamllama.cppaveraged89.76 tok/swith aggregate MTP accept rate0.9393, whileik_llama.cppaveraged110.24 tok/sover16.64s, a claimed23%throughput gain, despite lower aggregate accept rate0.8749in the updated results. The OP attributes practical fit to--fit/--fit-margin 1664onik_llama.cpp, with OOM mitigation by raising--fit-marginto1792or2048, and notes that running the display on an iGPU frees essentially all12GBVRAM for inference. Commenters focused on reproducibility: they requested the full upstream llama.cpp command and noted that several MTP-related PRs had merged recently, so benchmark timing may depend strongly on build date. One technical workaround suggested for single-GPU CachyOS/KDE users is a software-rendered Plasma Wayland session using LIBGL_ALWAYS_SOFTWARE=1 and GALLIUM_DRIVER=llvmpipe, reducing idle VRAM from roughly >1024MB to 126MB at the cost of slow/disabled compositor effects.

A CachyOS/KDE Wayland user described a VRAM-saving workaround for single-GPU systems: create a custom SDDM session that forces KDE Plasma to render via CPU using LIBGL_ALWAYS_SOFTWARE=1, GALLIUM_DRIVER=llvmpipe, and KWIN_COMPOSE=Q. They reported KDE Wayland idle VRAM dropping from >1024 MB to ~126 MB, freeing nearly a gigabyte of VRAM for running the 35B model, at the cost of disabled or very slow compositor animations.

Several commenters focused on whether the reported 110 tok/s comes from ik_llama.cpp having better MTP/speculative decoding behavior than upstream llama.cpp. One noted that ik_llama.cpp’s acceptance rate was reportedly never below0.790, while llama.cpp dropped as low as 0.477, asking for the exact llama.cpp command/settings and noting that multiple MTP-related PRs had landed in llama.cpp within the previous 24 hours.

A commenter asked about the IQ4_XS quantization used for Qwen3.6 35B A3B, noting it appears to be the lowest-memory Q4 quant and requesting details on both model quality/intelligence impact and the final VRAM/RAM split. This highlights the key tradeoff for 12 GB VRAM runs: fitting the model via aggressive quantization versus maintaining reasoning quality and avoiding excessive CPU/RAM offload bottlenecks.