T
traeai
Sign in
返回首页
Latent Space

[AINews] New AI Infra decacorns: Fireworks, Baseten (with OpenRouter on the way)

8.5Score

TL;DR · AI Summary

AI infrastructure companies such as Fireworks, Baseten, and OpenRouter are raising large amounts of funding, indicating strong growth momentum in the field.

Key Takeaways

  • Fireworks raised $1.5 billion, increasing its valuation by 3.75 times.
  • Baseten raised $1.1 billion, increasing its valuation by 2.2 times.
  • OpenRouter completed a $113 million C-round, growing fivefold.

Outline

Jump quickly between sections.

  1. Introduce several AI infrastructure companies raising large amounts of funding.

  2. ·Fireworks Funding

    Fireworks raised $1.5 billion, increasing its valuation by 3.75 times.

  3. ·Baseten Funding

    Baseten raised $1.1 billion, increasing its valuation by 2.2 times.

  4. ·OpenRouter Funding

    OpenRouter completed a $113 million C-round, growing fivefold.

  5. Several AI infrastructure companies are raising large amounts of funding, indicating strong growth momentum in the field.

Mindmap

See how the topics connect at a glance.

查看大纲文本(无障碍 / 无 JS 友好)
  • AI基础设施融资
    • Fireworks
      • 15亿美元融资,估值增长3.75倍
    • Baseten
      • 11亿美元融资,估值增长2.2倍
    • OpenRouter
      • 1.13亿美元C轮融资,增长5倍

Highlights

Key sentences worth saving and sharing.

#AI Infrastructure#Funding#Unicorns#Fireworks#Baseten
Open original article

URL 源: https://www.latent.space/p/ainews-new-ai-infra-decacorns-fireworks

发布时间: 2026-05-27T03:33:53+00:00

Markdown 内容: _参加 2026 年人工智能工程调查,并获得 >$2k 的信用额度和 AIE 工作坊票!_

读者喜欢我们报道没有新闻的时候,但我们的第二大喜好就是能够简单地强化你应该关注的趋势。四月时我们报道了 推理拐点,而今天的文章标题让你想起了上周的标题 新的人工智能基础设施独角兽:Exa,这正是我们要表达的观点。

随着人工智能融资的速度,我们的常规政策是只覆盖那些超过十亿美元估值的初创公司(> $10B),但在确认之前。今天关于 Fireworks 的 15 亿美元轮融资(“处于谈判中”,在七个月内增长了 3.75 倍,我们在播客中讨论过这里)和 Baseten 的 11 亿美元轮融资(“正在筹集资金”,在三个月内增长了 2.2 倍)的消息有些早,但推理领域投资加速和独角兽到十亿美元公司的进展太诱人了,以至于今天可以作为头条新闻。再加上 OpenRouter 的 1.13 亿美元 C 轮融资(六个月内的交易量增加了五倍)作为点缀:如果你打算进行多模型推理,你就需要一个路由器。

5/23/2026-5/26/2026 的人工智能新闻。我们检查了 12 个子版块,544 条推文,以及没有进一步的 Discord 频道。AINews 的网站 让你可以搜索所有过去的期数。提醒一下,AINews 现在是 Latent Space 的一部分。你可以 订阅或取消订阅 Substack 上的部分

代理工具、编码基准和超越“仅仅是一个模型”的转变

  • 代理工具工程正在成为编码代理的主要差异化因素:几篇帖子都集中在同一个论点上:获胜的堆栈现在是 模型 + 代理工具 + 评估循环,而不仅仅是更强的基础模型。一篇长的知乎总结认为 DeepSeek 显式地在构建一个代理团队 来关闭模型输出、运行时反馈、验证和修正之间的闭环,并声称通过缓存输入成本优势来支持更紧密的交互/验证循环。同时,谷歌的 Gemini 管理代理指南 将代理基础设施框架化为对一个管理代理的单个 API 调用,该代理具有沙盒、持久性和挂载功能,而 LangChain 更新的create_agent 文档dair.ai 的“代理”论文摘要 正式化了相同的堆栈:上下文治理、可信记忆、动态技能路由

研究代理、长期推理和“睡眠”用于上下文压缩

  • 数学/科学代理在正确的代理工具下显示出更多的能力余量:推特上最强的推文集群围绕着解决旧的开放问题的模型。一位数学家报告说 Claude Mythos 解决了 Erdős 问题 #90,并附带了后续细节,即模型经常收敛到不同于 OpenAI 早期路线的不同、更干净的证明路径。这一观点也得到了 @_sholtodouglas@kimmonismus 和随后由 Sébastien Bubeck 的强化:通过合适的代理工具,无论是 Mythos 还是 GPT-5.5 都可以在一次性的操作中重现内部模型所做的工作,暗示了一种大量潜在能力未被基础聊天用户体验所展示的情况。
  • 长时记忆再次成为核心瓶颈:论文《语言模型需要睡眠》引起了广泛关注。机制是一个类似于睡眠的巩固阶段,在此期间,近期上下文被转换为持久的快速权重,然后清除 KV 缓存,将计算转移到离线阶段,同时保持唤醒时间的延迟。dair.ai 的总结强调了系统角度:这是对长期轨迹代理不断增长的 KV 缓存的一种替代方案。这一主题与关于代理中记忆系统的讨论紧密相连,包括Omar 指向 Anthropic 的记忆演讲和 Dream 功能
  • 开放深度研究代理和科学预测也取得了进展QUEST,一个用于长时间范围事实查找、引证定位和报告合成的开放系列模型(2B 至 35B),作为通用深度研究代理发布。在科学评估方面,Sakana/斯坦福/牛津/AI2 的 CUSP 基准 发现当前模型通常能够识别有前景的研究方向,但在判断突破何时发生方面却面临巨大挑战。

模型、优化器和架构更新

基础设施、系统和半导体堆栈

  • 华为的“τ 扩放”论文主要被视为工程路线图,而不是一项新定律:一个非常详细的帖子认为 华为的“多层电子系统的时间尺度理论” 应该解释为 战略宣言/白皮书。核心提案是将 时间常数 τ 视为设备、芯片和数据中心规模下的统一指标,而非工艺节点。最具体的主张涉及未来麒麟设计中的 逻辑折叠,包括 密度增加 55%能效提高 41%频率增加 13%,固定节点条件下,以及封装/网络想法如 统一总线Hi-ONE 光学 I/O。同一帖子小心地指出缺少验证构件——晶圆照片、SEM、负载细节、产量曲线,并将最引人注目的数字解读为有希望但 未验证 的。后续反应也强调,华为的路径可能更多依赖于封装和架构,而不是光刻追赶,例如 @josiah_leee 引用 Jensen 的观点,指出 Hopper→Blackwell 的大部分收益来自于非节点优化。
  • 数据中心电力和推理供应约束正变得首要关注点SemiAnalysis 发表了关于 800VDC 过渡的文章,并得到了 John Carmack 的推荐,强调电动汽车电源电子领域与数据中心设计的交叉,包括高压 SiC 部件。此外,Epoch AI 估计了一个可能的推理计算危机:需求似乎比服务容量增长得更快,特别是对于长时间上下文的工作负载。他们的粗略模型表明,即使在有利假设下,当前全球 Blackwell 供应也能满足今天的市场需求,但随着上下文长度的增加吞吐量急剧下降,需求增长可能已经超过了供应。

生产工具和开发者基础设施

Top tweets (by engagement)

  • Claude Code security plugin: @ClaudeDevs’ release stood out because it paired a concrete product launch with an internal metric: 30–40% fewer security-related PR comments.
  • OpenRouter financing + production token growth: @OpenRouter’s $113M Series B is one of the clearer market signals that routing and multi-model infra are now seen as durable platform layers.
  • [Waiting for Qwen 3.7 open weight... The new King has arrived...](https://www.reddit.com/r/LocalLLaMA/comments/1tjvz6l/waiting_for_qwen_37_open_weight_the_new_king_has/) (Activity: 1217): The [image](https://i.redd.it/j8qkty82qj2h1.png) is a benchmark/marketing comparison from the [Qwen3.7 blog](https://qwen.ai/blog?id=qwen3.7) positioning Qwen3.7-Max as a leading frontier model across agentic coding, software engineering, MCP/tool-use, reasoning, and knowledge evaluations versus Qwen3.6-Plus, DS-V4-Pro Max, GLM-5.1, Kimi K2.6, and Claude Opus-4.6 Max. The technical significance is that the slide frames Qwen3.7-Max as highly competitive with or ahead of Claude-class models on many benchmarks, though Claude Opus-4.6 Max still appears to lead on some tasks such asClawEvalandCoWorkBench. Commenters note that this is the Max model, not necessarily representative of smaller/open-weight releases, and speculate about a potential3.7-122B-A17BMXFP4model with512kcontext for local hardware such as Strix Halo. The main debate is skepticism around open weights: commenters point out that Qwen has historically not open-weighted the Max series, so the title’s “waiting for open weight” framing may be unrealistic. Others caution not to expect a hypothetical 27B model to match the shown Max-tier benchmark results.
  • Several commenters distinguish Qwen Max from likely open-weight releases, noting that _“Qwen has never open-weighted the Max series”_ and warning not to expect a smaller 27B variant to match Max-level benchmark performance. The implied technical takeaway is that any public/open-weight Qwen 3.7 release may use a different architecture/scale than the benchmarked flagship model.
  • One technical wishlist centers on a hypothetical Qwen 3.7122B-A17BMTP MXFP4 model with 512k context, which commenters argue would be well-suited to Strix Halo-class local hardware. Another user references Qwen 3.5397B-A17BNVFP4, claiming it fits on 4x RTX 6000 Pro GPUs with enough memory headroom for roughly 10 concurrent 200k-token sessions, positioning it as a potential “Opus at home” if Qwen 3.7 matches reported benchmarks.
  • A commenter argues that open-weight frontier releases may be less likely because highly capable local models can undermine provider monetization. They claim Qwen’s strategy has shifted from disruption toward monetized frontier competition, which could affect whether large MoE models like 397B-A17B are released openly.
  • [Qwen3.6 35Ba3 has changed my workflows and even how I use my computer](https://www.reddit.com/r/LocalLLaMA/comments/1tjwrp7/qwen36_35ba3_has_changed_my_workflows_and_even/) (Activity: 567): The post describes a local-agent workflow using Qwen3.6 35B a3 viapi, where the user converts repeatable procedures into “skills” generated/documented by Codex, then reuses them for VPS DevOps,doclingPDF→EPUB conversion, Playwright testing, code tickets, and OS-level shell tasks. A concrete example: WhatsApp audio → transcription in AnythingLLM →content.md→ locally generated landing page, then aplan.mdticket queue executed by a “manager”piprocess spawning fresh-context sub-agents withpi -p @plan.md "Check the first Ticket with Status UNDONE and do it", marking ticketsDONE, committing via git, and finally deploying via a VPS skill. Commenters focused on operational concerns: what hardware can run this setup, whether the agent is sandboxed/trustworthy with OS access, and how hard pi is to adopt compared with other agentic tools such as Hermes.
  • A user reports running unsloth/Qwen3.6-35B-A3B-MTP-GGUF via Unsloth Studio on an MS-02 with a 24GB RTX Pro 4000 Blackwell SFF GPU, consistently seeing >100 tokens/s. They compare performance to “unoptimized GGUFs” on a Mac Studio M2, using the MS-02 as a small remote GPU server for the Mac workstation, and note that future MLX support in Unsloth could improve Mac-side performance. Screenshot: preview.redd.it.
  • [110 tok/s with 12GB VRAM on Qwen3.6 35B A3B and ik_llama.cpp](https://www.reddit.com/r/LocalLLaMA/comments/1tjh7az/110_toks_with_12gb_vram_on_qwen36_35b_a3b_and_ik/) (Activity: 565): The post benchmarks Qwen3.6-35B-A3B MTP using byteshape’sIQ4_XS[](https://huggingface.co/byteshape/Qwen3.6-35B-A3B-MTP-GGUF)4.19 bpw[GGUF](https://huggingface.co/byteshape/Qwen3.6-35B-A3B-MTP-GGUF) on an RTX 4070 Super 12GB + Ryzen 7 9700X, comparing upstreamllama.cppvsik_llama.cppwith--ctx-size 131072,q8_0KV cache, MTP draft max3, andp_min=0.75. Using the samemtp-bench.pyworkload, upstreamllama.cppaveraged89.76 tok/swith aggregate MTP accept rate0.9393, whileik_llama.cppaveraged110.24 tok/sover16.64s, a claimed23%throughput gain, despite lower aggregate accept rate0.8749in the updated results. The OP attributes practical fit to--fit/--fit-margin 1664onik_llama.cpp, with OOM mitigation by raising--fit-marginto1792or2048, and notes that running the display on an iGPU frees essentially all12GBVRAM for inference. Commenters focused on reproducibility: they requested the full upstream llama.cpp command and noted that several MTP-related PRs had merged recently, so benchmark timing may depend strongly on build date. One technical workaround suggested for single-GPU CachyOS/KDE users is a software-rendered Plasma Wayland session using LIBGL_ALWAYS_SOFTWARE=1 and GALLIUM_DRIVER=llvmpipe, reducing idle VRAM from roughly >1024MB to 126MB at the cost of slow/disabled compositor effects.
  • A CachyOS/KDE Wayland user described a VRAM-saving workaround for single-GPU systems: create a custom SDDM session that forces KDE Plasma to render via CPU using LIBGL_ALWAYS_SOFTWARE=1, GALLIUM_DRIVER=llvmpipe, and KWIN_COMPOSE=Q. They reported KDE Wayland idle VRAM dropping from >1024 MB to ~126 MB, freeing nearly a gigabyte of VRAM for running the 35B model, at the cost of disabled or very slow compositor animations.
  • Several commenters focused on whether the reported 110 tok/s comes from ik_llama.cpp having better MTP/speculative decoding behavior than upstream llama.cpp. One noted that ik_llama.cpp’s acceptance rate was reportedly never below0.790, while llama.cpp dropped as low as 0.477, asking for the exact llama.cpp command/settings and noting that multiple MTP-related PRs had landed in llama.cpp within the previous 24 hours.
  • A commenter asked about the IQ4_XS quantization used for Qwen3.6 35B A3B, noting it appears to be the lowest-memory Q4 quant and requesting details on both model quality/intelligence impact and the final VRAM/RAM split. This highlights the key tradeoff for 12 GB VRAM runs: fitting the model via aggressive quantization versus maintaining reasoning quality and avoiding excessive CPU/RAM offload bottlenecks.

AI may generate inaccurate information. Please verify important content.