T
traeai
Sign in
返回首页
Fireworks AI(@FireworksAI_HQ)

Fireworks AI on X: We ran 720 browser agent tasks with @nottecore across frontier models

8.5Score
Fireworks AI on X: We ran 720 browser agent tasks with @nottecore across frontier models

TL;DR · AI Summary

Fireworks AI tests show baseline models had 20% retry rates in browser agent tasks, while Kimi K2.5/GLM-5/MiniMax M2.5 achieved near-zero retries with stable latency, directly impacting production system costs/delays/reliability.

Key Takeaways

  • Baseline model failed ~1 in 5 calls causing multi-step workflow retries
  • Kimi K2.5/GLM-5/MiniMax M2.5 achieved near-zero retries with stable latency
  • Execution gaps directly translate to cost/latency/reliability differences in pro

Outline

Jump quickly between sections.

  1. Fireworks AI tested frontier models across 720 browser agent tasks using nottecore framework, revealing significant baseline model execution issues.

  2. Baseline model produced malformed outputs in 20% of calls requiring frequent retries in multi-step workflows.

  3. Kimi K2.5/GLM-5/MiniMax M2.5 achieved near-zero retries with consistent latency across multi-step tasks.

  4. Execution differences directly manifest as cost/latency/reliability divergences in production agent systems.

  5. Full analysis reveals actual performance gaps between models in agent systems.

Mindmap

See how the topics connect at a glance.

查看大纲文本(无障碍 / 无 JS 友好)
  • 浏览器代理任务执行差异分析
    • 模型对比
      • 基线模型(20%重试率)
      • 先进模型(近零重试率)
    • 性能指标
      • 重试率
      • 延迟稳定性
    • 生产影响
      • 成本差异
      • 可靠性差异

Highlights

Key sentences worth saving and sharing.

  • One baseline model produced malformed outputs in ~1 out of every 5 calls, leading to retries inside multi-step workflows.

    First paragraph

    ⬇︎ 下载 PNG𝕏 分享到 X
  • Across Kimi K2.5, GLM-5, and MiniMax M2.5 served on Fireworks, retry rates were near zero and latency stayed stable even as tasks extended across multiple steps.

    Second paragraph

    ⬇︎ 下载 PNG𝕏 分享到 X
  • That gap is what shows up as cost, latency, and reliability divergence in production agent systems.

    Conclusion section

    ⬇︎ 下载 PNG𝕏 分享到 X
#Fireworks AI#Browser Agents#Model Execution#Retry Rates#Cost Optimization
Open original article

One baseline model produced malformed outputs in ~1 out of every 5 calls, leading to retries inside multi-step workflows.

Across Kimi K2.5, GLM-5, and MiniMax M2.5 served on Fireworks, retry rates were https://t.co/x4wl2Uq6ZX" / X

Fireworks AI on X: "We ran 720 browser agent tasks with @nottecore across frontier models. One baseline model produced malformed outputs in ~1 out of every 5 calls, leading to retries inside multi-step workflows. Across Kimi K2.5, GLM-5, and MiniMax M2.5 served on Fireworks, retry rates were https://t.co/x4wl2Uq6ZX" / X

Don’t miss what’s happening

Image 1: Square profile picture

Fireworks AI

@FireworksAI_HQ

We ran 720 browser agent tasks with

@nottecore

across frontier models. One baseline model produced malformed outputs in ~1 out of every 5 calls, leading to retries inside multi-step workflows. Across Kimi K2.5, GLM-5, and MiniMax M2.5 served on Fireworks, retry rates were near zero and latency stayed stable even as tasks extended across multiple steps. Same workload. Same agent loop. Different execution behavior. That gap is what shows up as cost, latency, and reliability divergence in production agent systems. Read the report: https://fireworks.ai/blog/agent-exe cution-tax…

Image 2: Image

Last edited Opens edit history 6:22 PM · May 20, 2026

·

1,529 Views

2

3

20

5

AI may generate inaccurate information. Please verify important content.