Engineering voice agents: Latency, quality, and scale — Rishabh Bhargava, Together AI

Q: Introduction & Speaker Background

Rishabh Bhargava leads Voice AI at Together AI with a decade of AI infrastructure experience.

Q: Why Voice Agents Matter

Billions of human-handled calls annually can be automated; voice is a natural human-computer interface.

Q: Real-Time Requirement

Human conversations respond in ~300ms; AI exceeding 500ms causes noticeable lag, >1s leads to hang-ups.

AI Engineer

AI EngineerVideo2026年5月31日

Engineering voice agents: Latency, quality, and scale — Rishabh Bhargava, Together AI

8.5Score

Watchable video resourceOpen original video

TL;DR · AI Summary

Building high-quality, low-latency, scalable voice agents is now an engineering challenge requiring real-time response (<500ms), complex instruction handling, and tool calling — supported by Together AI’s infrastructure.

Key Takeaways

Voice agents must respond under 500ms; delays beyond this cause user drop-off, m
Complex workflows require tool calling and ambiguity resolution, not just base L
Together AI offers AI-native cloud services for training/inference at scale, ser

Outline

Jump quickly between sections.

§Introduction & Speaker Background
Rishabh Bhargava leads Voice AI at Together AI with a decade of AI infrastructure experience.
§Why Voice Agents Matter
Billions of human-handled calls annually can be automated; voice is a natural human-computer interface.
·Real-Time Requirement
Human conversations respond in ~300ms; AI exceeding 500ms causes noticeable lag, >1s leads to hang-ups.
·Intelligence & Tool Calling
Real-world tasks demand handling ambiguity and invoking tools to complete complex workflows.
§Next-Gen Architecture
Future voice agents will integrate multimodality, context memory, and dynamic toolchains for production use.

Mindmap

See how the topics connect at a glance.

查看大纲文本（无障碍 / 无 JS 友好）

语音代理工程挑战
- 实时性要求
  - 响应<500ms
  - 人类对话基准300ms
- 智能与功能
  - 工具调用能力
  - 处理模糊指令
- 基础设施支持
  - Together AI云平台
  - 服务Cursor等企业

Highlights

Key sentences worth saving and sharing.

When humans converse, we respond to each other’s cues in about 300 milliseconds; if an AI takes over 500ms, you’ll start noticing the delay.
— Paragraph 3:01
⬇︎ 下载 PNG 𝕏 分享到 X
Voice agents are no longer sci-fi or research—they’re primarily an engineering problem today, especially for rich, high-quality conversations.
— Paragraph 2:43
⬇︎ 下载 PNG 𝕏 分享到 X
Together AI provides reliable compute for model training and inference at scale, serving millions of developers and hundreds of companies.
— Paragraph 0:44
⬇︎ 下载 PNG 𝕏 分享到 X

#Voice AI#Latency Optimization#Together AI#Agent Engineering