T
traeai
Sign in
返回首页
宝玉(@dotey)

OpenAI Released Three New Voice Models in Realtime API

8.9Score
OpenAI Released Three New Voice Models in Realtime API

TL;DR · AI Summary

OpenAI released three new voice models: GPT-Realtime-2, GPT-Realtime-Translate, and GPT-Realtime-Whisper, significantly enhancing dialogue, translation, and real-time transcription capabilities.

Key Takeaways

  • GPT-Realtime-2 improved from 81.4% to 96.6% on the Big Bench Audio intelligence
  • GPT-Realtime-Translate supports over 70 input languages and 13 output languages,
  • GPT-Realtime-Whisper is a streaming version of Whisper, suitable for meeting and

Outline

Jump quickly between sections.

  1. OpenAI released three new voice models for dialogue, translation, and real-time transcription.

  2. GPT-Realtime-2 has GPT-5-level reasoning capability, with significant performance improvements.

  3. GPT-Realtime-Translate supports multi-language real-time translation, especially suitable for cross-border customer service scenarios.

  4. GPT-Realtime-Whisper is a streaming version of Whisper, suitable for meeting and live broadcast real-time transcription.

  5. The three models have different pricing strategies, allowing developers to choose based on their needs.

Mindmap

See how the topics connect at a glance.

查看大纲文本(无障碍 / 无 JS 友好)
  • OpenAI新语音模型
    • GPT-Realtime-2
      • 性能提升
      • 复杂任务编排
    • GPT-Realtime-Translate
      • 多语言支持
      • 跨境客服
    • GPT-Realtime-Whisper
      • 流式转录
      • 会议和直播

Highlights

Key sentences worth saving and sharing.

  • GPT-Realtime-2 improved from 81.4% to 96.6% on the Big Bench Audio intelligence test, with Audio MultiChallenge multi-round dialogue command following improving from 34.7% to 48.5%.

    Paragraph 2

    ⬇︎ 下载 PNG𝕏 分享到 X
  • GPT-Realtime-Translate supports over 70 input languages and 13 output languages, with Deutsche Telekom already using it in testing.

    Paragraph 3

    ⬇︎ 下载 PNG𝕏 分享到 X
  • GPT-Realtime-Whisper is a streaming version of Whisper, providing subtitles as you speak, suitable for meetings and live broadcasts.

    Paragraph 4

    ⬇︎ 下载 PNG𝕏 分享到 X
#OpenAI#voice model#API
Open original article

[1] Main Character: GPT-Realtime-2

Claimed to have GPT-5 level reasoning capabilities. Compared to the previous generation GPT-Realtime-1.5, it improved from 81.4% to 96.6% on the Big Bench Audio intelligence test, and from 34.7% to 48.5% on Audio MultiChallenge multi-round dialogue command following. Several practical changes: It starts with a prelude before speaking. Before executing long tasks, it says "Let me check" or "Please wait a moment" to avoid users thinking it has crashed. Tool invocation is transparent. It can invoke multiple tools simultaneously, and the process will be narrated, such as "Checking your calendar" or "Searching", allowing users to hear what the agent is doing. The context window expands from 32K to 128K, supporting longer conversations and more complex task orchestrations. Developers can choose from five levels of inference intensity ranging from minimal to xhigh, with low being the default. Simple questions use low latency, while complex tasks use high inference. When errors occur, it says "I cannot handle this part now" instead of freezing or giving random responses. [2] Translate and Whisper GPT-Realtime-Translate supports real-time voice translation for over 70 input languages and 13 output languages, targeting cross-border customer service, education, and live streaming scenarios. Deutsche Telekom is already testing it; BolnaAI reports an error rate reduction of 12.5% compared to other models in Indian dialects like Hindi, Tamil, and Telugu. GPT-Realtime-Whisper is a streaming version of Whisper, providing subtitles as you speak, targeting meetings, live streaming, and customer service transcription. [3] Pricing GPT-Realtime-2: $32 per million audio input tokens (cache $0.40), $64 per million output tokens. GPT-Realtime-Translate: $0.034 per minute. GPT-Realtime-Whisper: $0.017 per minute. All three are available on the Realtime API, and Playground allows direct testing of GPT-Realtime-2.

Quote

Image 1: Square profile picture

OpenAI

@OpenAI

21h

Introducing GPT-Realtime-2 in the API: our most intelligent voice model yet, bringing GPT-5-class reasoning to voice agents. Voice agents are now real-time collaborators that can listen, reason, and solve complex problems as conversations unfold. Now available in the API

Image 2

AI may generate inaccurate information. Please verify important content.