Zum Inhalt

[AINews] GPT-Realtime-2, -Translate, and -Whisper: new SOTA realtime voice APIs

OpenAI released realtime-23 three months ago, but it made relatively little impact since it was still powered by 4o-level intelligence (only a +5% improvement on Big Bench Audio). The confidence behind today’s Realtime-2 release was unmistakable (highlighted by a +15.2% jump in BBA), and it was very well received. As the blog post explains, three models are being released, which can be simplified to „voice-in, voice-out, and voice-to-voice.“ The emphasis is less on „voice quality“ and more on usability. TLDR: **Output:**

Preambles: Developers can configure short introductory phrases that precede the main response, such as “Let me check that” or “One moment while I look into it.”

Parallel tool calls and tool transparency: The model supports calling multiple tools simultaneously and can vocalize its actions with phrases like “checking your calendar” or “looking that up now.” This allows agents to remain responsive while performing tasks.

Stronger recovery behavior: The model can handle failures more gracefully by responding with natural statements like “I’m having trouble with that right now,” rather than breaking or producing errors.

Longer context: Increased from 32K to 128K tokens. Enhanced domain knowledge: The model more effectively preserves specialized terminology, proper nouns, medical concepts, and other domain-specific vocabulary. The model offers improved control over tone and delivery, allowing it to speak calmly, empathetically, or enthusiastically depending on the context. Developers can now choose from minimal, low, medium, high, or xhigh reasoning effort levels, with low set as the default. The demo video demonstrated how the audio model is better tuned to avoid interrupting when the main speaker is talking to another person.

  Latent.Space