Zum Inhalt

[AINews] Thinking Machines‘ Native Interaction Models – TML-Interaction-Small 276B-A12B – advances SOTA Realtime Voice and kills standard VAD

By pure coincidence, the same day we released Neil Zeghidour (CEO of Gradium, the for-profit spin-off of the renowned Kyutai Moshi) talk on what still needs to be built for real-time voice, Thinking Machines came out for only the third time in roughly a year—despite all the drama—to release Interaction Models: A Scalable Approach to Human-AI Collaboration. TML-Interaction-Small is a 276B-parameter MoE with 12B active parameters. , which immediately pushes the frontiers of realtime voice models exactly as Neil described—reviving and vastly improving upon the famously defunct “Her” demo from GPT-4o with far more detailed and realistic demonstrations that are presumably much closer to actual deployment. The complete blog post includes numerous demos showcasing its high degree of continuous interactivity, centered on streams of „time-aligned microturns“ lasting 200ms each. Using an encoder-free early fusion approach, with images and audio both processed at 30x, results in over 3x token usage, cache hit rates of 80–96%, and more than 353x longer time per task. That benchmark was further bolstered by OpenHands’ updated software-engineering benchmark announcement (tweet) and Claw-Eval’s broader, more agentic task suite spanning office, finance, terminal, and web domains—where MiMo-V2.5-Pro took the lead and DeepSeek V4 Flash showed striking efficiency relative to its size. Skepticism around TurboQuant is growing, with several posts offering a more measured perspective on the recently hyped quantization and serving method. @_EldarKurtic shared what he called the first in-depth analysis of TurboQuant, examining its accuracy, latency, and throughput. @vllm_project pointed to the Red Hat / vLLM study as a useful reference, while @jbhuang0604 offered a straightforward verdict: “it doesn’t really work well.” This is precisely the kind of inference claim where independent reproduction is crucial. Local and open models are still advancing faster than hardware limits. As @ClementDelangue aptly pointed out: on the same high-end MacBook Pro memory ceiling, the smartest open-weight model you can actually run has advanced from Llama 3 70B-level performance to DeepSeek V4 Flash mixed-Q2 GGUF-level performance — roughly 4.7× — in just 24 months. That equates to a doubling every ~3 months, outpacing Moore’s Law. Supporting evidence came from @victormustar’s observations on the sharp rise in GGUF uploads, along with widespread community reports that Qwen 3.6, Gemma 4, and DeepSeek variants have become viable for running nontrivial agent tasks locally. Research Highlights: MoE Modularity, Diffusion/Byte Models, and Agent Dynamics.

  Latent.Space