Interaction models
Mira Murati's lab put out a model that doesn't wait its turn — it watches you on video, listens, and talks back at the same time, all from one model instead of a stitched-together voice pipeline.
Today's voice assistants are assembled: one component decides you've stopped talking, another transcribes, the language model answers, a fourth speaks. Thinking Machines collapsed that assembly into a single model that reads the room in 200-millisecond slices — perceiving and responding in the same breath, so it can interject, finish your sentence, or comment on what it just saw you do, without waiting for a turn to end.
Open systems have done full-duplex audio — interrupting, backchanneling — since Kyutai's Moshi in 2024. The wager that's actually new is the video.
Talking over you is the demo that gets quoted, but it isn't the new part: open systems have done full-duplex audio since 2024. The real move is twofold — doing it at frontier scale, and folding live video into the same stream, so the model reacts to a raised eyebrow or a botched rep, not just to sound. The lab frames this as a thesis: interactivity should scale alongside intelligence, treated as its own axis rather than a wrapper bolted on after the model is smart.
Worth noticing what's load-bearing: an independent reviewer flagged that the headline numbers lean on a second, slower reasoning model running in the background — an easy way to post strong scores. The bet here is less about any one benchmark than about a well-funded lab deciding the frontier isn't only a smarter model, but a model you can actually talk with.
It's a research preview behind closed doors, not something you can use yet.