Sources#
Summary#
"Full-duplex" = the model perceives and responds at the same time, in a constant two-way exchange — as opposed to half-duplex turn-taking (one party at a time). Interaction Models generalize the audio full-duplex idea across audio, video, and text. The phrase the post uses: an experience that "feels more like collaborating and less like prompting."
The interaction modes it enables#
All of these are special-purpose harnesses today; in an interaction model they're special cases of model behavior (see Time-Aligned Micro-Turns):
- Proactive interjection — "interrupt when I say something wrong"; the model jumps in mid-turn when context warrants, not only at end-of-turn.
- Visual-cue reactions — "tell me when I've written a bug in my code"; "count how many pushups I do"; requires acting on a visual change with no audio cue (audio-only turn-detection harnesses fail this — they say "Sure thing!" then go silent).
- Simultaneous speech — user and model speak concurrently: "translate Spanish→English live."
- Speak-while-watching — "live-commentate this sports game."
- Time-aware speech — "remind me to breathe in and out every 4 seconds until I stop"; "how long did it take me to write this function?"
- Codeswitch correction — "every time I use another language, give me the correct word in the original language" (requires speaking at the same time as the user).
The model implicitly tracks whether the speaker is thinking, yielding, self-correcting, or inviting a response — no separate dialog-management component.
Concurrent non-speech action#
While listening and speaking, the model can simultaneously call tools, search, browse, or generate UI — weaving results back into the conversation when appropriate. The deeper/longer of these are delegated to the background model.
Prior art it builds on#
Audio full-duplex models are the existing example of bidirectional/continuous interaction; robotics and autonomous vehicles are cited as domains where real-time perception+action is a given. Interaction models apply the principle across all modalities.
Connections#
- Interaction Models — parent concept
- Time-Aligned Micro-Turns — the mechanism (no turn boundaries) that makes full-duplex possible
- Encoder-Free Early Fusion — joint multimodal reasoning is what lets a visual change trigger speech
- Turn-Based Interface Bottleneck — the half-duplex status quo this replaces
- Interactivity Benchmarks — TimeSpeak / CueSpeak / RepCount-A / ProactiveVideoQA / Charades measure exactly these modes
- Interaction / Background Model Split — where the concurrent deep work goes
Sources#
6 articles link here
- ConceptEncoder-Free Early Fusion
Multimodal design with minimal pre-processing instead of large standalone encoders: dMel audio embedding, 40×40-patch h…
- ConceptInteraction Models
Thinking Machines Lab (May 2026): models that handle audio/video/text interaction natively in real time instead of via…
- ConceptInteractivity Benchmarks
FD-bench, Audio MultiChallenge + new TimeSpeak/CueSpeak (proactive audio) and RepCount-A/ProactiveVideoQA/Charades (vis…
- ConceptTime-Aligned Micro-Turns
The core interaction-model move: input/output as continuous streams in ~200ms interleaved chunks, no turn boundaries; s…
- EntityTML-Interaction-Small
TML's first interaction model: 276B MoE / 12B active, audio+video+text in / text+audio out, 200ms micro-turns, async ba…
- ConceptTurn-Based Interface Bottleneck
Why current AI interfaces limit collaboration: single-thread turn-taking is a bandwidth bottleneck; humans pushed out b…
Related articles
- ConceptInteraction Models
Thinking Machines Lab (May 2026): models that handle audio/video/text interaction natively in real time instead of via…
- ConceptTime-Aligned Micro-Turns
The core interaction-model move: input/output as continuous streams in ~200ms interleaved chunks, no turn boundaries; s…
- ConceptInteraction / Background Model Split
Dual-model architecture: time-aware interaction model stays present; async background model handles deep reasoning/tool…
- ConceptThe Bitter Lesson
Sutton 2019: scaled general methods beat hand-engineered structure; recurring justification across the wiki for dissolv…
- EntityThinking Machines Lab
AI research lab behind interaction models (May 2026); harness-dissolves-into-model thesis; upstreamed streaming-session…
