Sources#
Summary#
An interaction model is a model that handles interaction natively — continuously taking in audio, video, and text and thinking, responding, and acting in real time — rather than emulating real-time behavior through external scaffolding (VAD, turn-detection, dialog-management harnesses). Announced by Thinking Machines Lab as a research preview in May 2026, with a first model named TML-Interaction-Small.
The central bet: interactivity should scale alongside intelligence. If interaction is part of the model, scaling the model makes it both smarter and a better collaborator. If interaction lives in a hand-crafted harness, The Bitter Lesson says that harness gets outpaced by general capability growth.
The thesis in one line#
"For interactivity to scale with intelligence, it must be part of the model itself."
This is the harness-shrinkage argument (see Harness Shrinkage as Models Improve) applied to the interaction layer: VAD, turn-boundary prediction, dialog state machines — all "meaningfully less intelligent than the model itself" — should dissolve into model behavior. Once they do, capabilities that those harnesses couldn't support (proactive interjection, speak-while-listening, reaction to visual cues) become special cases of what the model does, and improve with scale.
What it unlocks (capabilities, not harness features)#
- Seamless dialog management — model implicitly tracks whether the speaker is thinking, yielding, self-correcting, or inviting a response. No separate dialog-management component.
- Verbal and visual interjections — model jumps in when context warrants ("interrupt when I say something wrong", "tell me when I've written a bug"), not only at end-of-turn.
- Simultaneous speech — user and model speak concurrently (live translation).
- Time-awareness — direct sense of elapsed time ("how long did it take me to run a mile?").
- Simultaneous tool calls / search / generative UI — while listening and speaking, the model concurrently searches, browses, generates UI, weaves results back in.
See Full-Duplex Interaction and Interactivity Benchmarks for how these are demonstrated and measured.
Why turn-based interfaces are the bottleneck#
Today's models "experience reality in a single thread": they wait, blind, until the user finishes typing/speaking; then they generate, blind, until done or interrupted. This is a narrow channel for collaboration — it limits how much of a person's knowledge, intent, and judgement reaches the model, and how much of the model's work is legible. Analogy from the post: resolving a crucial disagreement over email instead of in person. Full treatment in Turn-Based Interface Bottleneck.
Architecture (three load-bearing ideas)#
- Time-Aligned Micro-Turns — input and output are continuous streams, processed/generated in 200ms chunks; no artificial turn boundaries. Silence, overlap, and interruption stay in context.
- Interaction / Background Model Split — a time-aware interaction model maintains real-time presence; an asynchronous background model handles sustained reasoning, tool use, longer-horizon work. The interaction model delegates with a rich context package (the full conversation, not a standalone query) and interleaves results back at a moment appropriate to what the user is doing. Net effect: "planning, tool-use, and agentic workflows of reasoning models at the response latency of non-thinking ones."
- Encoder-Free Early Fusion — minimal pre-processing instead of large standalone encoders/decoders: audio in as dMel + light embedding; images as 40×40 patches via hMLP; audio out via a flow head. All components co-trained from scratch with the transformer.
Plus engineering: streaming sessions for low-overhead frequent small prefills/decodes (upstreamed to SGLang); latency-tuned MoE kernels (gather+gemv instead of grouped gemm); bitwise trainer-sampler alignment via batch-invariant kernels (<5% overhead) for stability and debuggability.
Safety angle#
Real-time interaction stresses safety differently than turn-based exchange. TML's work focused on two axes:
- Modality-appropriate refusals — TTS-generated refusal / over-refusal training data so spoken refusals are colloquial but no less firm.
- Long-horizon robustness — automated red-teaming harness generating multi-turn refusal data, maintaining behavioral parity with the text model's refusals.
Limitations (per the post)#
- Long sessions — continuous A/V accumulates context fast; streaming-session design handles short/medium well, very long sessions need careful context management (parallels Context Window Smart Zone).
- Compute & connectivity — low-latency A/V streaming needs reliable connection; degrades badly without one.
- Scale —
TML-Interaction-Smallis 276B MoE / 12B active; larger pretrained models too slow to serve in this regime today; larger models promised "later this year". - Background agents — agentic intelligence acknowledged as essential but under-explored relative to the real-time work.
Connections#
- The Bitter Lesson — the principle the whole approach rests on: scaled general methods beat hand-engineered structure
- Harness Shrinkage as Models Improve — same move, applied to the interaction harness (VAD/turn-detection dissolve into the model)
- Turn-Based Interface Bottleneck — the problem being solved
- Time-Aligned Micro-Turns, Interaction / Background Model Split, Encoder-Free Early Fusion — the three architectural pillars
- Full-Duplex Interaction — the interaction modes this enables
- Interactivity Benchmarks — how intelligence + interactivity are measured jointly
- Thinking Machines Lab, TML-Interaction-Small — who built it, what the model is
- Context Window Smart Zone — the long-session limitation echoes the smart-zone problem
- AI Employee Framing / Human-AI Accountability Redesign — both argue against optimizing purely for autonomy; interaction models are the interface-side answer to keeping humans in the loop
- Design Concept Grilling — collaborative, real-time iteration over spec-and-walk-away; an interaction model is the substrate that would make grilling-style collaboration feel native
- Agent Harness Engineering — the harness-vs-model division-of-labor question, here resolved firmly toward the model for the interaction layer
- Claude Opus 4.7 —
xhigheffort tier appears as a baseline config (GPT-realtime-2.0 minimal/xhigh)
Open questions#
- Does the interaction/background split generalize, or is it a transitional artifact until a single model is both fast and deep enough?
- "Interactivity scales with intelligence" is asserted; the larger-model release later in 2026 is the test.
- Research grant announced for interactivity benchmarks — what becomes the FD-bench equivalent for video proactivity?
Sources#
17 articles link here
- ConceptAgent Harness Engineering
Patterns for scaffolding long-running LLM agents: environment design, progressive context disclosure, mechanical archit…
- ConceptAI Employee Framing
Kropp et al. (HBR May 2026, n=1,261): framing AI agents as "employees" vs "tools" cuts personal accountability −9pp, in…
- EssayOpinions on Using AI Tools & the Future of the Software Engineering Role
Debate map of four stances on using AI tools (bullish-insider / pragmatist-practitioner / skeptic-governance / architec…
- EntityAnthropic
AI safety company / vendor of Claude; mission-as-tiebreaker culture; ~30–40 PMs across teams; Mike Krieger leads Labs r…
- ConceptContext Window Smart Zone
Smart zone vs dumb zone (Dex Hardy / Matt Pocock): quadratic attention scaling, ~100K marker independent of advertised…
- ConceptDesign Concept Grilling
Matt Pocock's `grill-me` skill; reach Brooks "design concept" before any plan; counter to specs-to-code; PRD as destina…
- ConceptEncoder-Free Early Fusion
Multimodal design with minimal pre-processing instead of large standalone encoders: dMel audio embedding, 40×40-patch h…
- ConceptFull-Duplex Interaction
Perceive-and-respond simultaneously across modalities; proactive interjection, visual-cue reactions, simultaneous speec…
- ConceptHarness Shrinkage as Models Improve
Prompt scaffolding shrinks each model release; Cat Wu's pruning discipline; Boris Cherny "100 lines of code a year from…
- ConceptHuman-AI Accountability Redesign
HBR five-pillar prescription: span-of-control redesign, role redesign, performance management reset, decision-rights/es…
- ConceptInteraction / Background Model Split
Dual-model architecture: time-aware interaction model stays present; async background model handles deep reasoning/tool…
- ConceptInteractivity Benchmarks
FD-bench, Audio MultiChallenge + new TimeSpeak/CueSpeak (proactive audio) and RepCount-A/ProactiveVideoQA/Charades (vis…
- ConceptThe Bitter Lesson
Sutton 2019: scaled general methods beat hand-engineered structure; recurring justification across the wiki for dissolv…
- EntityThinking Machines Lab
AI research lab behind interaction models (May 2026); harness-dissolves-into-model thesis; upstreamed streaming-session…
- ConceptTime-Aligned Micro-Turns
The core interaction-model move: input/output as continuous streams in ~200ms interleaved chunks, no turn boundaries; s…
- EntityTML-Interaction-Small
TML's first interaction model: 276B MoE / 12B active, audio+video+text in / text+audio out, 200ms micro-turns, async ba…
- ConceptTurn-Based Interface Bottleneck
Why current AI interfaces limit collaboration: single-thread turn-taking is a bandwidth bottleneck; humans pushed out b…
Related articles
- ConceptTurn-Based Interface Bottleneck
Why current AI interfaces limit collaboration: single-thread turn-taking is a bandwidth bottleneck; humans pushed out b…
- ConceptHarness Shrinkage as Models Improve
Prompt scaffolding shrinks each model release; Cat Wu's pruning discipline; Boris Cherny "100 lines of code a year from…
- ConceptTime-Aligned Micro-Turns
The core interaction-model move: input/output as continuous streams in ~200ms interleaved chunks, no turn boundaries; s…
- ConceptAgent Harness Engineering
Patterns for scaffolding long-running LLM agents: environment design, progressive context disclosure, mechanical archit…
- ConceptThe Bitter Lesson
Sutton 2019: scaled general methods beat hand-engineered structure; recurring justification across the wiki for dissolv…
