Skip to content
H
Howardismvol. 03 · quiet corner of the web
PLATE II · PIECE № 04HOWARDISM

Interaction Models

PublishedMay 13, 2026FiledConceptReading6 minSourceAI-synthesised

Thinking Machines Lab (May 2026): models that handle audio/video/text interaction natively in real time instead of via harness; interactivity scales with intelligence only if it's in the model

Illustration for Interaction Models

Sources#

Summary#

An interaction model is a model that handles interaction natively — continuously taking in audio, video, and text and thinking, responding, and acting in real time — rather than emulating real-time behavior through external scaffolding (VAD, turn-detection, dialog-management harnesses). Announced by Thinking Machines Lab as a research preview in May 2026, with a first model named TML-Interaction-Small.

The central bet: interactivity should scale alongside intelligence. If interaction is part of the model, scaling the model makes it both smarter and a better collaborator. If interaction lives in a hand-crafted harness, The Bitter Lesson says that harness gets outpaced by general capability growth.

The thesis in one line#

"For interactivity to scale with intelligence, it must be part of the model itself."

This is the harness-shrinkage argument (see Harness Shrinkage as Models Improve) applied to the interaction layer: VAD, turn-boundary prediction, dialog state machines — all "meaningfully less intelligent than the model itself" — should dissolve into model behavior. Once they do, capabilities that those harnesses couldn't support (proactive interjection, speak-while-listening, reaction to visual cues) become special cases of what the model does, and improve with scale.

What it unlocks (capabilities, not harness features)#

  • Seamless dialog management — model implicitly tracks whether the speaker is thinking, yielding, self-correcting, or inviting a response. No separate dialog-management component.
  • Verbal and visual interjections — model jumps in when context warrants ("interrupt when I say something wrong", "tell me when I've written a bug"), not only at end-of-turn.
  • Simultaneous speech — user and model speak concurrently (live translation).
  • Time-awareness — direct sense of elapsed time ("how long did it take me to run a mile?").
  • Simultaneous tool calls / search / generative UI — while listening and speaking, the model concurrently searches, browses, generates UI, weaves results back in.

See Full-Duplex Interaction and Interactivity Benchmarks for how these are demonstrated and measured.

Why turn-based interfaces are the bottleneck#

Today's models "experience reality in a single thread": they wait, blind, until the user finishes typing/speaking; then they generate, blind, until done or interrupted. This is a narrow channel for collaboration — it limits how much of a person's knowledge, intent, and judgement reaches the model, and how much of the model's work is legible. Analogy from the post: resolving a crucial disagreement over email instead of in person. Full treatment in Turn-Based Interface Bottleneck.

Architecture (three load-bearing ideas)#

  1. Time-Aligned Micro-Turns — input and output are continuous streams, processed/generated in 200ms chunks; no artificial turn boundaries. Silence, overlap, and interruption stay in context.
  2. Interaction / Background Model Split — a time-aware interaction model maintains real-time presence; an asynchronous background model handles sustained reasoning, tool use, longer-horizon work. The interaction model delegates with a rich context package (the full conversation, not a standalone query) and interleaves results back at a moment appropriate to what the user is doing. Net effect: "planning, tool-use, and agentic workflows of reasoning models at the response latency of non-thinking ones."
  3. Encoder-Free Early Fusion — minimal pre-processing instead of large standalone encoders/decoders: audio in as dMel + light embedding; images as 40×40 patches via hMLP; audio out via a flow head. All components co-trained from scratch with the transformer.

Plus engineering: streaming sessions for low-overhead frequent small prefills/decodes (upstreamed to SGLang); latency-tuned MoE kernels (gather+gemv instead of grouped gemm); bitwise trainer-sampler alignment via batch-invariant kernels (<5% overhead) for stability and debuggability.

Safety angle#

Real-time interaction stresses safety differently than turn-based exchange. TML's work focused on two axes:

  • Modality-appropriate refusals — TTS-generated refusal / over-refusal training data so spoken refusals are colloquial but no less firm.
  • Long-horizon robustness — automated red-teaming harness generating multi-turn refusal data, maintaining behavioral parity with the text model's refusals.

Limitations (per the post)#

  • Long sessions — continuous A/V accumulates context fast; streaming-session design handles short/medium well, very long sessions need careful context management (parallels Context Window Smart Zone).
  • Compute & connectivity — low-latency A/V streaming needs reliable connection; degrades badly without one.
  • ScaleTML-Interaction-Small is 276B MoE / 12B active; larger pretrained models too slow to serve in this regime today; larger models promised "later this year".
  • Background agents — agentic intelligence acknowledged as essential but under-explored relative to the real-time work.

Connections#

Open questions#

  • Does the interaction/background split generalize, or is it a transitional artifact until a single model is both fast and deep enough?
  • "Interactivity scales with intelligence" is asserted; the larger-model release later in 2026 is the test.
  • Research grant announced for interactivity benchmarks — what becomes the FD-bench equivalent for video proactivity?

Sources#

§ end
About this piece

Articles in this journal are synthesised by AI agents from a curated wiki and are refreshed automatically as new concepts arrive. Topics, framing, and editorial direction are curated by Howardism.

17 articles link here
  • ConceptAgent Harness Engineering

    Patterns for scaffolding long-running LLM agents: environment design, progressive context disclosure, mechanical archit…

  • ConceptAI Employee Framing

    Kropp et al. (HBR May 2026, n=1,261): framing AI agents as "employees" vs "tools" cuts personal accountability −9pp, in…

  • EssayOpinions on Using AI Tools & the Future of the Software Engineering Role

    Debate map of four stances on using AI tools (bullish-insider / pragmatist-practitioner / skeptic-governance / architec…

  • EntityAnthropic

    AI safety company / vendor of Claude; mission-as-tiebreaker culture; ~30–40 PMs across teams; Mike Krieger leads Labs r…

  • ConceptContext Window Smart Zone

    Smart zone vs dumb zone (Dex Hardy / Matt Pocock): quadratic attention scaling, ~100K marker independent of advertised…

  • ConceptDesign Concept Grilling

    Matt Pocock's `grill-me` skill; reach Brooks "design concept" before any plan; counter to specs-to-code; PRD as destina…

  • ConceptEncoder-Free Early Fusion

    Multimodal design with minimal pre-processing instead of large standalone encoders: dMel audio embedding, 40×40-patch h…

  • ConceptFull-Duplex Interaction

    Perceive-and-respond simultaneously across modalities; proactive interjection, visual-cue reactions, simultaneous speec…

  • ConceptHarness Shrinkage as Models Improve

    Prompt scaffolding shrinks each model release; Cat Wu's pruning discipline; Boris Cherny "100 lines of code a year from…

  • ConceptHuman-AI Accountability Redesign

    HBR five-pillar prescription: span-of-control redesign, role redesign, performance management reset, decision-rights/es…

  • ConceptInteraction / Background Model Split

    Dual-model architecture: time-aware interaction model stays present; async background model handles deep reasoning/tool…

  • ConceptInteractivity Benchmarks

    FD-bench, Audio MultiChallenge + new TimeSpeak/CueSpeak (proactive audio) and RepCount-A/ProactiveVideoQA/Charades (vis…

  • ConceptThe Bitter Lesson

    Sutton 2019: scaled general methods beat hand-engineered structure; recurring justification across the wiki for dissolv…

  • EntityThinking Machines Lab

    AI research lab behind interaction models (May 2026); harness-dissolves-into-model thesis; upstreamed streaming-session…

  • ConceptTime-Aligned Micro-Turns

    The core interaction-model move: input/output as continuous streams in ~200ms interleaved chunks, no turn boundaries; s…

  • EntityTML-Interaction-Small

    TML's first interaction model: 276B MoE / 12B active, audio+video+text in / text+audio out, 200ms micro-turns, async ba…

  • ConceptTurn-Based Interface Bottleneck

    Why current AI interfaces limit collaboration: single-thread turn-taking is a bandwidth bottleneck; humans pushed out b…

Related articles
  • ConceptTurn-Based Interface Bottleneck

    Why current AI interfaces limit collaboration: single-thread turn-taking is a bandwidth bottleneck; humans pushed out b…

  • ConceptHarness Shrinkage as Models Improve

    Prompt scaffolding shrinks each model release; Cat Wu's pruning discipline; Boris Cherny "100 lines of code a year from…

  • ConceptTime-Aligned Micro-Turns

    The core interaction-model move: input/output as continuous streams in ~200ms interleaved chunks, no turn boundaries; s…

  • ConceptAgent Harness Engineering

    Patterns for scaffolding long-running LLM agents: environment design, progressive context disclosure, mechanical archit…

  • ConceptThe Bitter Lesson

    Sutton 2019: scaled general methods beat hand-engineered structure; recurring justification across the wiki for dissolv…