Howardism

Sources#

Interaction Models: A Scalable Approach to Human-AI Collaboration

Summary#

An interaction model is a model that handles interaction natively — continuously taking in audio, video, and text and thinking, responding, and acting in real time — rather than emulating real-time behavior through external scaffolding (VAD, turn-detection, dialog-management harnesses). Announced by Thinking Machines Lab as a research preview in May 2026, with a first model named TML-Interaction-Small.

The central bet: interactivity should scale alongside intelligence. If interaction is part of the model, scaling the model makes it both smarter and a better collaborator. If interaction lives in a hand-crafted harness, The Bitter Lesson says that harness gets outpaced by general capability growth.

The thesis in one line#

"For interactivity to scale with intelligence, it must be part of the model itself."

This is the harness-shrinkage argument (see Harness Shrinkage as Models Improve) applied to the interaction layer: VAD, turn-boundary prediction, dialog state machines — all "meaningfully less intelligent than the model itself" — should dissolve into model behavior. Once they do, capabilities that those harnesses couldn't support (proactive interjection, speak-while-listening, reaction to visual cues) become special cases of what the model does, and improve with scale.

What it unlocks (capabilities, not harness features)#

Seamless dialog management — model implicitly tracks whether the speaker is thinking, yielding, self-correcting, or inviting a response. No separate dialog-management component.
Verbal and visual interjections — model jumps in when context warrants ("interrupt when I say something wrong", "tell me when I've written a bug"), not only at end-of-turn.
Simultaneous speech — user and model speak concurrently (live translation).
Time-awareness — direct sense of elapsed time ("how long did it take me to run a mile?").
Simultaneous tool calls / search / generative UI — while listening and speaking, the model concurrently searches, browses, generates UI, weaves results back in.

See Full-Duplex Interaction and Interactivity Benchmarks for how these are demonstrated and measured.

Why turn-based interfaces are the bottleneck#

Today's models "experience reality in a single thread": they wait, blind, until the user finishes typing/speaking; then they generate, blind, until done or interrupted. This is a narrow channel for collaboration — it limits how much of a person's knowledge, intent, and judgement reaches the model, and how much of the model's work is legible. Analogy from the post: resolving a crucial disagreement over email instead of in person. Full treatment in Turn-Based Interface Bottleneck.

Architecture (three load-bearing ideas)#

Time-Aligned Micro-Turns — input and output are continuous streams, processed/generated in 200ms chunks; no artificial turn boundaries. Silence, overlap, and interruption stay in context.
Interaction / Background Model Split — a time-aware interaction model maintains real-time presence; an asynchronous background model handles sustained reasoning, tool use, longer-horizon work. The interaction model delegates with a rich context package (the full conversation, not a standalone query) and interleaves results back at a moment appropriate to what the user is doing. Net effect: "planning, tool-use, and agentic workflows of reasoning models at the response latency of non-thinking ones."
Encoder-Free Early Fusion — minimal pre-processing instead of large standalone encoders/decoders: audio in as dMel + light embedding; images as 40×40 patches via hMLP; audio out via a flow head. All components co-trained from scratch with the transformer.

Plus engineering: streaming sessions for low-overhead frequent small prefills/decodes (upstreamed to SGLang); latency-tuned MoE kernels (gather+gemv instead of grouped gemm); bitwise trainer-sampler alignment via batch-invariant kernels (<5% overhead) for stability and debuggability.

Safety angle#

Real-time interaction stresses safety differently than turn-based exchange. TML's work focused on two axes:

Modality-appropriate refusals — TTS-generated refusal / over-refusal training data so spoken refusals are colloquial but no less firm.
Long-horizon robustness — automated red-teaming harness generating multi-turn refusal data, maintaining behavioral parity with the text model's refusals.

Limitations (per the post)#

Long sessions — continuous A/V accumulates context fast; streaming-session design handles short/medium well, very long sessions need careful context management (parallels Context Window Smart Zone).
Compute & connectivity — low-latency A/V streaming needs reliable connection; degrades badly without one.
Scale — TML-Interaction-Small is 276B MoE / 12B active; larger pretrained models too slow to serve in this regime today; larger models promised "later this year".
Background agents — agentic intelligence acknowledged as essential but under-explored relative to the real-time work.

Connections#

The Bitter Lesson — the principle the whole approach rests on: scaled general methods beat hand-engineered structure
Harness Shrinkage as Models Improve — same move, applied to the interaction harness (VAD/turn-detection dissolve into the model)
Turn-Based Interface Bottleneck — the problem being solved
Time-Aligned Micro-Turns, Interaction / Background Model Split, Encoder-Free Early Fusion — the three architectural pillars
Full-Duplex Interaction — the interaction modes this enables
Interactivity Benchmarks — how intelligence + interactivity are measured jointly
Thinking Machines Lab, TML-Interaction-Small — who built it, what the model is
Context Window Smart Zone — the long-session limitation echoes the smart-zone problem
AI Employee Framing / Human-AI Accountability Redesign — both argue against optimizing purely for autonomy; interaction models are the interface-side answer to keeping humans in the loop
Design Concept Grilling — collaborative, real-time iteration over spec-and-walk-away; an interaction model is the substrate that would make grilling-style collaboration feel native
Agent Harness Engineering — the harness-vs-model division-of-labor question, here resolved firmly toward the model for the interaction layer
Claude Opus 4.7 — xhigh effort tier appears as a baseline config (GPT-realtime-2.0 minimal/xhigh)

Open questions#

Does the interaction/background split generalize, or is it a transitional artifact until a single model is both fast and deep enough?
"Interactivity scales with intelligence" is asserted; the larger-model release later in 2026 is the test.
Research grant announced for interactivity benchmarks — what becomes the FD-bench equivalent for video proactivity?

Sources#

Interaction Models: A Scalable Approach to Human-AI Collaboration

Interaction Models

Sources#

Summary#

The thesis in one line#

What it unlocks (capabilities, not harness features)#

Why turn-based interfaces are the bottleneck#

Architecture (three load-bearing ideas)#

Safety angle#

Limitations (per the post)#

Connections#

Open questions#

Sources#