Sources#
Summary#
The core architectural move in Interaction Models: instead of consuming a complete user turn and emitting a complete response, input and output are treated as continuous streams, processed and generated in ~200ms chunks ("micro-turns") that interleave. There are no artificial turn boundaries the model must adhere to — silence, overlap, and interruption all remain part of the model's context.
How it works#
- The model continuously interleaves: process 200ms of input → generate 200ms of output → process next 200ms of input → … across audio, video, and text.
- Human perception preserves concurrent input and output streams; the model sees a single interleaved token sequence (
input 0, output 0, input 1, output 1, …) that encodes the same timing. - Because timing is in the sequence, the model has a direct sense of elapsed time and can act during the user's turn, not only after it.
Contrast: turn-based models see an alternating token sequence with hard turn boundaries; real-time feel is faked by a harness that predicts those boundaries (VAD etc.) — see Turn-Based Interface Bottleneck.
Why 200ms#
200ms chunks are small enough for near-real-time concurrency of multiple input/output modalities. The cost: inference must do frequent small prefills and decodes, each under strict latency constraints — and existing LLM inference libraries aren't built for that (significant per-turn overhead).
Inference: streaming sessions#
TML's fix for the frequent-small-prefill problem:
- Client sends each 200ms chunk as a separate request.
- Inference server appends chunks into a persistent sequence in GPU memory — avoiding repeated memory reallocation and metadata recomputation.
- Upstreamed a version of this to SGLang.
- Plus latency-tuned kernels for bidirectional-serving shapes: e.g. gather+gemv for MoE kernels instead of the standard grouped gemm (citing prior work from PyTorch/
gpt-fastand Cursor's warp-decode).
Trainer-sampler alignment#
Bitwise trainer-sampler alignment is used for training stability and for debugging the system's components. Implemented via batch-invariant kernels with <5% e2e overhead. Two highlighted kernels:
- All-reduce / reduce-scatter — NVLS low-latency comm kernels, deterministic on Blackwell, bitwise-aligned across different parallelism strategies (Sequence Parallelism vs Tensor Parallelism).
- Attention — Split-KV normally causes inconsistent accumulation orders between decode and prefill; fixed by splitting consistently between decode and prefill (e.g. 4096 tokens at a time, left-aligned), keeping efficiency in both.
What it buys you#
Every interaction mode that needs a special-purpose harness today becomes a special case of model behavior — and improves with model size and training data: proactive interjection, simultaneous speech, visual-cue reactions, time estimation. See Full-Duplex Interaction.
Connections#
- Interaction Models — parent concept
- Turn-Based Interface Bottleneck — what this replaces
- Encoder-Free Early Fusion — the complementary "minimal pre-processing" choice that makes streaming feasible
- Interaction / Background Model Split — micro-turns keep the interaction model present; deep reasoning is delegated out so it doesn't stall the stream
- Full-Duplex Interaction — the capabilities unlocked
- The Bitter Lesson — "no turn boundaries → interaction modes become scalable model behavior" is a direct application
- Context Window Smart Zone — continuous A/V at 200ms granularity accumulates context fast; the open long-session problem
Sources#
8 articles link here
- ConceptEncoder-Free Early Fusion
Multimodal design with minimal pre-processing instead of large standalone encoders: dMel audio embedding, 40×40-patch h…
- ConceptFull-Duplex Interaction
Perceive-and-respond simultaneously across modalities; proactive interjection, visual-cue reactions, simultaneous speec…
- ConceptInteraction / Background Model Split
Dual-model architecture: time-aware interaction model stays present; async background model handles deep reasoning/tool…
- ConceptInteraction Models
Thinking Machines Lab (May 2026): models that handle audio/video/text interaction natively in real time instead of via…
- ConceptInteractivity Benchmarks
FD-bench, Audio MultiChallenge + new TimeSpeak/CueSpeak (proactive audio) and RepCount-A/ProactiveVideoQA/Charades (vis…
- ConceptThe Bitter Lesson
Sutton 2019: scaled general methods beat hand-engineered structure; recurring justification across the wiki for dissolv…
- EntityTML-Interaction-Small
TML's first interaction model: 276B MoE / 12B active, audio+video+text in / text+audio out, 200ms micro-turns, async ba…
- ConceptTurn-Based Interface Bottleneck
Why current AI interfaces limit collaboration: single-thread turn-taking is a bandwidth bottleneck; humans pushed out b…
Related articles
- ConceptInteraction Models
Thinking Machines Lab (May 2026): models that handle audio/video/text interaction natively in real time instead of via…
- ConceptFull-Duplex Interaction
Perceive-and-respond simultaneously across modalities; proactive interjection, visual-cue reactions, simultaneous speec…
- ConceptEncoder-Free Early Fusion
Multimodal design with minimal pre-processing instead of large standalone encoders: dMel audio embedding, 40×40-patch h…
- ConceptHarness Shrinkage as Models Improve
Prompt scaffolding shrinks each model release; Cat Wu's pruning discipline; Boris Cherny "100 lines of code a year from…
- ConceptInteractivity Benchmarks
FD-bench, Audio MultiChallenge + new TimeSpeak/CueSpeak (proactive audio) and RepCount-A/ProactiveVideoQA/Charades (vis…
