Howardism

Sources#

Interaction Models: A Scalable Approach to Human-AI Collaboration

Summary#

The core architectural move in Interaction Models: instead of consuming a complete user turn and emitting a complete response, input and output are treated as continuous streams, processed and generated in ~200ms chunks ("micro-turns") that interleave. There are no artificial turn boundaries the model must adhere to — silence, overlap, and interruption all remain part of the model's context.

How it works#

The model continuously interleaves: process 200ms of input → generate 200ms of output → process next 200ms of input → … across audio, video, and text.
Human perception preserves concurrent input and output streams; the model sees a single interleaved token sequence (input 0, output 0, input 1, output 1, …) that encodes the same timing.
Because timing is in the sequence, the model has a direct sense of elapsed time and can act during the user's turn, not only after it.

Contrast: turn-based models see an alternating token sequence with hard turn boundaries; real-time feel is faked by a harness that predicts those boundaries (VAD etc.) — see Turn-Based Interface Bottleneck.

Why 200ms#

200ms chunks are small enough for near-real-time concurrency of multiple input/output modalities. The cost: inference must do frequent small prefills and decodes, each under strict latency constraints — and existing LLM inference libraries aren't built for that (significant per-turn overhead).

Inference: streaming sessions#

TML's fix for the frequent-small-prefill problem:

Client sends each 200ms chunk as a separate request.
Inference server appends chunks into a persistent sequence in GPU memory — avoiding repeated memory reallocation and metadata recomputation.
Upstreamed a version of this to SGLang.
Plus latency-tuned kernels for bidirectional-serving shapes: e.g. gather+gemv for MoE kernels instead of the standard grouped gemm (citing prior work from PyTorch/gpt-fast and Cursor's warp-decode).

Trainer-sampler alignment#

Bitwise trainer-sampler alignment is used for training stability and for debugging the system's components. Implemented via batch-invariant kernels with <5% e2e overhead. Two highlighted kernels:

All-reduce / reduce-scatter — NVLS low-latency comm kernels, deterministic on Blackwell, bitwise-aligned across different parallelism strategies (Sequence Parallelism vs Tensor Parallelism).
Attention — Split-KV normally causes inconsistent accumulation orders between decode and prefill; fixed by splitting consistently between decode and prefill (e.g. 4096 tokens at a time, left-aligned), keeping efficiency in both.

What it buys you#

Every interaction mode that needs a special-purpose harness today becomes a special case of model behavior — and improves with model size and training data: proactive interjection, simultaneous speech, visual-cue reactions, time estimation. See Full-Duplex Interaction.

Connections#

Interaction Models — parent concept
Turn-Based Interface Bottleneck — what this replaces
Encoder-Free Early Fusion — the complementary "minimal pre-processing" choice that makes streaming feasible
Interaction / Background Model Split — micro-turns keep the interaction model present; deep reasoning is delegated out so it doesn't stall the stream
Full-Duplex Interaction — the capabilities unlocked
The Bitter Lesson — "no turn boundaries → interaction modes become scalable model behavior" is a direct application
Context Window Smart Zone — continuous A/V at 200ms granularity accumulates context fast; the open long-session problem

Sources#

Interaction Models: A Scalable Approach to Human-AI Collaboration

Time-Aligned Micro-Turns

Sources#

Summary#

How it works#

Why 200ms#

Inference: streaming sessions#

Trainer-sampler alignment#

What it buys you#

Connections#

Sources#