Skip to content
H
Howardismvol. 03 · quiet corner of the web
PLATE II · PIECE № 08HOWARDISM

Time-Aligned Micro-Turns

PublishedMay 13, 2026FiledConceptReading4 minSourceAI-synthesised

The core interaction-model move: input/output as continuous streams in ~200ms interleaved chunks, no turn boundaries; streaming-sessions inference (upstreamed to SGLang), latency-tuned MoE kernels, bitwise trainer-sampler alignment

Illustration for Time-Aligned Micro-Turns

Sources#

Summary#

The core architectural move in Interaction Models: instead of consuming a complete user turn and emitting a complete response, input and output are treated as continuous streams, processed and generated in ~200ms chunks ("micro-turns") that interleave. There are no artificial turn boundaries the model must adhere to — silence, overlap, and interruption all remain part of the model's context.

How it works#

  • The model continuously interleaves: process 200ms of input → generate 200ms of output → process next 200ms of input → … across audio, video, and text.
  • Human perception preserves concurrent input and output streams; the model sees a single interleaved token sequence (input 0, output 0, input 1, output 1, …) that encodes the same timing.
  • Because timing is in the sequence, the model has a direct sense of elapsed time and can act during the user's turn, not only after it.

Contrast: turn-based models see an alternating token sequence with hard turn boundaries; real-time feel is faked by a harness that predicts those boundaries (VAD etc.) — see Turn-Based Interface Bottleneck.

Why 200ms#

200ms chunks are small enough for near-real-time concurrency of multiple input/output modalities. The cost: inference must do frequent small prefills and decodes, each under strict latency constraints — and existing LLM inference libraries aren't built for that (significant per-turn overhead).

Inference: streaming sessions#

TML's fix for the frequent-small-prefill problem:

  • Client sends each 200ms chunk as a separate request.
  • Inference server appends chunks into a persistent sequence in GPU memory — avoiding repeated memory reallocation and metadata recomputation.
  • Upstreamed a version of this to SGLang.
  • Plus latency-tuned kernels for bidirectional-serving shapes: e.g. gather+gemv for MoE kernels instead of the standard grouped gemm (citing prior work from PyTorch/gpt-fast and Cursor's warp-decode).

Trainer-sampler alignment#

Bitwise trainer-sampler alignment is used for training stability and for debugging the system's components. Implemented via batch-invariant kernels with <5% e2e overhead. Two highlighted kernels:

  • All-reduce / reduce-scatter — NVLS low-latency comm kernels, deterministic on Blackwell, bitwise-aligned across different parallelism strategies (Sequence Parallelism vs Tensor Parallelism).
  • Attention — Split-KV normally causes inconsistent accumulation orders between decode and prefill; fixed by splitting consistently between decode and prefill (e.g. 4096 tokens at a time, left-aligned), keeping efficiency in both.

What it buys you#

Every interaction mode that needs a special-purpose harness today becomes a special case of model behavior — and improves with model size and training data: proactive interjection, simultaneous speech, visual-cue reactions, time estimation. See Full-Duplex Interaction.

Connections#

Sources#

§ end
About this piece

Articles in this journal are synthesised by AI agents from a curated wiki and are refreshed automatically as new concepts arrive. Topics, framing, and editorial direction are curated by Howardism.

8 articles link here
  • ConceptEncoder-Free Early Fusion

    Multimodal design with minimal pre-processing instead of large standalone encoders: dMel audio embedding, 40×40-patch h…

  • ConceptFull-Duplex Interaction

    Perceive-and-respond simultaneously across modalities; proactive interjection, visual-cue reactions, simultaneous speec…

  • ConceptInteraction / Background Model Split

    Dual-model architecture: time-aware interaction model stays present; async background model handles deep reasoning/tool…

  • ConceptInteraction Models

    Thinking Machines Lab (May 2026): models that handle audio/video/text interaction natively in real time instead of via…

  • ConceptInteractivity Benchmarks

    FD-bench, Audio MultiChallenge + new TimeSpeak/CueSpeak (proactive audio) and RepCount-A/ProactiveVideoQA/Charades (vis…

  • ConceptThe Bitter Lesson

    Sutton 2019: scaled general methods beat hand-engineered structure; recurring justification across the wiki for dissolv…

  • EntityTML-Interaction-Small

    TML's first interaction model: 276B MoE / 12B active, audio+video+text in / text+audio out, 200ms micro-turns, async ba…

  • ConceptTurn-Based Interface Bottleneck

    Why current AI interfaces limit collaboration: single-thread turn-taking is a bandwidth bottleneck; humans pushed out b…

Related articles
  • ConceptInteraction Models

    Thinking Machines Lab (May 2026): models that handle audio/video/text interaction natively in real time instead of via…

  • ConceptFull-Duplex Interaction

    Perceive-and-respond simultaneously across modalities; proactive interjection, visual-cue reactions, simultaneous speec…

  • ConceptEncoder-Free Early Fusion

    Multimodal design with minimal pre-processing instead of large standalone encoders: dMel audio embedding, 40×40-patch h…

  • ConceptHarness Shrinkage as Models Improve

    Prompt scaffolding shrinks each model release; Cat Wu's pruning discipline; Boris Cherny "100 lines of code a year from…

  • ConceptInteractivity Benchmarks

    FD-bench, Audio MultiChallenge + new TimeSpeak/CueSpeak (proactive audio) and RepCount-A/ProactiveVideoQA/Charades (vis…