Sources#
Summary#
A multimodal-design choice in Interaction Models: instead of routing audio and video through large, standalone encoders (and audio out through a separate TTS-like decoder), use minimal pre-processing and a single transformer with all components co-trained from scratch. "Encoder-free" is relative — there are light embedding layers — but there's no Whisper-scale audio encoder or separate TTS model.
The components (one 200ms micro-turn)#
Inputs (any subset of text / frame / audio):
- Text → token embedding (standard).
- Image / video frame → split into 40×40 patches, encoded by an hMLP (Touvron et al. 2022).
- Audio → taken in as dMel (Bai et al. 2024), transformed by a light-weight embedding layer ("bag of embeddings").
Single shared Transformer consumes the fused inputs.
Outputs:
- Text → unembedding (standard).
- Audio → a flow head (Lipman et al. 2022) producing mel.
All components are co-trained from scratch together with the transformer — not stitched from pretrained encoders/decoders.
Why this matters#
- Avoids the latency and complexity of large standalone encoders/decoders — important when you have to do this every 200ms (see Time-Aligned Micro-Turns).
- Early fusion (everything into one transformer) means the model reasons jointly over modalities rather than over pre-digested encoder outputs — a prerequisite for things like reacting to a visual cue while speaking (see Full-Duplex Interaction).
- Co-training from scratch is consistent with The Bitter Lesson: fewer hand-engineered modular boundaries, more learned-end-to-end.
Connections#
- Interaction Models — parent concept
- Time-Aligned Micro-Turns — why minimal pre-processing is a hard requirement (200ms budget)
- Interaction / Background Model Split — the other half of the architecture
- Full-Duplex Interaction — joint multimodal reasoning is what makes visual+audio interjection possible
- The Bitter Lesson — "co-train from scratch, drop the modular encoders" is a bitter-lesson move
Sources#
5 articles link here
- ConceptFull-Duplex Interaction
Perceive-and-respond simultaneously across modalities; proactive interjection, visual-cue reactions, simultaneous speec…
- ConceptInteraction Models
Thinking Machines Lab (May 2026): models that handle audio/video/text interaction natively in real time instead of via…
- ConceptThe Bitter Lesson
Sutton 2019: scaled general methods beat hand-engineered structure; recurring justification across the wiki for dissolv…
- ConceptTime-Aligned Micro-Turns
The core interaction-model move: input/output as continuous streams in ~200ms interleaved chunks, no turn boundaries; s…
- EntityTML-Interaction-Small
TML's first interaction model: 276B MoE / 12B active, audio+video+text in / text+audio out, 200ms micro-turns, async ba…
Related articles
- ConceptInteraction Models
Thinking Machines Lab (May 2026): models that handle audio/video/text interaction natively in real time instead of via…
- ConceptTime-Aligned Micro-Turns
The core interaction-model move: input/output as continuous streams in ~200ms interleaved chunks, no turn boundaries; s…
- ConceptTurn-Based Interface Bottleneck
Why current AI interfaces limit collaboration: single-thread turn-taking is a bandwidth bottleneck; humans pushed out b…
- ConceptInteractivity Benchmarks
FD-bench, Audio MultiChallenge + new TimeSpeak/CueSpeak (proactive audio) and RepCount-A/ProactiveVideoQA/Charades (vis…
- ConceptFull-Duplex Interaction
Perceive-and-respond simultaneously across modalities; proactive interjection, visual-cue reactions, simultaneous speec…
