Howardism

Sources#

Interaction Models: A Scalable Approach to Human-AI Collaboration

Summary#

A multimodal-design choice in Interaction Models: instead of routing audio and video through large, standalone encoders (and audio out through a separate TTS-like decoder), use minimal pre-processing and a single transformer with all components co-trained from scratch. "Encoder-free" is relative — there are light embedding layers — but there's no Whisper-scale audio encoder or separate TTS model.

The components (one 200ms micro-turn)#

Inputs (any subset of text / frame / audio):

Text → token embedding (standard).
Image / video frame → split into 40×40 patches, encoded by an hMLP (Touvron et al. 2022).
Audio → taken in as dMel (Bai et al. 2024), transformed by a light-weight embedding layer ("bag of embeddings").

Single shared Transformer consumes the fused inputs.

Outputs:

Text → unembedding (standard).
Audio → a flow head (Lipman et al. 2022) producing mel.

All components are co-trained from scratch together with the transformer — not stitched from pretrained encoders/decoders.

Why this matters#

Avoids the latency and complexity of large standalone encoders/decoders — important when you have to do this every 200ms (see Time-Aligned Micro-Turns).
Early fusion (everything into one transformer) means the model reasons jointly over modalities rather than over pre-digested encoder outputs — a prerequisite for things like reacting to a visual cue while speaking (see Full-Duplex Interaction).
Co-training from scratch is consistent with The Bitter Lesson: fewer hand-engineered modular boundaries, more learned-end-to-end.

Connections#

Interaction Models — parent concept
Time-Aligned Micro-Turns — why minimal pre-processing is a hard requirement (200ms budget)
Interaction / Background Model Split — the other half of the architecture
Full-Duplex Interaction — joint multimodal reasoning is what makes visual+audio interjection possible
The Bitter Lesson — "co-train from scratch, drop the modular encoders" is a bitter-lesson move

Sources#

Interaction Models: A Scalable Approach to Human-AI Collaboration

Encoder-Free Early Fusion

Sources#

Summary#

The components (one 200ms micro-turn)#

Why this matters#

Connections#

Sources#