Skip to content
H
Howardismvol. 03 · quiet corner of the web
PLATE II · PIECE № 01HOWARDISM

Encoder-Free Early Fusion

PublishedMay 13, 2026FiledConceptReading2 minSourceAI-synthesised

Multimodal design with minimal pre-processing instead of large standalone encoders: dMel audio embedding, 40×40-patch hMLP for frames, flow head for audio out, all co-trained from scratch in one transformer

Illustration for Encoder-Free Early Fusion

Sources#

Summary#

A multimodal-design choice in Interaction Models: instead of routing audio and video through large, standalone encoders (and audio out through a separate TTS-like decoder), use minimal pre-processing and a single transformer with all components co-trained from scratch. "Encoder-free" is relative — there are light embedding layers — but there's no Whisper-scale audio encoder or separate TTS model.

The components (one 200ms micro-turn)#

Inputs (any subset of text / frame / audio):

  • Text → token embedding (standard).
  • Image / video frame → split into 40×40 patches, encoded by an hMLP (Touvron et al. 2022).
  • Audio → taken in as dMel (Bai et al. 2024), transformed by a light-weight embedding layer ("bag of embeddings").

Single shared Transformer consumes the fused inputs.

Outputs:

  • Text → unembedding (standard).
  • Audio → a flow head (Lipman et al. 2022) producing mel.

All components are co-trained from scratch together with the transformer — not stitched from pretrained encoders/decoders.

Why this matters#

  • Avoids the latency and complexity of large standalone encoders/decoders — important when you have to do this every 200ms (see Time-Aligned Micro-Turns).
  • Early fusion (everything into one transformer) means the model reasons jointly over modalities rather than over pre-digested encoder outputs — a prerequisite for things like reacting to a visual cue while speaking (see Full-Duplex Interaction).
  • Co-training from scratch is consistent with The Bitter Lesson: fewer hand-engineered modular boundaries, more learned-end-to-end.

Connections#

Sources#

§ end
About this piece

Articles in this journal are synthesised by AI agents from a curated wiki and are refreshed automatically as new concepts arrive. Topics, framing, and editorial direction are curated by Howardism.

5 articles link here
  • ConceptFull-Duplex Interaction

    Perceive-and-respond simultaneously across modalities; proactive interjection, visual-cue reactions, simultaneous speec…

  • ConceptInteraction Models

    Thinking Machines Lab (May 2026): models that handle audio/video/text interaction natively in real time instead of via…

  • ConceptThe Bitter Lesson

    Sutton 2019: scaled general methods beat hand-engineered structure; recurring justification across the wiki for dissolv…

  • ConceptTime-Aligned Micro-Turns

    The core interaction-model move: input/output as continuous streams in ~200ms interleaved chunks, no turn boundaries; s…

  • EntityTML-Interaction-Small

    TML's first interaction model: 276B MoE / 12B active, audio+video+text in / text+audio out, 200ms micro-turns, async ba…

Related articles
  • ConceptInteraction Models

    Thinking Machines Lab (May 2026): models that handle audio/video/text interaction natively in real time instead of via…

  • ConceptTime-Aligned Micro-Turns

    The core interaction-model move: input/output as continuous streams in ~200ms interleaved chunks, no turn boundaries; s…

  • ConceptTurn-Based Interface Bottleneck

    Why current AI interfaces limit collaboration: single-thread turn-taking is a bandwidth bottleneck; humans pushed out b…

  • ConceptInteractivity Benchmarks

    FD-bench, Audio MultiChallenge + new TimeSpeak/CueSpeak (proactive audio) and RepCount-A/ProactiveVideoQA/Charades (vis…

  • ConceptFull-Duplex Interaction

    Perceive-and-respond simultaneously across modalities; proactive interjection, visual-cue reactions, simultaneous speec…