H
Howardismvol. 03 · quiet corner of the web
Howardism · Vol. 03Plate II · No. 02

Multimodal, tagged.

Notes6TagMultimodalOldest13 May 2026Newest13 May 2026

Every article tagged multimodal, newest first.

Articles tagged Multimodal, sorted by date, newest first.
TitleSummaryDate
Encoder-Free Early FusionMultimodal design with minimal pre-processing instead of large standalone encoders: dMel audio embedding, 40×40-patch hMLP for frames, flow head for audio out, all co-trained from scratch in one transformer
Full-Duplex InteractionPerceive-and-respond simultaneously across modalities; proactive interjection, visual-cue reactions, simultaneous speech, live translation/commentary, time-aware speech — all special cases of model behavior
Interaction / Background Model SplitDual-model architecture: time-aware interaction model stays present; async background model handles deep reasoning/tools; rich-context-package delegation; "reasoning-model planning at non-thinking latency"
Interaction ModelsThinking Machines Lab (May 2026): models that handle audio/video/text interaction natively in real time instead of via harness; interactivity scales with intelligence only if it's in the model
Interactivity BenchmarksFD-bench, Audio MultiChallenge + new TimeSpeak/CueSpeak (proactive audio) and RepCount-A/ProactiveVideoQA/Charades (visual proactivity); TML-Interaction-Small: 0.40s turn-taking latency, dominates interaction quality
Time-Aligned Micro-TurnsThe core interaction-model move: input/output as continuous streams in ~200ms interleaved chunks, no turn boundaries; streaming-sessions inference (upstreamed to SGLang), latency-tuned MoE kernels, bitwise trainer-sampler alignment