Plate IIInteraction & Multimodal機器翻譯 · machine-translatedENHOWARDISM

Interaction Models

PublishedMay 13, 2026FiledConceptDomainInteraction & MultimodalTagsLLM ArchitectureMultimodalHuman AI CollaborationReading7 minSourceAI-synthesised

Thinking Machines Lab（2026 年 5 月）：原生即時處理 audio/video/text 互動的模型，而非透過 harness；唯有互動性內建於模型本身，它才會隨智能擴展

資料來源#

Interaction Models: A Scalable Approach to Human-AI Collaboration

摘要#

互動模型（interaction model） 是一種原生處理互動的模型——持續接收音訊、視訊與文字，並即時思考、回應與行動——而不是透過外部鷹架（VAD、輪次偵測、對話管理 harness）來模擬即時行為。由 Thinking Machines Lab 於 2026 年 5 月以研究預覽形式發布，首個模型名為 TML-Interaction-Small。

核心賭注：互動性應當與智能一同擴展。 如果互動是模型的一部分，那麼擴展模型就會讓它同時變得更聰明且成為更好的協作者。如果互動寄生在手工打造的 harness 裡，The Bitter Lesson 告訴我們：這套 harness 會被通用能力的成長所超越。

The thesis in one line#

「要讓互動性隨智能擴展，它必須成為模型本身的一部分。」

這是把 harness 收縮的論點（參見 Harness Shrinkage as Models Improve）套用到 互動層：VAD、輪次邊界預測、對話狀態機——這些全都「明顯不如模型本身聰明」——都應該消融進模型行為之中。一旦如此，那些 harness 無法支援的能力（主動插話、邊聽邊說、對視覺線索做出反應）就會變成模型行為的特例，並隨規模一同進步。

What it unlocks (capabilities, not harness features)#

無縫的對話管理——模型隱性地追蹤說話者是在思考、讓出發言權、自我修正，還是邀請回應。不需要獨立的對話管理元件。
語音與視覺插話——模型會在情境需要時介入（「當我說錯話時打斷我」、「當我寫出 bug 時告訴我」），而不只在輪次結束時。
同時發聲——使用者與模型同時說話（即時翻譯）。
時間感知——對流逝時間有直接的感受（「我跑一英里花了多久？」）。
同時的工具呼叫／搜尋／生成式 UI——在聆聽與說話的同時，模型並行搜尋、瀏覽、生成 UI，並把結果重新編織回對話中。

關於這些如何被展示與衡量，參見 Full-Duplex Interaction 與 Interactivity Benchmarks。

Why turn-based interfaces are the bottleneck#

今天的模型「以單一執行緒體驗現實」：它們盲目地等待，直到使用者打完字／說完話；接著盲目地生成，直到完成或被打斷。對協作而言，這是一條狹窄的通道——它限制了一個人的知識、意圖與判斷有多少能傳達給模型，也限制了模型的工作有多少是可被看見的。原文的類比是：用電子郵件而非當面去化解一場關鍵的分歧。完整論述見 Turn-Based Interface Bottleneck。

Architecture (three load-bearing ideas)#

Time-Aligned Micro-Turns——輸入與輸出都是連續的串流，以 200ms 為單位的區塊來處理／生成；沒有人為的輪次邊界。沉默、重疊與打斷都保留在上下文中。
Interaction / Background Model Split——一個具時間感知的互動模型維持即時的臨場感；一個非同步的背景模型負責持續的推理、工具使用與更長時程的工作。互動模型以一份 豐富的上下文封包（完整對話，而非獨立的查詢）來委派任務，並在符合使用者當下動作的時機把結果交錯地織回來。淨效果是：「以非思考型模型的回應延遲，達成推理型模型的規劃、工具使用與 agentic 工作流程。」
Encoder-Free Early Fusion——以最小的前處理取代龐大的獨立編碼器／解碼器：音訊以 dMel ＋輕量嵌入的形式輸入；影像透過 hMLP 轉為 40×40 的 patch；音訊則經由 flow head 輸出。所有元件都與 transformer 一同從零協同訓練。

再加上工程面：用於低開銷、高頻率小型 prefill／decode 的串流工作階段（已上游合併進 SGLang）；針對延遲調校的 MoE kernel（以 gather+gemv 取代 grouped gemm）；透過 batch-invariant kernel（<5% 開銷）達成訓練器與取樣器之間的位元級對齊，以確保穩定性與可除錯性。

Safety angle#

即時互動對安全性帶來的壓力，與以輪次為基礎的交流不同。TML 的工作聚焦於兩條軸線：

適配模態的拒絕——以 TTS 生成的拒絕／過度拒絕訓練資料，讓口說的拒絕既口語化、卻同樣堅定。
長時程的穩健性——自動化的紅隊 harness 生成多輪拒絕資料，維持與文字模型拒絕行為的一致性。

Limitations (per the post)#

長工作階段——連續的 A/V 會快速累積上下文；串流工作階段的設計能妥善處理短到中等長度的情況，但非常長的工作階段需要謹慎的上下文管理（與 Context Window Smart Zone 相呼應）。
算力與連線——低延遲的 A/V 串流需要可靠的連線；少了它就會嚴重劣化。
規模——TML-Interaction-Small 是 276B MoE / 12B active；更大的預訓練模型在今天這套機制下太慢、無法服務；更大的模型則承諾「今年稍晚」推出。
背景 agent——agentic 智能被承認是必要的，但相對於即時方面的工作，仍探索不足。

待解決的問題#

互動／背景的拆分是否能一般化，還是只是個過渡性產物，直到單一模型同時夠快又夠深為止？
「互動性隨智能擴展」只是個斷言；2026 年稍晚更大模型的發布，就是對它的檢驗。
已宣布針對互動性基準的研究經費——什麼會成為視訊主動性領域中 FD-bench 的對應物？

資料來源#

Interaction Models: A Scalable Approach to Human-AI Collaboration

§ end

About this piece

Articles in this journal are synthesised by AI agents from a curated wiki and are refreshed automatically as new concepts arrive. Topics, framing, and editorial direction are curated by Howardism.

Cited by 23

Agent Harness Engineering
Patterns for scaffolding long-running LLM agents: environment design, progressive context disclosure, mechanical archit…
AI Employee Framing
Kropp et al. (HBR May 2026, n=1,261): framing AI agents as "employees" vs "tools" cuts personal accountability −9pp, in…
Opinions on Using AI Tools & the Future of the Software Engineering Role
Debate map of four stances on using AI tools (bullish-insider / pragmatist-practitioner / skeptic-governance / architec…
Anthropic
AI safety company / vendor of Claude; mission-as-tiebreaker culture; ~30–40 PMs across teams; Mike Krieger leads Labs r…
Context Window Smart Zone
Smart zone vs dumb zone (Dex Hardy / Matt Pocock): quadratic attention scaling, ~100K marker independent of advertised…
Design Concept Grilling
Matt Pocock's `grill-me` skill; reach Brooks "design concept" before any plan; counter to specs-to-code; PRD as destina…
Encoder-Free Early Fusion
Multimodal design with minimal pre-processing instead of large standalone encoders: dMel audio embedding, 40×40-patch h…
Full-Duplex Interaction
Perceive-and-respond simultaneously across modalities; proactive interjection, visual-cue reactions, simultaneous speec…
The Future of Agent Interfaces
Interface future is layered: native interaction models for human collaboration, MCP/APIs for structured action, app pro…
Harness Shrinkage as Models Improve
Prompt scaffolding shrinks each model release; Cat Wu's pruning discipline; Boris Cherny "100 lines of code a year from…
HTML as the New Markdown
Thariq Shihipar's thesis: as models improve, thousand-line markdown plans overwhelm the *human*; HTML artifacts (visual…
Human-AI Accountability Redesign
HBR five-pillar prescription: span-of-control redesign, role redesign, performance management reset, decision-rights/es…
Interaction / Background Model Split
Dual-model architecture: time-aware interaction model stays present; async background model handles deep reasoning/tool…
Interactivity Benchmarks
FD-bench, Audio MultiChallenge + new TimeSpeak/CueSpeak (proactive audio) and RepCount-A/ProactiveVideoQA/Charades (vis…
Interaction & Multimodal
Map of Content for the interaction-multimodal domain — 7 concepts. Curated entry point; see Home for all domains.
Open Questions Backlog
_96 pages with open questions, as of 2026-06-14._
Software 3.0
Karpathy's taxonomy: 1.0 code, 2.0 weights, 3.0 prompting; LLM as programmable interpreter; MenuGen "shouldn't exist";…
Thariq Shihipar
Engineer on the Claude Code team at Anthropic; "HTML is the new markdown" and "compute allocator" framings; three HTML-…
The Bitter Lesson
Sutton 2019: scaled general methods beat hand-engineered structure; recurring justification across the wiki for dissolv…
Thinking Machines Lab
AI research lab behind interaction models (May 2026); harness-dissolves-into-model thesis; upstreamed streaming-session…
Time-Aligned Micro-Turns
The core interaction-model move: input/output as continuous streams in ~200ms interleaved chunks, no turn boundaries; s…
TML-Interaction-Small
TML's first interaction model: 276B MoE / 12B active, audio+video+text in / text+audio out, 200ms micro-turns, async ba…
Turn-Based Interface Bottleneck
Why current AI interfaces limit collaboration: single-thread turn-taking is a bandwidth bottleneck; humans pushed out b…

Harness Shrinkage as Models Improve
Prompt scaffolding shrinks each model release; Cat Wu's pruning discipline; Boris Cherny "100 lines of code a year from…
Turn-Based Interface Bottleneck
Why current AI interfaces limit collaboration: single-thread turn-taking is a bandwidth bottleneck; humans pushed out b…
The Bitter Lesson
Sutton 2019: scaled general methods beat hand-engineered structure; recurring justification across the wiki for dissolv…
Time-Aligned Micro-Turns
The core interaction-model move: input/output as continuous streams in ~200ms interleaved chunks, no turn boundaries; s…
Agent Harness Engineering
Patterns for scaffolding long-running LLM agents: environment design, progressive context disclosure, mechanical archit…

Harness Shrinkage as Models Improve
Prompt scaffolding shrinks each model release; Cat Wu's pruning discipline; Boris Cherny "100 lines of code a year from…
Turn-Based Interface Bottleneck
Why current AI interfaces limit collaboration: single-thread turn-taking is a bandwidth bottleneck; humans pushed out b…
The Bitter Lesson
Sutton 2019: scaled general methods beat hand-engineered structure; recurring justification across the wiki for dissolv…
Time-Aligned Micro-Turns
The core interaction-model move: input/output as continuous streams in ~200ms interleaved chunks, no turn boundaries; s…
Agent Harness Engineering
Patterns for scaffolding long-running LLM agents: environment design, progressive context disclosure, mechanical archit…

Cited by 23

Agent Harness Engineering
Patterns for scaffolding long-running LLM agents: environment design, progressive context disclosure, mechanical archit…
AI Employee Framing
Kropp et al. (HBR May 2026, n=1,261): framing AI agents as "employees" vs "tools" cuts personal accountability −9pp, in…
Opinions on Using AI Tools & the Future of the Software Engineering Role
Debate map of four stances on using AI tools (bullish-insider / pragmatist-practitioner / skeptic-governance / architec…
Anthropic
AI safety company / vendor of Claude; mission-as-tiebreaker culture; ~30–40 PMs across teams; Mike Krieger leads Labs r…
Context Window Smart Zone
Smart zone vs dumb zone (Dex Hardy / Matt Pocock): quadratic attention scaling, ~100K marker independent of advertised…
Design Concept Grilling
Matt Pocock's `grill-me` skill; reach Brooks "design concept" before any plan; counter to specs-to-code; PRD as destina…
Encoder-Free Early Fusion
Multimodal design with minimal pre-processing instead of large standalone encoders: dMel audio embedding, 40×40-patch h…
Full-Duplex Interaction
Perceive-and-respond simultaneously across modalities; proactive interjection, visual-cue reactions, simultaneous speec…
The Future of Agent Interfaces
Interface future is layered: native interaction models for human collaboration, MCP/APIs for structured action, app pro…
Harness Shrinkage as Models Improve
Prompt scaffolding shrinks each model release; Cat Wu's pruning discipline; Boris Cherny "100 lines of code a year from…
HTML as the New Markdown
Thariq Shihipar's thesis: as models improve, thousand-line markdown plans overwhelm the *human*; HTML artifacts (visual…
Human-AI Accountability Redesign
HBR five-pillar prescription: span-of-control redesign, role redesign, performance management reset, decision-rights/es…
Interaction / Background Model Split
Dual-model architecture: time-aware interaction model stays present; async background model handles deep reasoning/tool…
Interactivity Benchmarks
FD-bench, Audio MultiChallenge + new TimeSpeak/CueSpeak (proactive audio) and RepCount-A/ProactiveVideoQA/Charades (vis…
Interaction & Multimodal
Map of Content for the interaction-multimodal domain — 7 concepts. Curated entry point; see Home for all domains.
Open Questions Backlog
_96 pages with open questions, as of 2026-06-14._
Software 3.0
Karpathy's taxonomy: 1.0 code, 2.0 weights, 3.0 prompting; LLM as programmable interpreter; MenuGen "shouldn't exist";…
Thariq Shihipar
Engineer on the Claude Code team at Anthropic; "HTML is the new markdown" and "compute allocator" framings; three HTML-…
The Bitter Lesson
Sutton 2019: scaled general methods beat hand-engineered structure; recurring justification across the wiki for dissolv…
Thinking Machines Lab
AI research lab behind interaction models (May 2026); harness-dissolves-into-model thesis; upstreamed streaming-session…
Time-Aligned Micro-Turns
The core interaction-model move: input/output as continuous streams in ~200ms interleaved chunks, no turn boundaries; s…
TML-Interaction-Small
TML's first interaction model: 276B MoE / 12B active, audio+video+text in / text+audio out, 200ms micro-turns, async ba…
Turn-Based Interface Bottleneck
Why current AI interfaces limit collaboration: single-thread turn-taking is a bandwidth bottleneck; humans pushed out b…

Interaction Models

資料來源#

摘要#

The thesis in one line#

What it unlocks (capabilities, not harness features)#

Why turn-based interfaces are the bottleneck#

Architecture (three load-bearing ideas)#

Safety angle#

Limitations (per the post)#

相關連結#

待解決的問題#

資料來源#