Howardismvol. 03 · quiet corner of the web

Plate II機器翻譯 · machine-translatedENHOWARDISM

TML-Interaction-Small

PublishedMay 13, 2026FiledEntityTagsType/entityLLM ModelReading3 minSourceAI-synthesised

TML 首款互動模型：276B MoE / 12B 活躍參數，音訊+視訊+文字輸入 / 文字+音訊輸出，200ms 微輪次，非同步背景 agent；所有模型中最佳的輪替延遲；2026 年 5 月 research preview

TML-Interaction-Small 示意圖

資料來源#

Interaction Models: A Scalable Approach to Human-AI Collaboration

它是什麼#

Thinking Machines Lab 的首款**互動模型** — 以 research preview 形式於 2026 年 5 月發布。定位為「第一個同時具備強大智慧／指令遵循能力以及互動性的模型。」

架構： 276B 參數 MoE，12B 活躍參數。從頭訓練為互動模型（而非在回合制模型上外掛互動功能）。
模態： 連續音訊 + 視訊 + 文字輸入；文字 + 音訊輸出。Encoder-Free Early Fusion（dMel 音訊嵌入、40×40-patch hMLP 處理影格、flow head 負責音訊輸出），單一共享 transformer，所有元件從頭共同訓練。
互動機制： Time-Aligned Micro-Turns — 200ms 交錯輸入／輸出區塊，無輪次邊界。
推理： 將深度推理／工具使用／長期任務委派給非同步背景模型 — 參見 Interaction / Background Model Split。即使不搭配背景 agent，在智慧基準測試上仍具競爭力。

主要數據（2026 年 5 月）#

輪替延遲：0.40 秒（FD-bench v1，音訊）— 所有比較模型中最佳。
FD-bench v1.5 平均分：77.8 vs 基線約 39–54（包含 thinking-high 模型）。
FD-bench v3（音訊+工具）：82.8% 回應品質 / 68.0% Pass@1（搭配背景 agent）。
Audio MultiChallenge APR：43.4% — 擊敗所有非思考基線；僅 GPT-realtime-2.0 xhigh（48.5%）更高。
比較基線：GPT-realtime-2.0（minimal/xhigh）、GPT-realtime-1.5、Gemini-3.1-flash-live-preview（minimal/high）、Qwen 3.5 Omni-plus-realtime。完整表格見 Interactivity Benchmarks。

限制（已承認）#

長時間連續音訊／視訊 session 會快速累積 context — 謹慎的 context 管理仍是未解問題（呼應 Context Window Smart Zone）。
需要可靠的低延遲連線；缺乏時效能嚴重下降。
稱為「Small」是因為較大的預訓練模型目前在此運行模式下速度太慢 — 更大的模型預計於 2026 年稍後推出。

可用性#

有限的 research preview「將於未來數月內」開放，更廣泛的發布「今年稍後」。歡迎透過 interaction@thinkingmachines.ai 提供回饋；研究補助開放申請中。

相關連結#

Interaction Models — 此模型類別
Thinking Machines Lab — 建造者
Time-Aligned Micro-Turns / Encoder-Free Early Fusion / Interaction / Background Model Split — 三大架構支柱
Full-Duplex Interaction — 其展示的互動模式
Interactivity Benchmarks — 完整基準測試表格及其擊敗的基線
Claude Opus 4.7 — 同時代產物（2026 年中前沿）；4.7 的 xhigh 努力層級對應 GPT-realtime 的 minimal/xhigh，此處作為基線配置使用
Context Window Smart Zone — 長 session 限制

資料來源#

Interaction Models: A Scalable Approach to Human-AI Collaboration

§ end

About this piece

Articles in this journal are synthesised by AI agents from a curated wiki and are refreshed automatically as new concepts arrive. Topics, framing, and editorial direction are curated by Howardism.

Cited by 9

Claude Opus 4.7
GA frontier model from Anthropic; direct upgrade to 4.6 at same price; literal instruction following, 1.0–1.35× tokeniz…
Encoder-Free Early Fusion
Multimodal design with minimal pre-processing instead of large standalone encoders: dMel audio embedding, 40×40-patch h…
Full-Duplex Interaction
Perceive-and-respond simultaneously across modalities; proactive interjection, visual-cue reactions, simultaneous speec…
Interaction / Background Model Split
Dual-model architecture: time-aware interaction model stays present; async background model handles deep reasoning/tool…
Interaction Models
Thinking Machines Lab (May 2026): models that handle audio/video/text interaction natively in real time instead of via…
Interactivity Benchmarks
FD-bench, Audio MultiChallenge + new TimeSpeak/CueSpeak (proactive audio) and RepCount-A/ProactiveVideoQA/Charades (vis…
Entities — People, Orgs, Tools & Projects
Map of Content for all 32 entity pages. See Home for concept domains.
Thinking Machines Lab
AI research lab behind interaction models (May 2026); harness-dissolves-into-model thesis; upstreamed streaming-session…
Time-Aligned Micro-Turns
The core interaction-model move: input/output as continuous streams in ~200ms interleaved chunks, no turn boundaries; s…

Related articles

Interaction Models
Thinking Machines Lab (May 2026): models that handle audio/video/text interaction natively in real time instead of via…
Interactivity Benchmarks
FD-bench, Audio MultiChallenge + new TimeSpeak/CueSpeak (proactive audio) and RepCount-A/ProactiveVideoQA/Charades (vis…
Time-Aligned Micro-Turns
The core interaction-model move: input/output as continuous streams in ~200ms interleaved chunks, no turn boundaries; s…
Full-Duplex Interaction
Perceive-and-respond simultaneously across modalities; proactive interjection, visual-cue reactions, simultaneous speec…
Interaction / Background Model Split
Dual-model architecture: time-aware interaction model stays present; async background model handles deep reasoning/tool…

Related articles

Interaction Models
Thinking Machines Lab (May 2026): models that handle audio/video/text interaction natively in real time instead of via…
Interactivity Benchmarks
FD-bench, Audio MultiChallenge + new TimeSpeak/CueSpeak (proactive audio) and RepCount-A/ProactiveVideoQA/Charades (vis…
Time-Aligned Micro-Turns
The core interaction-model move: input/output as continuous streams in ~200ms interleaved chunks, no turn boundaries; s…
Full-Duplex Interaction
Perceive-and-respond simultaneously across modalities; proactive interjection, visual-cue reactions, simultaneous speec…
Interaction / Background Model Split
Dual-model architecture: time-aware interaction model stays present; async background model handles deep reasoning/tool…

Cited by 9

Claude Opus 4.7
GA frontier model from Anthropic; direct upgrade to 4.6 at same price; literal instruction following, 1.0–1.35× tokeniz…
Encoder-Free Early Fusion
Multimodal design with minimal pre-processing instead of large standalone encoders: dMel audio embedding, 40×40-patch h…
Full-Duplex Interaction
Perceive-and-respond simultaneously across modalities; proactive interjection, visual-cue reactions, simultaneous speec…
Interaction / Background Model Split
Dual-model architecture: time-aware interaction model stays present; async background model handles deep reasoning/tool…
Interaction Models
Thinking Machines Lab (May 2026): models that handle audio/video/text interaction natively in real time instead of via…
Interactivity Benchmarks
FD-bench, Audio MultiChallenge + new TimeSpeak/CueSpeak (proactive audio) and RepCount-A/ProactiveVideoQA/Charades (vis…
Entities — People, Orgs, Tools & Projects
Map of Content for all 32 entity pages. See Home for concept domains.
Thinking Machines Lab
AI research lab behind interaction models (May 2026); harness-dissolves-into-model thesis; upstreamed streaming-session…
Time-Aligned Micro-Turns
The core interaction-model move: input/output as continuous streams in ~200ms interleaved chunks, no turn boundaries; s…