H
Howardism
Plate II機器翻譯 · machine-translatedENHOWARDISM

TML-Interaction-Small

PublishedMay 13, 2026FiledEntityTagsType/entityLLM ModelReading3 minSourceAI-synthesised

TML 首款互動模型:276B MoE / 12B 活躍參數,音訊+視訊+文字輸入 / 文字+音訊輸出,200ms 微輪次,非同步背景 agent;所有模型中最佳的輪替延遲;2026 年 5 月 research preview

TML-Interaction-Small 示意圖

資料來源#

它是什麼#

Thinking Machines Lab 的首款**互動模型** — 以 research preview 形式於 2026 年 5 月發布。定位為「第一個同時具備強大智慧/指令遵循能力以及互動性的模型。」

  • 架構: 276B 參數 MoE,12B 活躍參數。從頭訓練為互動模型(而非在回合制模型上外掛互動功能)。
  • 模態: 連續音訊 + 視訊 + 文字輸入;文字 + 音訊輸出。Encoder-Free Early Fusion(dMel 音訊嵌入、40×40-patch hMLP 處理影格、flow head 負責音訊輸出),單一共享 transformer,所有元件從頭共同訓練。
  • 互動機制: Time-Aligned Micro-Turns — 200ms 交錯輸入/輸出區塊,無輪次邊界。
  • 推理: 將深度推理/工具使用/長期任務委派給非同步背景模型 — 參見 Interaction / Background Model Split。即使不搭配背景 agent,在智慧基準測試上仍具競爭力。

主要數據(2026 年 5 月)#

  • 輪替延遲:0.40 秒(FD-bench v1,音訊)— 所有比較模型中最佳。
  • FD-bench v1.5 平均分:77.8 vs 基線約 39–54(包含 thinking-high 模型)。
  • FD-bench v3(音訊+工具):82.8% 回應品質 / 68.0% Pass@1(搭配背景 agent)。
  • Audio MultiChallenge APR:43.4% — 擊敗所有非思考基線;僅 GPT-realtime-2.0 xhigh(48.5%)更高。
  • 比較基線:GPT-realtime-2.0(minimal/xhigh)、GPT-realtime-1.5、Gemini-3.1-flash-live-preview(minimal/high)、Qwen 3.5 Omni-plus-realtime。完整表格見 Interactivity Benchmarks

限制(已承認)#

  • 長時間連續音訊/視訊 session 會快速累積 context — 謹慎的 context 管理仍是未解問題(呼應 Context Window Smart Zone)。
  • 需要可靠的低延遲連線;缺乏時效能嚴重下降。
  • 稱為「Small」是因為較大的預訓練模型目前在此運行模式下速度太慢 — 更大的模型預計於 2026 年稍後推出。

可用性#

有限的 research preview「將於未來數月內」開放,更廣泛的發布「今年稍後」。歡迎透過 interaction@thinkingmachines.ai 提供回饋;研究補助開放申請中。

相關連結#

資料來源#

§ end
About this piece

Articles in this journal are synthesised by AI agents from a curated wiki and are refreshed automatically as new concepts arrive. Topics, framing, and editorial direction are curated by Howardism.

Cited by 9
  • Claude Opus 4.7

    GA frontier model from Anthropic; direct upgrade to 4.6 at same price; literal instruction following, 1.0–1.35× tokeniz…

  • Encoder-Free Early Fusion

    Multimodal design with minimal pre-processing instead of large standalone encoders: dMel audio embedding, 40×40-patch h…

  • Full-Duplex Interaction

    Perceive-and-respond simultaneously across modalities; proactive interjection, visual-cue reactions, simultaneous speec…

  • Interaction / Background Model Split

    Dual-model architecture: time-aware interaction model stays present; async background model handles deep reasoning/tool…

  • Interaction Models

    Thinking Machines Lab (May 2026): models that handle audio/video/text interaction natively in real time instead of via…

  • Interactivity Benchmarks

    FD-bench, Audio MultiChallenge + new TimeSpeak/CueSpeak (proactive audio) and RepCount-A/ProactiveVideoQA/Charades (vis…

  • Entities — People, Orgs, Tools & Projects

    Map of Content for all 32 entity pages. See Home for concept domains.

  • Thinking Machines Lab

    AI research lab behind interaction models (May 2026); harness-dissolves-into-model thesis; upstreamed streaming-session…

  • Time-Aligned Micro-Turns

    The core interaction-model move: input/output as continuous streams in ~200ms interleaved chunks, no turn boundaries; s…

Related articles
  • Interaction Models

    Thinking Machines Lab (May 2026): models that handle audio/video/text interaction natively in real time instead of via…

  • Interactivity Benchmarks

    FD-bench, Audio MultiChallenge + new TimeSpeak/CueSpeak (proactive audio) and RepCount-A/ProactiveVideoQA/Charades (vis…

  • Time-Aligned Micro-Turns

    The core interaction-model move: input/output as continuous streams in ~200ms interleaved chunks, no turn boundaries; s…

  • Full-Duplex Interaction

    Perceive-and-respond simultaneously across modalities; proactive interjection, visual-cue reactions, simultaneous speec…

  • Interaction / Background Model Split

    Dual-model architecture: time-aware interaction model stays present; async background model handles deep reasoning/tool…