Sources#
Summary#
The evaluation surface Thinking Machines Lab uses to argue TML-Interaction-Small is "the first model that has both strong intelligence/instruction following and interactivity." Existing benchmarks barely cover interactivity, so TML uses the few that exist plus several new internal ones — and notes "no existing model can meaningfully perform" the visual-proactivity tasks. A research grant for interactivity benchmarks was announced.
Existing benchmarks used#
- FD-bench (v1 / v1.5 / v3) — one of the few benchmarks intended to measure interactivity. Model is given prerecorded audio and must respond at certain times; scenarios: user interruption, user backchannel, talking to others, background speech. v1 reports turn-taking latency; v1.5 an average score; v3 response quality / Pass@1 with tools.
- Audio MultiChallenge — common benchmark for intelligence + instruction following over audio (APR metric); baselines reported by Scale AI.
- BigBench Audio, IFEval (VoiceBench), IFEval (text), Harmbench (text refusal rate) — standard intelligence/IF/safety checks across modalities.
- QIVD (Qualcomm IVD) — video+audio QA, evaluated in a streaming setting (raw clip from the beginning, grade the transcript; GPT-4o-mini grader, following Qwen 3.5 Omni).
Headline results (TML-Interaction-Small, May 2026)#
- FD-bench v1 turn-taking latency: 0.40s — best responsiveness of any model compared (GPT-realtime-2.0 minimal 1.18s, GPT-realtime-1.5 0.59s, Gemini-3.1-flash-live minimal 0.57s).
- FD-bench v1.5 average: 77.8 vs ~39–54 for all baselines (including thinking-high models).
- FD-bench v3 (audio+tools): 82.8% response quality / 68.0% Pass@1 with background agent enabled — best on both.
- Audio MultiChallenge APR: 43.4% — beats every non-thinking baseline; only GPT-realtime-2.0 xhigh (48.5%) is higher.
- Claim: "dominates interaction quality while being more intelligent than any non-thinking model."
* in the table = result reported with the background agent enabled (used for benchmarks that need reasoning or tool calls).
New internal benchmarks (proactive audio)#
- TimeSpeak — can the model initiate speech at user-specified times with correct content? E.g. "remind me to breathe in and out every 4 seconds until I ask you to stop."
- CueSpeak — does the model speak at the appropriate moment with the semantically correct response? Entries constructed so the model must speak at the same time as the user. E.g. "every time I codeswitch, give me the correct word in the original language."
Both: one expected semantic response + timing window per example; LLM judge; correct only if meaning and timing are right; macro-averaged accuracy.
New internal benchmarks (visual proactivity)#
Adapted from existing video benchmarks; "no existing model can meaningfully perform any of these" — they stay silent or answer wrong, including thinking-high models:
- RepCount-A — repeated-action videos, adapted to online rep-counting; stream video after "count out reps for {action}"; graded by whether the last number said after the penultimate rep is within one rep of ground truth. Measures continuous visual tracking + timely counting.
- ProactiveVideoQA — videos with questions whose answers become available at specific moments; turn-weighted PAUC@ω=0.5 (0–100), averaged across turns/categories; staying silent scores 25.0; correct answers must come at the correct time, wrong answers penalized.
- Charades — temporal action localization; "say 'start' when the person starts {action}, say 'Stop' when they stop"; graded by temporal IoU between predicted and reference intervals.
Why it matters#
The argument structure: prior interactivity-oriented benchmarks "do not adequately capture the qualitative jumps" — so the case for interaction models rests partly on benchmarks TML had to invent. This is a known pattern (new capability → new eval), and a soft spot (self-defined metrics); TML's response is to open a research grant inviting community benchmarks.
Connections#
- Interaction Models — what's being measured
- TML-Interaction-Small — the model under test
- Interaction / Background Model Split —
*results use the background agent - Full-Duplex Interaction — TimeSpeak/CueSpeak/RepCount-A/ProactiveVideoQA/Charades each target one of these modes
- Time-Aligned Micro-Turns — turn-taking latency (0.40s) is the direct payoff of removing turn boundaries
- Scale-Dependent Prompt Sensitivity — another case of a paper introducing its own evaluation framing to surface a phenomenon standard benchmarks miss
- Claude Opus 4.7 —
xhigheffort tier shows up here as a baseline config (GPT-realtime-2.0 minimal/xhigh)
Sources#
6 articles link here
- ConceptFull-Duplex Interaction
Perceive-and-respond simultaneously across modalities; proactive interjection, visual-cue reactions, simultaneous speec…
- ConceptInteraction / Background Model Split
Dual-model architecture: time-aware interaction model stays present; async background model handles deep reasoning/tool…
- ConceptInteraction Models
Thinking Machines Lab (May 2026): models that handle audio/video/text interaction natively in real time instead of via…
- ConceptScale-Dependent Prompt Sensitivity
Large models underperform small ones on 7.7% of standard benchmarks due to overthinking; brevity constraints recover 26…
- EntityThinking Machines Lab
AI research lab behind interaction models (May 2026); harness-dissolves-into-model thesis; upstreamed streaming-session…
- EntityTML-Interaction-Small
TML's first interaction model: 276B MoE / 12B active, audio+video+text in / text+audio out, 200ms micro-turns, async ba…
Related articles
- ConceptInteraction Models
Thinking Machines Lab (May 2026): models that handle audio/video/text interaction natively in real time instead of via…
- EntityTML-Interaction-Small
TML's first interaction model: 276B MoE / 12B active, audio+video+text in / text+audio out, 200ms micro-turns, async ba…
- ConceptAgent Harness Engineering
Patterns for scaffolding long-running LLM agents: environment design, progressive context disclosure, mechanical archit…
- ConceptEncoder-Free Early Fusion
Multimodal design with minimal pre-processing instead of large standalone encoders: dMel audio embedding, 40×40-patch h…
- ConceptTime-Aligned Micro-Turns
The core interaction-model move: input/output as continuous streams in ~200ms interleaved chunks, no turn boundaries; s…
