Skip to content
H
Howardismvol. 03 · quiet corner of the web
PLATE II · PIECE № 05HOWARDISM

Interactivity Benchmarks

PublishedMay 13, 2026FiledConceptReading4 minSourceAI-synthesised

FD-bench, Audio MultiChallenge + new TimeSpeak/CueSpeak (proactive audio) and RepCount-A/ProactiveVideoQA/Charades (visual proactivity); TML-Interaction-Small: 0.40s turn-taking latency, dominates interaction quality

Illustration for Interactivity Benchmarks

Sources#

Summary#

The evaluation surface Thinking Machines Lab uses to argue TML-Interaction-Small is "the first model that has both strong intelligence/instruction following and interactivity." Existing benchmarks barely cover interactivity, so TML uses the few that exist plus several new internal ones — and notes "no existing model can meaningfully perform" the visual-proactivity tasks. A research grant for interactivity benchmarks was announced.

Existing benchmarks used#

  • FD-bench (v1 / v1.5 / v3) — one of the few benchmarks intended to measure interactivity. Model is given prerecorded audio and must respond at certain times; scenarios: user interruption, user backchannel, talking to others, background speech. v1 reports turn-taking latency; v1.5 an average score; v3 response quality / Pass@1 with tools.
  • Audio MultiChallenge — common benchmark for intelligence + instruction following over audio (APR metric); baselines reported by Scale AI.
  • BigBench Audio, IFEval (VoiceBench), IFEval (text), Harmbench (text refusal rate) — standard intelligence/IF/safety checks across modalities.
  • QIVD (Qualcomm IVD) — video+audio QA, evaluated in a streaming setting (raw clip from the beginning, grade the transcript; GPT-4o-mini grader, following Qwen 3.5 Omni).

Headline results (TML-Interaction-Small, May 2026)#

  • FD-bench v1 turn-taking latency: 0.40s — best responsiveness of any model compared (GPT-realtime-2.0 minimal 1.18s, GPT-realtime-1.5 0.59s, Gemini-3.1-flash-live minimal 0.57s).
  • FD-bench v1.5 average: 77.8 vs ~39–54 for all baselines (including thinking-high models).
  • FD-bench v3 (audio+tools): 82.8% response quality / 68.0% Pass@1 with background agent enabled — best on both.
  • Audio MultiChallenge APR: 43.4% — beats every non-thinking baseline; only GPT-realtime-2.0 xhigh (48.5%) is higher.
  • Claim: "dominates interaction quality while being more intelligent than any non-thinking model."

* in the table = result reported with the background agent enabled (used for benchmarks that need reasoning or tool calls).

New internal benchmarks (proactive audio)#

  • TimeSpeak — can the model initiate speech at user-specified times with correct content? E.g. "remind me to breathe in and out every 4 seconds until I ask you to stop."
  • CueSpeak — does the model speak at the appropriate moment with the semantically correct response? Entries constructed so the model must speak at the same time as the user. E.g. "every time I codeswitch, give me the correct word in the original language."

Both: one expected semantic response + timing window per example; LLM judge; correct only if meaning and timing are right; macro-averaged accuracy.

New internal benchmarks (visual proactivity)#

Adapted from existing video benchmarks; "no existing model can meaningfully perform any of these" — they stay silent or answer wrong, including thinking-high models:

  • RepCount-A — repeated-action videos, adapted to online rep-counting; stream video after "count out reps for {action}"; graded by whether the last number said after the penultimate rep is within one rep of ground truth. Measures continuous visual tracking + timely counting.
  • ProactiveVideoQA — videos with questions whose answers become available at specific moments; turn-weighted PAUC@ω=0.5 (0–100), averaged across turns/categories; staying silent scores 25.0; correct answers must come at the correct time, wrong answers penalized.
  • Charades — temporal action localization; "say 'start' when the person starts {action}, say 'Stop' when they stop"; graded by temporal IoU between predicted and reference intervals.

Why it matters#

The argument structure: prior interactivity-oriented benchmarks "do not adequately capture the qualitative jumps" — so the case for interaction models rests partly on benchmarks TML had to invent. This is a known pattern (new capability → new eval), and a soft spot (self-defined metrics); TML's response is to open a research grant inviting community benchmarks.

Connections#

Sources#

§ end
About this piece

Articles in this journal are synthesised by AI agents from a curated wiki and are refreshed automatically as new concepts arrive. Topics, framing, and editorial direction are curated by Howardism.

6 articles link here
  • ConceptFull-Duplex Interaction

    Perceive-and-respond simultaneously across modalities; proactive interjection, visual-cue reactions, simultaneous speec…

  • ConceptInteraction / Background Model Split

    Dual-model architecture: time-aware interaction model stays present; async background model handles deep reasoning/tool…

  • ConceptInteraction Models

    Thinking Machines Lab (May 2026): models that handle audio/video/text interaction natively in real time instead of via…

  • ConceptScale-Dependent Prompt Sensitivity

    Large models underperform small ones on 7.7% of standard benchmarks due to overthinking; brevity constraints recover 26…

  • EntityThinking Machines Lab

    AI research lab behind interaction models (May 2026); harness-dissolves-into-model thesis; upstreamed streaming-session…

  • EntityTML-Interaction-Small

    TML's first interaction model: 276B MoE / 12B active, audio+video+text in / text+audio out, 200ms micro-turns, async ba…

Related articles
  • ConceptInteraction Models

    Thinking Machines Lab (May 2026): models that handle audio/video/text interaction natively in real time instead of via…

  • EntityTML-Interaction-Small

    TML's first interaction model: 276B MoE / 12B active, audio+video+text in / text+audio out, 200ms micro-turns, async ba…

  • ConceptAgent Harness Engineering

    Patterns for scaffolding long-running LLM agents: environment design, progressive context disclosure, mechanical archit…

  • ConceptEncoder-Free Early Fusion

    Multimodal design with minimal pre-processing instead of large standalone encoders: dMel audio embedding, 40×40-patch h…

  • ConceptTime-Aligned Micro-Turns

    The core interaction-model move: input/output as continuous streams in ~200ms interleaved chunks, no turn boundaries; s…