Howardism

Sources#

Interaction Models: A Scalable Approach to Human-AI Collaboration

Summary#

The evaluation surface Thinking Machines Lab uses to argue TML-Interaction-Small is "the first model that has both strong intelligence/instruction following and interactivity." Existing benchmarks barely cover interactivity, so TML uses the few that exist plus several new internal ones — and notes "no existing model can meaningfully perform" the visual-proactivity tasks. A research grant for interactivity benchmarks was announced.

Existing benchmarks used#

FD-bench (v1 / v1.5 / v3) — one of the few benchmarks intended to measure interactivity. Model is given prerecorded audio and must respond at certain times; scenarios: user interruption, user backchannel, talking to others, background speech. v1 reports turn-taking latency; v1.5 an average score; v3 response quality / Pass@1 with tools.
Audio MultiChallenge — common benchmark for intelligence + instruction following over audio (APR metric); baselines reported by Scale AI.
BigBench Audio, IFEval (VoiceBench), IFEval (text), Harmbench (text refusal rate) — standard intelligence/IF/safety checks across modalities.
QIVD (Qualcomm IVD) — video+audio QA, evaluated in a streaming setting (raw clip from the beginning, grade the transcript; GPT-4o-mini grader, following Qwen 3.5 Omni).

Headline results (`TML-Interaction-Small`, May 2026)#

FD-bench v1 turn-taking latency: 0.40s — best responsiveness of any model compared (GPT-realtime-2.0 minimal 1.18s, GPT-realtime-1.5 0.59s, Gemini-3.1-flash-live minimal 0.57s).
FD-bench v1.5 average: 77.8 vs ~39–54 for all baselines (including thinking-high models).
FD-bench v3 (audio+tools): 82.8% response quality / 68.0% Pass@1 with background agent enabled — best on both.
Audio MultiChallenge APR: 43.4% — beats every non-thinking baseline; only GPT-realtime-2.0 xhigh (48.5%) is higher.
Claim: "dominates interaction quality while being more intelligent than any non-thinking model."

* in the table = result reported with the background agent enabled (used for benchmarks that need reasoning or tool calls).

New internal benchmarks (proactive audio)#

TimeSpeak — can the model initiate speech at user-specified times with correct content? E.g. "remind me to breathe in and out every 4 seconds until I ask you to stop."
CueSpeak — does the model speak at the appropriate moment with the semantically correct response? Entries constructed so the model must speak at the same time as the user. E.g. "every time I codeswitch, give me the correct word in the original language."

Both: one expected semantic response + timing window per example; LLM judge; correct only if meaning and timing are right; macro-averaged accuracy.

New internal benchmarks (visual proactivity)#

Adapted from existing video benchmarks; "no existing model can meaningfully perform any of these" — they stay silent or answer wrong, including thinking-high models:

RepCount-A — repeated-action videos, adapted to online rep-counting; stream video after "count out reps for {action}"; graded by whether the last number said after the penultimate rep is within one rep of ground truth. Measures continuous visual tracking + timely counting.
ProactiveVideoQA — videos with questions whose answers become available at specific moments; turn-weighted PAUC@ω=0.5 (0–100), averaged across turns/categories; staying silent scores 25.0; correct answers must come at the correct time, wrong answers penalized.
Charades — temporal action localization; "say 'start' when the person starts {action}, say 'Stop' when they stop"; graded by temporal IoU between predicted and reference intervals.

Why it matters#

The argument structure: prior interactivity-oriented benchmarks "do not adequately capture the qualitative jumps" — so the case for interaction models rests partly on benchmarks TML had to invent. This is a known pattern (new capability → new eval), and a soft spot (self-defined metrics); TML's response is to open a research grant inviting community benchmarks.

Connections#

Interaction Models — what's being measured
TML-Interaction-Small — the model under test
Interaction / Background Model Split — * results use the background agent
Full-Duplex Interaction — TimeSpeak/CueSpeak/RepCount-A/ProactiveVideoQA/Charades each target one of these modes
Time-Aligned Micro-Turns — turn-taking latency (0.40s) is the direct payoff of removing turn boundaries
Scale-Dependent Prompt Sensitivity — another case of a paper introducing its own evaluation framing to surface a phenomenon standard benchmarks miss
Claude Opus 4.7 — xhigh effort tier shows up here as a baseline config (GPT-realtime-2.0 minimal/xhigh)

Sources#

Interaction Models: A Scalable Approach to Human-AI Collaboration

Interactivity Benchmarks

Sources#

Summary#

Existing benchmarks used#

Headline results (TML-Interaction-Small, May 2026)#

New internal benchmarks (proactive audio)#

New internal benchmarks (visual proactivity)#

Why it matters#

Connections#

Sources#

Headline results (`TML-Interaction-Small`, May 2026)#