LG AIMay 9

ProactBench: Beyond What The User Asked For

Sepehr Harfi, Ahmad Salimi, Dongming Shen, Alex Smola

arXiv:2605.0922891.1

AI Analysis

For LLM developers and evaluators, this benchmark fills a gap in measuring proactive conversational abilities beyond explicit requests, with Recovery being a novel and challenging dimension.

ProactBench introduces a benchmark to measure conversational proactivity—the ability of LLMs to notice and act on implied user needs—across three phase-tied types. Across 16 models, the Recovery type is found to be both difficult and poorly predicted by existing benchmarks, offering a new evaluation signal.

Most LLM benchmarks score how well a model responds to explicit requests. They leave unmeasured a different conversational ability: noticing and acting on needs the user has implied but not said. We call this \emph{conversational proactivity}. ProactBench decomposes it into three phase-tied types: \textsc{Emergent}, inference from a single disclosed anchor; \textsc{Critical}, synthesis across multiple anchors; and \textsc{Recovery}, grounded forward-looking value after task completion. We operationalise the benchmark with three agents: a Planner, a User Agent, and an Assistant Model. Their information asymmetries defend against style-confounded scoring, rubric leakage, external-context contamination, and information dumps. The released corpus contains 198 curated dialogues with 624 trigger points across 24 communication styles drawn from a psychometric inventory and audited by an independent LLM judge. Across 16 frontier and open-weight models, \textsc{Recovery} is both difficult and weakly predicted by six standard benchmarks, making it a useful new evaluation signal.

View on arXiv PDF

Similar