CL HCJun 5

Explain Like I'm 5 or Whatever I Choose: Evaluating the Interactive Potential of Language Model Responses

arXiv:2606.0678831.9

Originality Incremental advance

AI Analysis

For researchers and developers designing interactive LLM interfaces, this work highlights the need for interface-specific evaluation criteria, but the findings are incremental.

The authors propose an evaluation framework for LLMs' ability to generate multiple responses to a single query with varying language complexity, inspired by direct manipulation interfaces. Testing on four models, the best (Claude Sonnet 4.5) only shifts complexity in the correct direction 46% of the time, indicating inconsistent performance.

Evaluations of large language models (LLMs) in scientific information seeking tasks have become increasingly use-centric, such as conducting live or multi-turn evaluations with real users. These evaluations still assume a single, static chat interface, but as models are integrated into new interfaces, evaluations must shift to incorporate interface-specific criteria. We propose a new evaluation framework based on a formative study with $16$ participants that tests models' ability to generate multiple responses to one query that differ along an interpretable axis of language (language complexity), inspired by direct manipulation interfaces from human-centered design literature. We evaluate GPT-5.1, GPT-5 mini, Claude Sonnet 4.5 + Thinking, and DeepSeek-V3.1 by generating 5 responses at different levels of language complexity for $98$ scientific queries. While models vary complexity across responses, most changes remain inconsistent, with the best performing model (Claude Sonnet 4.5) only shifting reliable complexity measures in the correct direction $46\%$ of the time. Our findings hold with increased sample size and alternative complexity levels.

View on arXiv PDF

Similar