RecoAtlas: From Semantic Plausibility to Set-Level Utility in LLM Recommendation Agents

Imad Aouali, Flavian Vasile, Otmane Sakhi, Alexandre Gilotte, Benjamin Heymann

arXiv:2605.1880564.6

Predicted impact top 47% in IR · last 90 daysOriginality Incremental advance

AI Analysis

Provides a more meaningful evaluation framework for shopping recommendation agents, addressing the gap between plausible-sounding and actually useful recommendations.

LLM recommendation agents are often evaluated by semantic plausibility rather than actual utility. RecoAtlas introduces a benchmark with behavior-grounded metrics (relevance, complementarity, diversity) and controlled tool environments, showing that semantic plausibility does not capture behavior-grounded utility and that performance scales with model capacity and tool alignment.

LLM recommendation agents increasingly produce structured recommendation reports: sets of items accompanied by natural-language justifications. Yet existing evaluations often reduce this setting to reranking small shortlisted candidate sets or judge reports mainly by semantic plausibility. We introduce Recommendation Atlas (Agentic Tool-Level Assessment for Shopping), or RecoAtlas, a benchmark and toolkit for evaluating shopping agents with behavior-grounded metrics. RecoAtlas complements held-out interaction metrics with learned utility proxies for relevance, complementarity, and diversity derived from interaction data, while separately measuring semantic coherence and explanation quality. Its controlled tool environment exposes agents to either semantic, behavior-aligned, or faulty tools, enabling diagnosis of whether performance gains arise from stronger reasoning, better signals, or more effective tool-use policies. Across controlled experiments, we show that RecoAtlas exhibits key properties of a meaningful benchmark for agentic systems: performance scales with model capacity and test-time compute, improves with stronger and better-aligned tools, degrades under noisy or misaligned signals, and reveals that semantic plausibility does not necessarily capture behavior-grounded utility. RecoAtlas provides a foundation for developing and evaluating shopping assistants that optimize not only for plausible recommendations, but also for coherent, behaviorally grounded recommendation sets.

View on arXiv PDF

Similar