IRLGJan 28

Less is More: Benchmarking LLM Based Recommendation Agents

arXiv:2601.20316v1
Originality Incremental advance
AI Analysis

This provides actionable guidelines for practitioners to reduce inference costs by approximately 88% without sacrificing quality, challenging the existing 'more context is better' paradigm in LLM-based recommendation systems.

The study challenged the assumption that longer user purchase histories improve LLM-based recommendations, finding no significant quality improvement with increased context length (5-50 items) across four state-of-the-art LLMs, with quality scores remaining flat at 0.17-0.23.

Large Language Models (LLMs) are increasingly deployed for personalized product recommendations, with practitioners commonly assuming that longer user purchase histories lead to better predictions. We challenge this assumption through a systematic benchmark of four state of the art LLMs GPT-4o-mini, DeepSeek-V3, Qwen2.5-72B, and Gemini 2.5 Flash across context lengths ranging from 5 to 50 items using the REGEN dataset. Surprisingly, our experiments with 50 users in a within subject design reveal no significant quality improvement with increased context length. Quality scores remain flat across all conditions (0.17--0.23). Our findings have significant practical implications: practitioners can reduce inference costs by approximately 88\% by using context (5--10 items) instead of longer histories (50 items), without sacrificing recommendation quality. We also analyze latency patterns across providers and find model specific behaviors that inform deployment decisions. This work challenges the existing ``more context is better'' paradigm and provides actionable guidelines for cost effective LLM based recommendation systems.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes