The Personalization Paradox: Semantic Loss vs. Reasoning Gains in Agentic AI Q&A
For researchers and developers of personalized AI systems, this work reveals a structural flaw in current LLM evaluation metrics that penalize meaningful personalized responses, providing a methodological foundation for more robust assessment.
The study found that personalization in an agentic RAG system for student advising improves reasoning and grounding but causes an apparent drop in semantic similarity due to metric limitations, with the fully personalized configuration yielding the highest overall gains.
AIVisor, an agentic retrieval-augmented LLM for student advising, was used to examine how personalization affects system performance across multiple evaluation dimensions. Using twelve authentic advising questions intentionally designed to stress lexical precision, we compared ten personalized and non-personalized system configurations and analyzed outcomes with a Linear Mixed-Effects Model across lexical (BLEU, ROUGE-L), semantic (METEOR, BERTScore), and grounding (RAGAS) metrics. Results showed a consistent trade-off: personalization reliably improved reasoning quality and grounding, yet introduced a significant negative interaction on semantic similarity, driven not by poorer answers but by the limits of current metrics, which penalize meaningful personalized deviations from generic reference texts. This reveals a structural flaw in prevailing LLM evaluation methods, which are ill-suited for assessing user-specific responses. The fully integrated personalized configuration produced the highest overall gains, suggesting that personalization can enhance system effectiveness when evaluated with appropriate multidimensional metrics. Overall, the study demonstrates that personalization produces metric-dependent shifts rather than uniform improvements and provides a methodological foundation for more transparent and robust personalization in agentic AI.