AIApr 27

Multi-Dimensional Evaluation of Sustainable City Trips with LLM-as-a-Judge and Human-in-the-Loop

Ashmi Banerjee, Adithi Satish, Wolfgang Wörndl, Yashar Deldjoo

arXiv:2604.2415876.9Has Code

Predicted impact top 40% in AI · last 90 daysOriginality Synthesis-oriented

AI Analysis

For researchers and practitioners in conversational recommendation systems, this work addresses the need for transparent and bias-aware evaluation of LLM-based judges.

The paper tackles the challenge of evaluating conversational travel recommendations across multiple stakeholder-centric dimensions. It proposes a three-phase calibration framework for LLM-as-a-Judge, showing that calibration clarifies reasoning but reveals divergent interpretations of sustainability.

Evaluating nuanced conversational travel recommendations is challenging when human annotations are costly and standard metrics ignore stakeholder-centric goals. We study LLMs-as-Judges for sustainable city-trip lists across four dimensions -- relevance, diversity, sustainability, and popularity balance, and propose a three-phase calibration framework: (1) baseline judging with multiple LLMs, (2) expert evaluation to identify systematic misalignment, and (3) dimension-specific calibration via rules and few-shot examples. Across two recommendation settings, we observe model-specific biases and high dimension-level variance, even when judges agree on overall rankings. Calibration clarifies reasoning per dimension but exposes divergent interpretations of sustainability, highlighting the need for transparent, bias-aware LLM evaluation. Prompts and code are released for reproducibility: https://github.com/ashmibanerjee/trs-llm-calibration.

View on arXiv PDF Code

Similar