CL AIAug 15, 2025

LETToT: Label-Free Evaluation of Large Language Models On Tourism Using Expert Tree-of-Thought

Ruiyan Qi, Congding Wen, Weibo Zhou, Jiwei Li, Shangsong Liang, Lingbo Li

arXiv:2508.11280v2h-index: 2

Originality Incremental advance

AI Analysis

This work addresses the problem of costly and hallucination-prone LLM evaluation in tourism for researchers and practitioners, offering a scalable, label-free alternative to annotated benchmarks.

The paper tackled the challenge of evaluating large language models (LLMs) in tourism without labeled data by proposing LETToT, a framework using expert-derived reasoning structures, resulting in relative quality gains of 4.99-14.15% over baselines and insights into model scaling and reasoning architectures.

Evaluating large language models (LLMs) in specific domain like tourism remains challenging due to the prohibitive cost of annotated benchmarks and persistent issues like hallucinations. We propose $\textbf{L}$able-Free $\textbf{E}$valuation of LLM on $\textbf{T}$ourism using Expert $\textbf{T}$ree-$\textbf{o}$f-$\textbf{T}$hought (LETToT), a framework that leverages expert-derived reasoning structures-instead of labeled data-to access LLMs in tourism. First, we iteratively refine and validate hierarchical ToT components through alignment with generic quality dimensions and expert feedback. Results demonstrate the effectiveness of our systematically optimized expert ToT with 4.99-14.15\% relative quality gains over baselines. Second, we apply LETToT's optimized expert ToT to evaluate models of varying scales (32B-671B parameters), revealing: (1) Scaling laws persist in specialized domains (DeepSeek-V3 leads), yet reasoning-enhanced smaller models (e.g., DeepSeek-R1-Distill-Llama-70B) close this gap; (2) For sub-72B models, explicit reasoning architectures outperform counterparts in accuracy and conciseness ($p<0.05$). Our work established a scalable, label-free paradigm for domain-specific LLM evaluation, offering a robust alternative to conventional annotated benchmarks.

View on arXiv PDF

Similar