AIOct 14, 2025

PricingLogic: Evaluating LLMs Reasoning on Complex Tourism Pricing Tasks

Yunuo Liu, Dawei Zhu, Zena Al-Khalili, Dai Cheng, Yanjun Chen, Dietrich Klakow, Wei Zhang, Xiaoyu Shen

arXiv:2510.12409v15.81 citationsh-index: 10Has CodeEMNLP

Originality Incremental advance

AI Analysis

This work addresses the need for reliable AI in tourism pricing to prevent financial losses and maintain customer trust, though it is incremental as it focuses on benchmarking rather than solving the problem.

The authors tackled the problem of evaluating whether Large Language Models (LLMs) can reliably automate tourism pricing tasks with overlapping fare rules, and found that LLMs show a steep performance drop on harder bundled-tour calculations, exposing systematic failures in rule interpretation and arithmetic reasoning.

We present PricingLogic, the first benchmark that probes whether Large Language Models(LLMs) can reliably automate tourism-related prices when multiple, overlapping fare rules apply. Travel agencies are eager to offload this error-prone task onto AI systems; however, deploying LLMs without verified reliability could result in significant financial losses and erode customer trust. PricingLogic comprises 300 natural-language questions based on booking requests derived from 42 real-world pricing policies, spanning two levels of difficulty: (i) basic customer-type pricing and (ii)bundled-tour calculations involving interacting discounts. Evaluations of a line of LLMs reveal a steep performance drop on the harder tier,exposing systematic failures in rule interpretation and arithmetic reasoning.These results highlight that, despite their general capabilities, today's LLMs remain unreliable in revenue-critical applications without further safeguards or domain adaptation. Our code and dataset are available at https://github.com/EIT-NLP/PricingLogic.

View on arXiv PDF Code

Similar