CLSep 17, 2025

Ticket-Bench: A Kickoff for Multilingual and Regionalized Agent Evaluation

arXiv:2509.14477v13 citationsh-index: 5Has Code
Originality Incremental advance
AI Analysis

This addresses the problem of biased agent evaluations for developers and researchers by providing a more realistic, multilingual benchmark, though it is incremental as it builds on existing evaluation frameworks.

The authors tackled the lack of cultural and linguistic diversity in evaluating task-oriented LLM agents by introducing Ticket-Bench, a multilingual benchmark for soccer ticket purchases across six languages, and found that reasoning-oriented models like GPT-5 and Qwen3-235B performed best but still showed significant cross-lingual disparities.

Large language models (LLMs) are increasingly deployed as task-oriented agents, where success depends on their ability to generate accurate function calls under realistic, multilingual conditions. However, existing agent evaluations largely overlook cultural and linguistic diversity, often relying on monolingual or naively translated benchmarks. We introduce Ticket-Bench, a benchmark for multilingual agent evaluation in task-oriented scenarios. Ticket-Bench simulates the domain of soccer ticket purchases across six major languages: Portuguese, English, Spanish, German, Italian, and French. Using localized teams, cities, and user profiles to provide a higher level of realism. We evaluate a wide range of commercial and open-source LLMs, measuring function-calling accuracy and consistency across languages. Results show that reasoning-oriented models (e.g., GPT-5, Qwen3-235B) dominate performance but still exhibit notable cross-lingual disparities. These findings underscore the need for culturally aware, multilingual benchmarks to guide the development of robust LLM agents.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes