AI CL HCAug 15, 2025

Inclusion Arena: An Open Platform for Evaluating Large Foundation Models with Real-World Apps

Kangyu Wang, Hongliang He, Lin Liu, Ruiqi Liang, Zhenzhong Lan, Jianguo Li

arXiv:2508.11452v2h-index: 8

Originality Incremental advance

AI Analysis

This provides a more practical evaluation platform for developers and users of large foundation models, though it is incremental as it builds on existing ranking methods like Bradley-Terry.

The paper tackles the problem that existing benchmarks for large foundation models often fail to reflect real-world performance, and presents Inclusion Arena, a live leaderboard that ranks models based on human feedback from AI applications, demonstrating reliable rankings with higher data transitivity and reduced manipulation risk.

Large Language Models (LLMs) and Multimodal Large Language Models (MLLMs) have ushered in a new era of AI capabilities, demonstrating near-human-level performance across diverse scenarios. While numerous benchmarks (e.g., MMLU) and leaderboards (e.g., Chatbot Arena) have been proposed to help evolve the development of LLMs and MLLMs, most rely on static datasets or crowdsourced general-domain prompts, often falling short of reflecting performance in real-world applications. To bridge this critical gap, we present Inclusion Arena, a live leaderboard that ranks models based on human feedback collected directly from AI-powered applications. Our platform integrates pairwise model comparisons into natural user interactions, ensuring evaluations reflect practical usage scenarios. For robust model ranking, we employ the Bradley-Terry model augmented with two key innovations: (1) Placement Matches, a cold-start mechanism to quickly estimate initial ratings for newly integrated models, and (2) Proximity Sampling, an intelligent comparison strategy that prioritizes battles between models of similar capabilities to maximize information gain and enhance rating stability. Extensive empirical analyses and simulations demonstrate that Inclusion Arena yields reliable and stable rankings, exhibits higher data transitivity compared to general crowdsourced datasets, and significantly mitigates the risk of malicious manipulation. By fostering an open alliance between foundation models and real-world applications, Inclusion Arena aims to accelerate the development of LLMs and MLLMs truly optimized for practical, user-centric deployments. The platform is publicly accessible at https://www.tbox.cn/about/model-ranking.

View on arXiv PDF

Similar