AIDec 24, 2024

GUI Testing Arena: A Unified Benchmark for Advancing Autonomous GUI Testing Agent

arXiv:2412.18426v113 citationsh-index: 7Has Code
Originality Incremental advance
AI Analysis

This addresses the problem of inconsistent evaluation in GUI testing research for AI developers, though it is incremental as it builds on existing GUI automation work.

The paper tackles the lack of a standardized benchmark for evaluating autonomous GUI testing agents by proposing GTArena, a unified environment that assesses models across three subtasks using real, injected-defect, and synthetic data, finding that even state-of-the-art models struggle with comprehensive performance.

Nowadays, research on GUI agents is a hot topic in the AI community. However, current research focuses on GUI task automation, limiting the scope of applications in various GUI scenarios. In this paper, we propose a formalized and comprehensive environment to evaluate the entire process of automated GUI Testing (GTArena), offering a fair, standardized environment for consistent operation of diverse multimodal large language models. We divide the testing process into three key subtasks: test intention generation, test task execution, and GUI defect detection, and construct a benchmark dataset based on these to conduct a comprehensive evaluation. It evaluates the performance of different models using three data types: real mobile applications, mobile applications with artificially injected defects, and synthetic data, thoroughly assessing their capabilities in this relevant task. Additionally, we propose a method that helps researchers explore the correlation between the performance of multimodal language large models in specific scenarios and their general capabilities in standard benchmark tests. Experimental results indicate that even the most advanced models struggle to perform well across all sub-tasks of automated GUI Testing, highlighting a significant gap between the current capabilities of Autonomous GUI Testing and its practical, real-world applicability. This gap provides guidance for the future direction of GUI Agent development. Our code is available at https://github.com/ZJU-ACES-ISE/ChatUITest.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes