SecAgent: Efficient Mobile GUI Agent with Semantic Context
This addresses data and efficiency bottlenecks for mobile automation in non-English ecosystems, representing an incremental improvement.
The authors tackled the scarcity of multilingual datasets and inefficient history representation in mobile GUI agents by constructing a Chinese dataset with 18k samples and proposing a semantic context mechanism, resulting in SecAgent outperforming similar-scale baselines and matching 7B-8B models on benchmarks.
Mobile Graphical User Interface (GUI) agents powered by multimodal large language models have demonstrated promising capabilities in automating complex smartphone tasks. However, existing approaches face two critical limitations: the scarcity of high-quality multilingual datasets, particularly for non-English ecosystems, and inefficient history representation methods. To address these challenges, we present SecAgent, an efficient mobile GUI agent at 3B scale. We first construct a human-verified Chinese mobile GUI dataset with 18k grounding samples and 121k navigation steps across 44 applications, along with a Chinese navigation benchmark featuring multi-choice action annotations. Building upon this dataset, we propose a semantic context mechanism that distills history screenshots and actions into concise, natural language summaries, significantly reducing computational costs while preserving task-relevant information. Through supervised and reinforcement fine-tuning, SecAgent outperforms similar-scale baselines and achieves performance comparable to 7B-8B models on our and public navigation benchmarks. Our dataset is available at https://huggingface.co/datasets/alibabagroup/CMGUI.