CLNov 16, 2025

MMWOZ: Building Multimodal Agent for Task-oriented Dialogue

Pu-Hai Yang, Heyan Huang, Heng-Da Xu, Fanshu Sun, Xian-Ling Mao, Chaoxu Mu

arXiv:2511.12586v12.7

Originality Synthesis-oriented

AI Analysis

This work addresses the problem of integrating GUI-based interactions into task-oriented dialogue systems for real-world deployment, though it is incremental as it builds upon existing datasets and methods.

The paper tackles the gap between traditional task-oriented dialogue systems and real-world applications lacking back-end APIs by introducing MMWOZ, a multimodal dataset extended from MultiWOZ 2.3 with GUI snapshots and operation instructions, and proposes MATE as a baseline model for building practical multimodal agents.

Task-oriented dialogue systems have garnered significant attention due to their conversational ability to accomplish goals, such as booking airline tickets for users. Traditionally, task-oriented dialogue systems are conceptualized as intelligent agents that interact with users using natural language and have access to customized back-end APIs. However, in real-world scenarios, the widespread presence of front-end Graphical User Interfaces (GUIs) and the absence of customized back-end APIs create a significant gap for traditional task-oriented dialogue systems in practical applications. In this paper, to bridge the gap, we collect MMWOZ, a new multimodal dialogue dataset that is extended from MultiWOZ 2.3 dataset. Specifically, we begin by developing a web-style GUI to serve as the front-end. Next, we devise an automated script to convert the dialogue states and system actions from the original dataset into operation instructions for the GUI. Lastly, we collect snapshots of the web pages along with their corresponding operation instructions. In addition, we propose a novel multimodal model called MATE (Multimodal Agent for Task-oriEnted dialogue) as the baseline model for the MMWOZ dataset. Furthermore, we conduct comprehensive experimental analysis using MATE to investigate the construction of a practical multimodal agent for task-oriented dialogue.

View on arXiv PDF

Similar