AI HCApr 28, 2024

MMAC-Copilot: Multi-modal Agent Collaboration Operating Copilot

Zirui Song, Yaohang Li, Meng Fang, Yanda Li, Zhenhao Chen, Zecheng Shi, Yuan Huang, Xiuying Chen, Ling Chen

arXiv:2404.18074v39.65 citationsh-index: 11

Originality Incremental advance

AI Analysis

This work addresses the issue of restricted interaction capabilities for AI agents in real-world applications, though it appears incremental as it builds on existing agent collaboration concepts.

The paper tackles the problem of large language model agents having limited versatility and hallucinations in PC application interactions by proposing MMAC-Copilot, a multi-modal agent collaboration framework that achieved a 6.8% average improvement on the GAIA benchmark and demonstrated strong performance on a new Visual Interaction Benchmark.

Large language model agents that interact with PC applications often face limitations due to their singular mode of interaction with real-world environments, leading to restricted versatility and frequent hallucinations. To address this, we propose the Multi-Modal Agent Collaboration framework (MMAC-Copilot), a framework utilizes the collective expertise of diverse agents to enhance interaction ability with application. The framework introduces a team collaboration chain, enabling each participating agent to contribute insights based on their specific domain knowledge, effectively reducing the hallucination associated with knowledge domain gaps. We evaluate MMAC-Copilot using the GAIA benchmark and our newly introduced Visual Interaction Benchmark (VIBench). MMAC-Copilot achieved exceptional performance on GAIA, with an average improvement of 6.8\% over existing leading systems. VIBench focuses on non-API-interactable applications across various domains, including 3D gaming, recreation, and office scenarios. It also demonstrated remarkable capability on VIBench. We hope this work can inspire in this field and provide a more comprehensive assessment of Autonomous agents. The anonymous Github is available at \href{https://anonymous.4open.science/r/ComputerAgentWithVision-3C12}{Anonymous Github}

View on arXiv PDF

Similar