AIMay 1

InfantAgent-Next: A Multimodal Generalist Agent for Automated Computer Interaction

Bin Lei, Weitai Kang, Zijian Zhang, Winson Chen, Xi Xie, Shan Zuo, Mimi Xie, Ali Payani, Mingyi Hong, Yan Yan, Caiwen Ding

arXiv:2505.1088746.26 citationsh-index: 11Has Code

Predicted impact top 12% in AI · last 90 daysOriginality Incremental advance

AI Analysis

For researchers in automated computer interaction, this work provides a modular agent that combines multiple modalities and achieves competitive results on diverse benchmarks.

InfantAgent-Next is a multimodal generalist agent that integrates tool-based and pure vision agents in a modular architecture, achieving 7.27% accuracy on OSWorld, outperforming Claude-Computer-Use.

This paper introduces \textsc{InfantAgent-Next}, a generalist agent capable of interacting with computers in a multimodal manner, encompassing text, images, audio, and video. Unlike existing approaches that either build intricate workflows around a single large model or only provide workflow modularity, our agent integrates tool-based and pure vision agents within a highly modular architecture, enabling different models to collaboratively solve decoupled tasks in a step-by-step manner. Our generality is demonstrated by our ability to evaluate not only pure vision-based real-world benchmarks (i.e., OSWorld), but also more general or tool-intensive benchmarks (e.g., GAIA and SWE-Bench). Specifically, we achieve $\mathbf{7.27\%}$ accuracy on OSWorld, higher than Claude-Computer-Use. Codes and evaluation scripts are open-sourced at https://github.com/bin123apple/InfantAgent.

View on arXiv PDF Code

Similar