AIJun 17, 2025

FormGym: Doing Paperwork with Agents

arXiv:2506.14079v1h-index: 1
Originality Incremental advance
AI Analysis

This addresses the time-consuming paperwork problem for users and agents, but it is incremental as it builds on existing agent methods with a new tool.

The paper tackles the problem of automating form-filling tasks in the pure-image domain without OCR or text access, where baseline agents achieve less than 1% accuracy, and introduces FieldFinder to improve performance, increasing accuracy from 2% to 56% in some cases.

Completing paperwork is a challenging and time-consuming problem. Form filling is especially challenging in the pure-image domain without access to OCR, typeset PDF text, or a DOM. For computer agents, it requires multiple abilities, including multi-modal understanding, information retrieval, and tool-use. We present a novel form-filling benchmark consisting of 432 fields spread across 55 documents and 3 tasks, requiring knowledge of 236 features per user. We find that baseline VLAs achieve less than 1% accuracy in most cases, primarily due to poor localization ability. GUI agents also struggle, scoring between 10.6-68.0% despite high cost and latency. Therefore, we also contribute FieldFinder, a tool to assist LLMs in identifying where to place text on a form. With FieldFinder, all models achieve equal or better performance in all six study conditions, with a maximum increase from 2% to 56%.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes