CVFeb 24

OCR-Agent: Agentic OCR with Capability and Memory Reflection

Shimin Wen, Zeyu Zhang, Xingdou Bian, Hongjie Zhu, Lulu He, Layi Shama, Daji Ergu, Ying Cai

arXiv:2602.21053v11.5h-index: 27Has Code

Originality Incremental advance

AI Analysis

This addresses the issue of cognitive biases and unstable improvements in VLMs for researchers and practitioners in AI, offering an incremental enhancement through structured reflection without additional training.

The paper tackles the problem of large vision-language models lacking effective self-correction mechanisms, leading to repetitive errors in multi-turn revisions, and proposes a novel iterative self-correction framework that improves performance on OCRBench v2, outperforming the open-source SOTA by +2.0 on English and +1.2 on Chinese subsets.

Large Vision-Language Models (VLMs) have demonstrated significant potential on complex visual understanding tasks through iterative optimization methods.However, these models generally lack effective self-correction mechanisms, making it difficult for them to independently rectify cognitive biases. Consequently, during multi-turn revisions, they often fall into repetitive and ineffective attempts, failing to achieve stable improvements in answer quality.To address this issue, we propose a novel iterative self-correction framework that endows models with two key capabilities: Capability Reflection and Memory Reflection. This framework guides the model to first diagnose errors and generate a correction plan via Capability Reflection, then leverage Memory Reflection to review past attempts to avoid repetition and explore new solutions, and finally, optimize the answer through rigorous re-reasoning. Experiments on the challenging OCRBench v2 benchmark show that OCR-Agent outperforms the current open-source SOTA model InternVL3-8B by +2.0 on English and +1.2 on Chinese subsets, while achieving state-of-the-art results in Visual Understanding (79.9) and Reasoning (66.5) - surpassing even larger fine-tuned models. Our method demonstrates that structured, self-aware reflection can significantly enhance VLMs' reasoning robustness without additional training. Code: https://github.com/AIGeeksGroup/OCR-Agent.

View on arXiv PDF Code

Similar