CVMay 27

ABot-OCR Technical Report

arXiv:2605.2797862.0h-index: 5
AI Analysis

This work addresses the problem of document parsing for end-to-end systems, offering a simpler alternative to modular pipelines with competitive performance.

ABot-OCR is an end-to-end vision-language model that transcribes page images directly into clean Markdown in a single forward pass, eliminating modular orchestration. It achieves state-of-the-art scores of 92.81 and 93.30 on OmniDocBench v1.5 and v1.6 benchmarks, narrowing the gap with pipeline baselines.

We introduce ABot-OCR, an end-to-end vision-language model that transcribes a page image directly into clean Markdown in a single forward pass. By doing so, our approach completely eliminates the need for brittle modular orchestration. To maximize parsing fidelity, we develop a dedicated data engine to provide large-scale, structurally consistent supervision. Furthermore, we propose Decoupled Heterogeneous Document Optimization, a structure-constrained reinforcement learning method that sharpens textual accuracy and strictly enforces markup well-formedness beyond supervised fine-tuning alone. Extensive evaluations demonstrate the superior performance of our framework. On the OmniDocBench v1.5 and v1.6 benchmarks, ABot-OCR achieves state-of-the-art scores of 92.81 and 93.30 among all end-to-end systems, substantially narrowing the performance gap relative to strong pipeline baselines. Finally, comprehensive multilingual text recognition across ten diverse languages further confirms the robust generalizability of ABot-OCR.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes