CVAug 18, 2025

DianJin-OCR-R1: Enhancing OCR Capabilities via a Reasoning-and-Tool Interleaved Vision-Language Model

arXiv:2508.13238v29 citationsh-index: 3
Originality Incremental advance
AI Analysis

This addresses OCR accuracy issues for document image parsing applications, though it appears incremental as it builds on existing VLM and expert model paradigms.

The paper tackles the problem of hallucinations and limited effectiveness in OCR tasks by large vision-language models, proposing DianJin-OCR-R1, a reasoning-and-tool interleaved framework that combines VLMs with expert models. Experimental results on ReST and OmniDocBench show it consistently outperforms non-reasoning counterparts and expert OCR models.

Recent advances in large vision-language models (LVLMs) have enabled a new paradigm of end-to-end document image parsing, excelling in Optical Character Recognition (OCR) tasks such as text, table, and formula recognition. However, generative LVLMs, similarly to large language models (LLMs), are prone to hallucinations--generating words that do not exist in input images. Furthermore, LVLMs are designed for general purposes and tend to be less effective on OCR tasks compared to expert models that are trained on domain-specific datasets. In this paper, we propose DianJin-OCR-R1, a reasoning-enhanced framework designed to address these limitations through training reasoning-and-tool interleaved VLMs. Given a recognition instruction, our DianJin-OCR-R1 model first recognizes the content in the input image by its own OCR capabilities, and then calls other tools (i.e., other expert models) to obtain their results as references, finally "looks again" the image and rethinks about the reasoning process to provide the final recognized content. Since architectures of expert models are tailored for specific OCR tasks, which makes them less prone to hallucinations, their results can help VLMs mitigate hallucinations. We evaluate our model on ReST and OmniDocBench, and experimental results show that our DianJin-OCR-R1 models consistently outperform their non-reasoning counterparts and expert OCR models, which proves the effectiveness of our method. Additionally, the results indicate that enhancing expert models, which are typically small and easy to iterate, enable performance improvements for VLMs.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes