CVJan 26, 2025

Ocean-OCR: Towards General OCR Application via a Vision-Language Model

Song Chen, Xinyu Guo, Yadong Li, Tao Zhang, Mingan Lin, Dongdong Kuang, Youwei Zhang, Lingfeng Ming, Fengyu Zhang, Yuran Wang, Jianhua Xu, Zenan Zhou

arXiv:2501.15558v127.223 citationsh-index: 13Has Code

Originality Highly original

AI Analysis

This addresses the problem of text-related task limitations in MLLMs for AI and OCR applications, representing a strong specific gain rather than an incremental improvement.

The paper tackles the insufficient OCR ability of multimodal large language models (MLLMs) by introducing Ocean-OCR, a 3B MLLM that achieves state-of-the-art performance on various OCR benchmarks and outperforms professional OCR models like TextIn and PaddleOCR.

Multimodal large language models (MLLMs) have shown impressive capabilities across various domains, excelling in processing and understanding information from multiple modalities. Despite the rapid progress made previously, insufficient OCR ability hinders MLLMs from excelling in text-related tasks. In this paper, we present \textbf{Ocean-OCR}, a 3B MLLM with state-of-the-art performance on various OCR scenarios and comparable understanding ability on general tasks. We employ Native Resolution ViT to enable variable resolution input and utilize a substantial collection of high-quality OCR datasets to enhance the model performance. We demonstrate the superiority of Ocean-OCR through comprehensive experiments on open-source OCR benchmarks and across various OCR scenarios. These scenarios encompass document understanding, scene text recognition, and handwritten recognition, highlighting the robust OCR capabilities of Ocean-OCR. Note that Ocean-OCR is the first MLLM to outperform professional OCR models such as TextIn and PaddleOCR.

View on arXiv PDF Code

Similar