CVJan 26, 2025

Ocean-OCR: Towards General OCR Application via a Vision-Language Model

arXiv:2501.15558v123 citationsh-index: 13Has Code
Originality Highly original
AI Analysis

This addresses the problem of text-related task limitations in MLLMs for AI and OCR applications, representing a strong specific gain rather than an incremental improvement.

The paper tackles the insufficient OCR ability of multimodal large language models (MLLMs) by introducing Ocean-OCR, a 3B MLLM that achieves state-of-the-art performance on various OCR benchmarks and outperforms professional OCR models like TextIn and PaddleOCR.

Multimodal large language models (MLLMs) have shown impressive capabilities across various domains, excelling in processing and understanding information from multiple modalities. Despite the rapid progress made previously, insufficient OCR ability hinders MLLMs from excelling in text-related tasks. In this paper, we present \textbf{Ocean-OCR}, a 3B MLLM with state-of-the-art performance on various OCR scenarios and comparable understanding ability on general tasks. We employ Native Resolution ViT to enable variable resolution input and utilize a substantial collection of high-quality OCR datasets to enhance the model performance. We demonstrate the superiority of Ocean-OCR through comprehensive experiments on open-source OCR benchmarks and across various OCR scenarios. These scenarios encompass document understanding, scene text recognition, and handwritten recognition, highlighting the robust OCR capabilities of Ocean-OCR. Note that Ocean-OCR is the first MLLM to outperform professional OCR models such as TextIn and PaddleOCR.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes