CVFeb 10, 2025

Benchmarking Vision-Language Models on Optical Character Recognition in Dynamic Video Environments

arXiv:2502.06445v112 citationsh-index: 2Has Code
Originality Synthesis-oriented
AI Analysis

This work addresses the need for standardized evaluation of VLMs in video-based OCR tasks for researchers and practitioners, though it is incremental as it builds on existing models and datasets.

The paper tackles the problem of evaluating Vision-Language Models (VLMs) on Optical Character Recognition (OCR) in dynamic video environments by benchmarking three state-of-the-art VLMs against traditional OCR systems on a curated dataset of 1,477 annotated frames, showing that VLMs can outperform conventional models in many scenarios but face challenges like hallucinations and sensitivity to occluded text.

This paper introduces an open-source benchmark for evaluating Vision-Language Models (VLMs) on Optical Character Recognition (OCR) tasks in dynamic video environments. We present a curated dataset containing 1,477 manually annotated frames spanning diverse domains, including code editors, news broadcasts, YouTube videos, and advertisements. Three state of the art VLMs - Claude-3, Gemini-1.5, and GPT-4o are benchmarked against traditional OCR systems such as EasyOCR and RapidOCR. Evaluation metrics include Word Error Rate (WER), Character Error Rate (CER), and Accuracy. Our results highlight the strengths and limitations of VLMs in video-based OCR tasks, demonstrating their potential to outperform conventional OCR models in many scenarios. However, challenges such as hallucinations, content security policies, and sensitivity to occluded or stylized text remain. The dataset and benchmarking framework are publicly available to foster further research.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes