Context-Independent OCR with Multimodal LLMs: Effects of Image Resolution and Visual Complexity
This addresses the problem of reliable character-level OCR for applications requiring precise text extraction, though it is incremental in evaluating existing methods under new conditions.
The study investigated how image resolution and visual complexity affect multimodal LLMs in context-independent OCR tasks, finding they match conventional OCR at about 300 ppi but degrade below 150 ppi, with weak correlation to visual complexity.
Due to their high versatility in tasks such as image captioning, document analysis, and automated content generation, multimodal Large Language Models (LLMs) have attracted significant attention across various industrial fields. In particular, they have been shown to surpass specialized models in Optical Character Recognition (OCR). Nevertheless, their performance under different image conditions remains insufficiently investigated, and individual character recognition is not guaranteed due to their reliance on contextual cues. In this work, we examine a context-independent OCR task using single-character images with diverse visual complexities to determine the conditions for accurate recognition. Our findings reveal that multimodal LLMs can match conventional OCR methods at about 300 ppi, yet their performance deteriorates significantly below 150 ppi. Additionally, we observe a very weak correlation between visual complexity and misrecognitions, whereas a conventional OCR-specific model exhibits no correlation. These results suggest that image resolution and visual complexity may play an important role in the reliable application of multimodal LLMs to OCR tasks that require precise character-level accuracy.