CLCVSep 22, 2025

Vision Language Models Are Not (Yet) Spelling Correctors

arXiv:2509.17418v11 citationsHas Code
Originality Incremental advance
AI Analysis

This work addresses the challenge of visual spelling correction for real-world applications, presenting a new benchmark and highlighting limitations in existing architectures, though it is incremental in proposing improvements.

The paper tackles the problem of spelling correction from visual input using vision language models, showing that current models fall significantly short of human performance, with consistent gains achieved through new solution paradigms like Joint OCR-Correction and Background Information enhancement.

Spelling correction from visual input poses unique challenges for vision language models (VLMs), as it requires not only detecting but also correcting textual errors directly within images. We present ReViCo (Real Visual Correction), the first benchmark that systematically evaluates VLMs on real-world visual spelling correction across Chinese and English. ReViCo contains naturally occurring errors collected from real-world image data and supports fine-grained evaluation at both image and token levels. Through comprehensive experiments on representative cascaded (Qwen) and native (InternVL) open-source models, as well as closed-source systems (GPT-4o, Claude), we show that current VLMs fall significantly short of human performance, particularly in correction. To address these limitations, we explore two solution paradigms: a Joint OCR-Correction pipeline and a Background Information enhanced approach, both of which yield consistent performance gains. Our analysis highlights fundamental limitations of existing architectures and provides actionable insights for advancing multimodal spelling correction.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes