Reading or Guessing? Visual Grounding Failures of Vision-Language Models for OCR in Ancient Greek Editions

Antonia Karamolegkou, Nicolas Angleraud, Benoît Sagot, Thibault Clérice

arXiv:2605.2775024.0h-index: 2

Predicted impact top 50% in CL · last 90 daysOriginality Synthesis-oriented

AI Analysis

For researchers using VLMs for OCR on low-resource historical documents, this work shows that fluent output does not imply visual grounding, motivating interpretability-driven evaluation beyond aggregate accuracy.

Vision-Language Models (VLMs) for OCR on Ancient Greek critical editions produce fluent but visually unsupported errors, unlike traditional OCR which shows local recognition noise. Under character-level perturbations, VLMs diverge from ground truth while traditional OCR remains faithful, and token-level analysis reveals model-specific reliance on language priors.

Recent work has shown that Vision-Language Models (VLMs) used for optical character recognition (OCR) can generate plausible but visually unsupported text, suggesting reliance on language priors. Comparing open-weight VLMs with traditional OCR baselines on low-resource Ancient Greek critical editions, we show that VLM errors often remain fluent even when wrong, producing plausible Greek substitutions where traditional engines produce local recognition noise. To analyze visual evidence during decoding, we introduce controlled image perturbations and token-level grounding measures based on conditional versus image-free decoding distributions. Under character-level perturbations, VLMs diverge sharply from the perturbed ground truth while traditional OCR remains comparatively faithful; however, token-level analysis shows that prior reliance is model-specific: in an OCR-specialist model, fluent lexical errors are produced with little reliance on the image, whereas general-purpose VLMs remain conditioned on the visual input even when wrong. Decode-time interventions fail to reliably restore grounding, while post-OCR language-model correction improves several systems only by repairing text after generation. Our results extend prior evidence of OCR language-prior reliance to low-resource historical documents and a broader set of models, showing that fluent output is not necessarily visually grounded and motivating interpretability-driven evaluation beyond aggregate accuracy.

View on arXiv PDF

Similar