CVSep 12, 2025

Detecting Text Manipulation in Images using Vision Language Models

arXiv:2509.10278v1h-index: 12Has Code
Originality Synthesis-oriented
AI Analysis

This work addresses a gap in detecting text manipulations in images, which is important for applications like fraud detection, but it is incremental as it builds on existing vision language model research.

The study tackled the problem of text manipulation detection in images, which is largely missing in existing vision language model research, by analyzing closed- and open-source models on various datasets and found that open-source models are still behind closed-source ones like GPT-4o, with specific models suffering from generalization issues.

Recent works have shown the effectiveness of Large Vision Language Models (VLMs or LVLMs) in image manipulation detection. However, text manipulation detection is largely missing in these studies. We bridge this knowledge gap by analyzing closed- and open-source VLMs on different text manipulation datasets. Our results suggest that open-source models are getting closer, but still behind closed-source ones like GPT- 4o. Additionally, we benchmark image manipulation detection-specific VLMs for text manipulation detection and show that they suffer from the generalization problem. We benchmark VLMs for manipulations done on in-the-wild scene texts and on fantasy ID cards, where the latter mimic a challenging real-world misuse.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes