CVJun 4

TextWand: A Unified Framework for Scene Text Editing

arXiv:2606.0573087.6Has Code
AI Analysis

For researchers and practitioners in scene text editing, this work provides a unified framework that outperforms existing open-source and closed-source models, though it is an incremental improvement over specialized methods.

TextWand unifies scene text removal, generation, and replacement into a single model using atomic primitives of rendering and erasure, achieving superior performance over existing models across all three tasks.

We propose TextWand, a general-purpose framework that unifies scene text removal, generation, and replacement into a single model. By decomposing complex editing tasks into the atomic primitives of rendering and erasure, TextWand achieves precise control over both text appearance and background integrity. Specifically, we introduce a novel design, Overlay-Reference Positional Encoding (ORPE), to enforce pixel-level layout fidelity and exemplar-driven style control, alongside a new strategy, Region-Adaptive Suppression (RAS), to ensure clean text erasure. To address the absence of a comprehensive benchmark for general-purpose scene text editing among existing single-task datasets, we construct TextWand-Bench. Extensive experiments demonstrate that TextWand outperforms existing leading open-source and closed-source models by delivering superior text content accuracy, layout and style consistency, and overall image quality across scene text removal, generation and replacement tasks.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes