TextWand: A Unified Framework for Scene Text Editing
For researchers and practitioners in scene text editing, this work provides a unified framework that outperforms existing open-source and closed-source models, though it is an incremental improvement over specialized methods.
TextWand unifies scene text removal, generation, and replacement into a single model using atomic primitives of rendering and erasure, achieving superior performance over existing models across all three tasks.
We propose TextWand, a general-purpose framework that unifies scene text removal, generation, and replacement into a single model. By decomposing complex editing tasks into the atomic primitives of rendering and erasure, TextWand achieves precise control over both text appearance and background integrity. Specifically, we introduce a novel design, Overlay-Reference Positional Encoding (ORPE), to enforce pixel-level layout fidelity and exemplar-driven style control, alongside a new strategy, Region-Adaptive Suppression (RAS), to ensure clean text erasure. To address the absence of a comprehensive benchmark for general-purpose scene text editing among existing single-task datasets, we construct TextWand-Bench. Extensive experiments demonstrate that TextWand outperforms existing leading open-source and closed-source models by delivering superior text content accuracy, layout and style consistency, and overall image quality across scene text removal, generation and replacement tasks.