CV AIOct 15, 2024

Preserve or Modify? Context-Aware Evaluation for Balancing Preservation and Modification in Text-Guided Image Editing

Yoonjeon Kim, Soohyun Ryu, Yeonsung Jung, Hyunkoo Lee, Joowon Kim, June Yong Yang, Jaeryong Hwang, Eunho Yang

arXiv:2410.11374v38.75 citationsh-index: 7Has CodeCVPR

Originality Incremental advance

AI Analysis

This addresses a key evaluation bottleneck for researchers and practitioners in text-guided image editing, though it is an incremental improvement over existing metrics like Directional CLIP similarity.

The paper tackles the problem of context-blindness in existing metrics for text-guided image editing, where current evaluation methods fail to balance preservation of source image elements with modifications based on target text. The proposed AugCLIP metric adaptively coordinates these aspects using CLIP representations and multi-modal language models, achieving strong alignment with human evaluations across five benchmark datasets.

The development of vision-language and generative models has significantly advanced text-guided image editing, which seeks the preservation of core elements in the source image while implementing modifications based on the target text. However, existing metrics have a context-blindness problem, indiscriminately applying the same evaluation criteria on completely different pairs of source image and target text, biasing towards either modification or preservation. Directional CLIP similarity, the only metric that considers both source image and target text, is also biased towards modification aspects and attends to irrelevant editing regions of the image. We propose AugCLIP, a context-aware metric that adaptively coordinates preservation and modification aspects, depending on the specific context of a given source image and target text. This is done by deriving the CLIP representation of an ideally edited image, that preserves the source image with necessary modifications to align with target text. More specifically, using a multi-modal large language model, AugCLIP augments the textual descriptions of the source and target, then calculates a modification vector through a hyperplane that separates source and target attributes in CLIP space. Extensive experiments on five benchmark datasets, encompassing a diverse range of editing scenarios, show that AugCLIP aligns remarkably well with human evaluation standards, outperforming existing metrics. The code is available at https://github.com/augclip/augclip_eval.

View on arXiv PDF Code

Similar