CVJul 9, 2025

Evaluating Attribute Confusion in Fashion Text-to-Image Generation

arXiv:2507.07079v11 citationsh-index: 5ICIAP
Originality Incremental advance
AI Analysis

This addresses a specific evaluation bottleneck for researchers and practitioners in fashion AI, offering a more reliable and scalable alternative to subjective assessments, though it is incremental in improving existing evaluation frameworks.

The paper tackles the problem of evaluating attribute confusion in fashion text-to-image generation, where attributes are correctly depicted but associated with wrong entities, and introduces a novel automatic metric, Localized VQAScore, which outperforms state-of-the-art methods in correlation with human judgments on a new dataset.

Despite the rapid advances in Text-to-Image (T2I) generation models, their evaluation remains challenging in domains like fashion, involving complex compositional generation. Recent automated T2I evaluation methods leverage pre-trained vision-language models to measure cross-modal alignment. However, our preliminary study reveals that they are still limited in assessing rich entity-attribute semantics, facing challenges in attribute confusion, i.e., when attributes are correctly depicted but associated to the wrong entities. To address this, we build on a Visual Question Answering (VQA) localization strategy targeting one single entity at a time across both visual and textual modalities. We propose a localized human evaluation protocol and introduce a novel automatic metric, Localized VQAScore (L-VQAScore), that combines visual localization with VQA probing both correct (reflection) and miss-localized (leakage) attribute generation. On a newly curated dataset featuring challenging compositional alignment scenarios, L-VQAScore outperforms state-of-the-art T2I evaluation methods in terms of correlation with human judgments, demonstrating its strength in capturing fine-grained entity-attribute associations. We believe L-VQAScore can be a reliable and scalable alternative to subjective evaluations.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes