LGNov 7, 2025

Turning Adversaries into Allies: Reversing Typographic Attacks for Multimodal E-Commerce Product Retrieval

arXiv:2511.05325v1h-index: 2
Originality Incremental advance
AI Analysis

This work addresses a security and performance issue in e-commerce product retrieval for platforms using vision-language models, but it is incremental as it adapts an existing attack concept for enhancement.

The paper tackled the vulnerability of multimodal product retrieval systems to typographic attacks by proposing a method that renders relevant textual content onto product images to improve image-text alignment, resulting in consistent improvements in retrieval accuracy across three e-commerce datasets and six vision foundation models.

Multimodal product retrieval systems in e-commerce platforms rely on effectively combining visual and textual signals to improve search relevance and user experience. However, vision-language models such as CLIP are vulnerable to typographic attacks, where misleading or irrelevant text embedded in images skews model predictions. In this work, we propose a novel method that reverses the logic of typographic attacks by rendering relevant textual content (e.g., titles, descriptions) directly onto product images to perform vision-text compression, thereby strengthening image-text alignment and boosting multimodal product retrieval performance. We evaluate our method on three vertical-specific e-commerce datasets (sneakers, handbags, and trading cards) using six state-of-the-art vision foundation models. Our experiments demonstrate consistent improvements in unimodal and multimodal retrieval accuracy across categories and model families. Our findings suggest that visually rendering product metadata is a simple yet effective enhancement for zero-shot multimodal retrieval in e-commerce applications.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes