CVCLLGJun 30, 2020

Modality-Agnostic Attention Fusion for visual search with text feedback

arXiv:2007.00145v189 citations
Originality Incremental advance
AI Analysis

This work addresses fine-grained visual search for applications like e-commerce, representing an incremental improvement over existing methods.

The paper tackles image retrieval with natural language feedback by proposing a Modality-Agnostic Attention Fusion (MAAF) model, which outperforms existing approaches on datasets like Fashion IQ and CSS and performs competitively on Fashion200k, while also introducing new benchmarks where it outperforms baselines.

Image retrieval with natural language feedback offers the promise of catalog search based on fine-grained visual features that go beyond objects and binary attributes, facilitating real-world applications such as e-commerce. Our Modality-Agnostic Attention Fusion (MAAF) model combines image and text features and outperforms existing approaches on two visual search with modifying phrase datasets, Fashion IQ and CSS, and performs competitively on a dataset with only single-word modifications, Fashion200k. We also introduce two new challenging benchmarks adapted from Birds-to-Words and Spot-the-Diff, which provide new settings with rich language inputs, and we show that our approach without modification outperforms strong baselines. To better understand our model, we conduct detailed ablations on Fashion IQ and provide visualizations of the surprising phenomenon of words avoiding "attending" to the image region they refer to.

Code Implementations2 repos
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes