CVJun 11, 2021

Step-Wise Hierarchical Alignment Network for Image-Text Matching

arXiv:2106.06509v1117 citations
Originality Incremental advance
AI Analysis

This work addresses the challenge of bridging semantic gaps between vision and language for applications like cross-modal retrieval, though it appears incremental in method.

The paper tackles the problem of fine-grained image-text matching by proposing a step-wise hierarchical alignment network (SHAN) that decomposes the task into multi-step reasoning, achieving superior results on benchmark datasets.

Image-text matching plays a central role in bridging the semantic gap between vision and language. The key point to achieve precise visual-semantic alignment lies in capturing the fine-grained cross-modal correspondence between image and text. Most previous methods rely on single-step reasoning to discover the visual-semantic interactions, which lacks the ability of exploiting the multi-level information to locate the hierarchical fine-grained relevance. Different from them, in this work, we propose a step-wise hierarchical alignment network (SHAN) that decomposes image-text matching into multi-step cross-modal reasoning process. Specifically, we first achieve local-to-local alignment at fragment level, following by performing global-to-local and global-to-global alignment at context level sequentially. This progressive alignment strategy supplies our model with more complementary and sufficient semantic clues to understand the hierarchical correlations between image and text. The experimental results on two benchmark datasets demonstrate the superiority of our proposed method.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes