CVLGJun 18, 2022

Bear the Query in Mind: Visual Grounding with Query-conditioned Convolution

arXiv:2206.09114v2h-index: 58
Originality Incremental advance
AI Analysis

This addresses the challenge of accurately locating objects based on natural language descriptions in multi-modal AI, with incremental improvements over prior methods.

The paper tackles the problem of visual grounding by proposing a Query-conditioned Convolution Module (QCM) to incorporate textual query information into visual feature extraction, achieving state-of-the-art performance on three datasets.

Visual grounding is a task that aims to locate a target object according to a natural language expression. As a multi-modal task, feature interaction between textual and visual inputs is vital. However, previous solutions mainly handle each modality independently before fusing them together, which does not take full advantage of relevant textual information while extracting visual features. To better leverage the textual-visual relationship in visual grounding, we propose a Query-conditioned Convolution Module (QCM) that extracts query-aware visual features by incorporating query information into the generation of convolutional kernels. With our proposed QCM, the downstream fusion module receives visual features that are more discriminative and focused on the desired object described in the expression, leading to more accurate predictions. Extensive experiments on three popular visual grounding datasets demonstrate that our method achieves state-of-the-art performance. In addition, the query-aware visual features are informative enough to achieve comparable performance to the latest methods when directly used for prediction without further multi-modal fusion.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes