CVJun 26, 2024

ScanFormer: Referring Expression Comprehension by Iteratively Scanning

arXiv:2406.18048v121 citations
Originality Incremental advance
AI Analysis

This work addresses efficiency issues in vision-language models for researchers and practitioners, though it is incremental as it builds on existing REC methods.

The paper tackled the problem of redundant visual processing in Referring Expression Comprehension by proposing ScanFormer, a coarse-to-fine iterative framework that discards irrelevant patches, achieving a balance between accuracy and efficiency on datasets like RefCOCO.

Referring Expression Comprehension (REC) aims to localize the target objects specified by free-form natural language descriptions in images. While state-of-the-art methods achieve impressive performance, they perform a dense perception of images, which incorporates redundant visual regions unrelated to linguistic queries, leading to additional computational overhead. This inspires us to explore a question: can we eliminate linguistic-irrelevant redundant visual regions to improve the efficiency of the model? Existing relevant methods primarily focus on fundamental visual tasks, with limited exploration in vision-language fields. To address this, we propose a coarse-to-fine iterative perception framework, called ScanFormer. It can iteratively exploit the image scale pyramid to extract linguistic-relevant visual patches from top to bottom. In each iteration, irrelevant patches are discarded by our designed informativeness prediction. Furthermore, we propose a patch selection strategy for discarded patches to accelerate inference. Experiments on widely used datasets, namely RefCOCO, RefCOCO+, RefCOCOg, and ReferItGame, verify the effectiveness of our method, which can strike a balance between accuracy and efficiency.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes