CVAug 31, 2023

Distraction-free Embeddings for Robust VQA

arXiv:2309.00133v1h-index: 14
Originality Incremental advance
AI Analysis

This addresses the challenge of robust cross-modal understanding in VQA, though it appears incremental as it builds on existing attention and alignment methods.

The paper tackles the problem of generating effective latent representations for Vision-Language Understanding tasks like Video Question Answering by removing distractors in the latent space, resulting in improved performance on the SUTD-TrafficQA dataset across various query types.

The generation of effective latent representations and their subsequent refinement to incorporate precise information is an essential prerequisite for Vision-Language Understanding (VLU) tasks such as Video Question Answering (VQA). However, most existing methods for VLU focus on sparsely sampling or fine-graining the input information (e.g., sampling a sparse set of frames or text tokens), or adding external knowledge. We present a novel "DRAX: Distraction Removal and Attended Cross-Alignment" method to rid our cross-modal representations of distractors in the latent space. We do not exclusively confine the perception of any input information from various modalities but instead use an attention-guided distraction removal method to increase focus on task-relevant information in latent embeddings. DRAX also ensures semantic alignment of embeddings during cross-modal fusions. We evaluate our approach on a challenging benchmark (SUTD-TrafficQA dataset), testing the framework's abilities for feature and event queries, temporal relation understanding, forecasting, hypothesis, and causal analysis through extensive experiments.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes