CVJul 9, 2024

Integrating Query-aware Segmentation and Cross-Attention for Robust VQA

arXiv:2407.12055v12 citationsh-index: 4
Originality Synthesis-oriented
AI Analysis

This work addresses VQA challenges for visually impaired users, but it is incremental as it builds on existing LVLM and segmentation techniques.

The paper tackles the problem of improving visual question answering (VQA) on the VizWiz dataset by integrating query-aware segmentation and cross-attention mechanisms, resulting in enhanced prediction accuracy through ensemble methods.

This paper introduces a method for VizWiz-VQA using LVLM with trainable cross-attention and LoRA finetuning. We train the model with the following conditions: 1) Training with original images. 2) Training with enhanced images using CLIPSeg to highlight or contrast the original image. 3) Training with integrating the output features of Vision Transformer (ViT) and CLIPSeg features of the original images. Then, we ensemble the results based on Levenshtein distance to enhance the prediction of the final answer. In the experiments, we demonstrate and analyze the proposed method's effectiveness.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes