CVLGSep 11, 2024

Learning to Compress Contexts for Efficient Knowledge-based Visual Question Answering

arXiv:2409.07331v22 citationsh-index: 10
AI Analysis

This addresses efficiency issues in KB-VQA for practical applications, offering a novel compression technique that is incremental but with strong specific gains.

The paper tackles the problem of inefficient inference in knowledge-based visual question answering (KB-VQA) due to large input token sizes, proposing a method that compresses retrieved knowledge into a compact Key-Value cache to adapt frozen multimodal large language models (MLLMs). It achieves state-of-the-art performance of 63.92% on OK-VQA and reduces inference latency by 22.0%-59.7% compared to prior methods.

Multimodal large language models (MLLMs) have demonstrated great performance on visual question answering (VQA). When it comes to knowledge-based Visual Question Answering (KB-VQA), MLLMs may lack the specialized domain knowledge needed to answer questions, necessitating the retrieval of necessary information from external knowledge sources. Previous works like Retrival-Augmented VQA-v2 (RAVQA-v2) focus on utilizing as much input information, such as image-based textual descriptions and retrieved knowledge, as possible to improve performance, but they all overlook the issue that with the number of input tokens increasing, inference efficiency significantly decreases, which contradicts the demands of practical applications. To address this issue, we propose \textbf{R}etrieval-\textbf{A}ugmented MLLMs with Compressed Contexts (RACC). RACC learns to compress and aggregate retrieved knowledge for a given image-question pair, generating a compact modulation in the form of Key-Value (KV) cache to adapt the downstream frozen MLLM, thereby achieving effective and efficient inference. RACC achieves a state-of-the-art (SOTA) performance of 63.92\% on OK-VQA. Moreover, it significantly reduces inference latency by 22.0\%-59.7\% compared to the prominent RAVQA-v2. Abundant experiments show RACC's broad applicability. It is compatible with various off-the-shelf MLLMs and can also handle different knowledge sources including textual and multimodal documents.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes