LG AIMay 23, 2025

Towards General Continuous Memory for Vision-Language Models

Wenyi Wu, Zixuan Song, Kun Zhou, Yifei Shao, Zhiting Hu, Biwei Huang

arXiv:2505.17670v215 citationsh-index: 5

Originality Incremental advance

AI Analysis

This addresses the need for efficient external memory in vision-language models for complex reasoning, though it is incremental as it builds on existing VLM capabilities.

The paper tackles the problem of vision-language models struggling with complex reasoning tasks requiring multimodal or multilingual knowledge by proposing a continuous memory system using dense embeddings, which improves performance on such tasks with only 1.2% additional parameters and 15.6K self-synthesized samples.

Language models (LMs) and their extension, vision-language models (VLMs), have achieved remarkable performance across various tasks. However, they still struggle with complex reasoning tasks that require multimodal or multilingual real-world knowledge. To support such capabilities, an external memory system that can efficiently provide relevant multimodal information is essential. Existing approaches generally concatenate image and text tokens into a long sequence as memory, which, however, may drastically increase context length and even degrade performance. In contrast, we propose using continuous memory, a compact set of dense embeddings to more effectively and efficiently represent multimodal and multilingual knowledge. Our key insight is that a VLM can serve as its own continuous memory encoder. We empirically show that this design improves performance on complex multimodal reasoning tasks. Building on this, we introduce a data-efficient and parameter-efficient method to fine-tune the VLM into a memory encoder, requiring only 1.2% of the model's parameters and a small corpus of 15.6K self-synthesized samples. Our approach CoMEM utilizes VLM's original capabilities to encode arbitrary multimodal and multilingual knowledge into just 8 continuous embeddings. Since the inference-time VLM remains frozen, our memory module is plug-and-play and can be flexibly integrated as needed. Extensive experiments across eight multimodal reasoning benchmarks demonstrate the effectiveness of our approach.

View on arXiv PDF

Similar