CVOct 8, 2025

Cross-Modal Attention Guided Unlearning in Vision-Language Models

arXiv:2510.07567v1h-index: 7
Originality Incremental advance
AI Analysis

This addresses privacy concerns in vision-language models for tasks like visual question answering, offering a practical solution for real-world applications, though it is incremental as it adapts existing unlearning ideas to a more complex multi-modal setting.

The paper tackles the problem of preventing vision-language models from leaking private or sensitive information memorized during training, by proposing a lightweight unlearning framework that uses cross-modal attention to modify visual tokens, achieving performance comparable to finetuning-based methods without altering pre-trained parameters.

Vision-Language Models (VLMs) have demonstrated immense capabilities in multi-modal understanding and inference tasks such as Visual Question Answering (VQA), which requires models to infer outputs based on visual and textual context simultaneously. Such inference abilities of large-scale pretrained models are often attributed to the massive scale of pre-training data collected across several domains. However, the models may memorize private and/or sensitive information during training and regurgitate it in inference. Recently, machine unlearning has been leveraged to address the leakage of private data in LLMs. VLMs add a layer of complexity to this process, as the visual context in the query may also contain sensitive information in addition to the text. To address this issue, we explore unlearning for vision-language models, specifically for the VQA task. We explore the role of visual tokens for output generation in VLMs using cross-modal attention and utilize it to formulate Cross-Modal Attention Guided Unlearning (CAGUL), a lightweight and efficient VLM unlearning framework. In contrast to computationally expensive model finetuning methods, CAGUL utilizes external modules to encode unlearning information in visual tokens of low importance for relevant queries. We find that the transformed visual tokens not only prevent leakage but also retain reference model behavior. Experimental results show that our method performs better or on par with finetuning-based baselines without altering the pre-trained model parameters or incurring retraining costs, making it a practical and effective unlearning solution for VLMs.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes