LGAIFeb 19, 2025

Quantifying Memorization and Parametric Response Rates in Retrieval-Augmented Vision-Language Models

arXiv:2502.13836v24 citationsh-index: 22Proceedings of the First Workshop on Large Language Model Memorization (L2M2)
Originality Incremental advance
AI Analysis

This addresses the challenge of measuring memorization in multimodal retrieval-augmented models for AI researchers, though it is incremental as it builds on existing work with new metrics and comparisons.

The paper tackles the problem of quantifying memorization versus retrieval in vision-language models, finding that finetuned models rely more on memorization and achieve higher accuracy (72% vs 52% on WebQA), and that image-based questions have 15-25% higher parametric response rates than text-based ones.

Large Language Models (LLMs) demonstrate remarkable capabilities in question answering (QA), but metrics for assessing their reliance on memorization versus retrieval remain underdeveloped. Moreover, while finetuned models are state-of-the-art on closed-domain tasks, general-purpose models like GPT-4o exhibit strong zero-shot performance. This raises questions about the trade-offs between memorization, generalization, and retrieval. In this work, we analyze the extent to which multimodal retrieval-augmented VLMs memorize training data compared to baseline VLMs. Using the WebQA benchmark, we contrast finetuned models with baseline VLMs on multihop retrieval and question answering, examining the impact of finetuning on data memorization. To quantify memorization in end-to-end retrieval and QA systems, we propose several proxy metrics by investigating instances where QA succeeds despite retrieval failing. In line with existing work, we find that finetuned models rely more heavily on memorization than retrieval-augmented VLMs, and achieve higher accuracy as a result (72% vs 52% on WebQA test set). Finally, we present the first empirical comparison of the parametric effect between text and visual modalities. Here, we find that image-based questions have parametric response rates that are consistently 15-25% higher than for text-based questions in the WebQA dataset. As such, our measures pose a challenge for future work, both to account for differences in model memorization across different modalities and more generally to reconcile memorization and generalization in joint Retrieval-QA tasks.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes