CVCRFeb 26

No Caption, No Problem: Caption-Free Membership Inference via Model-Fitted Embeddings

arXiv:2602.22689v1h-index: 7
Originality Incremental advance
AI Analysis

This work is significant for auditing memorization in latent diffusion models, particularly for scenarios where ground-truth captions are unavailable, which is a common problem for privacy and intellectual property concerns.

The paper introduces MoFit, a caption-free membership inference attack (MIA) framework for latent diffusion models. It addresses the limitation of existing MIAs that require ground-truth captions by constructing synthetic conditioning inputs that are overfitted to the model's generative manifold, enabling effective membership inference without textual annotations.

Latent diffusion models have achieved remarkable success in high-fidelity text-to-image generation, but their tendency to memorize training data raises critical privacy and intellectual property concerns. Membership inference attacks (MIAs) provide a principled way to audit such memorization by determining whether a given sample was included in training. However, existing approaches assume access to ground-truth captions. This assumption fails in realistic scenarios where only images are available and their textual annotations remain undisclosed, rendering prior methods ineffective when substituted with vision-language model (VLM) captions. In this work, we propose MoFit, a caption-free MIA framework that constructs synthetic conditioning inputs that are explicitly overfitted to the target model's generative manifold. Given a query image, MoFit proceeds in two stages: (i) model-fitted surrogate optimization, where a perturbation applied to the image is optimized to construct a surrogate in regions of the model's unconditional prior learned from member samples, and (ii) surrogate-driven embedding extraction, where a model-fitted embedding is derived from the surrogate and then used as a mismatched condition for the query image. This embedding amplifies conditional loss responses for member samples while leaving hold-outs relatively less affected, thereby enhancing separability in the absence of ground-truth captions. Our comprehensive experiments across multiple datasets and diffusion models demonstrate that MoFit consistently outperforms prior VLM-conditioned baselines and achieves performance competitive with caption-dependent methods.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes