CRMay 7

Pop Quiz Attack: Black-box Membership Inference Attacks Against Large Language Models

arXiv:2605.0642368.0
Predicted impact top 22% in CR · last 90 daysOriginality Incremental advance
AI Analysis

This work exposes persistent privacy vulnerabilities in modern LLMs, enabling adversaries to infer training data membership with high accuracy.

The PopQuiz Attack achieves a black-box membership inference attack against LLMs by converting target data into multiple-choice questions, attaining an average ROC-AUC of 0.873 and outperforming existing methods by 20.6% across six models and four datasets.

Large language models (LLMs) show strong performance across many applications, but their ability to memorize and potentially reveal training data raises serious privacy concerns. We introduce the PopQuiz Attack, a black-box membership inference attack that tests whether a model can recall specific training examples. The core idea is to turn target data into quiz-style multiple-choice questions and infer membership from the model's answers. Across six widely used LLMs (GPT-3.5, GPT-4o, LLaMA2-7b, LLaMA2-13b, Mistral-7b, and Vicuna-7b) and four datasets, our method achieves an average ROC-AUC of 0.873 and outperforms existing approaches by 20.6%. We further analyze factors affecting attack success, including query complexity, data type, data structure, and training settings. We also evaluate instruction-based, filter-based, and differential privacy-based defenses, which reduce performance but do not eliminate the risk. Our results highlight persistent privacy vulnerabilities in modern LLMs.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes