LG CRFeb 19

Discovering Universal Activation Directions for PII Leakage in Language Models

Leo Marchyok, Zachary Coalson, Sungho Keum, Sooel Son, Sanghyun Hong

arXiv:2602.16980v11.4h-index: 18

Originality Highly original

AI Analysis

This addresses privacy risks in AI systems for users and developers, offering a novel mechanistic-interpretability approach.

The paper tackled the problem of understanding and controlling personally identifiable information (PII) leakage in language models by discovering universal activation directions that amplify PII generation probability across contexts, with results showing substantial increases in leakage compared to existing methods.

Modern language models exhibit rich internal structure, yet little is known about how privacy-sensitive behaviors, such as personally identifiable information (PII) leakage, are represented and modulated within their hidden states. We present UniLeak, a mechanistic-interpretability framework that identifies universal activation directions: latent directions in a model's residual stream whose linear addition at inference time consistently increases the likelihood of generating PII across prompts. These model-specific directions generalize across contexts and amplify PII generation probability, with minimal impact on generation quality. UniLeak recovers such directions without access to training data or groundtruth PII, relying only on self-generated text. Across multiple models and datasets, steering along these universal directions substantially increases PII leakage compared to existing prompt-based extraction methods. Our results offer a new perspective on PII leakage: the superposition of a latent signal in the model's representations, enabling both risk amplification and mitigation.

View on arXiv PDF

Similar