AIFeb 25

Beyond Refusal: Probing the Limits of Agentic Self-Correction for Semantic Sensitive Information

arXiv:2602.21496v1h-index: 4
Originality Incremental advance
AI Analysis

This addresses a critical safety issue for LLM users by enabling self-correction without destroying utility, though it is incremental as it builds on existing defense concepts.

The paper tackles the problem of Semantic Sensitive Information (SemSI) leaks in Large Language Models (LLMs), where models infer sensitive attributes or generate harmful content, by introducing SemSIEdit, an inference-time framework that reduces leakage by 34.6% with a utility loss of 9.8%.

While defenses for structured PII are mature, Large Language Models (LLMs) pose a new threat: Semantic Sensitive Information (SemSI), where models infer sensitive identity attributes, generate reputation-harmful content, or hallucinate potentially wrong information. The capacity of LLMs to self-regulate these complex, context-dependent sensitive information leaks without destroying utility remains an open scientific question. To address this, we introduce SemSIEdit, an inference-time framework where an agentic "Editor" iteratively critiques and rewrites sensitive spans to preserve narrative flow rather than simply refusing to answer. Our analysis reveals a Privacy-Utility Pareto Frontier, where this agentic rewriting reduces leakage by 34.6% across all three SemSI categories while incurring a marginal utility loss of 9.8%. We also uncover a Scale-Dependent Safety Divergence: large reasoning models (e.g., GPT-5) achieve safety through constructive expansion (adding nuance), whereas capacity-constrained models revert to destructive truncation (deleting text). Finally, we identify a Reasoning Paradox: while inference-time reasoning increases baseline risk by enabling the model to make deeper sensitive inferences, it simultaneously empowers the defense to execute safe rewrites.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes