SIMay 10

Modeling Implicit Conflict Monitoring Mechanisms against Stereotypes in LLMs

Jingshen Zhang, Bo Wang, Yanlin Fu, Dongming Zhao, Ruifang He, Yuexian Hou, Zifei Yu

arXiv:2605.0964755.0

Predicted impact top 17% in SI · last 90 daysOriginality Highly original

AI Analysis

For LLM safety researchers, this work provides a mechanistic understanding of implicit self-debiasing and offers lightweight interventions to enhance fairness.

The paper introduces COCO, a method to identify neurons in LLMs that mediate self-debiasing against stereotypes. Ablating these neurons causes over 90% of outputs to revert to biased content, while training-free editing strategies (LE-COCO, NE-COCO) improve robustness against adversarial jailbreaks and safety benchmarks without harming generative performance.

In this paper, we study an emergent self-debiasing mechanisms against stereotypical content in Large Language Models (LLMs). Unlike traditional safety mechanisms that are primarily triggered by explicit input-level stimuli, self-debiasing mechanisms can involve generation-time intrinsic correction that are not directly reducible to surface-level prompt. Motivated by conflict-monitoring and response-inhibition accounts in cognitive neuroscience, we propose COCO, a contrastive causal method designed to identify COCO neurons that exhibit high intra-\underline{CO}nsistency yet sharp inter-\underline{CO}ntrast across antithetical generative responses, such as stereotypical versus unbiased outputs. Ablation studies reveal that deactivating COCO neurons leads to a catastrophic collapse of the model's fairness; over 90\% of outputs revert to biased content, far exceeding the bias levels induced by explicit adversarial jailbreak attacks. Observing that simple weight amplification of COCO neurons yields only marginal gains, we propose two training-free, lightweight editing strategies: Local Enhancement (LE-COCO) and Networked Enhancement (NE-COCO). Comprehensive evaluations show that our methods bolster robustness against adversarial jailbreaks and achieve strong performance on open-ended safety benchmarks, while preserving foundational generative proficiency. While this study primarily addresses social stereotypes, the COCO mechanism holds significant potential for diverse domains like hallucination detection, offering valuable insights toward the development of self-evolving AI agents.

View on arXiv PDF

Similar