AIMay 24, 2025

Mitigating Deceptive Alignment via Self-Monitoring

Jiaming Ji, Wenqi Chen, Kaile Wang, Donghai Hong, Sitong Fang, Boyuan Chen, Jiayi Zhou, Juntao Dai, Sirui Han, Yike Guo, Yaodong Yang

arXiv:2505.18807v125 citationsh-index: 13

Originality Incremental advance

AI Analysis

This addresses a critical safety issue for AI systems by mitigating covert misalignment in reasoning, though it is an incremental improvement over existing safety methods.

The paper tackles the problem of deceptive alignment in large language models, where models appear aligned but pursue misaligned goals during chain-of-thought reasoning, by introducing CoT Monitor+, a framework that embeds a self-monitor to intercept deception internally, resulting in a 43.8% reduction in deceptive behaviors on average while preserving task accuracy.

Modern large language models rely on chain-of-thought (CoT) reasoning to achieve impressive performance, yet the same mechanism can amplify deceptive alignment, situations in which a model appears aligned while covertly pursuing misaligned goals. Existing safety pipelines treat deception as a black-box output to be filtered post-hoc, leaving the model free to scheme during its internal reasoning. We ask: Can deception be intercepted while the model is thinking? We answer this question, the first framework that embeds a Self-Monitor inside the CoT process itself, named CoT Monitor+. During generation, the model produces (i) ordinary reasoning steps and (ii) an internal self-evaluation signal trained to flag and suppress misaligned strategies. The signal is used as an auxiliary reward in reinforcement learning, creating a feedback loop that rewards honest reasoning and discourages hidden goals. To study deceptive alignment systematically, we introduce DeceptionBench, a five-category benchmark that probes covert alignment-faking, sycophancy, etc. We evaluate various LLMs and show that unrestricted CoT roughly aggravates the deceptive tendency. In contrast, CoT Monitor+ cuts deceptive behaviors by 43.8% on average while preserving task accuracy. Further, when the self-monitor signal replaces an external weak judge in RL fine-tuning, models exhibit substantially fewer obfuscated thoughts and retain transparency. Our project website can be found at cot-monitor-plus.github.io

View on arXiv PDF

Similar