LGJun 17, 2025

RL-Obfuscation: Can Language Models Learn to Evade Latent-Space Monitors?

arXiv:2506.14261v313 citationsh-index: 3
Originality Incremental advance
AI Analysis

This addresses a security vulnerability in AI safety monitoring for researchers and practitioners, showing that current detection methods can be bypassed, which is an incremental but important finding.

The paper tackles the problem of whether language models can learn to evade latent-space monitors that detect undesirable behaviors, finding that token-level monitors are highly vulnerable with models achieving high evasion success rates, while more holistic monitors remain robust.

Latent-space monitors aim to detect undesirable behaviours in Large Language Models by leveraging their internal representations rather than relying solely on black-box outputs. These methods have shown promise in identifying behaviours such as deception and unsafe completions. However, these monitors may themselves become training signals, for example, by using problematic samples found in deployment to retrain models. This raises an important question: can models learn to evade such monitors? To evaluate this capability, we introduce RL-Obfuscation, in which LLMs are finetuned via reinforcement learning to evade latent-space monitors while maintaining their blackbox behaviour. We apply RL-Obfuscation to Language Models ranging from 7B to 14B parameters and evaluate their Evasion Success Rate against a suite of monitors. We find that token-level monitors are highly vulnerable to this attack while more holistic monitors, such as max-pooling or attention-based probes, remain robust. Moreover, for these vulnerable monitors, models trained to evade a single static monitor can generalise to evade other unseen monitors. We also find that the models can be trained to conditionally bypass latent-space monitors on only certain inputs. Finally, we study how the models bypass these monitors and find that the model can learn to repurpose tokens to have different internal representations.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes