LGAug 27, 2025

Evaluating Language Model Reasoning about Confidential Information

arXiv:2508.19980v12 citationsh-index: 9
Originality Incremental advance
AI Analysis

This addresses a critical safety concern for deploying language models as autonomous agents in high-stakes settings, but it is incremental as it builds on existing benchmarks and methods.

The paper tackled the problem of whether language models can reliably follow context-dependent safety rules, such as verifying passwords, and found that current models struggle with this task, with reasoning traces often leaking confidential information.

As language models are increasingly deployed as autonomous agents in high-stakes settings, ensuring that they reliably follow user-defined rules has become a critical safety concern. To this end, we study whether language models exhibit contextual robustness, or the capability to adhere to context-dependent safety specifications. For this analysis, we develop a benchmark (PasswordEval) that measures whether language models can correctly determine when a user request is authorized (i.e., with a correct password). We find that current open- and closed-source models struggle with this seemingly simple task, and that, perhaps surprisingly, reasoning capabilities do not generally improve performance. In fact, we find that reasoning traces frequently leak confidential information, which calls into question whether reasoning traces should be exposed to users in such applications. We also scale the difficulty of our evaluation along multiple axes: (i) by adding adversarial user pressure through various jailbreaking strategies, and (ii) through longer multi-turn conversations where password verification is more challenging. Overall, our results suggest that current frontier models are not well-suited to handling confidential information, and that reasoning capabilities may need to be trained in a different manner to make them safer for release in high-stakes settings.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes