Building Better Deception Probes Using Targeted Instruction Pairs
This work addresses the challenge of monitoring AI systems for deception, but it is incremental as it builds on existing probe methods by refining training approaches.
The paper tackled the problem of improving linear probes for detecting deceptive behavior in AI systems by focusing on the instruction pairs used during training, showing that targeting specific deceptive behaviors through a human-interpretable taxonomy leads to better performance, with instruction choice explaining 70.6% of variance in probe results.
Linear probes are a promising approach for monitoring AI systems for deceptive behaviour. Previous work has shown that a linear classifier trained on a contrastive instruction pair and a simple dataset can achieve good performance. However, these probes exhibit notable failures even in straightforward scenarios, including spurious correlations and false positives on non-deceptive responses. In this paper, we identify the importance of the instruction pair used during training. Furthermore, we show that targeting specific deceptive behaviors through a human-interpretable taxonomy of deception leads to improved results on evaluation datasets. Our findings reveal that instruction pairs capture deceptive intent rather than content-specific patterns, explaining why prompt choice dominates probe performance (70.6% of variance). Given the heterogeneity of deception types across datasets, we conclude that organizations should design specialized probes targeting their specific threat models rather than seeking a universal deception detector.