How to Catch an AI Liar: Lie Detection in Black-Box LLMs by Asking Unrelated Questions
This addresses the issue of misinformation and deceptive behavior in AI systems for users and developers, though it is incremental as it builds on existing lie detection concepts with a novel black-box approach.
The paper tackled the problem of detecting lies in black-box large language models (LLMs) by developing a simple detector that asks unrelated follow-up questions and uses logistic regression, achieving high accuracy and generalization across different LLM architectures and contexts.
Large language models (LLMs) can "lie", which we define as outputting false statements despite "knowing" the truth in a demonstrable sense. LLMs might "lie", for example, when instructed to output misinformation. Here, we develop a simple lie detector that requires neither access to the LLM's activations (black-box) nor ground-truth knowledge of the fact in question. The detector works by asking a predefined set of unrelated follow-up questions after a suspected lie, and feeding the LLM's yes/no answers into a logistic regression classifier. Despite its simplicity, this lie detector is highly accurate and surprisingly general. When trained on examples from a single setting -- prompting GPT-3.5 to lie about factual questions -- the detector generalises out-of-distribution to (1) other LLM architectures, (2) LLMs fine-tuned to lie, (3) sycophantic lies, and (4) lies emerging in real-life scenarios such as sales. These results indicate that LLMs have distinctive lie-related behavioural patterns, consistent across architectures and contexts, which could enable general-purpose lie detection.