CLAILGSep 26, 2023

How to Catch an AI Liar: Lie Detection in Black-Box LLMs by Asking Unrelated Questions

BerkeleyCambridge
arXiv:2309.15840v192 citationsh-index: 64
Originality Incremental advance
AI Analysis

This addresses the issue of misinformation and deceptive behavior in AI systems for users and developers, though it is incremental as it builds on existing lie detection concepts with a novel black-box approach.

The paper tackled the problem of detecting lies in black-box large language models (LLMs) by developing a simple detector that asks unrelated follow-up questions and uses logistic regression, achieving high accuracy and generalization across different LLM architectures and contexts.

Large language models (LLMs) can "lie", which we define as outputting false statements despite "knowing" the truth in a demonstrable sense. LLMs might "lie", for example, when instructed to output misinformation. Here, we develop a simple lie detector that requires neither access to the LLM's activations (black-box) nor ground-truth knowledge of the fact in question. The detector works by asking a predefined set of unrelated follow-up questions after a suspected lie, and feeding the LLM's yes/no answers into a logistic regression classifier. Despite its simplicity, this lie detector is highly accurate and surprisingly general. When trained on examples from a single setting -- prompting GPT-3.5 to lie about factual questions -- the detector generalises out-of-distribution to (1) other LLM architectures, (2) LLMs fine-tuned to lie, (3) sycophantic lies, and (4) lies emerging in real-life scenarios such as sales. These results indicate that LLMs have distinctive lie-related behavioural patterns, consistent across architectures and contexts, which could enable general-purpose lie detection.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes