CYApr 6

From Hallucination to Scheming: A Unified Taxonomy and Benchmark Analysis for LLM Deception

Jerick Shi, Terry Jingcheng Zhang, Zhijing Jin, Vincent Conitzer

arXiv:2604.0478888.12 citations

AI Analysis

This work addresses the challenge of inconsistent terminology and benchmark gaps in studying LLM deception for developers and regulators, though it is incremental as it synthesizes existing work into a new framework.

The authors tackled the problem of systematically misleading outputs from large language models by proposing a unified taxonomy to categorize deception along dimensions like goal-directedness and mechanism, and they analyzed 50 benchmarks to reveal gaps such as under-coverage of pragmatic distortion and strategic deception.

Large language models (LLMs) produce systematically misleading outputs, from hallucinated citations to strategic deception of evaluators, yet these phenomena are studied by separate communities with incompatible terminology. We propose a unified taxonomy organized along three complementary dimensions: degree of goal-directedness (behavioral to strategic deception), object of deception, and mechanism (fabrication, omission, or pragmatic distortion). Applying this taxonomy to 50 existing benchmarks reveals that every benchmark tests fabrication while pragmatic distortion, attribution, and capability self-knowledge remain critically under-covered, and strategic deception benchmarks are nascent. We offer concrete recommendations for developers and regulators, including a minimal reporting template for positioning future work within our framework.

View on arXiv PDF

Similar