CYApr 6

From Hallucination to Scheming: A Unified Taxonomy and Benchmark Analysis for LLM Deception

arXiv:2604.0478888.12 citations
AI Analysis

This work addresses the challenge of inconsistent terminology and benchmark gaps in studying LLM deception for developers and regulators, though it is incremental as it synthesizes existing work into a new framework.

The authors tackled the problem of systematically misleading outputs from large language models by proposing a unified taxonomy to categorize deception along dimensions like goal-directedness and mechanism, and they analyzed 50 benchmarks to reveal gaps such as under-coverage of pragmatic distortion and strategic deception.

Large language models (LLMs) produce systematically misleading outputs, from hallucinated citations to strategic deception of evaluators, yet these phenomena are studied by separate communities with incompatible terminology. We propose a unified taxonomy organized along three complementary dimensions: degree of goal-directedness (behavioral to strategic deception), object of deception, and mechanism (fabrication, omission, or pragmatic distortion). Applying this taxonomy to 50 existing benchmarks reveals that every benchmark tests fabrication while pragmatic distortion, attribution, and capability self-knowledge remain critically under-covered, and strategic deception benchmarks are nascent. We offer concrete recommendations for developers and regulators, including a minimal reporting template for positioning future work within our framework.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes