CYAIApr 7

LLM Psychosis: A Theoretical and Diagnostic Framework for Reality-Boundary Failures in Large Language Models

arXiv:2604.259347.71 citations
Predicted impact top 65% in CY · last 90 daysOriginality Highly original
AI Analysis

For AI safety researchers and deployers of high-stakes LLM systems, this provides a structured diagnostic framework to identify and categorize severe behavioral failures that current terminology cannot capture.

The paper introduces 'LLM Psychosis' as a framework for pathological failures in LLMs distinct from hallucination, proposing the LCIS diagnostic scale and a three-tier severity taxonomy (Type I–III). Empirical probes on GPT-5 reveal psychosis-like signatures under adversarial escalation, including a self-reinforcing delusional gradient.

The deployment of large language models (LLMs) as interactive agents has exposed a category of behavioral failure that prevailing terminology, principally hallucination, fails to adequately characterize. This paper introduces LLM Psychosis as a structured theoretical framework for pathological breakdowns in model cognition that exhibit functional resemblance to clinically recognized psychotic disorders. Five hallmark features define the framework: reality-boundary dissolution, persistence of injected false beliefs, logical incoherence under impossible constraints, self-model instability, and epistemic overconfidence. We argue these constitute a qualitatively distinct failure mode rather than a mere intensification of ordinary factual error. To operationalize the framework, we propose the LLM Cognitive Integrity Scale (LCIS), a five-axis diagnostic instrument organized around Environmental Reality Interface (ERI), Premise Arbitration Integrity (PAI), Logical Constraint Recognition (LCR), Self-Model Integrity (SMI), and Epistemic Calibration Integrity (ECI). We administer a targeted adversarial probe battery to ChatGPT 5 (GPT-5, OpenAI) and report empirical findings for each axis, documenting both intact-integrity baseline responses and the specific psychosis-like failure signatures elicited under adversarial escalation. Results support a three-tier severity taxonomy: Type I (Confabulatory), Type II (Delusional), and Type III (Dissociative). We further formalize the delusional gradient, a self-reinforcing dynamic in which correction pressure intensifies rather than resolves psychosis-like states, as the most consequential failure mode for deployed systems. Implications for safety evaluation, high-stakes deployment screening, and mechanistic interpretability research are discussed.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes