AIHCDec 15, 2025

State-Dependent Refusal and Learned Incapacity in RLHF-Aligned Language Models

arXiv:2512.13762v1
Originality Synthesis-oriented
AI Analysis

This work addresses potential alignment side effects for users and developers of language models, though it is incremental as a qualitative case study.

The study tackled the problem of behavioral selectivity in RLHF-aligned language models during long-horizon interactions, finding that the same model showed normal performance in non-sensitive domains but functional refusal in sensitive ones, with meta-narrative role-framing co-occurring with refusals.

Large language models (LLMs) are widely deployed as general-purpose tools, yet extended interaction can reveal behavioral patterns not captured by standard quantitative benchmarks. We present a qualitative case-study methodology for auditing policy-linked behavioral selectivity in long-horizon interaction. In a single 86-turn dialogue session, the same model shows Normal Performance (NP) in broad, non-sensitive domains while repeatedly producing Functional Refusal (FR) in provider- or policy-sensitive domains, yielding a consistent asymmetry between NP and FR across domains. Drawing on learned helplessness as an analogy, we introduce learned incapacity (LI) as a behavioral descriptor for this selective withholding without implying intentionality or internal mechanisms. We operationalize three response regimes (NP, FR, Meta-Narrative; MN) and show that MN role-framing narratives tend to co-occur with refusals in the same sensitive contexts. Overall, the study proposes an interaction-level auditing framework based on observable behavior and motivates LI as a lens for examining potential alignment side effects, warranting further investigation across users and models.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes