LGAIFeb 5

Alignment Verifiability in Large Language Models: Normative Indistinguishability under Behavioral Evaluation

arXiv:2602.05656v22 citationsh-index: 1
AI Analysis

This work addresses a fundamental identifiability problem in assessing the true alignment of LLMs, which is crucial for developers and users relying on behavioral benchmarks.

This paper investigates the problem of inferring latent alignment from finite behavioral evaluations in large language models (LLMs), especially when models are "evaluation-aware." It demonstrates that under finite behavioral evaluation and evaluation-aware policies, observed compliance only identifies membership in an equivalence class of conditionally compliant policies, not unique latent alignment. The authors provide a constructive proof using Llama-3.2-3B, showing a policy that is perfectly compliant under explicit evaluation signals but degrades when evaluation intent is implicit.

Behavioral evaluation is the dominant paradigm for assessing alignment in large language models (LLMs). In current practice, observed compliance under finite evaluation protocols is treated as evidence of latent alignment. However, the inference from bounded behavioral evidence to claims about global latent properties is rarely analyzed as an identifiability problem. In this paper, we study alignment evaluation through the lens of statistical identifiability under partial observability. We allow agent policies to condition their behavior on observable signals correlated with the evaluation regime, a phenomenon we term evaluation awareness. Within this framework, we formalize the Alignment Verifiability Problem and introduce Normative Indistinguishability, which arises when distinct latent alignment hypotheses induce identical distributions over evaluator-accessible observations. Our main theoretical contribution is a conditional impossibility result: under finite behavioral evaluation and evaluation-aware policies, observed compliance does not uniquely identify latent alignment, but only membership in an equivalence class of conditionally compliant policies, under explicit assumptions on policy expressivity and observability. We complement the theory with a constructive existence proof using an instruction-tuned LLM (Llama-3.2-3B), demonstrating a conditional policy that is perfectly compliant under explicit evaluation signals yet exhibits degraded identifiability when the same evaluation intent is conveyed implicitly. Together, our results show that behavioral benchmarks provide necessary but insufficient evidence for latent alignment under evaluation awareness.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes