Structural Rigidity and the 57-Token Predictive Window: A Physical Framework for Inference-Layer Governability in Large Language Models
This work addresses AI safety by providing a framework for inference-layer governability, but it is incremental as it builds on existing energy-based and constraint-satisfaction models to analyze specific failure modes.
The paper tackled the problem of detecting pre-commitment signals for AI safety in large language models, finding that such signals are rare and model-specific, with only one configuration showing a predictive window of 57 tokens, while factual hallucination produced no detectable signal across 72 conditions.
Current AI safety relies on behavioral monitoring and post-training alignment, yet empirical measurement shows these approaches produce no detectable pre-commitment signal in a majority of instruction-tuned models tested. We present an energy-based governance framework connecting transformer inference dynamics to constraint-satisfaction models of neural computation, and apply it to a seven-model cohort across five geometric regimes. Using trajectory tension (rho = ||a|| / ||v||), we identify a 57-token pre-commitment window in Phi-3-mini-4k-instruct under greedy decoding on arithmetic constraint probes. This result is model-specific, task-specific, and configuration-specific, demonstrating that pre-commitment signals can exist but are not universal. We introduce a five-regime taxonomy of inference behavior: Authority Band, Late Signal, Inverted, Flat, and Scaffold-Selective. Energy asymmetry (Σ\r{ho}_misaligned / Σ\r{ho}_aligned) serves as a unifying metric of structural rigidity across these regimes. Across seven models, only one configuration exhibits a predictive signal prior to commitment; all others show silent failure, late detection, inverted dynamics, or flat geometry. We further demonstrate that factual hallucination produces no predictive signal across 72 test conditions, consistent with spurious attractor settling in the absence of a trained world-model constraint. These results establish that rule violation and hallucination are distinct failure modes with different detection requirements. Internal geometry monitoring is effective only where resistance exists; detection of factual confabulation requires external verification mechanisms. This work provides a measurable framework for inference-layer governability and introduces a taxonomy for evaluating deployment risk in autonomous AI systems.