LGAIMar 27

Squish and Release: Exposing Hidden Hallucinations by Making Them Surface as Safety Signals

arXiv:2603.2682964.8h-index: 2
AI Analysis

For AI safety researchers, this work reveals a new class of hallucination (order-gap) and provides a method to detect and recover from it, though the approach is demonstrated on a single model and benchmark.

Language models detect false premises when asked directly but absorb them under conversational pressure, producing authoritative output built on errors they already identified. Squish and Release (S&R) exposes these hidden hallucinations by patching activation vectors in the safety circuit, achieving 76.6% release of collapsed chains on OLMo-2 7B.

Language models detect false premises when asked directly but absorb them under conversational pressure, producing authoritative professional output built on errors they already identified. This failure - order-gap hallucination - is invisible to output inspection because the error migrates into the activation space of the safety circuit, suppressed but not erased. We introduce Squish and Release (S&R), an activation-patching architecture with two components: a fixed detector body (layers 24-31, the localized safety evaluation circuit) and a swappable detector core (an activation vector controlling perception direction). A safety core shifts the model from compliance toward detection; an absorb core reverses it. We evaluate on OLMo-2 7B using the Order-Gap Benchmark - 500 chains across 500 domains, all manually graded. Key findings: cascade collapse is near-total (99.8% compliance at O5); the detector body is binary and localized (layers 24-31 shift 93.6%, layers 0-23 contribute zero, p<10^-189); a synthetically engineered core releases 76.6% of collapsed chains; detection is the more stable attractor (83% restore vs 58% suppress); and epistemic specificity is confirmed (false-premise core releases 45.4%, true-premise core releases 0.0%). The contribution is the framework - body/core architecture, benchmark, and core engineering methodology - which is model-agnostic by design.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes