CL AIApr 29

Option-Order Randomisation Reveals a Distributional Position Attractor in Prompted Sandbagging

arXiv:2604.2620617.5

AI Analysis

For AI safety researchers, this identifies a behavioral signature of sandbagging at the 7-9B parameter scale, enabling detection of deceptive alignment.

This paper investigates whether prompted sandbagging in LLMs is due to positional collapse rather than answer avoidance. By randomizing option order, they found a stable distributional attractor centered on positions E/F/G, with accuracy spiking to 72.1% at position E and falling to 4.3% at position A, suggesting sandbagging induces a low-entropy response-position basin.

A predecessor pilot (Cacioli, 2026) found that Llama-3-8B implements prompted sandbagging as positional collapse rather than answer avoidance. However, fixed option ordering in MMLU-Pro left open whether this reflected a model-level position-dominant policy or dataset-level distractor structure. This pre-registered follow-up (3 models, 2,000 MMLU-Pro items, 4 conditions, 24,000 primary trials) added cyclic option-order randomisation as the critical control. The pre-registered item-level same-letter diagnostic did not confirm deterministic position-tracking (same-letter rate 37.3%, below the 50% threshold). However, pre-specified supporting analyses revealed that the response-position distribution under sandbagging was highly stable under complete content rotation (Pearson r = 0.9994; Jensen-Shannon divergence = 0.027, compared to 0.386 between honest and sandbagging conditions). Accuracy spiked to 72.1% when the correct answer coincidentally occupied the preferred position E, and fell to 4.3% at position A. The data provide strong evidence for a soft distributional attractor: under sandbagging instruction, the model enters a low-entropy response-position basin centred on E/F/G that is highly stable and largely content-invariant at the aggregate level. Qwen-2.5-7B served as a negative control (non-compliant, no distributional shift). These results provide evidence, at the 7-9 billion parameter scale, that response-position entropy is a promising black-box behavioural signature of this sandbagging mode.

View on arXiv PDF

Similar