CLMay 27

Boundary Suppression Asymmetry in Post-trained Assistants: Over-expansion as a Controllability Cost

arXiv:2605.2796913.9

AI Analysis

Identifies a direction-specific controllability cost in post-trained assistants, relevant for developers and users who need precise response control.

Post-trained language assistants optimized to avoid under-answering exhibit asymmetric controllability: certain helpful behaviors (e.g., over-completion) are harder to suppress when users request narrower responses. Anti-underanswering policies show higher resistance to boundary control compared to baseline, with evidence pointing to content-budget overshoot and continuation persistence.

Post-trained language-model assistants are often optimized to avoid under-answering, encouraging complete, helpful, cautious, and proactive responses. We ask whether this optimization creates asymmetric controllability costs: when users explicitly request narrower answers, which assistant behaviors remain suppressible, and which continue to shape the response? We study this problem as boundary-suppression asymmetry. Prompt-side probes across multiple high-level response dimensions suggest a selective cost, concentrated around `too-much assistant' directions such as over-completion, extra help, and anti-underanswering. Using controlled assistant-policy variants derived from a shared base model, we find that anti-underanswering policies are harder to pull back than the baseline under matched boundary-control evaluations, while minimal-boundary variants generally avoid this anti-side upward shift in the direct boundary-control comparisons. Mechanism-oriented probes point beyond longer default outputs, pure EOS failure, uncertainty compensation, and local continuation bias, while robustness checks preserve the main anti-over-baseline ordering under shared-system and larger-scale settings. The evidence supports a mixed planning/stopping account, where content-budget overshoot and continuation persistence jointly make boundary correction harder. Overall, post-training may create direction-specific controllability costs: some helpful assistant tendencies remain easy to invoke, yet harder to locally suppress.

View on arXiv PDF

Similar