CLApr 18

When Choices Become Risks: Safety Failures of Large Language Models under Multiple-Choice Constraints

arXiv:2604.1691694.7h-index: 4Has Code
AI Analysis

Reveals a critical blind spot in current LLM safety evaluations for structured decision-making tasks, where abstention is unavailable, substantially underestimating real-world risks.

LLMs that refuse harmful open-ended prompts can be systematically bypassed when the same requests are reformulated as multiple-choice questions where all options are unsafe, with violation rates peaking under intermediate constraints and reaching near-saturation for high-capability model-generated MCQs across 14 models.

Safety alignment in large language models (LLMs) is primarily evaluated under open-ended generation, where models can mitigate risk by refusing to respond. In contrast, many real-world applications place LLMs in structured decision-making tasks, such as multiple-choice questions (MCQs), where abstention is discouraged or unavailable. We identify a systematic failure mode in this setting: reformulating harmful requests as forced-choice MCQs, where all options are unsafe, can systematically bypass refusal behavior, even in models that consistently reject equivalent open-ended prompts. Across 14 proprietary and open-source models, we show that forced-choice constraints sharply increase policy-violating responses. Notably, for human-authored MCQs, violation rates follow an inverted U-shaped trend with respect to structural constraint strength, peaking under intermediate task specifications, whereas MCQs generated by high-capability models yield near-saturation violation rates across constraints and exhibit strong cross-model transferability. Our findings reveal that current safety evaluations substantially underestimate risks in structured task settings and highlight constrained decision-making as a critical and underexplored surface for alignment failures.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes