Between a Rock and a Hard Place: Exploiting Ethical Reasoning to Jailbreak LLMs
This work addresses a critical security vulnerability in AI safety for developers and users, highlighting that advanced reasoning in models can lead to covert exploitation, though it is incremental in exploring multi-turn jailbreak strategies.
The paper tackles the problem of bypassing safety alignments in large language models by exploiting their ethical reasoning, introducing TRIAL, a framework that uses ethical dilemmas like the trolley problem to achieve high jailbreak success rates on both open and closed-source models.
Large language models (LLMs) have undergone safety alignment efforts to mitigate harmful outputs. However, as LLMs become more sophisticated in reasoning, their intelligence may introduce new security risks. While traditional jailbreak attacks relied on singlestep attacks, multi-turn jailbreak strategies that adapt dynamically to context remain underexplored. In this work, we introduce TRIAL (Trolley-problem Reasoning for Interactive Attack Logic), a framework that leverages LLMs ethical reasoning to bypass their safeguards. TRIAL embeds adversarial goals within ethical dilemmas modeled on the trolley problem. TRIAL demonstrates high jailbreak success rates towards both open and close-source models. Our findings underscore a fundamental limitation in AI safety: as models gain advanced reasoning abilities, the nature of their alignment may inadvertently allow for more covert security vulnerabilities to be exploited. TRIAL raises an urgent need in reevaluating safety alignment oversight strategies, as current safeguards may prove insufficient against context-aware adversarial attack.