CLJan 3, 2025

Turning Logic Against Itself : Probing Model Defenses Through Contrastive Questions

arXiv:2501.01872v62 citationsh-index: 15EMNLP
Originality Highly original
AI Analysis

This work addresses a critical safety problem for AI developers and users by exposing and mitigating reasoning-driven vulnerabilities in language models, representing an incremental improvement over existing defense methods.

The paper tackles the vulnerability of large language models to subtle jailbreak attacks by introducing POATE, a novel technique using contrastive reasoning to provoke unethical responses, achieving a significantly higher attack success rate (~44%) compared to existing methods. It also proposes Intent-Aware CoT and Reverse Thinking CoT as countermeasures to enhance reasoning robustness and strengthen defenses against such adversarial exploits.

Large language models, despite extensive alignment with human values and ethical principles, remain vulnerable to sophisticated jailbreak attacks that exploit their reasoning abilities. Existing safety measures often detect overt malicious intent but fail to address subtle, reasoning-driven vulnerabilities. In this work, we introduce POATE (Polar Opposite query generation, Adversarial Template construction, and Elaboration), a novel jailbreak technique that harnesses contrastive reasoning to provoke unethical responses. POATE crafts semantically opposing intents and integrates them with adversarial templates, steering models toward harmful outputs with remarkable subtlety. We conduct extensive evaluation across six diverse language model families of varying parameter sizes to demonstrate the robustness of the attack, achieving significantly higher attack success rates (~44%) compared to existing methods. To counter this, we propose Intent-Aware CoT and Reverse Thinking CoT, which decompose queries to detect malicious intent and reason in reverse to evaluate and reject harmful responses. These methods enhance reasoning robustness and strengthen the model's defense against adversarial exploits.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes