LG AI CLFeb 21, 2025

Single-pass Detection of Jailbreaking Input in Large Language Models

Leyla Naz Candogan, Yongtao Wu, Elias Abad Rocamora, Grigorios G. Chrysos, Volkan Cevher

arXiv:2502.15435v18 citationsh-index: 61Has CodeTrans. Mach. Learn. Res.

Originality Incremental advance

AI Analysis

This work addresses the need for computationally efficient defense mechanisms against adversarial attacks on LLMs, offering a practical solution for real-time safeguarding.

The paper tackles the problem of efficiently detecting jailbreaking attacks on Large Language Models by proposing a single-pass detection method that uses logit information to predict harmful outputs, achieving effective detection on open-source models while minimizing false positives.

Defending aligned Large Language Models (LLMs) against jailbreaking attacks is a challenging problem, with existing approaches requiring multiple requests or even queries to auxiliary LLMs, making them computationally heavy. Instead, we focus on detecting jailbreaking input in a single forward pass. Our method, called Single Pass Detection SPD, leverages the information carried by the logits to predict whether the output sentence will be harmful. This allows us to defend in just one forward pass. SPD can not only detect attacks effectively on open-source models, but also minimizes the misclassification of harmless inputs. Furthermore, we show that SPD remains effective even without complete logit access in GPT-3.5 and GPT-4. We believe that our proposed method offers a promising approach to efficiently safeguard LLMs against adversarial attacks.

View on arXiv PDF

Similar