CLAICRLGAug 27, 2023

Detecting Language Model Attacks with Perplexity

arXiv:2308.14132v3459 citationsh-index: 2Has Code
Originality Incremental advance
AI Analysis

This addresses security vulnerabilities in LLMs for users and developers, though it is incremental as it builds on existing perplexity-based detection methods.

The paper tackled the problem of detecting adversarial suffix attacks on large language models by analyzing perplexity values, finding that a Light-GBM model trained on perplexity and token length resolved false positives and correctly detected most attacks in testing.

A novel hack involving Large Language Models (LLMs) has emerged, exploiting adversarial suffixes to deceive models into generating perilous responses. Such jailbreaks can trick LLMs into providing intricate instructions to a malicious user for creating explosives, orchestrating a bank heist, or facilitating the creation of offensive content. By evaluating the perplexity of queries with adversarial suffixes using an open-source LLM (GPT-2), we found that they have exceedingly high perplexity values. As we explored a broad range of regular (non-adversarial) prompt varieties, we concluded that false positives are a significant challenge for plain perplexity filtering. A Light-GBM trained on perplexity and token length resolved the false positives and correctly detected most adversarial attacks in the test set.

Code Implementations2 repos
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes