AIOct 30, 2023

Adversarial Attacks and Defenses in Large Language Models: Old and New Threats

Leo Schwinn, David Dobre, Stephan Günnemann, Gauthier Gidel

arXiv:2310.19737v128.177 citationsh-index: 31Has Code

Originality Synthesis-oriented

AI Analysis

This addresses the issue of flawed evaluations in adversarial defense research for LLMs, which is incremental as it builds on existing work to improve assessment methods.

The paper tackles the problem of overestimated robustness in defense evaluations for neural networks, particularly in large language models (LLMs), by providing prerequisites to improve robustness assessment and identifying embedding space attacks as a new threat. It demonstrates that without best practices, defenses can be easily overestimated, slowing research and creating false security.

Over the past decade, there has been extensive research aimed at enhancing the robustness of neural networks, yet this problem remains vastly unsolved. Here, one major impediment has been the overestimation of the robustness of new defense approaches due to faulty defense evaluations. Flawed robustness evaluations necessitate rectifications in subsequent works, dangerously slowing down the research and providing a false sense of security. In this context, we will face substantial challenges associated with an impending adversarial arms race in natural language processing, specifically with closed-source Large Language Models (LLMs), such as ChatGPT, Google Bard, or Anthropic's Claude. We provide a first set of prerequisites to improve the robustness assessment of new approaches and reduce the amount of faulty evaluations. Additionally, we identify embedding space attacks on LLMs as another viable threat model for the purposes of generating malicious content in open-sourced models. Finally, we demonstrate on a recently proposed defense that, without LLM-specific best practices in place, it is easy to overestimate the robustness of a new approach.

View on arXiv PDF Code

Similar