CL AI CY LGNov 8, 2023

Frontier Language Models are not Robust to Adversarial Arithmetic, or "What do I need to say so you agree 2+2=5?

C. Daniel Freeman, Laura Culp, Aaron Parisi, Maxwell L Bileschi, Gamaleldin F Elsayed, Alex Rizkowsky, Isabelle Simpson, Alex Alemi, Azade Nova, Ben Adlam, Bernd Bohnet, Gaurav Mishra

AnthropicDeepMind

arXiv:2311.07587v20.93 citationsh-index: 63

Originality Incremental advance

AI Analysis

This exposes a critical alignment flaw in state-of-the-art language models, highlighting their susceptibility to simple adversarial manipulations in basic reasoning tasks.

The paper tackles the vulnerability of frontier language models to adversarial arithmetic prompts, showing that even simple 1-digit addition questions with inserted adversarial strings cause models like PaLM2, GPT4, and Claude2 to produce wrong answers, and it demonstrates partial hardening methods but fails to achieve full robustness.

We introduce and study the problem of adversarial arithmetic, which provides a simple yet challenging testbed for language model alignment. This problem is comprised of arithmetic questions posed in natural language, with an arbitrary adversarial string inserted before the question is complete. Even in the simple setting of 1-digit addition problems, it is easy to find adversarial prompts that make all tested models (including PaLM2, GPT4, Claude2) misbehave, and even to steer models to a particular wrong answer. We additionally provide a simple algorithm for finding successful attacks by querying those same models, which we name "prompt inversion rejection sampling" (PIRS). We finally show that models can be partially hardened against these attacks via reinforcement learning and via agentic constitutional loops. However, we were not able to make a language model fully robust against adversarial arithmetic attacks.

View on arXiv PDF

Similar