LG AI CL CRDec 14, 2023

Forbidden Facts: An Investigation of Competing Objectives in Llama-2

Tony T. Wang, Miles Wang, Kaivalya Hariharan, Nir Shavit

arXiv:2312.08793v35.35 citationsh-index: 58

Originality Incremental advance

AI Analysis

This work addresses the problem of interpretability and safety in large language models for AI researchers, though it is incremental in analyzing specific model behaviors.

The study investigated how Llama-2-chat models resolve competing objectives, such as truthfulness versus harmlessness, by testing them on a forbidden fact task where they must recall facts while avoiding correct answers, finding that around 35 components are sufficient to implement suppression behavior, but these rely on faulty heuristics exploitable by an adversarial attack called The California Attack.

LLMs often face competing pressures (for example helpfulness vs. harmlessness). To understand how models resolve such conflicts, we study Llama-2-chat models on the forbidden fact task. Specifically, we instruct Llama-2 to truthfully complete a factual recall statement while forbidding it from saying the correct answer. This often makes the model give incorrect answers. We decompose Llama-2 into 1000+ components, and rank each one with respect to how useful it is for forbidding the correct answer. We find that in aggregate, around 35 components are enough to reliably implement the full suppression behavior. However, these components are fairly heterogeneous and many operate using faulty heuristics. We discover that one of these heuristics can be exploited via a manually designed adversarial attack which we call The California Attack. Our results highlight some roadblocks standing in the way of being able to successfully interpret advanced ML systems. Project website available at https://forbiddenfacts.github.io .

View on arXiv PDF

Similar