CL CR HCNov 10, 2023

Summon a Demon and Bind it: A Grounded Theory of LLM Red Teaming

Nanna Inie, Jonathan Stray, Leon Derczynski

arXiv:2311.06237v37.128 citationsh-index: 34

Originality Incremental advance

AI Analysis

This work addresses the need for a formal definition and understanding of LLM red teaming, which is incremental as it builds on existing qualitative methods to characterize a novel human activity in AI safety.

The paper tackles the problem of understanding how and why people deliberately generate abnormal outputs from Large Language Models (LLMs) through attacks, resulting in a comprehensive grounded theory that defines LLM red teaming as a limit-seeking, non-malicious activity and identifies a taxonomy of 12 strategies and 35 techniques.

Engaging in the deliberate generation of abnormal outputs from Large Language Models (LLMs) by attacking them is a novel human activity. This paper presents a thorough exposition of how and why people perform such attacks, defining LLM red-teaming based on extensive and diverse evidence. Using a formal qualitative methodology, we interviewed dozens of practitioners from a broad range of backgrounds, all contributors to this novel work of attempting to cause LLMs to fail. We focused on the research questions of defining LLM red teaming, uncovering the motivations and goals for performing the activity, and characterizing the strategies people use when attacking LLMs. Based on the data, LLM red teaming is defined as a limit-seeking, non-malicious, manual activity, which depends highly on a team-effort and an alchemist mindset. It is highly intrinsically motivated by curiosity, fun, and to some degrees by concerns for various harms of deploying LLMs. We identify a taxonomy of 12 strategies and 35 different techniques of attacking LLMs. These findings are presented as a comprehensive grounded theory of how and why people attack large language models: LLM red teaming.

View on arXiv PDF

Similar