AICLLGOct 22, 2024

LLMScan: Causal Scan for LLM Misbehavior Detection

arXiv:2410.16638v410 citationsh-index: 10ICML
Originality Highly original
AI Analysis

This addresses the risk of LLM misbehavior in critical applications, offering a comprehensive solution beyond existing specific-target methods.

The paper tackles the problem of detecting untruthful, biased, and harmful responses in Large Language Models (LLMs) by introducing LLMScan, a monitoring technique based on causality analysis, which effectively distinguishes misbehavior through causal distributions and enables accurate, lightweight detectors.

Despite the success of Large Language Models (LLMs) across various fields, their potential to generate untruthful, biased and harmful responses poses significant risks, particularly in critical applications. This highlights the urgent need for systematic methods to detect and prevent such misbehavior. While existing approaches target specific issues such as harmful responses, this work introduces LLMScan, an innovative LLM monitoring technique based on causality analysis, offering a comprehensive solution. LLMScan systematically monitors the inner workings of an LLM through the lens of causal inference, operating on the premise that the LLM's `brain' behaves differently when misbehaving. By analyzing the causal contributions of the LLM's input tokens and transformer layers, LLMScan effectively detects misbehavior. Extensive experiments across various tasks and models reveal clear distinctions in the causal distributions between normal behavior and misbehavior, enabling the development of accurate, lightweight detectors for a variety of misbehavior detection tasks.

Code Implementations2 repos
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes