CRNov 19, 2024
When Backdoors Speak: Understanding LLM Backdoor Attacks Through Model-Generated ExplanationsHuaizhi Ge, Yiming Li, Qifan Wang et al.
Large Language Models (LLMs) are known to be vulnerable to backdoor attacks, where triggers embedded in poisoned samples can maliciously alter LLMs' behaviors. In this paper, we move beyond attacking LLMs and instead examine backdoor attacks through the novel lens of natural language explanations. Specifically, we leverage LLMs' generative capabilities to produce human-readable explanations for their decisions, enabling direct comparisons between explanations for clean and poisoned samples. Our results show that backdoored models produce coherent explanations for clean inputs but diverse and logically flawed explanations for poisoned data, a pattern consistent across classification and generation tasks for different backdoor attacks. Further analysis reveals key insights into the explanation generation process. At the token level, explanation tokens associated with poisoned samples only appear in the final few transformer layers. At the sentence level, attention dynamics indicate that poisoned inputs shift attention away from the original input context during explanation generation. These findings enhance our understanding of backdoor mechanisms in LLMs and present a promising framework for detecting vulnerabilities through explainability.
CLJun 25, 2024
How Well Can Knowledge Edit Methods Edit Perplexing Knowledge?Huaizhi Ge, Frank Rudzicz, Zining Zhu
Large language models (LLMs) have demonstrated remarkable capabilities, but updating their knowledge post-training remains a critical challenge. While recent model editing techniques like Rank-One Model Editing (ROME) show promise, their effectiveness may vary based on the nature of the knowledge being edited. We introduce the concept of ``perplexingness'': the degree to which new knowledge conflicts with an LLM's learned conceptual hierarchies and categorical relationships. For instance, editing ``British Shorthair is a kind of cat'' to ``British Shorthair is a kind of dog'' represents a low-perplexingness edit within the same taxonomic level, while editing ``A cat is a kind of animal'' to ``A cat is a kind of plant'' represents a high-perplexingness edit that violates fundamental categorical boundaries. To systematically investigate this phenomenon, we introduce HierarchyData, a carefully curated dataset of 99 hyponym-hypernym pairs across diverse categories. Through controlled experiments across three models and four editing methods, we demonstrate a strong negative correlation between the perplexingness of new knowledge and the effectiveness of knowledge editing. Our analysis reveals that edits involving more abstract concepts (hypernyms) generally exhibit higher perplexingness and are more resistant to modification than their specific counterparts (hyponyms). These findings highlight a fundamental challenge in LLM knowledge editing: the more a new fact contradicts an LLM's learned conceptual hierarchies, the harder it becomes to reliably encode that knowledge.
CLJun 25, 2024
Understanding Language Model Circuits through Knowledge EditingHuaizhi Ge, Frank Rudzicz, Zining Zhu
Recent advances in language model interpretability have identified circuits, critical subnetworks that replicate model behaviors, yet how knowledge is structured within these crucial subnetworks remains opaque. To gain an understanding toward the knowledge in the circuits, we conduct systematic knowledge editing experiments on the circuits of the GPT-2 language model. Our analysis reveals intriguing patterns in how circuits respond to editing attempts, the extent of knowledge distribution across network components, and the architectural composition of knowledge-bearing circuits. These findings offer insights into the complex relationship between model circuits and knowledge representation, deepening the understanding of how information is organized within language models. Our findings offer novel insights into the ``meanings'' of the circuits, and introduce directions for further interpretability and safety research of language models.