CLAIMar 19

WASD: Locating Critical Neurons as Sufficient Conditions for Explaining and Controlling LLM Behavior

arXiv:2603.1847428.2h-index: 2
AI Analysis

This work addresses the need for precise behavioral control in LLMs for complex applications, offering a method that improves explanation quality and controllability, though it appears incremental as it builds on existing neuron-based analysis.

The paper tackles the problem of explaining and controlling large language model behavior by proposing WASD, a framework that identifies sufficient neural conditions for token generation, resulting in more stable, accurate, and concise explanations than conventional methods on SST-2 and CounterFact datasets with the Gemma-2-2B model.

Precise behavioral control of large language models (LLMs) is critical for complex applications. However, existing methods often incur high training costs, lack natural language controllability, or compromise semantic coherence. To bridge this gap, we propose WASD (unWeaving Actionable Sufficient Directives), a novel framework that explains model behavior by identifying sufficient neural conditions for token generation. Our method represents candidate conditions as neuron-activation predicates and iteratively searches for a minimal set that guarantees the current output under input perturbations. Experiments on SST-2 and CounterFact with the Gemma-2-2B model demonstrate that our approach produces explanations that are more stable, accurate, and concise than conventional attribution graphs. Moreover, through a case study on controlling cross-lingual output generation, we validated the practical effectiveness of WASD in controlling model behavior.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes