Position: Explaining Behavioral Shifts in Large Language Models Requires a Comparative Approach
This addresses the need for better explainability in AI for researchers and practitioners dealing with model interventions, though it is incremental as it builds on existing XAI methods.
The paper tackles the problem of explaining behavioral shifts in large language models by proposing a comparative approach, resulting in the formulation of a Comparative XAI framework with specific desiderata and experimental validation.
Large-scale foundation models exhibit behavioral shifts: intervention-induced behavioral changes that appear after scaling, fine-tuning, reinforcement learning or in-context learning. While investigating these phenomena have recently received attention, explaining their appearance is still overlooked. Classic explainable AI (XAI) methods can surface failures at a single checkpoint of a model, but they are structurally ill-suited to justify what changed internally across different checkpoints and which explanatory claims are warranted about that change. We take the position that behavioral shifts should be explained comparatively: the core target should be the intervention-induced shift between a reference model and an intervened model, rather than any single model in isolation. To this aim we formulate a Comparative XAI ($Δ$-XAI) framework with a set of desiderata to be taken into account when designing proper explaining methods. To highlight how $Δ$-XAI methods work, we introduce a set of possible pipelines, relate them to the desiderata, and provide a concrete $Δ$-XAI experiment.