From Attribution to Action: A Human-Centered Application of Activation Steering
For practitioners of explainable AI, this work demonstrates that activation steering can make interpretability more actionable, but highlights risks like ripple effects and limited generalization of instance-level corrections.
This work introduces an interactive workflow combining SAE-based attribution with activation steering for instance-level analysis in vision models, implemented as a web-based tool. In expert interviews (N=8) with CLIP debugging tasks, steering enabled a shift from inspection to intervention-based hypothesis testing for all participants, with most grounding trust in observed model responses rather than explanation plausibility.
Explainable AI (XAI) methods reveal which features influence model predictions, yet provide limited means for practitioners to act on these explanations. Activation steering of components identified via XAI offers a path toward actionable explanations, although its practical utility remains understudied. We introduce an interactive workflow combining SAE-based attribution with activation steering for instance-level analysis of concept usage in vision models, implemented as a web-based tool. Based on this workflow, we conduct semi-structured expert interviews (N=8) with debugging tasks on CLIP to investigate how practitioners reason about, trust, and apply activation steering. We find that steering enables a shift from inspection to intervention-based hypothesis testing (8/8 participants), with most grounding trust in observed model responses rather than explanation plausibility alone (6/8). Participants adopted systematic debugging strategies dominated by component suppression (7/8) and highlighted risks including ripple effects and limited generalization of instance-level corrections. Overall, activation steering renders interpretability more actionable while raising important considerations for safe and effective use.