Not All Explanations Simulate Equally: Comparing Verbalized Feature Attributions and Self-Generated Rationales
For researchers and practitioners using natural-language explanations to understand model behavior, this work highlights that not all explanations are equally effective for simulation, but the results are incremental and model-dependent.
The paper compares verbalized feature attributions and self-generated rationales for question answering models, finding that explanation format and granularity affect how well they enable simulation of model behavior in counterfactual settings, with effects varying across models and formats.
Natural-language explanations are often treated as a unified interface for understanding model behavior, but different explanation sources may support simulation in different ways. This paper compares two families of explanations for question answering models: verbalized feature attributions and self-generated rationales. We evaluate them under a shared counterfactual simulation setting, using an LLM judge as predictor and measuring whether it can better predict a model's answers to follow-up questions when given its explanation. Across multiple instruction-tuned models, we analyze how explanation source, verbalization strategy, and feature granularity affect the simulatability of explanations. Our results show that explanation format and granularity affect simulatability: attribution-based explanations and self-generated rationales differ in how much they improve counterfactual prediction, with effects that vary across models and formats.