CLAIDec 11, 2025

Explanation Bias is a Product: Revealing the Hidden Lexical and Position Preferences in Post-Hoc Feature Attribution

arXiv:2512.11108v2
Originality Incremental advance
AI Analysis

This addresses the issue of mistrust in post-hoc explanations for users of language models, though it is incremental in improving evaluation methods rather than solving bias directly.

The paper tackled the problem of inconsistent and biased feature attribution explanations in language models by developing a model- and method-agnostic framework to evaluate lexical and position biases, finding a trade-off where models scoring high on one bias type score low on the other.

Good quality explanations strengthen the understanding of language models and data. Feature attribution methods, such as Integrated Gradient, are a type of post-hoc explainer that can provide token-level insights. However, explanations on the same input may vary greatly due to underlying biases of different methods. Users may be aware of this issue and mistrust their utility, while unaware users may trust them inadequately. In this work, we delve beyond the superficial inconsistencies between attribution methods, structuring their biases through a model- and method-agnostic framework of three evaluation metrics. We systematically assess both lexical and position bias (what and where in the input) for two transformers; first, in a controlled, pseudo-random classification task on artificial data; then, in a semi-controlled causal relation detection task on natural data. We find a trade-off between lexical and position biases in our model comparison, with models that score high on one type score low on the other. We also find signs that anomalous explanations are more likely to be biased.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes