Approximate Attributions for Off-the-Shelf Siamese Transformers
This work provides interpretability tools for a widely used but poorly understood model class, though it is incremental as it builds on prior attribution methods.
The paper tackles the problem of attributing predictions in Siamese transformers, which are not addressed by existing methods, by proposing both exact and approximate attribution techniques for off-the-shelf models, enabling analysis of linguistic aspects like syntactic roles and negation.
Siamese encoders such as sentence transformers are among the least understood deep models. Established attribution methods cannot tackle this model class since it compares two inputs rather than processing a single one. To address this gap, we have recently proposed an attribution method specifically for Siamese encoders (Möller et al., 2023). However, it requires models to be adjusted and fine-tuned and therefore cannot be directly applied to off-the-shelf models. In this work, we reassess these restrictions and propose (i) a model with exact attribution ability that retains the original model's predictive performance and (ii) a way to compute approximate attributions for off-the-shelf models. We extensively compare approximate and exact attributions and use them to analyze the models' attendance to different linguistic aspects. We gain insights into which syntactic roles Siamese transformers attend to, confirm that they mostly ignore negation, explore how they judge semantically opposite adjectives, and find that they exhibit lexical bias.