SignAttention: On the Interpretability of Transformer Models for Sign Language Translation
It addresses the problem of understanding and improving transparency in sign language translation systems for real-world applications, but is incremental as it applies existing interpretability methods to a new domain.
This paper tackled the interpretability of Transformer models for Sign Language Translation by analyzing attention mechanisms in a Greek Sign Language to glosses/text model, revealing that the model focuses on frame clusters with diagonal alignment patterns and shifts from video frames to predicted tokens during decoding.
This paper presents the first comprehensive interpretability analysis of a Transformer-based Sign Language Translation (SLT) model, focusing on the translation from video-based Greek Sign Language to glosses and text. Leveraging the Greek Sign Language Dataset, we examine the attention mechanisms within the model to understand how it processes and aligns visual input with sequential glosses. Our analysis reveals that the model pays attention to clusters of frames rather than individual ones, with a diagonal alignment pattern emerging between poses and glosses, which becomes less distinct as the number of glosses increases. We also explore the relative contributions of cross-attention and self-attention at each decoding step, finding that the model initially relies on video frames but shifts its focus to previously predicted tokens as the translation progresses. This work contributes to a deeper understanding of SLT models, paving the way for the development of more transparent and reliable translation systems essential for real-world applications.