Towards Opening the Black Box of Neural Machine Translation: Source and Target Interpretations of the Transformer
This work addresses the need for better interpretability in NMT models, which is incremental as it extends existing methods to include target context.
The authors tackled the problem of limited interpretability in Neural Machine Translation by developing a method to attribute predictions to both source and target tokens, revealing insights into model behavior.
In Neural Machine Translation (NMT), each token prediction is conditioned on the source sentence and the target prefix (what has been previously translated at a decoding step). However, previous work on interpretability in NMT has mainly focused solely on source sentence tokens' attributions. Therefore, we lack a full understanding of the influences of every input token (source sentence and target prefix) in the model predictions. In this work, we propose an interpretability method that tracks input tokens' attributions for both contexts. Our method, which can be extended to any encoder-decoder Transformer-based model, allows us to better comprehend the inner workings of current NMT models. We apply the proposed method to both bilingual and multilingual Transformers and present insights into their behaviour.