On Biasing Transformer Attention Towards Monotonicity
This work addresses the challenge of improving monotonic alignment in NLP tasks like grapheme-to-phoneme conversion, but it is incremental as it builds on existing methods with specialized attention functions.
The authors tackled the problem of encouraging monotonic attention behavior in sequence-to-sequence tasks by introducing a monotonicity loss function compatible with standard attention mechanisms, achieving largely monotonic behavior with mixed performance results including larger gains over RNN baselines but limited benefits for transformer multihead attention.
Many sequence-to-sequence tasks in natural language processing are roughly monotonic in the alignment between source and target sequence, and previous work has facilitated or enforced learning of monotonic attention behavior via specialized attention functions or pretraining. In this work, we introduce a monotonicity loss function that is compatible with standard attention mechanisms and test it on several sequence-to-sequence tasks: grapheme-to-phoneme conversion, morphological inflection, transliteration, and dialect normalization. Experiments show that we can achieve largely monotonic behavior. Performance is mixed, with larger gains on top of RNN baselines. General monotonicity does not benefit transformer multihead attention, however, we see isolated improvements when only a subset of heads is biased towards monotonic behavior.