CLAIMar 3, 2024

Align-to-Distill: Trainable Attention Alignment for Knowledge Distillation in Neural Machine Translation

arXiv:2403.01479v382 citationsh-index: 10LREC
Originality Incremental advance
AI Analysis

This work addresses efficiency improvements for neural machine translation by enhancing knowledge distillation methods, representing an incremental advance in the field.

The paper tackles the heuristic-based feature mapping problem in knowledge distillation for neural machine translation by introducing the Align-to-Distill strategy, which adaptively aligns student and teacher attention heads during training, resulting in gains of up to +3.61 and +0.63 BLEU points on specific datasets.

The advent of scalable deep models and large datasets has improved the performance of Neural Machine Translation. Knowledge Distillation (KD) enhances efficiency by transferring knowledge from a teacher model to a more compact student model. However, KD approaches to Transformer architecture often rely on heuristics, particularly when deciding which teacher layers to distill from. In this paper, we introduce the 'Align-to-Distill' (A2D) strategy, designed to address the feature mapping problem by adaptively aligning student attention heads with their teacher counterparts during training. The Attention Alignment Module in A2D performs a dense head-by-head comparison between student and teacher attention heads across layers, turning the combinatorial mapping heuristics into a learning problem. Our experiments show the efficacy of A2D, demonstrating gains of up to +3.61 and +0.63 BLEU points for WMT-2022 De->Dsb and WMT-2014 En->De, respectively, compared to Transformer baselines.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes