Fast Training of NMT Model with Data Sorting
This work addresses a computational bottleneck for researchers and practitioners training NMT models, offering an incremental improvement to reduce training time without sacrificing accuracy.
The paper tackles the computational inefficiency of Transformer models in Neural Machine Translation by sorting sentence pairs by length before batching to minimize wasted computation on empty tokens, showing gains in computational time while maintaining performance in experiments on English-Korean and English-Luganda language pairs.
The Transformer model has revolutionized Natural Language Processing tasks such as Neural Machine Translation, and many efforts have been made to study the Transformer architecture, which increased its efficiency and accuracy. One potential area for improvement is to address the computation of empty tokens that the Transformer computes only to discard them later, leading to an unnecessary computational burden. To tackle this, we propose an algorithm that sorts translation sentence pairs based on their length before batching, minimizing the waste of computing power. Since the amount of sorting could violate the independent and identically distributed (i.i.d) data assumption, we sort the data partially. In experiments, we apply the proposed method to English-Korean and English-Luganda language pairs for machine translation and show that there are gains in computational time while maintaining the performance. Our method is independent of architectures, so that it can be easily integrated into any training process with flexible data lengths.