CLMay 16, 2023

Exploring the Impact of Layer Normalization for Zero-shot Neural Machine Translation

arXiv:2305.09312v1224 citations
Originality Incremental advance
AI Analysis

This work addresses the problem of improving translation accuracy for zero-shot directions in machine translation, though it is incremental as it builds on known architectural variations.

The paper investigates the impact of layer normalization placement (PreNorm vs. PostNorm) on zero-shot neural machine translation, finding that PostNorm consistently outperforms PreNorm by up to 12.3 BLEU points across multiple datasets and 54 translation directions.

This paper studies the impact of layer normalization (LayerNorm) on zero-shot translation (ZST). Recent efforts for ZST often utilize the Transformer architecture as the backbone, with LayerNorm at the input of layers (PreNorm) set as the default. However, Xu et al. (2019) has revealed that PreNorm carries the risk of overfitting the training data. Based on this, we hypothesize that PreNorm may overfit supervised directions and thus have low generalizability for ZST. Through experiments on OPUS, IWSLT, and Europarl datasets for 54 ZST directions, we demonstrate that the original Transformer setting of LayerNorm after residual connections (PostNorm) consistently outperforms PreNorm by up to 12.3 BLEU points. We then study the performance disparities by analyzing the differences in off-target rates and structural variations between PreNorm and PostNorm. This study highlights the need for careful consideration of the LayerNorm setting for ZST.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes