CLNov 3, 2024

MoCE: Adaptive Mixture of Contextualization Experts for Byte-based Neural Machine Translation

arXiv:2411.01474v212 citationsh-index: 2Has CodeNAACL
Originality Incremental advance
AI Analysis

This work addresses the problem of improving byte-based neural machine translation for multilingual scalability, offering an incremental advancement in adaptive contextualization.

The paper tackles the challenge of limited semantic information in byte-level tokenization for multilingual machine translation by proposing MoCE, an adaptive mixture of contextualization experts, which outperforms existing methods and subword-based models with fewer parameters on the Ted-59 dataset.

Byte-based machine translation systems have shown significant potential in massively multilingual settings. Unicode encoding, which maps each character to specific byte(s), eliminates the emergence of unknown words, even in new languages. This avoids out-of-vocabulary risk in multilingual translation and enables broad language scalability. However, byte-level tokenization results in sequences that are hard to interpret due to limited semantic information per byte. Local contextualization has proven effective in assigning initial semantics to tokens, improving sentence comprehension. Nevertheless, variations in encoding rules across languages necessitate an adaptive approach for effective contextualization. To this end, we propose Mixture of Contextualization Experts (MoCE), adaptively selecting and mixing attention heads, which are treated as contextualization experts. This enhances the flexibility of contextualization scales and allows models to search for better contextualization combinations. Experiment results show that our method outperforms existing methods without extensive manual adjustment of hyper-parameters and surpasses subword-based models with fewer parameters in Ted-59 dataset. Our code is available at https://github.com/ictnlp/MoCE.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes