Samsung R&D Institute Philippines at WMT 2023
This work addresses machine translation efficiency for specific language pairs, but it is incremental as it applies known techniques to a constrained setting.
The paper tackled machine translation for English-Hebrew and Hebrew-English by developing constrained Transformer models with data preprocessing, backtranslation, and noisy channel reranking, achieving performance comparable to or better than larger unconstrained baselines like mBART50 M2M and NLLB 200 MoE on FLORES-200 and NTREX-128 benchmarks.
In this paper, we describe the constrained MT systems submitted by Samsung R&D Institute Philippines to the WMT 2023 General Translation Task for two directions: en$\rightarrow$he and he$\rightarrow$en. Our systems comprise of Transformer-based sequence-to-sequence models that are trained with a mix of best practices: comprehensive data preprocessing pipelines, synthetic backtranslated data, and the use of noisy channel reranking during online decoding. Our models perform comparably to, and sometimes outperform, strong baseline unconstrained systems such as mBART50 M2M and NLLB 200 MoE despite having significantly fewer parameters on two public benchmarks: FLORES-200 and NTREX-128.