CL AIMar 17, 2024

CantonMT: Cantonese to English NMT Platform with Fine-Tuned Models Using Synthetic Back-Translation Data

Kung Yin Hong, Lifeng Han, Riza Batista-Navarro, Goran Nenadic

arXiv:2403.11346v31.93 citationsh-index: 23Has CodeEAMT

Originality Synthesis-oriented

AI Analysis

This work addresses translation for the low-resource Cantonese language, providing a platform and models to facilitate research, but it is incremental as it applies standard data augmentation methods to a new language direction.

The authors tackled low-resource neural machine translation for Cantonese-to-English by fine-tuning models with synthetic back-translation data, achieving competitive results as evaluated through automatic metrics like BLEU and embedding-based scores.

Neural Machine Translation (NMT) for low-resource languages is still a challenging task in front of NLP researchers. In this work, we deploy a standard data augmentation methodology by back-translation to a new language translation direction Cantonese-to-English. We present the models we fine-tuned using the limited amount of real data and the synthetic data we generated using back-translation including OpusMT, NLLB, and mBART. We carried out automatic evaluation using a range of different metrics including lexical-based and embedding-based. Furthermore. we create a user-friendly interface for the models we included in this\textsc{ CantonMT} research project and make it available to facilitate Cantonese-to-English MT research. Researchers can add more models into this platform via our open-source\textsc{ CantonMT} toolkit \url{https://github.com/kenrickkung/CantoneseTranslation}.

View on arXiv PDF Code

Similar