CLJul 11, 2023

Neural Machine Translation Data Generation and Augmentation using ChatGPT

arXiv:2307.05779v12.99 citationsh-index: 18

Originality Incremental advance

AI Analysis

This addresses the data scarcity issue for machine translation researchers and practitioners, but it is incremental as it builds on existing generative models.

The paper tackles the problem of expensive and time-consuming parallel corpora creation for neural machine translation by using generative language models like ChatGPT to hallucinate parallel data, finding that this data improves translation signal even with domain mismatches.

Neural models have revolutionized the field of machine translation, but creating parallel corpora is expensive and time-consuming. We investigate an alternative to manual parallel corpora - hallucinated parallel corpora created by generative language models. Although these models are themselves trained on parallel data, they can leverage a multilingual vector space to create data, and may be able to supplement small manually-procured corpora. Our experiments highlight two key findings - despite a lack of diversity in their output, the hallucinated data improves the translation signal, even when the domain clashes with the original dataset.

View on arXiv PDF

Similar