CL AIApr 20, 2022

DaLC: Domain Adaptation Learning Curve Prediction for Neural Machine Translation

Cheonbok Park, Hantae Kim, Ioan Calapodescu, Hyunchang Cho, Vassilina Nikoulina

arXiv:2204.09259v131.9638 citationsh-index: 15

Originality Incremental advance

AI Analysis

This addresses a practical issue for machine translation practitioners by enabling informed decisions on resource investment for dataset creation, though it is incremental as it builds on existing domain adaptation frameworks.

The paper tackles the problem of predicting domain adaptation performance for neural machine translation without parallel data, proposing a model that uses monolingual source samples and achieves better domain distinction with instance-level features compared to prior corpus-level methods.

Domain Adaptation (DA) of Neural Machine Translation (NMT) model often relies on a pre-trained general NMT model which is adapted to the new domain on a sample of in-domain parallel data. Without parallel data, there is no way to estimate the potential benefit of DA, nor the amount of parallel samples it would require. It is however a desirable functionality that could help MT practitioners to make an informed decision before investing resources in dataset creation. We propose a Domain adaptation Learning Curve prediction (DaLC) model that predicts prospective DA performance based on in-domain monolingual samples in the source language. Our model relies on the NMT encoder representations combined with various instance and corpus-level features. We demonstrate that instance-level is better able to distinguish between different domains compared to corpus-level frameworks proposed in previous studies. Finally, we perform in-depth analyses of the results highlighting the limitations of our approach, and provide directions for future research.

View on arXiv PDF

Similar