CLMar 24, 2025

PAD: Towards Efficient Data Generation for Transfer Learning Using Phrase Alignment

Jong Myoung Kim, Young-Jun_Lee, Ho-Jin Choi, Sangkeun Jung

arXiv:2503.18250v22.7h-index: 7

Originality Incremental advance

AI Analysis

This provides a cost-efficient solution for resource-scarce languages, though it is incremental as it builds on existing SMT methods.

The paper tackled the problem of data scarcity for non-English languages like Korean in transfer learning by using Phrase Aligned Data (PAD) from Statistical Machine Translation, resulting in significant performance improvements and cost efficiency.

Transfer learning leverages the abundance of English data to address the scarcity of resources in modeling non-English languages, such as Korean. In this study, we explore the potential of Phrase Aligned Data (PAD) from standardized Statistical Machine Translation (SMT) to enhance the efficiency of transfer learning. Through extensive experiments, we demonstrate that PAD synergizes effectively with the syntactic characteristics of the Korean language, mitigating the weaknesses of SMT and significantly improving model performance. Moreover, we reveal that PAD complements traditional data construction methods and enhances their effectiveness when combined. This innovative approach not only boosts model performance but also suggests a cost-efficient solution for resource-scarce languages.

View on arXiv PDF

Similar