CL MLDec 5, 2015

PJAIT Systems for the IWSLT 2015 Evaluation Campaign Enhanced by Comparable Corpora

arXiv:1512.01639v14.07 citationsh-index: 15

Originality Synthesis-oriented

AI Analysis

This work addresses translation quality for diverse language pairs in SMT, but it is incremental as it builds on existing methods with new data and tools.

The paper tackled improving Statistical Machine Translation (SMT) systems for multiple language pairs by using comparable corpora and adaptation techniques, resulting in a positive impact on translation quality as measured by BLEU, NIST, and TER metrics.

In this paper, we attempt to improve Statistical Machine Translation (SMT) systems on a very diverse set of language pairs (in both directions): Czech - English, Vietnamese - English, French - English and German - English. To accomplish this, we performed translation model training, created adaptations of training settings for each language pair, and obtained comparable corpora for our SMT systems. Innovative tools and data adaptation techniques were employed. The TED parallel text corpora for the IWSLT 2015 evaluation campaign were used to train language models, and to develop, tune, and test the system. In addition, we prepared Wikipedia-based comparable corpora for use with our SMT system. This data was specified as permissible for the IWSLT 2015 evaluation. We explored the use of domain adaptation techniques, symmetrized word alignment models, the unsupervised transliteration models and the KenLM language modeling tool. To evaluate the effects of different preparations on translation results, we conducted experiments and used the BLEU, NIST and TER metrics. Our results indicate that our approach produced a positive impact on SMT quality.

View on arXiv PDF

Similar