CL AIMar 16, 2022

BPE vs. Morphological Segmentation: A Case Study on Machine Translation of Four Polysynthetic Languages

Manuel Mager, Arturo Oncevay, Elisabeth Mager, Katharina Kann, Ngoc Thang Vu

arXiv:2203.08954v132.1646 citationsh-index: 38

Originality Incremental advance

AI Analysis

This work addresses data sparsity issues in NLP for low-resource polysynthetic languages, though it is incremental as it builds on existing segmentation techniques.

The study tackled the challenge of machine translation for polysynthetic languages by comparing morphological segmentation methods against BPEs, finding that unsupervised morphological segmentation outperformed BPEs for three out of four languages in translation tasks.

Morphologically-rich polysynthetic languages present a challenge for NLP systems due to data sparsity, and a common strategy to handle this issue is to apply subword segmentation. We investigate a wide variety of supervised and unsupervised morphological segmentation methods for four polysynthetic languages: Nahuatl, Raramuri, Shipibo-Konibo, and Wixarika. Then, we compare the morphologically inspired segmentation methods against Byte-Pair Encodings (BPEs) as inputs for machine translation (MT) when translating to and from Spanish. We show that for all language pairs except for Nahuatl, an unsupervised morphological segmentation algorithm outperforms BPEs consistently and that, although supervised methods achieve better segmentation scores, they under-perform in MT challenges. Finally, we contribute two new morphological segmentation datasets for Raramuri and Shipibo-Konibo, and a parallel corpus for Raramuri--Spanish.

View on arXiv PDF

Similar