CLOct 18, 2019

Controlling Utterance Length in NMT-based Word Segmentation with Attention

Pierre Godard, Laurent Besacier, Francois Yvon

arXiv:1910.08418v125.6644 citations

Originality Incremental advance

AI Analysis

This work addresses the problem of word segmentation for computational language documentation in under-resourced settings, offering an incremental improvement by leveraging translation data.

The paper tackled unsupervised word segmentation in low-resource languages by using neural machine translation models with bilingual information, introducing a new loss function for joint alignment and segmentation, and demonstrated effective control over segmentation length on Mboshi data.

One of the basic tasks of computational language documentation (CLD) is to identify word boundaries in an unsegmented phonemic stream. While several unsupervised monolingual word segmentation algorithms exist in the literature, they are challenged in real-world CLD settings by the small amount of available data. A possible remedy is to take advantage of glosses or translation in a foreign, well-resourced, language, which often exist for such data. In this paper, we explore and compare ways to exploit neural machine translation models to perform unsupervised boundary detection with bilingual information, notably introducing a new loss function for jointly learning alignment and segmentation. We experiment with an actual under-resourced language, Mboshi, and show that these techniques can effectively control the output segmentation length.

View on arXiv PDF

Similar