CLOct 5, 2022

Revisiting Syllables in Language Modelling and their Application on Low-Resource Machine Translation

arXiv:2210.02509v1580 citationsh-index: 13
Originality Incremental advance
AI Analysis

This work addresses the problem of improving language modeling and machine translation for low-resource and synthetic languages, though it is incremental by revisiting an existing linguistic unit.

The study tackled the underutilization of syllables in language modeling and machine translation by demonstrating that syllables outperform characters and subwords in open-vocabulary language modeling across 21 languages and in low-resource neural machine translation for Spanish-Shipibo-Konibo, achieving comparable perplexity and better translation results.

Language modelling and machine translation tasks mostly use subword or character inputs, but syllables are seldom used. Syllables provide shorter sequences than characters, require less-specialised extracting rules than morphemes, and their segmentation is not impacted by the corpus size. In this study, we first explore the potential of syllables for open-vocabulary language modelling in 21 languages. We use rule-based syllabification methods for six languages and address the rest with hyphenation, which works as a syllabification proxy. With a comparable perplexity, we show that syllables outperform characters and other subwords. Moreover, we study the importance of syllables on neural machine translation for a non-related and low-resource language-pair (Spanish--Shipibo-Konibo). In pairwise and multilingual systems, syllables outperform unsupervised subwords, and further morphological segmentation methods, when translating into a highly synthetic language with a transparent orthography (Shipibo-Konibo). Finally, we perform some human evaluation, and discuss limitations and opportunities.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes