Multilingual unsupervised sequence segmentation transfers to extremely low-resource languages
This work addresses the challenge of enabling NLP pipelines for low-resource languages, particularly Indigenous American languages, by improving segmentation without extensive labeled data, though it is incremental as it builds on existing models and datasets.
The paper tackles the problem of unsupervised sequence segmentation for extremely low-resource languages by pre-training a Masked Segmental Language Model multilingually, showing that transfer from typologically similar languages yields consistent segmentation quality, exceeding monolingual baselines in 6 out of 10 settings and achieving a zero-shot F1 score of 20.6.
We show that unsupervised sequence-segmentation performance can be transferred to extremely low-resource languages by pre-training a Masked Segmental Language Model (Downey et al., 2021) multilingually. Further, we show that this transfer can be achieved by training over a collection of low-resource languages that are typologically similar (but phylogenetically unrelated) to the target language. In our experiments, we transfer from a collection of 10 Indigenous American languages (AmericasNLP, Mager et al., 2021) to K'iche', a Mayan language. We compare our multilingual model to a monolingual (from-scratch) baseline, as well as a model pre-trained on Quechua only. We show that the multilingual pre-trained approach yields consistent segmentation quality across target dataset sizes, exceeding the monolingual baseline in 6/10 experimental settings. Our model yields especially strong results at small target sizes, including a zero-shot performance of 20.6 F1. These results have promising implications for low-resource NLP pipelines involving human-like linguistic units, such as the sparse transcription framework proposed by Bird (2020).