CLOct 13, 2022

Tone prediction and orthographic conversion for Basaa

Ilya Nikitin, Brian O'Connor, Anastasia Safonova

arXiv:2210.06986v11 citationsh-index: 2

Originality Synthesis-oriented

AI Analysis

This work addresses orthographic conversion for Basaa, a low-resource language, which is an incremental improvement in natural language processing for specific linguistic applications.

The paper tackles the problem of transliterating missionary Basaa orthographies into the official orthography using a seq2seq approach with mT5, achieving a character error rate (CER) of 12.6747 and a word error rate (WER) of 40.1012.

In this paper, we present a seq2seq approach for transliterating missionary Basaa orthographies into the official orthography. Our model uses pre-trained Basaa missionary and official orthography corpora using BERT. Since Basaa is a low-resource language, we have decided to use the mT5 model for our project. Before training our model, we pre-processed our corpora by eliminating one-to-one correspondences between spellings and unifying characters variably containing either one to two characters into single-character form. Our best mT5 model achieved a CER equal to 12.6747 and a WER equal to 40.1012.

View on arXiv PDF

Similar