CLDec 31, 2024

Sinhala Transliteration: A Comparative Analysis Between Rule-based and Seq2Seq Approaches

Yomal De Mel, Kasun Wickramasinghe, Nisansa de Silva, Surangika Ranathunga

arXiv:2501.00529v120.917 citationsh-index: 14Has CodeCOLING Workshops

Originality Synthesis-oriented

AI Analysis

This addresses transliteration for low-resource languages like Sinhala, where Romanization is common due to convenience and tech literacy issues, but it is incremental as it applies existing methods to a new domain.

The study tackled Romanized Sinhala transliteration by comparing a rule-based baseline with a Transformer-based sequence-to-sequence method, finding that the Transformer method captured more ad-hoc patterns in the scripts.

Due to reasons of convenience and lack of tech literacy, transliteration (i.e., Romanizing native scripts instead of using localization tools) is eminently prevalent in the context of low-resource languages such as Sinhala, which have their own writing script. In this study, our focus is on Romanized Sinhala transliteration. We propose two methods to address this problem: Our baseline is a rule-based method, which is then compared against our second method where we approach the transliteration problem as a sequence-to-sequence task akin to the established Neural Machine Translation (NMT) task. For the latter, we propose a Transformer-based Encode-Decoder solution. We witnessed that the Transformer-based method could grab many ad-hoc patterns within the Romanized scripts compared to the rule-based method. The code base associated with this paper is available on GitHub - https://github.com/kasunw22/Sinhala-Transliterator/

View on arXiv PDF Code

Similar