CLAIDec 5, 2024

AyutthayaAlpha: A Thai-Latin Script Transliteration Transformer

arXiv:2412.03877v11 citationsh-index: 3
Originality Highly original
AI Analysis

This addresses the challenge of accurate Thai-Latin transliteration for applications like cross-lingual information retrieval and identity verification, representing a significant advance in bridging linguistic gaps while respecting cultural dimensions.

This study tackled the problem of transliterating Thai proper names into Latin script by developing AyutthayaAlpha, a transformer-based model that achieved state-of-the-art performance with 82.32% first-token accuracy and 95.24% first-three-token accuracy while maintaining a low character error rate of 0.0047.

This study introduces AyutthayaAlpha, an advanced transformer-based machine learning model designed for the transliteration of Thai proper names into Latin script. Our system achieves state-of-the-art performance with 82.32% first-token accuracy and 95.24% first-three-token accuracy, while maintaining a low character error rate of 0.0047. The complexity of Thai phonology, including tonal features and vowel length distinctions, presents significant challenges for accurate transliteration, which we address through a novel two-model approach: AyutthayaAlpha-Small, based on the ByT5 architecture, and AyutthayaAlpha-VerySmall, a computationally efficient variant that unexpectedly outperforms its larger counterpart. Our research combines linguistic rules with deep learning, training on a carefully curated dataset of 1.2 million Thai-Latin name pairs, augmented through strategic upsampling to 2.7 million examples. Extensive evaluations against existing transliteration methods and human expert benchmarks demonstrate that AyutthayaAlpha not only achieves superior accuracy but also effectively captures personal and cultural preferences in name romanization. The system's practical applications extend to cross-lingual information retrieval, international data standardization, and identity verification systems, with particular relevance for government databases, academic institutions, and global business operations. This work represents a significant advance in bridging linguistic gaps between Thai and Latin scripts, while respecting the cultural and personal dimensions of name transliteration.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes