CLMay 5, 2020

Phonetic and Visual Priors for Decipherment of Informal Romanization

arXiv:2005.02517v1995 citations
AI Analysis

This addresses the challenge of processing informal digital communication in non-Latin script languages for NLP applications, though it is incremental as it builds on prior decipherment methods with new priors.

The paper tackles the problem of deciphering informal romanization (non-Latin script languages encoded into Latin characters) by proposing an unsupervised noisy-channel WFST cascade model with phonetic and visual priors. The model, trained on Egyptian Arabic and Russian data, shows substantial performance improvements, bringing results closer to supervised benchmarks.

Informal romanization is an idiosyncratic process used by humans in informal digital communication to encode non-Latin script languages into Latin character sets found on common keyboards. Character substitution choices differ between users but have been shown to be governed by the same main principles observed across a variety of languages---namely, character pairs are often associated through phonetic or visual similarity. We propose a noisy-channel WFST cascade model for deciphering the original non-Latin script from observed romanized text in an unsupervised fashion. We train our model directly on romanized data from two languages: Egyptian Arabic and Russian. We demonstrate that adding inductive bias through phonetic and visual priors on character mappings substantially improves the model's performance on both languages, yielding results much closer to the supervised skyline. Finally, we introduce a new dataset of romanized Russian, collected from a Russian social network website and partially annotated for our experiments.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes