CL LGMar 31, 2020

A Clustering Framework for Lexical Normalization of Roman Urdu

Abdul Rafae Khan, Asim Karim, Hassan Sajjad, Faisal Kamiran, Jia Xu

arXiv:2004.00088v1

AI Analysis

This work addresses a domain-specific challenge in natural language processing for South Asian online content, but appears incremental as it builds on existing normalization techniques.

The authors tackled the problem of lexical normalization for Roman Urdu, an informal language variant with non-standard spelling, by developing a clustering framework that includes a phonetic algorithm and string matching, achieving unspecified normalization improvements.

Roman Urdu is an informal form of the Urdu language written in Roman script, which is widely used in South Asia for online textual content. It lacks standard spelling and hence poses several normalization challenges during automatic language processing. In this article, we present a feature-based clustering framework for the lexical normalization of Roman Urdu corpora, which includes a phonetic algorithm UrduPhone, a string matching component, a feature-based similarity function, and a clustering algorithm Lex-Var. UrduPhone encodes Roman Urdu strings to their pronunciation-based representations. The string matching component handles character-level variations that occur when writing Urdu using Roman script.

View on arXiv PDF

Similar