CLLGMar 31, 2020

A Clustering Framework for Lexical Normalization of Roman Urdu

arXiv:2004.00088v1
AI Analysis

This work addresses a domain-specific challenge in natural language processing for South Asian online content, but appears incremental as it builds on existing normalization techniques.

The authors tackled the problem of lexical normalization for Roman Urdu, an informal language variant with non-standard spelling, by developing a clustering framework that includes a phonetic algorithm and string matching, achieving unspecified normalization improvements.

Roman Urdu is an informal form of the Urdu language written in Roman script, which is widely used in South Asia for online textual content. It lacks standard spelling and hence poses several normalization challenges during automatic language processing. In this article, we present a feature-based clustering framework for the lexical normalization of Roman Urdu corpora, which includes a phonetic algorithm UrduPhone, a string matching component, a feature-based similarity function, and a clustering algorithm Lex-Var. UrduPhone encodes Roman Urdu strings to their pronunciation-based representations. The string matching component handles character-level variations that occur when writing Urdu using Roman script.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes