CLJun 25, 2024

Script-Agnostic Language Identification

arXiv:2406.17901v1
Originality Incremental advance
AI Analysis

This addresses a challenge for low-resource and closely related languages, especially in the Indian Subcontinent, by enabling more accurate language identification across different scripts, though it is incremental as it builds on existing representation learning methods.

The paper tackled the problem of language identification for languages written in multiple scripts, such as those in the Indian Subcontinent, by proposing script-agnostic representations using strategies like word-level script randomization. The result showed that this approach is valuable for downstream script-agnostic identification while maintaining competitive performance on natural text.

Language identification is used as the first step in many data collection and crawling efforts because it allows us to sort online text into language-specific buckets. However, many modern languages, such as Konkani, Kashmiri, Punjabi etc., are synchronically written in several scripts. Moreover, languages with different writing systems do not share significant lexical, semantic, and syntactic properties in neural representation spaces, which is a disadvantage for closely related languages and low-resource languages, especially those from the Indian Subcontinent. To counter this, we propose learning script-agnostic representations using several different experimental strategies (upscaling, flattening, and script mixing) focusing on four major Dravidian languages (Tamil, Telugu, Kannada, and Malayalam). We find that word-level script randomization and exposure to a language written in multiple scripts is extremely valuable for downstream script-agnostic language identification, while also maintaining competitive performance on naturally occurring text.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes