CLJun 25, 2024

Script-Agnostic Language Identification

Milind Agarwal, Joshua Otten, Antonios Anastasopoulos

arXiv:2406.17901v11.0Has Code

Originality Incremental advance

AI Analysis

This addresses a challenge for low-resource and closely related languages, especially in the Indian Subcontinent, by enabling more accurate language identification across different scripts, though it is incremental as it builds on existing representation learning methods.

The paper tackled the problem of language identification for languages written in multiple scripts, such as those in the Indian Subcontinent, by proposing script-agnostic representations using strategies like word-level script randomization. The result showed that this approach is valuable for downstream script-agnostic identification while maintaining competitive performance on natural text.

Language identification is used as the first step in many data collection and crawling efforts because it allows us to sort online text into language-specific buckets. However, many modern languages, such as Konkani, Kashmiri, Punjabi etc., are synchronically written in several scripts. Moreover, languages with different writing systems do not share significant lexical, semantic, and syntactic properties in neural representation spaces, which is a disadvantage for closely related languages and low-resource languages, especially those from the Indian Subcontinent. To counter this, we propose learning script-agnostic representations using several different experimental strategies (upscaling, flattening, and script mixing) focusing on four major Dravidian languages (Tamil, Telugu, Kannada, and Malayalam). We find that word-level script randomization and exposure to a language written in multiple scripts is extremely valuable for downstream script-agnostic language identification, while also maintaining competitive performance on naturally occurring text.

View on arXiv PDF Code

Similar