CLIRLGDec 20, 2024

A Breadth-First Catalog of Text Processing, Speech Processing and Multimodal Research in South Asian Languages

arXiv:2501.00029v12 citations
Originality Synthesis-oriented
AI Analysis

It provides a catalog for NLP researchers interested in South Asian languages, but it is incremental as it reviews existing work without new methods or results.

The paper reviews recent literature (2022-2024) on text, speech, and multimodal processing for South Asian languages, identifying trends and challenges, with a focus on 21 low-resource languages.

We review the recent literature (January 2022- October 2024) in South Asian languages on text-based language processing, multimodal models, and speech processing, and provide a spotlight analysis focused on 21 low-resource South Asian languages, namely Saraiki, Assamese, Balochi, Bhojpuri, Bodo, Burmese, Chhattisgarhi, Dhivehi, Gujarati, Kannada, Kashmiri, Konkani, Khasi, Malayalam, Meitei, Nepali, Odia, Pashto, Rajasthani, Sindhi, and Telugu. We identify trends, challenges, and future research directions, using a step-wise approach that incorporates relevance classification and clustering based on large language models (LLMs). Our goal is to provide a breadth-first overview of the recent developments in South Asian language technologies to NLP researchers interested in working with South Asian languages.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes