CLAIMar 10, 2022

IndicNLG Benchmark: Multilingual Datasets for Diverse NLG Tasks in Indic Languages

Microsoft
arXiv:2203.05437v2304 citationsh-index: 41
Originality Incremental advance
AI Analysis

This provides a foundational resource for NLG research in Indic languages, addressing a gap for researchers and practitioners in multilingual AI.

The paper tackles the scarcity of datasets for natural language generation (NLG) in non-English languages by introducing the IndicNLG Benchmark, a collection of datasets for 11 Indic languages across five tasks, resulting in approximately 8 million examples and demonstrating strong performance of multilingual pre-trained models.

Natural Language Generation (NLG) for non-English languages is hampered by the scarcity of datasets in these languages. In this paper, we present the IndicNLG Benchmark, a collection of datasets for benchmarking NLG for 11 Indic languages. We focus on five diverse tasks, namely, biography generation using Wikipedia infoboxes, news headline generation, sentence summarization, paraphrase generation and, question generation. We describe the created datasets and use them to benchmark the performance of several monolingual and multilingual baselines that leverage pre-trained sequence-to-sequence models. Our results exhibit the strong performance of multilingual language-specific pre-trained models, and the utility of models trained on our dataset for other related NLG tasks. Our dataset creation methods can be easily applied to modest-resource languages as they involve simple steps such as scraping news articles and Wikipedia infoboxes, light cleaning, and pivoting through machine translation data. To the best of our knowledge, the IndicNLG Benchmark is the first NLG benchmark for Indic languages and the most diverse multilingual NLG dataset, with approximately 8M examples across 5 tasks and 11 languages. The datasets and models are publicly available at https://ai4bharat.iitm.ac.in/indicnlg-suite.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes