CL AISep 7, 2021

IndicBART: A Pre-trained Model for Indic Natural Language Generation

Raj Dabre, Himani Shrotriya, Anoop Kunchukuttan, Ratish Puduppully, Mitesh M. Khapra, Pratyush Kumar

arXiv:2109.02903v230.8643 citationsHas Code

Originality Incremental advance

AI Analysis

This addresses the challenge of limited resources for Indic language processing, offering a more efficient model for multilingual generation tasks.

The paper tackled the problem of natural language generation for Indic languages by developing IndicBART, a pre-trained sequence-to-sequence model for 11 Indic languages and English, which is competitive with larger models like mBART50 in tasks such as neural machine translation and extreme summarization, especially in low-resource scenarios.

In this paper, we study pre-trained sequence-to-sequence models for a group of related languages, with a focus on Indic languages. We present IndicBART, a multilingual, sequence-to-sequence pre-trained model focusing on 11 Indic languages and English. IndicBART utilizes the orthographic similarity between Indic scripts to improve transfer learning between similar Indic languages. We evaluate IndicBART on two NLG tasks: Neural Machine Translation (NMT) and extreme summarization. Our experiments on NMT and extreme summarization show that a model specific to related languages like IndicBART is competitive with large pre-trained models like mBART50 despite being significantly smaller. It also performs well on very low-resource translation scenarios where languages are not included in pre-training or fine-tuning. Script sharing, multilingual training, and better utilization of limited model capacity contribute to the good performance of the compact IndicBART model.

View on arXiv PDF Code

Similar