CLApr 30, 2020

AI4Bharat-IndicNLP Corpus: Monolingual Corpora and Word Embeddings for Indic Languages

Anoop Kunchukuttan, Divyanshu Kakwani, Satish Golla, Gokul N. C., Avik Bhattacharyya, Mitesh M. Khapra, Pratyush Kumar

arXiv:2005.00085v15.4109 citationsHas Code

Originality Synthesis-oriented

AI Analysis

This provides essential resources for NLP research in Indian languages, though it is incremental as it applies existing methods to new data.

The authors tackled the lack of large-scale monolingual corpora for Indian languages by creating the IndicNLP corpus with 2.7 billion words across 10 languages, and showed that their pre-trained word embeddings significantly outperform existing ones on evaluation tasks.

We present the IndicNLP corpus, a large-scale, general-domain corpus containing 2.7 billion words for 10 Indian languages from two language families. We share pre-trained word embeddings trained on these corpora. We create news article category classification datasets for 9 languages to evaluate the embeddings. We show that the IndicNLP embeddings significantly outperform publicly available pre-trained embedding on multiple evaluation tasks. We hope that the availability of the corpus will accelerate Indic NLP research. The resources are available at https://github.com/ai4bharat-indicnlp/indicnlp_corpus.

View on arXiv PDF Code

Similar