AI4Bharat-IndicNLP Corpus: Monolingual Corpora and Word Embeddings for Indic Languages
This provides essential resources for NLP research in Indian languages, though it is incremental as it applies existing methods to new data.
The authors tackled the lack of large-scale monolingual corpora for Indian languages by creating the IndicNLP corpus with 2.7 billion words across 10 languages, and showed that their pre-trained word embeddings significantly outperform existing ones on evaluation tasks.
We present the IndicNLP corpus, a large-scale, general-domain corpus containing 2.7 billion words for 10 Indian languages from two language families. We share pre-trained word embeddings trained on these corpora. We create news article category classification datasets for 9 languages to evaluate the embeddings. We show that the IndicNLP embeddings significantly outperform publicly available pre-trained embedding on multiple evaluation tasks. We hope that the availability of the corpus will accelerate Indic NLP research. The resources are available at https://github.com/ai4bharat-indicnlp/indicnlp_corpus.