CLApr 30, 2020

AI4Bharat-IndicNLP Corpus: Monolingual Corpora and Word Embeddings for Indic Languages

arXiv:2005.00085v1109 citationsHas Code
Originality Synthesis-oriented
AI Analysis

This provides essential resources for NLP research in Indian languages, though it is incremental as it applies existing methods to new data.

The authors tackled the lack of large-scale monolingual corpora for Indian languages by creating the IndicNLP corpus with 2.7 billion words across 10 languages, and showed that their pre-trained word embeddings significantly outperform existing ones on evaluation tasks.

We present the IndicNLP corpus, a large-scale, general-domain corpus containing 2.7 billion words for 10 Indian languages from two language families. We share pre-trained word embeddings trained on these corpora. We create news article category classification datasets for 9 languages to evaluate the embeddings. We show that the IndicNLP embeddings significantly outperform publicly available pre-trained embedding on multiple evaluation tasks. We hope that the availability of the corpus will accelerate Indic NLP research. The resources are available at https://github.com/ai4bharat-indicnlp/indicnlp_corpus.

Code Implementations2 repos
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes