CL LGNov 26, 2024

Non-Contextual BERT or FastText? A Comparative Analysis

Abhay Shanbhag, Suramya Jadhav, Amogh Thakurdesai, Ridhima Sinare, Raviraj Joshi

arXiv:2411.17661v31.91 citationsh-index: 12Has Code

Originality Incremental advance

AI Analysis

This addresses the challenge of limited data for NLP in low-resource languages, offering a practical alternative for embedding selection, though it is incremental as it builds on existing BERT and FastText methods.

The study tackled the problem of selecting effective embeddings for low-resource languages like Marathi by comparing non-contextual BERT embeddings (from MuRIL and MahaBERT) with FastText embeddings (IndicFT and MahaFT) on tasks such as news classification, sentiment analysis, and hate speech detection, finding that non-contextual BERT embeddings outperformed FastText embeddings.

Natural Language Processing (NLP) for low-resource languages, which lack large annotated datasets, faces significant challenges due to limited high-quality data and linguistic resources. The selection of embeddings plays a critical role in achieving strong performance in NLP tasks. While contextual BERT embeddings require a full forward pass, non-contextual BERT embeddings rely only on table lookup. Existing research has primarily focused on contextual BERT embeddings, leaving non-contextual embeddings largely unexplored. In this study, we analyze the effectiveness of non-contextual embeddings from BERT models (MuRIL and MahaBERT) and FastText models (IndicFT and MahaFT) for tasks such as news classification, sentiment analysis, and hate speech detection in one such low-resource language Marathi. We compare these embeddings with their contextual and compressed variants. Our findings indicate that non-contextual BERT embeddings extracted from the model's first embedding layer outperform FastText embeddings, presenting a promising alternative for low-resource NLP.

View on arXiv PDF Code

Similar