CL LGOct 26, 2020

Robust and Consistent Estimation of Word Embedding for Bangla Language by fine-tuning Word2Vec Model

arXiv:2010.13404v312 citations

Originality Synthesis-oriented

AI Analysis

This work addresses the need for reliable word embeddings in Bangla NLP, but it is incremental as it applies an existing method to a new language.

The study tackled the problem of generating effective word embeddings for the Bangla language by fine-tuning the Word2Vec model, finding that 300-dimensional vectors from the skip-gram method with a window size of 4 provided the most robust representations.

Word embedding or vector representation of word holds syntactical and semantic characteristics of a word which can be an informative feature for any machine learning-based models of natural language processing. There are several deep learning-based models for the vectorization of words like word2vec, fasttext, gensim, glove, etc. In this study, we analyze word2vec model for learning word vectors by tuning different hyper-parameters and present the most effective word embedding for Bangla language. For testing the performances of different word embeddings generated by fine-tuning of word2vec model, we perform both intrinsic and extrinsic evaluations. We cluster the word vectors to examine the relational similarity of words for intrinsic evaluation and also use different word embeddings as the feature of news article classifier for extrinsic evaluation. From our experiment, we discover that the word vectors with 300 dimensions, generated from "skip-gram" method of word2vec model using the sliding window size of 4, are giving the most robust vector representations for Bangla language.

View on arXiv PDF

Similar