An Evaluation of Sindhi Word Embedding in Semantic Analogies and Downstream Tasks
This work addresses the need for better natural language processing tools for the Sindhi language, but it is incremental as it applies existing methods to new data.
The authors tackled the problem of creating and evaluating word embeddings for Sindhi by building a new corpus of over 61 million words and applying standard embedding algorithms. The results showed that continuous-bag-of-words and skip-gram outperformed GloVe and existing fastText embeddings in both intrinsic and extrinsic evaluations.
In this paper, we propose a new word embedding based corpus consisting of more than 61 million words crawled from multiple web resources. We design a preprocessing pipeline for the filtration of unwanted text from crawled data. Afterwards, the cleaned vocabulary is fed to state-of-the-art continuous-bag-of-words, skip-gram, and GloVe word embedding algorithms. For the evaluation of pretrained embeddings, we use popular intrinsic and extrinsic evaluation approaches. The evaluation results reveal that continuous-bag-of-words and skip-gram perform better than GloVe and existing Sindhi fastText word embedding on both intrinsic and extrinsic evaluation approaches