Tariq Rahim Soomro

1.0CLAug 28, 2024

An Evaluation of Sindhi Word Embedding in Semantic Analogies and Downstream Tasks

Wazir Ali, Saifullah Tumrani, Jay Kumar et al.

In this paper, we propose a new word embedding based corpus consisting of more than 61 million words crawled from multiple web resources. We design a preprocessing pipeline for the filtration of unwanted text from crawled data. Afterwards, the cleaned vocabulary is fed to state-of-the-art continuous-bag-of-words, skip-gram, and GloVe word embedding algorithms. For the evaluation of pretrained embeddings, we use popular intrinsic and extrinsic evaluation approaches. The evaluation results reveal that continuous-bag-of-words and skip-gram perform better than GloVe and existing Sindhi fastText word embedding on both intrinsic and extrinsic evaluation approaches

Tariq Rahim Soomro

1 Paper