Development of Word Embeddings for Uzbek Language
This provides a resource for NLP researchers and practitioners working with the low-resource Uzbek language, though it is incremental as it applies existing methods to new data.
The authors tackled the lack of Uzbek language word embeddings by developing the first publicly available set using word2vec, GloVe, and fastText on a custom web corpus, enabling use in NLP tasks.
In this paper, we share the process of developing word embeddings for the Cyrillic variant of the Uzbek language. The result of our work is the first publicly available set of word vectors trained on the word2vec, GloVe, and fastText algorithms using a high-quality web crawl corpus developed in-house. The developed word embeddings can be used in many natural language processing downstream tasks.