Open foundation models for Azerbaijani language
This work addresses the problem of limited open-source AI resources for Azerbaijani speakers, though it is incremental as it builds on existing multilingual model efforts.
The paper tackled the lack of open foundation models for Azerbaijani by introducing a large text corpus, encoder-only language models, labeled datasets, and extensive benchmarking, resulting in comprehensive evaluation of major open-source models with Azerbaijani support.
The emergence of multilingual large language models has enabled the development of language understanding and generation systems in Azerbaijani. However, most of the production-grade systems rely on cloud solutions, such as GPT-4. While there have been several attempts to develop open foundation models for Azerbaijani, these works have not found their way into common use due to a lack of systemic benchmarking. This paper encompasses several lines of work that promote open-source foundation models for Azerbaijani. We introduce (1) a large text corpus for Azerbaijani, (2) a family of encoder-only language models trained on this dataset, (3) labeled datasets for evaluating these models, and (4) extensive evaluation that covers all major open-source models with Azerbaijani support.