100,000+ Movie Reviews from Kazakhstan: Russian, Kazakh, and Code-Switched Texts
For researchers working on low-resource and code-switched sentiment analysis, this provides a new dataset and baselines, but the contribution is incremental as it applies existing methods to a new domain.
The paper introduces a new multilingual movie review dataset from Kazakhstan with 100,502 reviews, and benchmarks sentiment analysis tasks, finding that transformer models outperform classical baselines on polarity classification but struggle with score classification due to class imbalance.
We present a new publicly available corpus of 100,502 movie reviews from Kazakhstan collected from kino.kz, spanning 2001-2025 and covering 4,943 unique titles. The dataset is multilingual, consisting mainly of Russian reviews alongside Kazakh and code-switched texts. Reviews are manually annotated for language and sentiment polarity, and 11,309 reviews additionally contain explicit user-provided ratings. We define two sentiment tasks -- three-way polarity classification and five-class score classification -- and benchmark classical BoW/TF-IDF baselines against multilingual transformer models (mBERT, XLM-RoBERTa, RemBERT). Experimental results show that transformer models consistently outperform classical baselines on polarity classification, while score classification remains challenging under leakage-controlled evaluation due to severe class imbalance and subtle distinctions between adjacent rating levels.