Benchmarking Multimodal Models for Ukrainian Language Understanding Across Academic and Cultural Domains
This work addresses the problem of evaluating multimodal models for the Ukrainian language, which is incremental as it extends existing benchmarking approaches to a new linguistic context.
The authors tackled the lack of multimodal benchmarks for low-resource languages by introducing ZNO-Vision, a Ukrainian-centric benchmark with over 4,300 questions across 12 disciplines, and found that only a few models performed above baseline, with performance degradation in translated tasks.
While the evaluation of multimodal English-centric models is an active area of research with numerous benchmarks, there is a profound lack of benchmarks or evaluation suites for low- and mid-resource languages. We introduce ZNO-Vision, a comprehensive multimodal Ukrainian-centric benchmark derived from standardized university entrance examination (ZNO). The benchmark consists of over 4,300 expert-crafted questions spanning 12 academic disciplines, including mathematics, physics, chemistry, and humanities. We evaluated the performance of both open-source models and API providers, finding that only a handful of models performed above baseline. Alongside the new benchmark, we performed the first evaluation study of multimodal text generation for the Ukrainian language: we measured caption generation quality on the Multi30K-UK dataset, translated the VQA benchmark into Ukrainian, and measured performance degradation relative to original English versions. Lastly, we tested a few models from a cultural perspective on knowledge of national cuisine. We believe our work will advance multimodal generation capabilities for the Ukrainian language and our approach could be useful for other low-resource languages.