LLMzSzŁ: a comprehensive LLM benchmark for Polish
This addresses the need for robust evaluation of LLMs in Polish, a low-resource language, by providing a large-scale benchmark for researchers and practitioners, though it is incremental as it applies existing methods to new data.
The authors introduced LLMzSzŁ, the first comprehensive benchmark for Polish language based on national exams, comprising nearly 19k questions across 154 domains, and found that multilingual LLMs outperform monolingual ones in knowledge transfer, though monolingual models are better when size is constrained.
This article introduces the first comprehensive benchmark for the Polish language at this scale: LLMzSzŁ (LLMs Behind the School Desk). It is based on a coherent collection of Polish national exams, including both academic and professional tests extracted from the archives of the Polish Central Examination Board. It covers 4 types of exams, coming from 154 domains. Altogether, it consists of almost 19k closed-ended questions. We investigate the performance of open-source multilingual, English, and Polish LLMs to verify LLMs' abilities to transfer knowledge between languages. Also, the correlation between LLMs and humans at model accuracy and exam pass rate levels is examined. We show that multilingual LLMs can obtain superior results over monolingual ones; however, monolingual models may be beneficial when model size matters. Our analysis highlights the potential of LLMs in assisting with exam validation, particularly in identifying anomalies or errors in examination tasks.