CLAIMay 30, 2025

Eye of Judgement: Dissecting the Evaluation of Russian-speaking LLMs with POLLUX

arXiv:2505.24616v3h-index: 2Has Code
Originality Incremental advance
AI Analysis

This provides a scalable and interpretable evaluation tool for developers of Russian-speaking LLMs, though it is incremental as it adapts existing benchmark and LLM-as-a-Judge concepts to a new language.

The authors tackled the problem of evaluating Russian-speaking large language models by introducing POLLUX, a comprehensive benchmark with 2,100 prompts across 35 task types, and developed a novel evaluation methodology using LLM-as-a-Judge evaluators to enhance interpretability and scalability, effectively replacing costly human judgments.

We introduce POLLUX, a comprehensive open-source benchmark designed to evaluate the generative capabilities of large language models (LLMs) in Russian. Our main contribution is a novel evaluation methodology that enhances the interpretability of LLM assessment. For each task type, we define a set of detailed criteria and develop a scoring protocol where models evaluate responses and provide justifications for their ratings. This enables transparent, criteria-driven evaluation beyond traditional resource-consuming, side-by-side human comparisons. POLLUX includes a detailed, fine-grained taxonomy of 35 task types covering diverse generative domains such as code generation, creative writing, and practical assistant use cases, totaling 2,100 manually crafted and professionally authored prompts. Each task is categorized by difficulty (easy/medium/hard), with experts constructing the dataset entirely from scratch. We also release a family of LLM-as-a-Judge (7B and 32B) evaluators trained for nuanced assessment of generative outputs. This approach provides scalable, interpretable evaluation and annotation tools for model development, effectively replacing costly and less precise human judgments.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes