SEAIAug 19, 2025

COMPASS: A Multi-Dimensional Benchmark for Evaluating Code Generation in Large Language Models

arXiv:2508.13757v11 citationsh-index: 18
Originality Incremental advance
AI Analysis

This work addresses the need for more comprehensive evaluation of code generation models for developers and researchers, though it is incremental as it builds on existing benchmarks by adding efficiency and quality dimensions.

The authors tackled the problem of evaluating code generation in large language models beyond functional correctness by introducing COMPASS, a multi-dimensional benchmark that assesses correctness, efficiency, and quality using 50 competitive programming problems and real human baselines, revealing that models like Anthropic Claude Opus 4, Google Gemini 2.5 Pro, and OpenAI O4-Mini-High often produce inefficient or low-quality code despite high correctness scores.

Current code generation benchmarks focus primarily on functional correctness while overlooking two critical aspects of real-world programming: algorithmic efficiency and code quality. We introduce COMPASS (COdility's Multi-dimensional Programming ASSessment), a comprehensive evaluation framework that assesses code generation across three dimensions: correctness, efficiency, and quality. COMPASS consists of 50 competitive programming problems from real Codility competitions, providing authentic human baselines from 393,150 submissions. Unlike existing benchmarks that treat algorithmically inefficient solutions identically to optimal ones provided they pass test cases, COMPASS systematically evaluates runtime efficiency and code quality using industry-standard analysis tools. Our evaluation of three leading reasoning-enhanced models, Anthropic Claude Opus 4, Google Gemini 2.5 Pro, and OpenAI O4-Mini-High, reveals that models achieving high correctness scores do not necessarily produce efficient algorithms or maintainable code. These findings highlight the importance of evaluating more than just correctness to truly understand the real-world capabilities of code generation models. COMPASS serves as a guiding framework, charting a path for future research toward AI systems that are robust, reliable, and ready for production use.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes