CLSEOct 27, 2025

MATCH: Task-Driven Code Evaluation through Contrastive Learning

arXiv:2510.23169v22 citationsh-index: 3EMNLP
Originality Incremental advance
AI Analysis

This addresses a critical challenge in AI-based code generation for developers, providing a scalable and accurate evaluation method, though it builds incrementally on prior reference-free approaches like ICE-Score.

The paper tackles the problem of evaluating AI-generated code's alignment with developer intent without requiring reference code, introducing MATCH, a reference-free metric using contrastive learning that achieves stronger correlations with functional correctness and human preference than existing metrics across multiple programming languages.

AI-based code generation is increasingly prevalent, with GitHub Copilot estimated to generate 46% of the code on GitHub. Accurately evaluating how well generated code aligns with developer intent remains a critical challenge. Traditional evaluation methods, such as unit tests, are often unscalable and costly. Syntactic similarity metrics (e.g., BLEU, ROUGE) fail to capture code functionality, and metrics like CodeBERTScore require reference code, which is not always available. To address the gap in reference-free evaluation, with few alternatives such as ICE-Score, this paper introduces MATCH, a novel reference-free metric. MATCH uses Contrastive Learning to generate meaningful embeddings for code and natural language task descriptions, enabling similarity scoring that reflects how well generated code implements the task. We show that MATCH achieves stronger correlations with functional correctness and human preference than existing metrics across multiple programming languages.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes