SEAIDec 1, 2025

LLM-as-a-Judge for Scalable Test Coverage Evaluation: Accuracy, Operational Reliability, and Cost

arXiv:2512.01232v11 citationsh-index: 3
Originality Synthesis-oriented
AI Analysis

This addresses a bottleneck in QA pipelines for software development, offering a scalable and cost-effective solution, though it appears incremental as it applies existing LLM methods to a specific domain problem.

The paper tackles the problem of evaluating software test coverage at scale by introducing LLM-as-a-Judge (LAJ), a framework that uses large language models to assess Gherkin acceptance tests, showing that smaller models like GPT-4o Mini can achieve the best accuracy (6.07 MAAE) with high reliability (96.6% ECR@1) and low cost ($1.01 per 1K evaluations), yielding a 78x cost reduction compared to GPT-5.

Assessing software test coverage at scale remains a bottleneck in QA pipelines. We present LLM-as-a-Judge (LAJ), a production-ready, rubric-driven framework for evaluating Gherkin acceptance tests with structured JSON outputs. Across 20 model configurations (GPT-4, GPT-5 with varying reasoning effort, and open-weight models) on 100 expert-annotated scripts over 5 runs (500 evaluations), we provide the first comprehensive analysis spanning accuracy, operational reliability, and cost. We introduce the Evaluation Completion Rate (ECR@1) to quantify first-attempt success, revealing reliability from 85.4% to 100.0% with material cost implications via retries. Results show that smaller models can outperform larger ones: GPT-4o Mini attains the best accuracy (6.07 MAAE), high reliability (96.6% ECR@1), and low cost ($1.01 per 1K), yielding a 78x cost reduction vs. GPT-5 (high reasoning) while improving accuracy. Reasoning effort is model-family dependent: GPT-5 benefits from increased reasoning (with predictable accuracy-cost tradeoffs), whereas open-weight models degrade across all dimensions as reasoning increases. Overall, cost spans 175x ($0.45-$78.96 per 1K). We release the dataset, framework, and code to support reproducibility and deployment.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes