ACE-Bench: A Lightweight Benchmark for Evaluating Azure SDK Usage Correctness
This provides a practical tool for developers and CI systems to assess SDK usage correctness without cloud resources, though it is incremental as it builds on existing benchmark and evaluation methods.
The paper tackles the problem of evaluating whether LLM-based coding agents correctly use Azure SDKs by introducing ACE-Bench, a lightweight, execution-free benchmark that uses deterministic regex and LLM-judge checks on documentation examples, showing consistent gains from retrieval-augmented settings across models.
We present ACE-Bench (Azure SDK Coding Evaluation Benchmark), an execution-free benchmark that provides fast, reproducible pass or fail signals for whether large language model (LLM)-based coding agents use Azure SDKs correctly-without provisioning cloud resources or maintaining fragile end-to-end test environments. ACE-Bench turns official Azure SDK documentation examples into self-contained coding tasks and validates solutions with task-specific atomic criteria: deterministic regex checks that enforce required API usage patterns and reference-based LLM-judge checks that capture semantic workflow constraints. This design makes SDK-centric evaluation practical in day-to-day development and CI: it reduces evaluation cost, improves repeatability, and scales to new SDKs and languages as documentation evolves. Using a lightweight coding agent, we benchmark multiple state-of-the-art LLMs and quantify the benefit of retrieval in an MCP-enabled augmented setting, showing consistent gains from documentation access while highlighting substantial cross-model differences.