SEAIApr 7

CAKE: Cloud Architecture Knowledge Evaluation of Large Language Models

arXiv:2604.0575529.7
AI Analysis

This addresses the problem of assessing LLMs as software architecture co-pilots for developers and researchers, though it is incremental as it introduces a new benchmark rather than a novel method.

The authors tackled the lack of benchmarks for evaluating large language models' understanding of cloud-native software architecture by creating CAKE, a benchmark with 188 expert-validated questions across cognitive levels and topics, finding that multiple-choice accuracy plateaus at 99.2% above 3B parameters while free-response scores scale steadily and differentiate models better.

In today's software architecture, large language models (LLMs) serve as software architecture co-pilots. However, no benchmark currently exists to evaluate large language models' actual understanding of cloud-native software architecture. For this reason we present a benchmark called CAKE, which consists of 188 expert-validated questions covering four cognitive levels of Bloom's revised taxonomy -- recall, analyze, design, and implement -- and five cloud-native topics. Evaluation is conducted on 22 model configurations (0.5B--70B parameters) across four LLM families, using three-run majority voting for multiple-choice questions (MCQs) and LLM-as-a-judge scoring for free-responses (FR). Based on this evaluation, four notable findings were identified. First, MCQ accuracy plateaus above 3B parameters, with the best model reaching 99.2\%. Second, free-response scores scale steadily across all cognitive levels. Third, the two formats capture different facets of knowledge, as the MCQ accuracy approaches a ceiling while free-responses continue to differentiate models. Finally, reasoning augmentation (+think) improves free-response quality, while tool augmentation (+tool) degrades performance for small models. These results suggest that the evaluation format fundamentally shapes how we measure architectural knowledge in LLMs.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes