Dayuan Jiang

CL
h-index1
4papers
2citations
Novelty40%
AI Score44

4 Papers

SEJun 3
SWE-InfraBench: Evaluating Language Models on Cloud Infrastructure Code

Natalia Tarasova, Enrique Balp-Straffon, Aleksei Iancheruk et al.

Building infrastructure-as-code (IaC) in cloud computing is a critical task, underpinning the reliability, scalability, and security of modern software systems. Despite the remarkable progress of large language models (LLMs) in software engineering -- demonstrated across many dedicated benchmarks -- their capabilities in developing IaC remain underexplored. Unlike existing IaC benchmarks that predominantly center on declarative paradigms such as Terraform and involve generating entire codebases from scratch, our benchmark reflects the incremental code edits common in enterprise development with imperative tools like the AWS CDK. We present SWE-InfraBench, a diverse evaluation dataset sourced from dozens of real-world IaC codebases that challenge LLMs to perform realistic code modifications in AWS CDK repositories. Each example requires models to implement changes to existing codebases based on natural language instructions, with success determined by passing provided test cases. These tasks demand sophisticated reasoning about cloud resource dependencies and implementation patterns beyond conventional code generation challenges. Our evaluation results reveal significant limitations in current LLMs showing that even state-of-the-art systems struggle with many tasks -- the best model, Sonnet 3.7, succeeds in only 34\% of cases, while specialized reasoning models like DeepSeek R1 achieve just 24% success. The SWE-InfraBench dataset is available at: https://www.kaggle.com/datasets/64e59070fd51c0278560b01eb5dc4f3c447d5268cdabe5a350d2969e4413fea5

CLMay 21
Ishigaki-IDS-Bench: A Benchmark for Generating Information Delivery Specification from BIM Information Requirements

Ryo Kanazawa, Koyo Hidaka, Teppei Miyamoto et al.

Large language models (LLMs) are widely used to generate structured outputs such as JSON, SQL, and code, yet public resources remain limited for evaluating generation that must simultaneously satisfy industry-standard XML and domain vocabulary constraints. This paper presents Ishigaki-IDS-Bench, a benchmark for evaluating the ability to generate Information Delivery Specification (IDS) XML from Building Information Modeling (BIM) information requirements. The benchmark contains 166 BIM/IDS expert-authored and verified examples created by expanding 83 practical scenarios into Japanese and English, corresponding gold IDS files, and metadata for input format, language, turn setting, IFC version, and construction domain. Its evaluation combines IDSAuditTool-based Processability, Structure, and Content audits with content-agreement evaluation against gold IDS files. In zero-shot evaluation over 10 LLMs, the best model reaches 65.6% macro F1 for content agreement, while only 27.7% of outputs pass the Content audit. These results show that current LLMs can express part of the information requirements as IDS, but still struggle to stably generate XML that satisfies the IDS standard and IFC vocabulary constraints. Ishigaki-IDS-Bench supports comparative evaluation, failure analysis, and the development of constrained structured generation methods that conform to domain standards. We release the evaluation scripts and benchmark data under the CC BY 4.0 license on GitHub and Hugging Face.

GRJan 8
GenAI-DrawIO-Creator: A Framework for Automated Diagram Generation

Jinze Yu, Dayuan Jiang

Diagrams are crucial for communicating complex information, yet creating and modifying them remains a labor-intensive task. We present GenAI-DrawIO-Creator, a novel framework that leverages Large Language Models (LLMs) to automate diagram generation and manipulation in the structured XML format used by draw.io. Our system integrates Claude 3.7 to reason about structured visual data and produce valid diagram representations. Key contributions include a high-level system design enabling real-time diagram updates, specialized prompt engineering and error-checking to ensure well-formed XML outputs. We demonstrate a working prototype capable of generating accurate diagrams (such as network architectures and flowcharts) from natural language or code, and even replicating diagrams from images. Simulated evaluations show that our approach significantly reduces diagram creation time and produces outputs with high structural fidelity. Our results highlight the promise of Claude 3.7 in handling structured visual reasoning tasks and lay the groundwork for future research in AI-assisted diagramming applications.

CLNov 27, 2025
STED and Consistency Scoring: A Framework for Evaluating LLM Structured Output Reliability

Guanghui Wang, Jinze Yu, Xing Zhang et al.

Large Language Models (LLMs) are increasingly deployed for structured data generation, yet output consistency remains critical for production applications. We introduce a comprehensive framework for evaluating and improving consistency in LLM-generated structured outputs. Our approach combines: (1) STED (Semantic Tree Edit Distance), a novel similarity metric balancing semantic flexibility with structural strictness when comparing JSON outputs, and (2) a consistency scoring framework aggregating multiple STED measurements across repeated generations to quantify reliability. Through systematic experiments on synthetic datasets with controlled schema, expression, and semantic variations, we demonstrate STED achieves superior performance ($0.86-0.90$ similarity for semantic equivalents, $0.0$ for structural breaks) compared to existing metrics including TED, BERTScore, and DeepDiff. Applying our framework to benchmark six LLMs reveals significant variations: Claude-3.7-Sonnet demonstrates exceptional consistency, maintaining near-perfect structural reliability even at high temperatures ($T=0.9$), while models like Claude-3-Haiku and Nova-Pro exhibit substantial degradation requiring careful tuning. Our framework enables practical applications including targeted model selection for structured tasks, iterative prompt refinement for reproducible results, and diagnostic analysis to identify inconsistency root causes. This work provides theoretical foundations and practical tools for ensuring reliable structured output generation in LLM-based production systems.