SE AIApr 27

Leveraging LLMs for Multi-File DSL Code Generation: An Industrial Case Study

Sivajeet Chand, Kevin Nguyen, Peter Kuntz, Alexander Pretschner

arXiv:2604.2467815.9

Predicted impact top 85% in SE · last 90 daysOriginality Synthesis-oriented

AI Analysis

For enterprise DSL development, this work demonstrates that fine-tuned LLMs can reliably generate repository-scale multi-file changes, addressing a practical bottleneck in industrial code generation.

This paper adapts LLMs for multi-file DSL code generation in an industrial setting at BMW, achieving high exact-match accuracy and perfect structural fidelity (1.00) on held-out multi-file outputs through fine-tuning.

Large language models (LLMs) perform strongly on general-purpose code generation, yet their applicability to enterprise domain-specific languages (DSLs) remains underexplored, especially for repository-scale change generation spanning multiple files and folder structures from a single natural-language (NL) instruction. We report an industrial case study at BMW that adapts code-oriented LLMs to generate and modify project-root DSL artifacts for an Xtext-based DSL that drives downstream Java/TypeScript code generation. We develop an end-to-end pipeline for dataset construction, multi-file task representation, model adaptation, and evaluation. We encode DSL folder hierarchies as structured, path-preserving JSON, allowing single-response generation at repository scale and learning cross-file dependencies. We evaluate two instruction-tuned code LLMs (Qwen2.5-Coder and DeepSeek-Coder, 7B) under three configurations: baseline prompting, one-shot in-context learning, and parameter-efficient fine-tuning (QLoRA). Beyond standard similarity metrics, we introduce task-specific measures that assess edit correctness and repository structural fidelity. Fine-tuning yields the most significant gains across models and metrics, achieving high exact-match accuracy, substantial edit similarity, and structural fidelity of 1.00 on our held-out set for multi-file outputs. At the same time, one-shot in-context learning provides smaller but consistent improvements over baseline prompting. We further validate practical utility via an expert developer survey and an execution-based check using the existing code generator.

View on arXiv PDF

Similar