SE AISep 15, 2025

Evaluating Large Language Models for Functional and Maintainable Code in Industrial Settings: A Case Study at ASML

arXiv:2509.12395v13.4h-index: 15Has CodeASE

Originality Synthesis-oriented

AI Analysis

This addresses the problem of applying LLMs to proprietary industrial codebases for software developers in specialized settings, though it is incremental as it extends existing methods to a new domain.

The study investigated the performance of large language models in generating functional and maintainable code for ASML's proprietary industrial software, finding that prompting techniques and model size significantly impact output quality, with few-shot and chain-of-thought prompting achieving the highest build success rates.

Large language models have shown impressive performance in various domains, including code generation across diverse open-source domains. However, their applicability in proprietary industrial settings, where domain-specific constraints and code interdependencies are prevalent, remains largely unexplored. We present a case study conducted in collaboration with the leveling department at ASML to investigate the performance of LLMs in generating functional, maintainable code within a closed, highly specialized software environment. We developed an evaluation framework tailored to ASML's proprietary codebase and introduced a new benchmark. Additionally, we proposed a new evaluation metric, build@k, to assess whether LLM-generated code successfully compiles and integrates within real industrial repositories. We investigate various prompting techniques, compare the performance of generic and code-specific LLMs, and examine the impact of model size on code generation capabilities, using both match-based and execution-based metrics. The findings reveal that prompting techniques and model size have a significant impact on output quality, with few-shot and chain-of-thought prompting yielding the highest build success rates. The difference in performance between the code-specific LLMs and generic LLMs was less pronounced and varied substantially across different model families.

View on arXiv PDF

Similar