SEAIAug 15, 2025

Hallucination in LLM-Based Code Generation: An Automotive Case Study

arXiv:2508.11257v12 citationsh-index: 82025 3rd International Conference on Foundation and Large Language Models (FLLM)
Originality Synthesis-oriented
AI Analysis

This addresses the problem of unreliable code generation for safety-critical automotive software, though it is an incremental study focusing on a specific domain.

The paper investigates hallucination in LLM-based code generation for automotive software, finding that state-of-the-art models like GPT-4.1 and GPT-4o produced high frequencies of syntax violations, invalid reference errors, and API knowledge conflicts, with only these two models achieving correct solutions when given the most context-rich prompts.

Large Language Models (LLMs) have shown significant potential in automating code generation tasks offering new opportunities across software engineering domains. However, their practical application remains limited due to hallucinations - outputs that appear plausible but are factually incorrect, unverifiable or nonsensical. This paper investigates hallucination phenomena in the context of code generation with a specific focus on the automotive domain. A case study is presented that evaluates multiple code LLMs for three different prompting complexities ranging from a minimal one-liner prompt to a prompt with Covesa Vehicle Signal Specifications (VSS) as additional context and finally to a prompt with an additional code skeleton. The evaluation reveals a high frequency of syntax violations, invalid reference errors and API knowledge conflicts in state-of-the-art models GPT-4.1, Codex and GPT-4o. Among the evaluated models, only GPT-4.1 and GPT-4o were able to produce a correct solution when given the most context-rich prompt. Simpler prompting strategies failed to yield a working result, even after multiple refinement iterations. These findings highlight the need for effective mitigation techniques to ensure the safe and reliable use of LLM generated code, especially in safety-critical domains such as automotive software systems.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes