SE AISep 13, 2025

When the Code Autopilot Breaks: Why LLMs Falter in Embedded Machine Learning

arXiv:2509.10946v13.4h-index: 4

Originality Synthesis-oriented

AI Analysis

This addresses reliability issues in LLM-based code generation for embedded ML, which is incremental as it builds on existing work by categorizing failures without proposing new methods.

The paper investigates failure modes in LLM-powered embedded machine learning pipelines, showing how prompt format and model behavior lead to silent or unpredictable errors that standard validation often misses, with a taxonomy of error-prone behaviors derived from empirical analysis.

Large Language Models (LLMs) are increasingly used to automate software generation in embedded machine learning workflows, yet their outputs often fail silently or behave unpredictably. This article presents an empirical investigation of failure modes in LLM-powered ML pipelines, based on an autopilot framework that orchestrates data preprocessing, model conversion, and on-device inference code generation. We show how prompt format, model behavior, and structural assumptions influence both success rates and failure characteristics, often in ways that standard validation pipelines fail to detect. Our analysis reveals a diverse set of error-prone behaviors, including format-induced misinterpretations and runtime-disruptive code that compiles but breaks downstream. We derive a taxonomy of failure categories and analyze errors across multiple LLMs, highlighting common root causes and systemic fragilities. Though grounded in specific devices, our study reveals broader challenges in LLM-based code generation. We conclude by discussing directions for improving reliability and traceability in LLM-powered embedded ML systems.

View on arXiv PDF

Similar