AIJul 11, 2025

Agentic Large Language Models for Conceptual Systems Engineering and Design

arXiv:2507.08619v211 citationsh-index: 3DAC
Originality Incremental advance
AI Analysis

This work addresses the challenge of maintaining task continuity in engineering design workflows for researchers and practitioners, though it appears incremental as it compares variations of existing multi-agent approaches.

The researchers tackled the problem of early-stage engineering design by comparing a structured multi-agent system (MAS) to a simpler two-agent system (2AS) for managing requirements extraction, functional decomposition, and simulator code generation, finding that MAS generated more granular design graphs (5-6 nodes) while 2AS achieved up to 100% code compatibility under specific settings but both systems had low requirement coverage (less than 20%).

Early-stage engineering design involves complex, iterative reasoning, yet existing large language model (LLM) workflows struggle to maintain task continuity and generate executable models. We evaluate whether a structured multi-agent system (MAS) can more effectively manage requirements extraction, functional decomposition, and simulator code generation than a simpler two-agent system (2AS). The target application is a solar-powered water filtration system as described in a cahier des charges. We introduce the Design-State Graph (DSG), a JSON-serializable representation that bundles requirements, physical embodiments, and Python-based physics models into graph nodes. A nine-role MAS iteratively builds and refines the DSG, while the 2AS collapses the process to a Generator-Reflector loop. Both systems run a total of 60 experiments (2 LLMs - Llama 3.3 70B vs reasoning-distilled DeepSeek R1 70B x 2 agent configurations x 3 temperatures x 5 seeds). We report a JSON validity, requirement coverage, embodiment presence, code compatibility, workflow completion, runtime, and graph size. Across all runs, both MAS and 2AS maintained perfect JSON integrity and embodiment tagging. Requirement coverage remained minimal (less than 20%). Code compatibility peaked at 100% under specific 2AS settings but averaged below 50% for MAS. Only the reasoning-distilled model reliably flagged workflow completion. Powered by DeepSeek R1 70B, the MAS generated more granular DSGs (average 5-6 nodes) whereas 2AS mode-collapsed. Structured multi-agent orchestration enhanced design detail. Reasoning-distilled LLM improved completion rates, yet low requirements and fidelity gaps in coding persisted.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes