SEAINov 21, 2024

Evaluation-Driven Development and Operations of LLM Agents: A Process Model and Reference Architecture

arXiv:2411.13768v37 citationsh-index: 20
Originality Incremental advance
AI Analysis

This addresses the problem of systematic evaluation for LLM agents in AI development, offering a novel framework for continuous adaptation and governance, though it is incremental in building on existing evaluation practices.

The paper tackles the challenge of evaluating LLM agents, which have open-ended and probabilistic behaviors, by proposing an evaluation-driven development and operations (EDDOps) approach that embeds continuous evaluation into a process model and reference architecture to support safer and traceable evolution.

Large Language Models (LLMs) have enabled the emergence of LLM agents, systems capable of pursuing under-specified goals and adapting after deployment. Evaluating such agents is challenging because their behavior is open ended, probabilistic, and shaped by system-level interactions over time. Traditional evaluation methods, built around fixed benchmarks and static test suites, fail to capture emergent behaviors or support continuous adaptation across the lifecycle. To ground a more systematic approach, we conduct a multivocal literature review (MLR) synthesizing academic and industrial evaluation practices. The findings directly inform two empirically derived artifacts: a process model and a reference architecture that embed evaluation as a continuous, governing function rather than a terminal checkpoint. Together they constitute the evaluation-driven development and operations (EDDOps) approach, which unifies offline (development-time) and online (runtime) evaluation within a closed feedback loop. By making evaluation evidence drive both runtime adaptation and governed redevelopment, EDDOps supports safer, more traceable evolution of LLM agents aligned with changing objectives, user needs, and governance constraints.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes