A Pattern Language for Resilient Visual Agents
For enterprise architects integrating visual AI agents, this work offers a structured approach to balance latency and determinism, but it is an incremental contribution without empirical validation.
The paper addresses the challenge of integrating multimodal foundation models into enterprise systems by proposing an architectural pattern language that separates fast deterministic reflexes from slow probabilistic supervision, consisting of four design patterns. No concrete performance numbers are provided.
Integrating multimodal foundation models into enterprise ecosystems presents a fundamental software architecture challenge. Architects must balance competing quality attributes: the high latency and non-determinism of vision language action (VLA) models versus the strict determinism and real-time performance required by enterprise control loops. In this study, we propose an architectural pattern language for visual agents that separates fast, deterministic reflexes from slow, probabilistic supervision. It consists of four architectural design patterns: (1) Hybrid Affordance Integration, (2) Adaptive Visual Anchoring, (3) Visual Hierarchy Synthesis, and (4) Semantic Scene Graph.