Operational Memory Architecture for Kubernetes:Preserving Causal Context Across the Evidence Horizon
For Kubernetes operators debugging crash loops, OMA preserves diagnostically valuable context that is otherwise lost, though the problem is domain-specific and the solution is incremental.
The paper introduces the Operational Memory Architecture (OMA) to preserve causal failure evidence in Kubernetes that is lost due to native event retention, achieving causal edge construction with mean latency below 1 ms and collector processing ~2.8 events/sec with under 10 MB memory.
Kubernetes clusters generate rich operational events during pod lifecycle transitions, yet the platform's native event retention model discards the most diagnostically valuable context. The LastTerminationState field, which records a container's last failure, is overwritten shortly after a pod restart. We define this as the evidence horizon. During high-frequency crash loops, this horizon may be crossed multiple times before inspection, permanently losing critical evidence. This paper introduces the Operational Memory Architecture (OMA) to preserve causal failure evidence before event rotation. OMA encodes evidence retention and causal reconstruction as explicit architectural requirements. It captures operational events into causal chains using three patterns: P001 (OOMKill chain), P002 (ConfigMap variable misconfiguration), and P003 (ConfigMap volume mount propagation). We implement OMA as an open-source system with a Go-based Kubernetes watcher, SQLite operational memory store, and a simple query interface. Experiments on Minikube and AKS include a 30-run latency analysis and stress tests with up to 20 crash-looping pods. Causal edges are built with mean latency below 1 ms. The collector processes ~2.8 events/sec while using under 10 MB memory, showing minimal overhead and effective evidence preservation.