VerificAgent: Domain-Specific Memory Verification for Scalable Oversight of Aligned Computer-Use Agents
This addresses safety and alignment issues for computer-using agents in domain-specific tasks, representing an incremental improvement by focusing on memory verification as a scalable oversight mechanism.
The paper tackled the problem of unvetted memories in computer-using agents leading to unsafe heuristics and drift from user intent, and introduced VerificAgent, a framework that improves task reliability and reduces hallucination-induced failures without additional model fine-tuning.
Continual memory augmentation lets computer-using agents (CUAs) learn from prior interactions, but unvetted memories can encode domain-inappropriate or unsafe heuristics--spurious rules that drift from user intent and safety constraints. We introduce VerificAgent, a scalable oversight framework that treats persistent memory as an explicit alignment surface. VerificAgent combines (1) an expert-curated seed of domain knowledge, (2) iterative, trajectory-based memory growth during training, and (3) a post-hoc human fact-checking pass to sanitize accumulated memories before deployment. Evaluated on OSWorld productivity tasks and additional adversarial stress tests, VerificAgent improves task reliability, reduces hallucination-induced failures, and preserves interpretable, auditable guidance--without additional model fine-tuning. By letting humans correct high-impact errors once, the verified memory acts as a frozen safety contract that future agent actions must satisfy. Our results suggest that domain-scoped, human-verified memory offers a scalable oversight mechanism for CUAs, complementing broader alignment strategies by limiting silent policy drift and anchoring agent behavior to the norms and safety constraints of the target domain.