AIOct 17, 2025

AUGUSTUS: An LLM-Driven Multimodal Agent System with Contextualized User Memory

Georgia Tech
arXiv:2510.15261v12 citationsh-index: 7
Originality Incremental advance
AI Analysis

This addresses the need for efficient multimodal memory in AI agents, offering a novel approach that improves performance and speed, though it is incremental in enhancing existing agent systems.

The paper tackles the problem of multimodal memory in agent systems by introducing AUGUSTUS, which uses a graph-structured multimodal contextual memory with semantic tags, resulting in outperforming traditional multimodal RAG and being 3.5 times faster for ImageNet classification.

Riding on the success of LLMs with retrieval-augmented generation (RAG), there has been a growing interest in augmenting agent systems with external memory databases. However, the existing systems focus on storing text information in their memory, ignoring the importance of multimodal signals. Motivated by the multimodal nature of human memory, we present AUGUSTUS, a multimodal agent system aligned with the ideas of human memory in cognitive science. Technically, our system consists of 4 stages connected in a loop: (i) encode: understanding the inputs; (ii) store in memory: saving important information; (iii) retrieve: searching for relevant context from memory; and (iv) act: perform the task. Unlike existing systems that use vector databases, we propose conceptualizing information into semantic tags and associating the tags with their context to store them in a graph-structured multimodal contextual memory for efficient concept-driven retrieval. Our system outperforms the traditional multimodal RAG approach while being 3.5 times faster for ImageNet classification and outperforming MemGPT on the MSC benchmark.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes