OCR-Memory: Optical Context Retrieval for Long-Horizon Agent Memory
For LLM agents operating in long-horizon interactive settings, OCR-Memory provides a memory framework that increases effective capacity while preserving evidence fidelity, addressing a key bottleneck in agent autonomy.
OCR-Memory addresses the token-expensive and information-lossy limitations of existing LLM agent memory systems by using visual modality to store long histories as annotated images, enabling retrieval via visual anchors and verbatim text transcription. Experiments show consistent gains in long-horizon agent benchmarks under strict context limits.
Autonomous LLM agents increasingly operate in long-horizon, interactive settings where success depends on reusing experience accumulated over extended histories. However, existing agent memory systems are fundamentally constrained by text-context budgets: storing or revisiting raw trajectories is prohibitively token-expensive, while summarization and text-only retrieval trade token savings for information loss and fragmented evidence. To address this limitation, we propose Optical Context Retrieval Memory (OCR-Memory), a memory framework that leverages the visual modality as a high-density representation of agent experience, enabling retention of arbitrarily long histories with minimal prompt overhead at retrieval time. Specifically, OCR-Memory renders historical trajectories into images annotated with unique visual identifiers. OCR-Memory retrieves stored experience via a \emph{locate-and-transcribe} paradigm that selects relevant regions through visual anchors and retrieves the corresponding verbatim text, avoiding free-form generation and reducing hallucination. Experiments on long-horizon agent benchmarks show consistent gains under strict context limits, demonstrating that optical encoding increases effective memory capacity while preserving faithful evidence recovery.