CVDec 25, 2025

TAMEing Long Contexts in Personalization: Towards Training-Free and State-Aware MLLM Personalized Assistant

Rongpei Hong, Jian Lang, Ting Zhong, Yong Wang, Fan Zhou

arXiv:2512.21616v113.15 citationsh-index: 5

Originality Incremental advance

AI Analysis

This addresses the need for personalized AI assistants that can handle extended conversations, though it is incremental as it builds on existing MLLM personalization methods.

The paper tackles the problem of enabling multimodal large language models to engage in long-context personalized dialogues by introducing LCMP, the first benchmark for this task, and TAME, a training-free framework that achieves the best performance on LCMP, showcasing remarkable and evolving interaction experiences.

Multimodal Large Language Model (MLLM) Personalization is a critical research problem that facilitates personalized dialogues with MLLMs targeting specific entities (known as personalized concepts). However, existing methods and benchmarks focus on the simple, context-agnostic visual identification and textual replacement of the personalized concept (e.g., "A yellow puppy" -> "Your puppy Mochi"), overlooking the ability to support long-context conversations. An ideal personalized MLLM assistant is capable of engaging in long-context dialogues with humans and continually improving its experience quality by learning from past dialogue histories. To bridge this gap, we propose LCMP, the first Long-Context MLLM Personalization evaluation benchmark. LCMP assesses the capability of MLLMs in perceiving variations of personalized concepts and generating contextually appropriate personalized responses that reflect these variations. As a strong baseline for LCMP, we introduce a novel training-free and state-aware framework TAME. TAME endows MLLMs with double memories to manage the temporal and persistent variations of each personalized concept in a differentiated manner. In addition, TAME incorporates a new training-free Retrieve-then-Align Augmented Generation (RA2G) paradigm. RA2G introduces an alignment step to extract the contextually fitted information from the multi-memory retrieved knowledge to the current questions, enabling better interactions for complex real-world user queries. Experiments on LCMP demonstrate that TAME achieves the best performance, showcasing remarkable and evolving interaction experiences in long-context scenarios.

View on arXiv PDF

Similar