CVFeb 4, 2025

Personalization Toolkit: Training Free Personalization of Large Vision Language Models

arXiv:2502.02452v35 citationsh-index: 21
AI Analysis

This addresses the impracticality of existing training-based methods for real-world deployment in personalizing LVLMs for users and objects.

The paper tackles the problem of personalizing Large Vision-Language Models (LVLMs) without requiring time-consuming test-time training, introducing a training-free approach that uses pre-trained vision models, retrieval-augmented generation, and visual prompting to achieve state-of-the-art results.

Personalization of Large Vision-Language Models (LVLMs) involves customizing models to recognize specific users and object instances, and to generate contextually tailored responses. Existing approaches typically rely on time-consuming test-time training for each user or object, making them impractical for real-world deployment, a limitation reflected in current personalization benchmarks, which are focused on object-centric, single-concept evaluations. In this paper, we present a novel training-free approach to LVLM personalization and introduce a comprehensive real-world benchmark designed to rigorously evaluate various aspects of the personalization task. Our method leverages pre-trained vision foundation models to extract distinctive features, applies retrieval-augmented generation (RAG) techniques to identify instances within visual inputs, and employs visual prompting strategies to guide model outputs. Our model-agnostic vision toolkit enables efficient and flexible multi-concept personalization across both images and videos, without any additional training. We achieve state-of-the-art results, surpassing existing training-based methods.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes