CVAIMar 17, 2025

Concept-as-Tree: A Controllable Synthetic Data Framework Makes Stronger Personalized VLMs

arXiv:2503.12999v32 citationsh-index: 8
Originality Highly original
AI Analysis

This work solves the problem of data limitations for researchers and practitioners aiming to personalize VLMs, representing an incremental improvement by providing a novel synthetic data pipeline.

The paper tackles the challenge of improving personalization in Vision-Language Models (VLMs) by addressing the scarcity of user-provided positive samples and low-quality negative samples, introducing the Concept-as-Tree (CaT) framework to generate controllable synthetic data, which significantly enhances VLM performance across personalization benchmarks.

Vision-Language Models (VLMs) have demonstrated exceptional performance in various multi-modal tasks. Recently, there has been an increasing interest in improving the personalization capabilities of VLMs. To better integrate user-provided concepts into VLMs, many methods use positive and negative samples to fine-tune these models. However, the scarcity of user-provided positive samples and the low quality of retrieved negative samples pose challenges for existing techniques. To reveal the relationship between sample and model performance, we systematically investigate the amount and diversity impact of positive and negative samples (easy and hard) on VLM personalization tasks. Based on the detailed analysis, we introduce Concept-as-Tree (CaT), which represents a concept as a tree structure, thereby enabling the data generation of positive and negative samples with varying difficulty and diversity, and can be easily extended to multi-concept scenarios. With a well-designed data filtering strategy, our CaT framework can ensure the quality of generated data, constituting a powerful pipeline. We perform thorough experiments with various VLM personalization baselines to assess the effectiveness of the pipeline, alleviating the lack of positive samples and the low quality of negative samples. Our results demonstrate that CaT equipped with the proposed data filter significantly enhances the capabilities of VLMs across personalization benchmarks. To the best of our knowledge, this work is the first controllable synthetic data pipeline for VLM personalization. The code will be released.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes