CVAIMar 17, 2025

From Head to Tail: Towards Balanced Representation in Large Vision-Language Models through Adaptive Data Calibration

arXiv:2503.12821v410 citationsh-index: 14CVPR
Originality Incremental advance
AI Analysis

This addresses data imbalance issues in LVLMs for tasks like Visual Question Answering and Visual Reasoning, offering an incremental improvement over existing methods.

The paper tackles the long-tail data imbalance problem in Large Vision-Language Models (LVLMs) by proposing an Adaptive Data Refinement Framework (ADR) that rebalances and synthesizes data, resulting in a 4.36% relative improvement in average performance on LLaVA 1.5 across eleven benchmarks without increasing training data volume.

Large Vision-Language Models (LVLMs) have achieved significant progress in combining visual comprehension with language generation. Despite this success, the training data of LVLMs still suffers from Long-Tail (LT) problems, where the data distribution is highly imbalanced. Previous works have mainly focused on traditional VLM architectures, i.e., CLIP or ViT, and specific tasks such as recognition and classification. Nevertheless, the exploration of LVLM (e.g. LLaVA) and more general tasks (e.g. Visual Question Answering and Visual Reasoning) remains under-explored. In this paper, we first conduct an in-depth analysis of the LT issues in LVLMs and identify two core causes: the overrepresentation of head concepts and the underrepresentation of tail concepts. Based on the above observation, we propose an $\textbf{A}$daptive $\textbf{D}$ata $\textbf{R}$efinement Framework ($\textbf{ADR}$), which consists of two stages: $\textbf{D}$ata $\textbf{R}$ebalancing ($\textbf{DR}$) and $\textbf{D}$ata $\textbf{S}$ynthesis ($\textbf{DS}$). In the DR stage, we adaptively rebalance the redundant data based on entity distributions, while in the DS stage, we leverage Denoising Diffusion Probabilistic Models (DDPMs) and scarce images to supplement underrepresented portions. Through comprehensive evaluations across eleven benchmarks, our proposed ADR effectively mitigates the long-tail problem in the training data, improving the average performance of LLaVA 1.5 relatively by 4.36%, without increasing the training data volume.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes