LGAug 27, 2025

SCAR: A Characterization Scheme for Multi-Modal Dataset

Ri Su, Zhao Chen, Caleb Chen Cao, Nan Tang, Lei Chen

arXiv:2508.19659v1h-index: 16Has Code

Originality Incremental advance

AI Analysis

This work addresses the need for theoretical insights into data quality for researchers and practitioners in multi-modal AI, though it is incremental as it builds on existing data-centric methods.

The paper tackles the problem of understanding how data properties affect generalization in foundation models by introducing SCAR, a scheme that characterizes datasets across four structural measures, and uses it to identify a minimal 'Foundation Data' subset that preserves generalization behavior, with experiments validating its effectiveness in predicting data utility and guiding data acquisition.

Foundation models exhibit remarkable generalization across diverse tasks, largely driven by the characteristics of their training data. Recent data-centric methods like pruning and compression aim to optimize training but offer limited theoretical insight into how data properties affect generalization, especially the data characteristics in sample scaling. Traditional perspectives further constrain progress by focusing predominantly on data quantity and training efficiency, often overlooking structural aspects of data quality. In this study, we introduce SCAR, a principled scheme for characterizing the intrinsic structural properties of datasets across four key measures: Scale, Coverage, Authenticity, and Richness. Unlike prior data-centric measures, SCAR captures stable characteristics that remain invariant under dataset scaling, providing a robust and general foundation for data understanding. Leveraging these structural properties, we introduce Foundation Data-a minimal subset that preserves the generalization behavior of the full dataset without requiring model-specific retraining. We model single-modality tasks as step functions and estimate the distribution of the foundation data size to capture step-wise generalization bias across modalities in the target multi-modal dataset. Finally, we develop a SCAR-guided data completion strategy based on this generalization bias, which enables efficient, modality-aware expansion of modality-specific characteristics in multimodal datasets. Experiments across diverse multi-modal datasets and model architectures validate the effectiveness of SCAR in predicting data utility and guiding data acquisition. Code is available at https://github.com/McAloma/SCAR.

View on arXiv PDF Code

Similar