Efficient Forward-Only Data Valuation for Pretrained LLMs and VLMs
This work addresses the problem of scalable data valuation for practitioners using pretrained LLMs and VLMs, offering an incremental improvement over existing methods by reducing computational costs.
The paper tackles the computational challenge of data valuation for large language and vision-language models by introducing For-Value, a forward-only framework that estimates influence scores efficiently with a single forward pass, matching or outperforming gradient-based methods in identifying impactful fine-tuning examples and detecting mislabeled data.
Quantifying the influence of individual training samples is essential for enhancing the transparency and accountability of large language models (LLMs) and vision-language models (VLMs). However, existing data valuation methods often rely on Hessian information or model retraining, making them computationally prohibitive for billion-parameter models. In this work, we introduce For-Value, a forward-only data valuation framework that enables scalable and efficient influence estimation for both LLMs and VLMs. By leveraging the rich representations of modern foundation models, For-Value computes influence scores using a simple closed-form expression based solely on a single forward pass, thereby eliminating the need for costly gradient computations. Our theoretical analysis demonstrates that For-Value accurately estimates per-sample influence by capturing alignment in hidden representations and prediction errors between training and validation samples. Extensive experiments show that For-Value matches or outperforms gradient-based baselines in identifying impactful fine-tuning examples and effectively detecting mislabeled data.