CLAug 13, 2025

Efficient Forward-Only Data Valuation for Pretrained LLMs and VLMs

Wenlong Deng, Jiaming Zhang, Qi Zeng, Christos Thrampoulidis, Boying Gong, Xiaoxiao Li

arXiv:2508.10180v21 citationsh-index: 7

Originality Incremental advance

AI Analysis

This work addresses the problem of scalable data valuation for practitioners using pretrained LLMs and VLMs, offering an incremental improvement over existing methods by reducing computational costs.

The paper tackles the computational challenge of data valuation for large language and vision-language models by introducing For-Value, a forward-only framework that estimates influence scores efficiently with a single forward pass, matching or outperforming gradient-based methods in identifying impactful fine-tuning examples and detecting mislabeled data.

Quantifying the influence of individual training samples is essential for enhancing the transparency and accountability of large language models (LLMs) and vision-language models (VLMs). However, existing data valuation methods often rely on Hessian information or model retraining, making them computationally prohibitive for billion-parameter models. In this work, we introduce For-Value, a forward-only data valuation framework that enables scalable and efficient influence estimation for both LLMs and VLMs. By leveraging the rich representations of modern foundation models, For-Value computes influence scores using a simple closed-form expression based solely on a single forward pass, thereby eliminating the need for costly gradient computations. Our theoretical analysis demonstrates that For-Value accurately estimates per-sample influence by capturing alignment in hidden representations and prediction errors between training and validation samples. Extensive experiments show that For-Value matches or outperforms gradient-based baselines in identifying impactful fine-tuning examples and effectively detecting mislabeled data.

View on arXiv PDF

Similar