LGAIApr 24

Utility-Aware Data Pricing: Token-Level Quality and Empirical Training Gain for LLMs

arXiv:2604.2289347.31 citations
AI Analysis

For LLM data market participants, this provides a fairer, auditable pricing method that captures nonlinear data contributions, though the framework is domain-specific and incremental in combining existing techniques.

This paper proposes a dynamic data valuation framework for LLMs that moves from static row-count pricing to utility-based pricing using token-level information density, empirical training gain (via influence functions and Data Shapley), and cryptographic verifiability. Experiments on three domains show proxy-based empirical gain achieves near-perfect ranking alignment with realized utility, outperforming row-count and token-count baselines.

Traditional data valuation methods based on ``row-count $\times$ quality coefficient'' paradigms fail to capture the nuanced, nonlinear contributions that data makes to Large Language Model (LLM) capabilities. This paper presents a dynamic data valuation framework that transitions from static accounting to utility-based pricing. Our approach operates on three layers: (1) token-level information density metrics using Shannon entropy and Data Quality Scores; (2) empirical training gain measurement through influence functions, proxy model strategies, and Data Shapley values; and (3) cryptographic verifiability through hash-based commitments, Merkle trees, and a tamper-evident training ledger. We provide comprehensive experimental validation on three real domains (instruction following, mathematical reasoning, and code summarization), demonstrating that proxy-based empirical gain achieves near-perfect ranking alignment with realized utility, substantially outperforming row-count and token-count baselines. This framework enables a fair Data-as-a-Service economy where high-reasoning data is priced according to its actual contribution to model intelligence, while providing the transparency and auditability necessary for trustworthy data markets.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes