GTLGApr 7, 2025

Do Data Valuations Make Good Data Prices?

arXiv:2504.05563v22 citationsh-index: 28
Originality Synthesis-oriented
AI Analysis

This addresses the challenge of compensating data owners fairly and efficiently in AI data markets, which is crucial for enabling robust data sharing and model development, though it is incremental in applying existing mechanism design principles to a new context.

The paper tackles the problem of designing payments for data contributors in markets for large language models, showing that popular valuation methods like Leave-One-Out and Data Shapley lead to inefficient outcomes by failing to ensure truthful cost reporting, and proposes adapting Myerson and VCG payment rules to achieve incentive compatibility and market efficiency.

As large language models increasingly rely on external data sources, compensating data contributors has become a central concern. But how should these payments be devised? We revisit data valuations from a $\textit{market-design perspective}$ where payments serve to compensate data owners for the $\textit{private}$ heterogeneous costs they incur for collecting and sharing data. We show that popular valuation methods-such as Leave-One-Out and Data Shapley-make for poor payments. They fail to ensure truthful reporting of the costs, leading to $\textit{inefficient market}$ outcomes. To address this, we adapt well-established payment rules from mechanism design, namely Myerson and Vickrey-Clarke-Groves (VCG), to the data market setting. We show that Myerson payment is the minimal truthful mechanism, optimal from the buyer's perspective. Additionally, we identify a condition under which both data buyers and sellers are utility-satisfied, and the market achieves efficiency. Our findings highlight the importance of incorporating incentive compatibility into data valuation design, paving the way for more robust and efficient data markets. Our data market framework is readily applicable to real-world scenarios. We illustrate this with simulations of contributor compensation in an LLM based retrieval-augmented generation (RAG) marketplace tasked with challenging medical question answering.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes