LGMay 29, 2025

Daunce: Data Attribution through Uncertainty Estimation

arXiv:2505.23223v12 citationsh-index: 10
Originality Incremental advance
AI Analysis

This addresses the need for scalable and accurate data attribution in machine learning, enabling applications like data debugging and valuation, though it is incremental as it builds on connections between uncertainty and influence functions.

The paper tackles the problem of identifying influential training examples for model predictions by introducing Daunce, a scalable data attribution method using uncertainty estimation, which achieves more accurate attribution than existing methods and is the first to apply to proprietary large language models like GPT.

Training data attribution (TDA) methods aim to identify which training examples influence a model's predictions on specific test data most. By quantifying these influences, TDA supports critical applications such as data debugging, curation, and valuation. Gradient-based TDA methods rely on gradients and second-order information, limiting their applicability at scale. While recent random projection-based methods improve scalability, they often suffer from degraded attribution accuracy. Motivated by connections between uncertainty and influence functions, we introduce Daunce - a simple yet effective data attribution approach through uncertainty estimation. Our method operates by fine-tuning a collection of perturbed models and computing the covariance of per-example losses across these models as the attribution score. Daunce is scalable to large language models (LLMs) and achieves more accurate attribution compared to existing TDA methods. We validate Daunce on tasks ranging from vision tasks to LLM fine-tuning, and further demonstrate its compatibility with black-box model access. Applied to OpenAI's GPT models, our method achieves, to our knowledge, the first instance of data attribution on proprietary LLMs.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes