Imperfect Influence, Preserved Rankings: A Theory of TRAK for Data Attribution
This work provides theoretical grounding for a widely used tool in interpreting AI models, though it is incremental as it analyzes an existing method rather than introducing new techniques.
The paper tackled the lack of theoretical understanding of the TRAK algorithm for data attribution in AI models, showing that while its approximations introduce significant errors, the estimated influence remains highly correlated with the original influence, preserving relative rankings of data points.
Data attribution, tracing a model's prediction back to specific training data, is an important tool for interpreting sophisticated AI models. The widely used TRAK algorithm addresses this challenge by first approximating the underlying model with a kernel machine and then leveraging techniques developed for approximating the leave-one-out (ALO) risk. Despite its strong empirical performance, the theoretical conditions under which the TRAK approximations are accurate as well as the regimes in which they break down remain largely unexplored. In this paper, we provide a theoretical analysis of the TRAK algorithm, characterizing its performance and quantifying the errors introduced by the approximations on which the method relies. We show that although the approximations incur significant errors, TRAK's estimated influence remains highly correlated with the original influence and therefore largely preserves the relative ranking of data points. We corroborate our theoretical results through extensive simulations and empirical studies.