AIMay 7

DataDignity: Training Data Attribution for Large Language Models

arXiv:2605.0568715.3
Predicted impact top 46% in AI · last 90 daysOriginality Incremental advance
AI Analysis

For auditors and developers needing to trace knowledge in LLM outputs, this work provides a more robust evaluation framework and a method that significantly improves provenance ranking, especially under adversarial query conditions.

The authors tackle the problem of pinpoint provenance—identifying which source document supports a language model's response—and introduce FakeWiki, a controlled benchmark that weakens lexical shortcuts. Their proposed ScoringModel improves mean Recall@10 from 35.0 to 52.2 across nine LLMs and five query conditions, outperforming baselines in 41 of 45 settings.

Auditing language-model outputs often requires more than judging correctness: an auditor may need to identify which source document most likely supports the knowledge expressed in a response. We study this as pinpoint provenance: given a prompt, a target-model response, and a candidate corpus, rank the documents that best support the response. We introduce FakeWiki, a controlled benchmark of 3,537 fabricated Wikipedia-style articles designed to preserve ground-truth provenance while weakening lexical shortcuts. FakeWiki includes QA probes, source-preserving paraphrases, retro-generated variants, hard anti-documents that remain topically similar while removing answer-critical facts, and five query conditions: clean prompting plus four jailbreak-inspired transformations. We evaluate seven retrieval baselines, a training-free activation-steering retrieval-fusion method, SteerFuse, and a supervised contrastive provenance ranker, ScoringModel. ScoringModel maps response and document features into a shared space and is trained with InfoNCE using in-batch, retrieval-mined, and anti-document negatives. Across nine open-weight instruction-tuned LLMs and five query conditions, ScoringModel improves mean Recall@10 from 35.0 for the strongest retrieval baseline to 52.2, without inference-time fusion, and wins 41/45 model-by-condition cells. SteerFuse is usually second-best despite requiring no supervised training, showing that activation-space evidence can efficiently complement text retrieval. On jailbreak-inspired transformed queries, ScoringModel improves Recall@10 by 15.7 points on average over the best baseline. Overall, our work shows that robust training data attribution requires evaluation settings that separate true answer support from topical or lexical resemblance.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes