LGAIJul 18, 2025

Influence Functions for Preference Dataset Pruning

arXiv:2507.14344v11 citationsh-index: 1
Originality Incremental advance
AI Analysis

This work addresses dataset quality issues for researchers and practitioners in reinforcement learning fine-tuning, but it is incremental as it builds on existing influence function methods.

The paper tackles the problem of noisy human preference datasets in language model fine-tuning by using influence function approximations to prune harmful training examples, resulting in a 1.5% accuracy uplift after removing 10% of examples.

Language models are commonly fine-tuned via reinforcement learning to alter their behavior or elicit new capabilities. Datasets used for these purposes, and particularly human preference datasets, are often noisy. The relatively small size post-training datasets, combined with parameter-efficient fine-tuning methods, enable the use of influence functions approximations to detect and prune training examples that are harmful to performance on a validation set. In this work, we adapt the TL;DR dataset for reward model training to demonstrate how conjugate-gradient approximated influence functions can be used to filter datasets. In our experiments, influence function filtering yields a small retraining accuracy uplift of 1.5% after removing 10% of training examples. We also show that gradient similarity outperforms influence functions for detecting helpful training examples. This suggests that local curvature is important for detecting harmful training examples, but less so for identifying helpful examples.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes