On Support Samples of Next Word Prediction
This work addresses interpretability for language model researchers, offering incremental insights into data influence on model behavior.
The paper tackled the challenge of understanding language model decisions by investigating data-centric interpretability in next-word prediction, revealing that support samples (which promote or deter predictions) are an intrinsic property predictable before training, and non-support samples become increasingly important in deeper layers for preventing overfitting and shaping generalization.
Language models excel in various tasks by making complex decisions, yet understanding the rationale behind these decisions remains a challenge. This paper investigates \emph{data-centric interpretability} in language models, focusing on the next-word prediction task. Using representer theorem, we identify two types of \emph{support samples}-those that either promote or deter specific predictions. Our findings reveal that being a support sample is an intrinsic property, predictable even before training begins. Additionally, while non-support samples are less influential in direct predictions, they play a critical role in preventing overfitting and shaping generalization and representation learning. Notably, the importance of non-support samples increases in deeper layers, suggesting their significant role in intermediate representation formation. These insights shed light on the interplay between data and model decisions, offering a new dimension to understanding language model behavior and interpretability.