CLLGNov 15, 2023

Data Similarity is Not Enough to Explain Language Model Performance

arXiv:2311.09006v1134 citationsh-index: 14
Originality Incremental advance
AI Analysis

This challenges a common assumption in machine learning about data similarity driving model performance, indicating a more complex relationship.

The study tested whether similarity between pretraining and downstream task data correlates with language model performance, finding that while it holds for multilingual datasets, similarity metrics often show no correlation with accuracy or each other in other benchmarks.

Large language models achieve high performance on many but not all downstream tasks. The interaction between pretraining data and task data is commonly assumed to determine this variance: a task with data that is more similar to a model's pretraining data is assumed to be easier for that model. We test whether distributional and example-specific similarity measures (embedding-, token- and model-based) correlate with language model performance through a large-scale comparison of the Pile and C4 pretraining datasets with downstream benchmarks. Similarity correlates with performance for multilingual datasets, but in other benchmarks, we surprisingly find that similarity metrics are not correlated with accuracy or even each other. This suggests that the relationship between pretraining data and downstream tasks is more complex than often assumed.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes