Correlated Errors in Large Language Models
This reveals a potential algorithmic monoculture issue in LLMs, impacting tasks like hiring and evaluation, and is incremental in providing empirical evidence for error correlation.
The study tackled the problem of error correlation across large language models (LLMs) by conducting a large-scale evaluation on over 350 models, finding that models agree 60% of the time when both err on a leaderboard dataset, and larger, more accurate models show highly correlated errors even with distinct architectures and providers.
Diversity in training data, architecture, and providers is assumed to mitigate homogeneity in LLMs. However, we lack empirical evidence on whether different LLMs differ meaningfully. We conduct a large-scale empirical evaluation on over 350 LLMs overall, using two popular leaderboards and a resume-screening task. We find substantial correlation in model errors -- on one leaderboard dataset, models agree 60% of the time when both models err. We identify factors driving model correlation, including shared architectures and providers. Crucially, however, larger and more accurate models have highly correlated errors, even with distinct architectures and providers. Finally, we show the effects of correlation in two downstream tasks: LLM-as-judge evaluation and hiring -- the latter reflecting theoretical predictions regarding algorithmic monoculture.