William F. Bradley

h-index1
2papers

2 Papers

CLNov 3, 2024
LLMs and the Madness of Crowds

William F. Bradley

We investigate the patterns of incorrect answers produced by large language models (LLMs) during evaluation. These errors exhibit highly non-intuitive behaviors unique to each model. By analyzing these patterns, we measure the similarities between LLMs and construct a taxonomy that categorizes them based on their error correlations. Our findings reveal that the incorrect responses are not randomly distributed but systematically correlated across models, providing new insights into the underlying structures and relationships among LLMs.

CLNov 3, 2024
Enhancing LLM Evaluations: The Garbling Trick

William F. Bradley

As large language models (LLMs) become increasingly powerful, traditional evaluation metrics tend to saturate, making it challenging to distinguish between models. We propose a general method to transform existing LLM evaluations into a series of progressively more difficult tasks. These enhanced evaluations emphasize reasoning capabilities and can reveal relative performance differences that are not apparent in the original assessments. To demonstrate the effectiveness of our approach, we create a new multiple-choice test corpus, extend it into a family of evaluations, and assess a collection of LLMs. Our results offer insights into the comparative abilities of these models, particularly highlighting the differences between base LLMs and more recent "reasoning" models.