PhyloLM : Inferring the Phylogeny of Large Language Models and Predicting their Performances in Benchmarks
This provides a novel tool for researchers and developers to assess LLM relationships and predict performance without needing transparent training data, though it is incremental in applying existing phylogenetic methods to a new domain.
The authors tackled the problem of understanding relationships and predicting performance among Large Language Models (LLMs) by introducing PhyloLM, a method that adapts phylogenetic algorithms to analyze LLM output similarity. The result was a phylogenetic distance metric that successfully captured known relationships across 156 models and predicted performance in benchmarks, offering a cost-effective tool for evaluating LLM capabilities.
This paper introduces PhyloLM, a method adapting phylogenetic algorithms to Large Language Models (LLMs) to explore whether and how they relate to each other and to predict their performance characteristics. Our method calculates a phylogenetic distance metric based on the similarity of LLMs' output. The resulting metric is then used to construct dendrograms, which satisfactorily capture known relationships across a set of 111 open-source and 45 closed models. Furthermore, our phylogenetic distance predicts performance in standard benchmarks, thus demonstrating its functional validity and paving the way for a time and cost-effective estimation of LLM capabilities. To sum up, by translating population genetic concepts to machine learning, we propose and validate a tool to evaluate LLM development, relationships and capabilities, even in the absence of transparent training information.