The Oracle Has Spoken: A Multi-Aspect Evaluation of Dialogue in Pythia
This work addresses the need for fine-grained evaluation of dialogue in LLMs, but it is incremental as it builds on existing methods and datasets without introducing new paradigms.
The study evaluated how model size and supervised fine-tuning affect specific dialogue abilities in Pythia models, finding that fine-tuning quickly saturates performance for most models with only mild improvements from increased size, and raised concerns about metric reliability due to similar trends across metrics.
Dialogue is one of the landmark abilities of large language models (LLMs). Despite its ubiquity, few studies actually distinguish specific ingredients underpinning dialogue behavior emerging during post-training. We employ a comprehensive suite of model-based metrics, each targeting a distinct fine-grained aspect of dialogue, motivated by linguistic theory. We evaluate how the performance of pre-trained Pythia models changes with respect to each of those dimensions, depending on model size and as a result of supervised fine-tuning on conversational datasets. We observe only a mild impact of raw model size on most metrics, whereas fine-tuning quickly saturates the scores for all but the smallest models tested. Somewhat contrary to our expectations, many metrics show very similar trends, especially if they are all rooted in the same evaluator model, which raises the question of their reliability in measuring a specific dimension. To that end, we conduct additional analyses of score distributions, metric correlations, and term frequencies in generated responses to help explain our observations.