Models Know Models Best: Evaluation via Model-Preferred Formats
This addresses evaluation inconsistencies for LLM researchers, though it is incremental as it builds on existing format analysis.
The paper tackled the problem of inconsistent performance of Large Language Models (LLMs) on multiple-choice tasks due to differences between symbol-based and cloze-style evaluation formats, and introduced a dynamic format-alignment strategy using model-preference signals, achieving substantial and consistent improvements in zero-shot accuracy across benchmarks.
Performance of Large Language Models (LLMs) on multiple-choice tasks differs markedly between symbol-based and cloze-style evaluation formats. The observed discrepancies are systematically attributable to task characteristics: natural language continuation benefits from likelihood scoring, whereas explicit comparison is better suited to symbol-based selection. These trends are consistent across various decoder-based LLMs, indicating model-agnostic effects. To address these inconsistencies, a dynamic format-alignment strategy is introduced that employs a lightweight classifier trained on latent model-preference signals. In contrast to human-designed heuristics, which often degrade performance, this approach uses model-generated signals to determine the optimal format for each problem instance. The proposed method achieves substantial and consistent improvements in zero-shot accuracy across reasoning and knowledge benchmarks, better revealing the models' latent capabilities.