HumorRank: A Tournament-Based Leaderboard for Evaluating Humor Generation in Large Language Models
This provides a scalable and interpretable benchmark for researchers and developers working on humor generation in LLMs, though it is incremental as it builds on existing evaluation methods.
The paper tackled the challenge of evaluating humor in large language models by introducing HumorRank, a tournament-based framework that produced statistically grounded rankings of nine models, showing that humor quality depends on comedic mechanisms rather than just model scale.
Evaluating humor in large language models (LLMs) is an open challenge because existing approaches yield isolated, incomparable metrics rather than unified model rankings, making it difficult to track progress across systems. We introduce HumorRank, a tournament-based evaluation framework and leaderboard for textual humor generation. Using SemEval-2026 MWAHAHA test dataset, we conduct an extensive automated pairwise evaluation across nine models spanning proprietary, open-weight, and specialized systems. Pairwise judgments grounded in the General Theory of Verbal Humor (GTVH) are aggregated via an Adaptive Swiss tournament, with Bradley-Terry Maximum Likelihood Estimation (MLE) producing globally consistent humor generation capability rankings. Our results demonstrate that HumorRank yields statistically grounded model stratifications, showing that humor quality is driven by mastery of comedic mechanisms rather than model scale alone. HumorRank thus provides a scalable, interpretable methodology for benchmarking and understanding LLM-generated humor.