LGAICLJun 15, 2024

Humor in AI: Massive Scale Crowd-Sourced Preferences and Benchmarks for Cartoon Captioning

arXiv:2406.10522v217 citations
AI Analysis

This work addresses the challenge of developing and benchmarking AI for creative tasks like humor generation, which is important for advancing multimodal AI applications, though it is incremental as it builds on existing preference-based fine-tuning methods.

The authors tackled the problem of evaluating and improving AI-generated humorous captions by creating a large-scale multimodal preference dataset with over 250 million human ratings on 2.2 million captions from The New Yorker's cartoon contest, and they found that current fine-tuning methods like RLHF and DPO have limitations, with state-of-the-art models like GPT4 and Claude underperforming top human contestants in humor generation.

We present a novel multimodal preference dataset for creative tasks, consisting of over 250 million human ratings on more than 2.2 million captions, collected through crowdsourcing rating data for The New Yorker's weekly cartoon caption contest over the past eight years. This unique dataset supports the development and evaluation of multimodal large language models and preference-based fine-tuning algorithms for humorous caption generation. We propose novel benchmarks for judging the quality of model-generated captions, utilizing both GPT4 and human judgments to establish ranking-based evaluation strategies. Our experimental results highlight the limitations of current fine-tuning methods, such as RLHF and DPO, when applied to creative tasks. Furthermore, we demonstrate that even state-of-the-art models like GPT4 and Claude currently underperform top human contestants in generating humorous captions. As we conclude this extensive data collection effort, we release the entire preference dataset to the research community, fostering further advancements in AI humor generation and evaluation.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes