CL AIApr 16, 2025

Evaluating the Diversity and Quality of LLM Generated Content

Alexander Shypula, Shuo Li, Botong Zhang, Vishakh Padmakumar, Kayo Yin, Osbert Bastani

arXiv:2504.12522v127.753 citationsh-index: 37

Originality Highly original

AI Analysis

This addresses a dilemma for applications requiring diverse yet high-quality outputs, such as creative assistance or synthetic data generation, by revealing a distinction between form and content diversity that traditional metrics overlook.

The study tackled the problem of reduced diversity in preference-tuned LLMs by introducing a framework to measure effective semantic diversity, finding that these models produce greater effective semantic diversity than SFT or base models by generating more high-quality outputs overall, with smaller models being more parameter-efficient at generating unique content.

Recent work suggests that preference-tuning techniques--including Reinforcement Learning from Human Preferences (RLHF) methods like PPO and GRPO, as well as alternatives like DPO--reduce diversity, creating a dilemma given that such models are widely deployed in applications requiring diverse outputs. To address this, we introduce a framework for measuring effective semantic diversity--diversity among outputs that meet quality thresholds--which better reflects the practical utility of large language models (LLMs). Using open-ended tasks that require no human intervention, we find counterintuitive results: although preference-tuned models--especially those trained via RL--exhibit reduced lexical and syntactic diversity, they produce greater effective semantic diversity than SFT or base models, not from increasing diversity among high-quality outputs, but from generating more high-quality outputs overall. We discover that preference tuning reduces syntactic diversity while preserving semantic diversity--revealing a distinction between diversity in form and diversity in content that traditional metrics often overlook. Our analysis further shows that smaller models are consistently more parameter-efficient at generating unique content within a fixed sampling budget, offering insights into the relationship between model scaling and diversity. These findings have important implications for applications that require diverse yet high-quality outputs, from creative assistance to synthetic data generation.

View on arXiv PDF

Similar