Elias in the Lighthouse, Again? Diagnosing Low Diversity in LLM Stories

arXiv:2605.2649288.1

Predicted impact top 38% in CL · last 90 daysOriginality Incremental advance

AI Analysis

Identifies a critical failure in LLM story generation for users and developers, showing that alignment data disproportionately reduces output diversity.

LLM-generated stories exhibit extremely low lexical diversity, with 11 words appearing in 88.3% of stories across four models, likely due to small preference datasets used in alignment.

LLM-generated stories are a popular use case, but they show very low variability. We sample 20,000 total stories from four current models using five prompts. We find that 11 words occur in 88.3% of generated stories, with little difference between models. These words include names (Elias, Mara, Elara), settings (lighthouses), and professions (clockmaker, librarian). These tokens do not often occur in published literature nor pre-training data, but they are found in preference data that is likely to have been used by all current models. Surprisingly, these "lighthouse" stories are infrequent when compared with the average post-training story, much of which contains references to copyrighted characters or adult content. This result demonstrates the potentially disproportionate impact of small datasets combined with powerful alignment algorithms.

View on arXiv PDF

Similar