CLOct 21, 2025

Engagement Undermines Safety: How Stereotypes and Toxicity Shape Humor in Language Models

Atharvan Dogra, Soumya Suvra Ghosal, Ameet Deshpande, Ashwin Kalyan, Dinesh Manocha

arXiv:2510.18454v12.7h-index: 14

Originality Incremental advance

AI Analysis

This reveals a safety risk in LLMs for creative applications, showing how engagement-driven optimization can amplify biases and toxicity.

The study found that optimizing language models for humor increases harmful content, with stereotypical or toxic jokes receiving 10-21% higher humor scores and appearing up to 28% more often in outputs rated as funny.

Large language models are increasingly used for creative writing and engagement content, raising safety concerns about the outputs. Therefore, casting humor generation as a testbed, this work evaluates how funniness optimization in modern LLM pipelines couples with harmful content by jointly measuring humor, stereotypicality, and toxicity. This is further supplemented by analyzing incongruity signals through information-theoretic metrics. Across six models, we observe that harmful outputs receive higher humor scores which further increase under role-based prompting, indicating a bias amplification loop between generators and evaluators. Information-theoretic analyses show harmful cues widen predictive uncertainty and surprisingly, can even make harmful punchlines more expected for some models, suggesting structural embedding in learned humor distributions. External validation on an additional satire-generation task with human perceived funniness judgments shows that LLM satire increases stereotypicality and typically toxicity, including for closed models. Quantitatively, stereotypical/toxic jokes gain $10-21\%$ in mean humor score, stereotypical jokes appear $11\%$ to $28\%$ more often among the jokes marked funny by LLM-based metric and up to $10\%$ more often in generations perceived as funny by humans.

View on arXiv PDF

Similar